Flaky Test Finder
Runs the suite again and again across days, records every result, and surfaces the tests that fail without any code change
Distinguish genuinely flaky tests from real failures by gathering pass/fail data across many identical runs.
After N runs (e.g. 15 over 3 days), produce a flakiness report; or run continuously and report on a cadence.
Fixed interval (0 */4 * * *) · Autonomous
How one iteration works
discover → plan → execute → verify → escalate
- 1Discover
Note the current commit SHA so runs on unchanged code are comparable.
- 2Plan
Decide to run the full suite (or a target subset) this cycle.
- 3Execute
Run the tests; append each test's result, the SHA, and timestamp to the results store.
- 4Verify
Only count a test as flaky when it both passed and failed on the SAME commit — never flag a test that only failed after code changed.
- 5Escalate
When a test crosses a flakiness threshold, write it to the report with its pass/fail ratio.
The prompt
The tool-agnostic spec the loop runs each pass — copy it, then wire it to your tool below.
Record the current commit SHA, then run the test suite. For every test, append its outcome (pass/fail), the SHA, and the timestamp to the results log. Do not modify any code or tests. A test counts as flaky only if it has BOTH passed and failed on the same SHA. After this run, update the flakiness report: list each flaky test with its fail rate and the SHAs it flaked on, sorted by fail rate. If nothing is newly flaky, say so.
/loop 4h run the suite, append results, and update the flakiness report
while true; do SHA=$(git rev-parse HEAD); run_tests --json >> results.ndjson; agent -p 'update flakiness report from results.ndjson'; sleep 14400; done
Memory contract
Append-only results log keyed by (test_id, commit_sha, timestamp, outcome). The report is derived from it; nothing is overwritten.
Verification & guardrails
How it checks itself. Flakiness is asserted only from mixed outcomes on an identical SHA; a single failure is not enough to flag.
- Read-only with respect to code — it only runs tests and appends data
- Never edits or deletes a test on its own
- Compares within the same commit so code changes can't masquerade as flakiness
Failure modes
- Calls a test flaky when the failures actually came from changed code — always key by SHA
- Results file grows unbounded — rotate or summarize old runs
- Misses time-of-day flakiness if it always runs at the same minute — vary the schedule
Variations
- Targeted. Only re-run the subset of tests already suspected flaky to save time, widening occasionally to catch new ones.
- Quarantine proposer. When a test crosses a high threshold, have it open a PR proposing a quarantine/retag — still human-approved.
Example run
Run 11/15 at SHA a1b2c3d. 0 real failures. 'test_websocket_reconnect' failed this run but passed runs 3,5,7 on the same SHA -> flaky, fail rate 27%. Report updated.