Fix retry queue to re-run failed tests on Buildkite rebuild#392
Merged
Conversation
thadcraft-shopify
approved these changes
Apr 8, 2026
When Buildkite rebuilds a job (new BUILDKITE_BUILD_ID), workers start
with an empty per-worker log in the new Redis namespace. Previously,
retry_queue would find no matching failures and fall through to "The
retry queue does not contain any failure" — leaving error-reports
populated and the summary step failing indefinitely.
Fix: when the per-worker log intersection yields no failures, fall back
to redis.hkeys(key('error-reports')) so any worker can retry failures
from any other worker. Also handles automatic LSO retries
(BUILDKITE_RETRY_TYPE=automatic) via the existing elsif retry? branch.
Additional fixes:
- Retry#acknowledge queues a Redis SADD so BuildRecord#record_success
gets correct multi-exec result indices for stats delta correction
- reset_stats now purges error-report-deltas for the resetting worker
to prevent stats underflow on same-worker retries
- QueueEntry.format raises ArgumentError when file_path is nil/empty
and canonicalizes paths via File.expand_path
- SingleExample#queue_entry resolves file_path via source_location
- RSpec SingleExample and BuildStatusRecorder use example.file_path
0aea578 to
9c91f74
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Situation
When Buildkite rebuilds a job (new
BUILDKITE_BUILD_ID), workers start fresh in a new Redis namespace — so their per-worker logs are empty.retry_queuebuilds its test list by intersecting the per-worker log witherror-reports, and with an empty log that intersection is always empty. Workers would print "The retry queue does not contain any failure, we'll process the main queue instead", find the main queue exhausted too, and exit 0. The summary step would then read the still-populatederror-reportsand fail indefinitely.The same problem affects automatic LSO retries (
BUILDKITE_RETRY_TYPE=automatic), already fixed in #390, which relied on the same intersection.Execution
Worker#retry_queue(worker.rb): When the per-worker log intersection yields nothing, fall back toredis.hkeys(key('error-reports'))directly. Any worker in the retry batch can now pick up failures from any other worker.Retry#acknowledge(retry.rb):BuildRecord#record_successruns inside aredis.multiblock and destructures the result array by position.Static#acknowledgewas returning a plain Ruby value instead of queuing a Redis command, shifting all subsequent indices and breaking the stats delta correction. Now queues aSADDtoprocessedso the positions stay correct.reset_stats(build_record.rb): Added a purge oferror-report-deltasentries belonging to the resetting worker. Without this,apply_error_report_delta_correctionwould subtract from already-zeroed counters on a same-worker retry, producing negative failure counts.QueueEntry.format(queue_entry.rb): Now raisesArgumentErrorwhenfile_pathis nil or empty, and canonicalizes viaFile.expand_path. Previously, nil file paths were silently accepted and persisted into Redis — making entry keys non-reproducible across different working directories.SingleExample#queue_entryand RSpec counterparts: Resolvefile_pathviasource_location/example.file_pathinstead of hardcoding nil.