Skip to content

Fix retry queue to re-run failed tests on Buildkite rebuild#392

Merged
ianks merged 2 commits intomainfrom
ianks/fix-retry-queue-fallback
Apr 8, 2026
Merged

Fix retry queue to re-run failed tests on Buildkite rebuild#392
ianks merged 2 commits intomainfrom
ianks/fix-retry-queue-fallback

Conversation

@ianks
Copy link
Copy Markdown
Contributor

@ianks ianks commented Apr 8, 2026

Situation

When Buildkite rebuilds a job (new BUILDKITE_BUILD_ID), workers start fresh in a new Redis namespace — so their per-worker logs are empty. retry_queue builds its test list by intersecting the per-worker log with error-reports, and with an empty log that intersection is always empty. Workers would print "The retry queue does not contain any failure, we'll process the main queue instead", find the main queue exhausted too, and exit 0. The summary step would then read the still-populated error-reports and fail indefinitely.

The same problem affects automatic LSO retries (BUILDKITE_RETRY_TYPE=automatic), already fixed in #390, which relied on the same intersection.

Execution

Worker#retry_queue (worker.rb): When the per-worker log intersection yields nothing, fall back to redis.hkeys(key('error-reports')) directly. Any worker in the retry batch can now pick up failures from any other worker.

Retry#acknowledge (retry.rb): BuildRecord#record_success runs inside a redis.multi block and destructures the result array by position. Static#acknowledge was returning a plain Ruby value instead of queuing a Redis command, shifting all subsequent indices and breaking the stats delta correction. Now queues a SADD to processed so the positions stay correct.

reset_stats (build_record.rb): Added a purge of error-report-deltas entries belonging to the resetting worker. Without this, apply_error_report_delta_correction would subtract from already-zeroed counters on a same-worker retry, producing negative failure counts.

QueueEntry.format (queue_entry.rb): Now raises ArgumentError when file_path is nil or empty, and canonicalizes via File.expand_path. Previously, nil file paths were silently accepted and persisted into Redis — making entry keys non-reproducible across different working directories.

SingleExample#queue_entry and RSpec counterparts: Resolve file_path via source_location / example.file_path instead of hardcoding nil.

@ianks ianks requested a review from thadcraft-shopify April 8, 2026 01:57
ianks added 2 commits April 7, 2026 22:04
When Buildkite rebuilds a job (new BUILDKITE_BUILD_ID), workers start
with an empty per-worker log in the new Redis namespace. Previously,
retry_queue would find no matching failures and fall through to "The
retry queue does not contain any failure" — leaving error-reports
populated and the summary step failing indefinitely.

Fix: when the per-worker log intersection yields no failures, fall back
to redis.hkeys(key('error-reports')) so any worker can retry failures
from any other worker. Also handles automatic LSO retries
(BUILDKITE_RETRY_TYPE=automatic) via the existing elsif retry? branch.

Additional fixes:
- Retry#acknowledge queues a Redis SADD so BuildRecord#record_success
  gets correct multi-exec result indices for stats delta correction
- reset_stats now purges error-report-deltas for the resetting worker
  to prevent stats underflow on same-worker retries
- QueueEntry.format raises ArgumentError when file_path is nil/empty
  and canonicalizes paths via File.expand_path
- SingleExample#queue_entry resolves file_path via source_location
- RSpec SingleExample and BuildStatusRecorder use example.file_path
@ianks ianks force-pushed the ianks/fix-retry-queue-fallback branch from 0aea578 to 9c91f74 Compare April 8, 2026 02:04
@ianks ianks merged commit 637d7b7 into main Apr 8, 2026
22 checks passed
@ianks ianks deleted the ianks/fix-retry-queue-fallback branch April 8, 2026 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants