datasetworker: skip pack jobs orphaned by deleted preparation#679
Open
JAG-UK wants to merge 1 commit into
Open
datasetworker: skip pack jobs orphaned by deleted preparation#679JAG-UK wants to merge 1 commit into
JAG-UK wants to merge 1 commit into
Conversation
A deleted preparation sets jobs.attachment_id to NULL (SET NULL cascade) and leaves the reaper to delete the rows later. findJob did not exclude those, so an orphaned pack job got claimed and pack.Pack dereferenced the nil Attachment, panicking the whole worker in a crash loop. Skip orphans when claiming, and guard pack.Pack defensively. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
parkan
added a commit
that referenced
this pull request
Jun 29, 2026
## Problem
A dataset worker can nil-deref `*job.Attachment` and panic, taking down
the service. Reported as a pack-side panic at `pack/pack.go:91`
(`job.Attachment.Preparation.GetMinPieceSize()`), but the same shape
exists at the run-loop dispatch in `datasetworker.go` for all three job
types -- a `unified.db` from a user's `prep start-scan` run shows the
scan path panicking with the same root cause.
With `--exit-on-error` it becomes a crash loop: the panic unwinds past
the normal error handler so the job's state is never written to `Error`,
and on restart the healthcheck resets the dead worker's claimed job back
to `Ready`. Same job, same panic, again.
## Root cause
Job/file/car associations use `OnDelete: SET NULL` for fast prep
deletion with async reaping. When a preparation (or attachment) is
deleted, `jobs.attachment_id` is nulled and the healthcheck reaper
deletes those rows later, in batches of 100 every 5 minutes.
`findJob` claims jobs with:
\`\`\`go
Where("type = ? AND (state = ? OR (state = ? AND worker_id IS NULL))",
...)
\`\`\`
No `attachment_id IS NOT NULL` guard, so an orphaned job whose
attachment was nulled but not yet reaped gets claimed. The preload then
leaves `job.Attachment` nil and the dispatch -- `case model.Scan:
w.scan(workCtx, *job.Attachment)` and the equivalents for
`Pack`/`DagGen` -- nil-derefs.
Data-state dependent, not engine dependent: reproduces on sqlite and
postgres alike. A fresh DB just hides it until the next preparation
deletion.
## Fix
- `find.go`: add `attachment_id IS NOT NULL` so orphaned jobs are
skipped and left for the reaper. This is the primary fix.
- `datasetworker.go`: defensive nil guard at the run-loop dispatch,
covering Scan/Pack/DagGen uniformly. Belt-and-suspenders for any race
between the SET NULL cascade and the claim+preload window.
- `find_test.go`: `TestFindWorkSkipsOrphanedJob` asserts orphaned jobs
of all three types are not claimed.
Supersedes #679 -- same primary fix, with the defensive guard moved one
frame up so Scan and DagGen are covered by the same code path, not just
Pack.
Co-authored-by: JAG-UK <jon.geater@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A pack worker reliably panics with a nil-pointer dereference at
pack/pack.go:91(job.Attachment.Preparation.GetMinPieceSize()), taking down the wholedataset-workerservice. With--exit-on-errorit becomes a crash loop: the panic unwinds past the normal error handler so the job's state is never set toError, and on restart the healthcheck resets the dead worker's claimed job back toReady, so the same job is picked and panics again.Root cause
All job/file/car associations use
OnDelete:SET NULL"for fast prep deletion, async cleanup". When a preparation (or attachment) is deleted, the DB setsjobs.attachment_idtoNULLand leaves the healthcheck reaper to delete those orphaned rows later, in batches of 100 every 5 minutes.findJobclaims pack jobs with:There is no
attachment_id IS NOT NULLguard, so an orphaned job whose attachment was nulled — but not yet reaped — gets claimed. The preload then leavesjob.Attachmentnil andpack.Packdereferences it.This is data-state dependent, not DB-engine dependent (it reproduces on both sqlite and postgres); a fresh DB just hides it until the next preparation deletion.
Fix
findJob: addattachment_id IS NOT NULLso orphaned jobs are skipped and left for the reaper.pack.Pack: guard defensively against a nilAttachment/Preparation(returns an error instead of panicking) to cover the SET-NULL-vs-reaper race.TestFindPackWorkSkipsOrphanedJobasserts an orphaned pack job is not claimed.🤖 Generated with Claude Code