Skip to content

datasetworker: skip pack jobs orphaned by deleted preparation#679

Open
JAG-UK wants to merge 1 commit into
data-preservation-programs:mainfrom
JAG-UK:claude/admiring-wilson-c92a17
Open

datasetworker: skip pack jobs orphaned by deleted preparation#679
JAG-UK wants to merge 1 commit into
data-preservation-programs:mainfrom
JAG-UK:claude/admiring-wilson-c92a17

Conversation

@JAG-UK

@JAG-UK JAG-UK commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Problem

A pack worker reliably panics with a nil-pointer dereference at pack/pack.go:91 (job.Attachment.Preparation.GetMinPieceSize()), taking down the whole dataset-worker service. With --exit-on-error it becomes a crash loop: the panic unwinds past the normal error handler so the job's state is never set to Error, and on restart the healthcheck resets the dead worker's claimed job back to Ready, so the same job is picked and panics again.

Root cause

All job/file/car associations use OnDelete:SET NULL "for fast prep deletion, async cleanup". When a preparation (or attachment) is deleted, the DB sets jobs.attachment_id to NULL and leaves the healthcheck reaper to delete those orphaned rows later, in batches of 100 every 5 minutes.

findJob claims pack jobs with:

Where("type = ? AND (state = ? OR (state = ? AND worker_id IS NULL))", ...)

There is no attachment_id IS NOT NULL guard, so an orphaned job whose attachment was nulled — but not yet reaped — gets claimed. The preload then leaves job.Attachment nil and pack.Pack dereferences it.

This is data-state dependent, not DB-engine dependent (it reproduces on both sqlite and postgres); a fresh DB just hides it until the next preparation deletion.

Fix

  • findJob: add attachment_id IS NOT NULL so orphaned jobs are skipped and left for the reaper.
  • pack.Pack: guard defensively against a nil Attachment/Preparation (returns an error instead of panicking) to cover the SET-NULL-vs-reaper race.
  • Test: TestFindPackWorkSkipsOrphanedJob asserts an orphaned pack job is not claimed.

🤖 Generated with Claude Code

A deleted preparation sets jobs.attachment_id to NULL (SET NULL cascade)
and leaves the reaper to delete the rows later. findJob did not exclude
those, so an orphaned pack job got claimed and pack.Pack dereferenced the
nil Attachment, panicking the whole worker in a crash loop.

Skip orphans when claiming, and guard pack.Pack defensively.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
parkan added a commit that referenced this pull request Jun 29, 2026
## Problem

A dataset worker can nil-deref `*job.Attachment` and panic, taking down
the service. Reported as a pack-side panic at `pack/pack.go:91`
(`job.Attachment.Preparation.GetMinPieceSize()`), but the same shape
exists at the run-loop dispatch in `datasetworker.go` for all three job
types -- a `unified.db` from a user's `prep start-scan` run shows the
scan path panicking with the same root cause.

With `--exit-on-error` it becomes a crash loop: the panic unwinds past
the normal error handler so the job's state is never written to `Error`,
and on restart the healthcheck resets the dead worker's claimed job back
to `Ready`. Same job, same panic, again.

## Root cause

Job/file/car associations use `OnDelete: SET NULL` for fast prep
deletion with async reaping. When a preparation (or attachment) is
deleted, `jobs.attachment_id` is nulled and the healthcheck reaper
deletes those rows later, in batches of 100 every 5 minutes.

`findJob` claims jobs with:

\`\`\`go
Where("type = ? AND (state = ? OR (state = ? AND worker_id IS NULL))",
...)
\`\`\`

No `attachment_id IS NOT NULL` guard, so an orphaned job whose
attachment was nulled but not yet reaped gets claimed. The preload then
leaves `job.Attachment` nil and the dispatch -- `case model.Scan:
w.scan(workCtx, *job.Attachment)` and the equivalents for
`Pack`/`DagGen` -- nil-derefs.

Data-state dependent, not engine dependent: reproduces on sqlite and
postgres alike. A fresh DB just hides it until the next preparation
deletion.

## Fix

- `find.go`: add `attachment_id IS NOT NULL` so orphaned jobs are
skipped and left for the reaper. This is the primary fix.
- `datasetworker.go`: defensive nil guard at the run-loop dispatch,
covering Scan/Pack/DagGen uniformly. Belt-and-suspenders for any race
between the SET NULL cascade and the claim+preload window.
- `find_test.go`: `TestFindWorkSkipsOrphanedJob` asserts orphaned jobs
of all three types are not claimed.

Supersedes #679 -- same primary fix, with the defensive guard moved one
frame up so Scan and DagGen are covered by the same code path, not just
Pack.

Co-authored-by: JAG-UK <jon.geater@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant