Skip to content

fix(ingestor): spawn by floating tag + retire chart-pinned digest machinery (companion to client-runtime#94)#231

Merged
saadqbal merged 3 commits into
developfrom
fix/ingestor-floating-tag
Jun 9, 2026
Merged

fix(ingestor): spawn by floating tag + retire chart-pinned digest machinery (companion to client-runtime#94)#231
saadqbal merged 3 commits into
developfrom
fix/ingestor-floating-tag

Conversation

@LukasWodka

Copy link
Copy Markdown
Contributor

Summary

Companion to tracebloc/client-runtime#94. That PR makes jobs-manager spawn the ingestor Job by its floating tag (ghcr.io/tracebloc/ingestor:0.3) with imagePullPolicy: Always by default, with digest-pinning kept as an opt-in. This PR wires the chart to that model and removes the now-dead machinery the old pinned-digest design required.

Why

The ingestor was the only spawned image pinned by a chart digest. A helm upgrade that reset values reverted images.ingestor.digest to the chart baseline, silently downgrading the ingestor — the failure mode behind the recurring "empty columns rejected as non-numeric" customer reports (a stale ingestor lacked the NaN-tolerance fixes). Training pods never had this problem because they spawn by floating tag + Always. This makes the ingestor match — the cluster always runs the current published ingestor, and a helm upgrade can no longer revert it because there is no pinned digest to reset.

What changed

jobs-manager-deployment.yaml — wire INGESTOR_IMAGE_REPOSITORY + INGESTOR_IMAGE_TAG (new) alongside the now-optional INGESTOR_IMAGE_DIGEST. Empty digest (the default) → floating tag.

values.yaml / values.schema.jsonimages.ingestor: add repository, default digest: "" (opt-in pin), keep tag: "0.3"; drop the dead autoRefresh flag. The v0.3.6 digest is retained as a comment for easy pinning. Repository + tag are overridable for air-gapped mirrors. imageRefresh: drop ingestorResolveFailureThreshold (class-2 only).

image-refresh-cronjob.yaml + _helpers.tpl — retire the class-2 (ingestor) pass entirely. The kubelet now resolves the ingestor digest at each spawn, so there is nothing for the CronJob to reconcile. It keeps only the class-1 pass (jobs-manager + pods-monitor rollout restart). imageRefreshEnabled no longer factors in the ingestor (the CronJob is skipped when both class-1 images are pinned).

helm-ci.yamlingestor-multiarch now validates the floating tag (the default spawn target) and only checks a pinned digest when one is set, instead of failing on the now-empty digest.

ingestor/ subchart (README + values) — update docs that described the old auto-upgrade / INGESTOR_IMAGE_DIGEST currency mechanism.

Backward compatibility

  • Pinning still works: set images.ingestor.digest (or pass image_digest per-request via the subchart) → repo@digest + IfNotPresent. The INGESTOR_IMAGE_DIGEST env is retained as the opt-in pin.
  • --reuse-values from older releases: all new env reads are nil-guarded.

Validation (local)

  • helm lint --strict on all four platform values (aks / eks / bm / oc) — clean.
  • helm template — renders the new env wiring; CronJob is class-1-only.
  • helm unittest ./client167/167 pass.
  • sh -n on the rendered image-refresh script — valid; if/fi + for/done balanced.

Note

Supersedes the stopgap baseline-digest bump (v0.3.6) as the default currency mechanism — the pinned digest is now opt-in. The v0.3.6 value is preserved in values.yaml as the documented pin baseline.

⚠️ High-blast-radius prod-infra change across the chart + CronJob that can't be fully exercised without a live cluster. The CronJob class-2 excision and the deployment env wiring deserve a careful look before promotion.

🤖 Generated with Claude Code

…hinery

Companion to client-runtime#94. That PR makes jobs-manager spawn the
ingestor Job by its floating tag with imagePullPolicy=Always by default
(digest pinning becomes opt-in). This wires the chart to that model and
removes the now-dead machinery the old pinned-digest design needed.

Root problem: the ingestor was the only spawned image pinned by a chart
digest. A `helm upgrade` that reset values reverted images.ingestor.digest
to the chart baseline, silently downgrading the ingestor — the failure mode
behind the recurring "empty columns rejected as non-numeric" customer
reports. Training pods never had this because they spawn by floating tag +
Always; this makes the ingestor match.

jobs-manager-deployment.yaml:
  * Wire INGESTOR_IMAGE_REPOSITORY + INGESTOR_IMAGE_TAG (new) alongside the
    now-optional INGESTOR_IMAGE_DIGEST. Empty digest (default) -> floating tag.

values.yaml / values.schema.json:
  * images.ingestor: add `repository`, default `digest: ""` (opt-in pin),
    keep `tag: "0.3"`; drop the dead `autoRefresh` flag. v0.3.6 digest kept
    as a comment for easy pinning. Repository overridable for air-gap mirror.
  * imageRefresh: drop `ingestorResolveFailureThreshold` (class-2 only).

image-refresh-cronjob.yaml + _helpers.tpl:
  * Retire the class-2 (ingestor) pass entirely — the kubelet now resolves
    the ingestor digest at each spawn, so there is nothing to reconcile. The
    CronJob keeps only the class-1 pass (jobs-manager + pods-monitor
    rollout-restart). imageRefreshEnabled no longer factors in the ingestor.

helm-ci.yaml:
  * ingestor-multiarch now validates the floating TAG (the default spawn
    target) and only checks a pinned digest when one is set, instead of
    failing on the now-empty digest.

ingestor subchart (README + values): update docs that described the old
auto-upgrade / INGESTOR_IMAGE_DIGEST currency mechanism.

Validated locally: helm lint --strict (aks/eks/bm/oc), helm template,
helm unittest (167/167), sh -n on the rendered refresh script.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@LukasWodka

Copy link
Copy Markdown
Contributor Author

👋 Heads-up — Code review queue is at 12 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

…ring

Minor version bump for the floating-tag ingestor feature. Adds three
jobs-manager deployment unittests covering the new ingestor image wiring:
floating-tag default (INGESTOR_IMAGE_REPOSITORY/TAG set, DIGEST empty),
opt-in digest pin, and repository+tag override for air-gapped mirrors.

helm unittest ./client: 170/170 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ing-tag

# Conflicts:
#	client/tests/jobs_manager_test.yaml
@saadqbal saadqbal merged commit 408d7d9 into develop Jun 9, 2026
14 checks passed
saadqbal added a commit that referenced this pull request Jun 9, 2026
…on lockstep

The 1.6.0 chart bump (#231) moved Chart.yaml `version` 1.5.1 → 1.6.0 but left
`appVersion` at "1.5.1" — every prior release (1.1.0 … 1.5.1) bumped the two
in lockstep. tracebloc.labels emits `app.kubernetes.io/version` from
.Chart.AppVersion, so a 1.6.0 chart labelled every resource (and reported to
cluster info, which reads that standard label) as app version 1.5.1 while
`helm.sh/chart` correctly said client-1.6.0 — two disagreeing version signals.

Restore lockstep so the published 1.6.0 carries the right app version from the
start (no stale-label 1.6.1 follow-up). No functional/workload change; helm
unittest 196 pass, lint clean, render shows both labels at 1.6.0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants