fix(ingestor): spawn by floating tag + retire chart-pinned digest machinery (companion to client-runtime#94)#231
Merged
Merged
Conversation
…hinery
Companion to client-runtime#94. That PR makes jobs-manager spawn the
ingestor Job by its floating tag with imagePullPolicy=Always by default
(digest pinning becomes opt-in). This wires the chart to that model and
removes the now-dead machinery the old pinned-digest design needed.
Root problem: the ingestor was the only spawned image pinned by a chart
digest. A `helm upgrade` that reset values reverted images.ingestor.digest
to the chart baseline, silently downgrading the ingestor — the failure mode
behind the recurring "empty columns rejected as non-numeric" customer
reports. Training pods never had this because they spawn by floating tag +
Always; this makes the ingestor match.
jobs-manager-deployment.yaml:
* Wire INGESTOR_IMAGE_REPOSITORY + INGESTOR_IMAGE_TAG (new) alongside the
now-optional INGESTOR_IMAGE_DIGEST. Empty digest (default) -> floating tag.
values.yaml / values.schema.json:
* images.ingestor: add `repository`, default `digest: ""` (opt-in pin),
keep `tag: "0.3"`; drop the dead `autoRefresh` flag. v0.3.6 digest kept
as a comment for easy pinning. Repository overridable for air-gap mirror.
* imageRefresh: drop `ingestorResolveFailureThreshold` (class-2 only).
image-refresh-cronjob.yaml + _helpers.tpl:
* Retire the class-2 (ingestor) pass entirely — the kubelet now resolves
the ingestor digest at each spawn, so there is nothing to reconcile. The
CronJob keeps only the class-1 pass (jobs-manager + pods-monitor
rollout-restart). imageRefreshEnabled no longer factors in the ingestor.
helm-ci.yaml:
* ingestor-multiarch now validates the floating TAG (the default spawn
target) and only checks a pinned digest when one is set, instead of
failing on the now-empty digest.
ingestor subchart (README + values): update docs that described the old
auto-upgrade / INGESTOR_IMAGE_DIGEST currency mechanism.
Validated locally: helm lint --strict (aks/eks/bm/oc), helm template,
helm unittest (167/167), sh -n on the rendered refresh script.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
Author
|
👋 Heads-up — Code review queue is at 12 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
…ring Minor version bump for the floating-tag ingestor feature. Adds three jobs-manager deployment unittests covering the new ingestor image wiring: floating-tag default (INGESTOR_IMAGE_REPOSITORY/TAG set, DIGEST empty), opt-in digest pin, and repository+tag override for air-gapped mirrors. helm unittest ./client: 170/170 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ing-tag # Conflicts: # client/tests/jobs_manager_test.yaml
saadqbal
approved these changes
Jun 9, 2026
This was referenced Jun 9, 2026
saadqbal
added a commit
that referenced
this pull request
Jun 9, 2026
…on lockstep The 1.6.0 chart bump (#231) moved Chart.yaml `version` 1.5.1 → 1.6.0 but left `appVersion` at "1.5.1" — every prior release (1.1.0 … 1.5.1) bumped the two in lockstep. tracebloc.labels emits `app.kubernetes.io/version` from .Chart.AppVersion, so a 1.6.0 chart labelled every resource (and reported to cluster info, which reads that standard label) as app version 1.5.1 while `helm.sh/chart` correctly said client-1.6.0 — two disagreeing version signals. Restore lockstep so the published 1.6.0 carries the right app version from the start (no stale-label 1.6.1 follow-up). No functional/workload change; helm unittest 196 pass, lint clean, render shows both labels at 1.6.0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Companion to tracebloc/client-runtime#94. That PR makes
jobs-managerspawn the ingestor Job by its floating tag (ghcr.io/tracebloc/ingestor:0.3) withimagePullPolicy: Alwaysby default, with digest-pinning kept as an opt-in. This PR wires the chart to that model and removes the now-dead machinery the old pinned-digest design required.Why
The ingestor was the only spawned image pinned by a chart digest. A
helm upgradethat reset values revertedimages.ingestor.digestto the chart baseline, silently downgrading the ingestor — the failure mode behind the recurring "empty columns rejected as non-numeric" customer reports (a stale ingestor lacked the NaN-tolerance fixes). Training pods never had this problem because they spawn by floating tag +Always. This makes the ingestor match — the cluster always runs the current published ingestor, and ahelm upgradecan no longer revert it because there is no pinned digest to reset.What changed
jobs-manager-deployment.yaml— wireINGESTOR_IMAGE_REPOSITORY+INGESTOR_IMAGE_TAG(new) alongside the now-optionalINGESTOR_IMAGE_DIGEST. Empty digest (the default) → floating tag.values.yaml/values.schema.json—images.ingestor: addrepository, defaultdigest: ""(opt-in pin), keeptag: "0.3"; drop the deadautoRefreshflag. The v0.3.6 digest is retained as a comment for easy pinning. Repository + tag are overridable for air-gapped mirrors.imageRefresh: dropingestorResolveFailureThreshold(class-2 only).image-refresh-cronjob.yaml+_helpers.tpl— retire the class-2 (ingestor) pass entirely. The kubelet now resolves the ingestor digest at each spawn, so there is nothing for the CronJob to reconcile. It keeps only the class-1 pass (jobs-manager + pods-monitorrollout restart).imageRefreshEnabledno longer factors in the ingestor (the CronJob is skipped when both class-1 images are pinned).helm-ci.yaml—ingestor-multiarchnow validates the floating tag (the default spawn target) and only checks a pinned digest when one is set, instead of failing on the now-empty digest.ingestor/subchart (README + values) — update docs that described the old auto-upgrade /INGESTOR_IMAGE_DIGESTcurrency mechanism.Backward compatibility
images.ingestor.digest(or passimage_digestper-request via the subchart) →repo@digest+IfNotPresent. TheINGESTOR_IMAGE_DIGESTenv is retained as the opt-in pin.--reuse-valuesfrom older releases: all new env reads are nil-guarded.Validation (local)
helm lint --stricton all four platform values (aks / eks / bm / oc) — clean.helm template— renders the new env wiring; CronJob is class-1-only.helm unittest ./client— 167/167 pass.sh -non the rendered image-refresh script — valid;if/fi+for/donebalanced.Note
Supersedes the stopgap baseline-digest bump (v0.3.6) as the default currency mechanism — the pinned digest is now opt-in. The v0.3.6 value is preserved in
values.yamlas the documented pin baseline.🤖 Generated with Claude Code