Sync develop → main for v1.6.0 chart release#235
Merged
Conversation
…ervice-suite test(charts): add helm-unittest suite for requests-proxy-service template
fix(#222): wire SINGLE_NODE — chart default from hostPath.enabled + installer
…225) * fix(mysql): probe over TCP so a moved socket can't kill a healthy DB The startup/liveness/readiness probes ran `mysqladmin ping -h localhost`, which connects via the client-default unix socket /var/run/mysqld/mysqld.sock. The mysql-client image actually writes its socket to /var/lib/mysql/mysql.sock, so the probe can never reach a healthy mysqld: the startup probe exhausts (~130s) and the kubelet kill-loops the database, cascading jobs-manager and requests-proxy into CrashLoopBackOff. Switch all three probes to TCP (-h 127.0.0.1, port 3306) — immune to where the image places the socket — and document why so it is not reverted. Reproduced on a fresh install; cached environments were masked by the mutable :prod tag + imagePullPolicy: IfNotPresent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(mysql): lock probes to TCP + keep writable scratch mounts (kill-loop regression guard, backend#767) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…_PROXY_* values) (#229) * fix(proxy): make all egress workloads proxy-aware (wire the dead HTTP_PROXY_* values) The env.HTTP_PROXY_HOST/PORT/USERNAME/PASSWORD values were declared in values.yaml + values.schema.json but consumed by no template, and no workload pod received HTTP(S)_PROXY/NO_PROXY env. Behind a corporate proxy the installer's node-level proxy (scripts/lib/cluster.sh) handles image pulls, but the application pods make direct external calls (jobs-manager -> api.tracebloc.io) the network refuses -> CrashLoopBackOff. Add a tracebloc.proxyEnv helper that derives HTTP(S)_PROXY + an auto-augmented NO_PROXY (cluster-internal ranges always included, mirroring cluster.sh, so in-cluster + MySQL traffic never traverses the proxy) from the env.HTTP_PROXY_* values, and reference it on every external-egress workload: jobs-manager (api + pods-monitor), requests-proxy, image-refresh CronJob, auto-upgrade CronJob. Renders nothing when no proxy is set, so non-proxy installs are unchanged. Excludes mysql-client and resource-monitor (no external egress) and the ingestor sub-chart (talks only to jobs-manager.<ns>.svc, in-cluster). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(proxy): assert every egress workload gets proxy env when configured, none when not (backend#768) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…hinery (companion to client-runtime#94) (#231) * fix(ingestor): spawn by floating tag + retire chart-pinned digest machinery Companion to client-runtime#94. That PR makes jobs-manager spawn the ingestor Job by its floating tag with imagePullPolicy=Always by default (digest pinning becomes opt-in). This wires the chart to that model and removes the now-dead machinery the old pinned-digest design needed. Root problem: the ingestor was the only spawned image pinned by a chart digest. A `helm upgrade` that reset values reverted images.ingestor.digest to the chart baseline, silently downgrading the ingestor — the failure mode behind the recurring "empty columns rejected as non-numeric" customer reports. Training pods never had this because they spawn by floating tag + Always; this makes the ingestor match. jobs-manager-deployment.yaml: * Wire INGESTOR_IMAGE_REPOSITORY + INGESTOR_IMAGE_TAG (new) alongside the now-optional INGESTOR_IMAGE_DIGEST. Empty digest (default) -> floating tag. values.yaml / values.schema.json: * images.ingestor: add `repository`, default `digest: ""` (opt-in pin), keep `tag: "0.3"`; drop the dead `autoRefresh` flag. v0.3.6 digest kept as a comment for easy pinning. Repository overridable for air-gap mirror. * imageRefresh: drop `ingestorResolveFailureThreshold` (class-2 only). image-refresh-cronjob.yaml + _helpers.tpl: * Retire the class-2 (ingestor) pass entirely — the kubelet now resolves the ingestor digest at each spawn, so there is nothing to reconcile. The CronJob keeps only the class-1 pass (jobs-manager + pods-monitor rollout-restart). imageRefreshEnabled no longer factors in the ingestor. helm-ci.yaml: * ingestor-multiarch now validates the floating TAG (the default spawn target) and only checks a pinned digest when one is set, instead of failing on the now-empty digest. ingestor subchart (README + values): update docs that described the old auto-upgrade / INGESTOR_IMAGE_DIGEST currency mechanism. Validated locally: helm lint --strict (aks/eks/bm/oc), helm template, helm unittest (167/167), sh -n on the rendered refresh script. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore(chart): bump 1.5.1 -> 1.6.0 + helm-unittest for ingestor env wiring Minor version bump for the floating-tag ingestor feature. Adds three jobs-manager deployment unittests covering the new ingestor image wiring: floating-tag default (INGESTOR_IMAGE_REPOSITORY/TAG set, DIGEST empty), opt-in digest pin, and repository+tag override for air-gapped mirrors. helm unittest ./client: 170/170 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
* feat(installer): self-verify CLI usability post-install with a shell-correct PATH fix (#738)
Step 5 installed the tracebloc CLI and then told the user "open a new
terminal so it's on your PATH" — without ever proving a fresh terminal
would actually find it. That is exactly the cli#61 failure mode (binary
lands in ~/.local/bin, which a brand-new shell doesn't have on PATH),
left undetected until the customer hits it. The installer is the last
place to catch it.
After the install attempt, self-verify and report precisely:
- Probe `command -v tracebloc` in BOTH a fresh login shell ("$SHELL" -lic)
and a non-login shell ("$SHELL" -ic) — they read different startup files
(~/.profile vs ~/.bashrc), and cli#61 was "works in my login shell,
missing in a plain `bash` subshell".
- If found: confirm via `tracebloc version` and print a VERIFIED verdict.
The canonical `tracebloc dataset push ./data` next step stays in the
summary's "What to do next" — not duplicated here.
- If a fresh shell would NOT find it: print the EXACT shell-correct fix
for the user's actual $SHELL (zsh→~/.zshrc, bash+linux→~/.bashrc,
bash+darwin→~/.bash_profile, fish→fish_add_path + ~/.config/fish/config.fish,
else ~/.profile), not a generic "open a new terminal".
Stays NON-FATAL by design: the client is already connected by Step 5, so
the verification always returns 0 and is hardened against the orchestrator's
`set -e`. Mirrored in install-k8s.ps1 (RefreshPath is the faithful
"fresh terminal" probe on Windows, since the CLI installer edits the
user-scope registry PATH).
Tests: extend install-cli.bats (verified-command success, actionable
shell-correct PATH hint on miss, fish-specific fix, non-fatal under
`set -e`) and mirror in install-k8s.Tests.ps1 (Test-TraceblocCli:
verified verdict, actionable hint, non-fatal when RefreshPath throws).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(installer): make the #738 Windows CLI-verify Pester-safe on Linux CI
Two follow-ups so the Pester jobs go green (they were the only red checks
on #215; Pester is green on develop, so this PR introduced both):
- install-k8s.ps1: the new $TRACEBLOC_CLI_INSTALL_DIR ran Join-Path on
$env:LOCALAPPDATA at top level. The Pester suite dot-sources this script,
and on the Linux runner $env:LOCALAPPDATA is null — Join-Path throws on a
null -Path, aborting BeforeAll and failing the whole container (0/65).
Guard it; "" placeholder off Windows since the value is only used there.
- install-k8s.Tests.ps1: add a `function tracebloc { }` stub so
`Mock tracebloc` can bind. Pester v5 only mocks commands that already
exist (cf. the existing kubectl/docker/helm/k3d stubs); without it the
"fresh-shell success" test threw CommandNotFoundException — the lone
windows-latest failure.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(installer): make the #738 PATH-fix guidance actually persist
Review follow-up on #215 (comment #2). The miss-branch hint printed a bare
`export PATH=…` (fixes only the current shell) followed by `source ${rc}` on
an rc that did not yet contain the line — so neither command persisted the
fix, while the closing note implied ${rc} should hold it. The user is never
told to write the line into the rc. Rewrite the guidance per-shell:
- POSIX shells (zsh / bash / sh / dash): `echo '<export>' >> ${rc}` then
`source ${rc}` — one copy-pasteable step that fixes THIS terminal and every
new one.
- fish: `fish_add_path "…"` already persists (a universal var) AND applies to
the running shell, so drop the misleading `source ~/.config/fish/config.fish`.
Tests (install-cli.bats): the zsh miss-path now asserts the `echo … >> ~/.zshrc`
form; the fish case asserts no POSIX `export` and no `source`. bats 8/8 pass,
shellcheck --severity=error gate clean, bash -n clean.
NOTE: review comment #1 (fish fresh-shell probe using `command -v`, which fish's
`command` builtin lacks) is NOT addressed here — it needs verification on a real
fish and is tracked separately.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>
…232) (#233) The Amazon Linux and AlmaLinux/Rocky branches of install_docker_engine gated on `[[ -f /etc/os-release ]] && grep ... /etc/os-release`. The bats suite mocks `grep`, but a bash builtin file-test can't be intercepted by a function mock — so on macOS (no /etc/os-release) both branches short-circuited and the two install_docker_engine distro tests fell through to the get.docker.com path and failed. CI was green because unit-bash runs on ubuntu. Read the os-release path from `${TB_OS_RELEASE_FILE:-/etc/os-release}` for both the file-test and the grep. The bats setup() now writes a real os-release fixture per $TEST_DISTRO to a temp file and exports TB_OS_RELEASE_FILE, so distro detection runs for real (real `[[ -f ]]` + real `grep`) on every dev host. Behaviour is unchanged when the env var is unset (production / Linux CI). All 20 tests in scripts/tests/setup-linux.bats now pass on macOS. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
aptracebloc
previously approved these changes
Jun 9, 2026
…passthrough jobs-manager's api and pods-monitor containers include tracebloc.proxyEnv (merged, cluster-safe NO_PROXY) AND a generic .Values.env passthrough. The passthrough re-emitted a user-set NO_PROXY UNMERGED after proxyEnv; k8s keeps the last duplicate, so the unmerged copy won — dropping the cluster-internal entries (.svc, 10.0.0.0/8, ...) and routing in-cluster traffic through the proxy. Exclude proxy-owned keys (via a shared $proxyKeys list) from both passthrough loops so proxyEnv is the sole source. Adds a proxy_env_test regression (custom NO_PROXY + proxy -> single merged NO_PROXY, no unmerged copy) for both jobs-manager containers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fix(#229): dedupe NO_PROXY — exclude proxy keys from generic env passthrough
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit dce8692. Configure here.
LukasWodka
approved these changes
Jun 9, 2026
divyasinghds
approved these changes
Jun 9, 2026
This was referenced Jun 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Sync develop -> main to cut the v1.6.0 client chart release. Tracks #234.
mainis 6 commits behind develop — exactly the changes validated this cycle:Validated on dev (full e2e) and on a prod canary (
hasan-prod, healthy on 1.6.0).After merge, publishing GitHub Release v1.6.0 triggers
release-helm-chart.yaml(packages client + ingestor, pushes to gh-pages). NOTE: this flips fleet-wide auto-upgrade — client-chart releases with auto-upgrade ON converge to 1.6.0 at their next :23 cronjob.🤖 Generated with Claude Code
Note
Medium Risk
Fleet auto-upgrade to 1.6.0 changes ingestion image resolution and GPU pending behavior on managed clusters; proxy NO_PROXY merging affects egress if HTTP_PROXY_HOST is set—validated on dev and a prod canary per release notes.
Overview
v1.6.0 release sync: bumps the client chart to 1.6.0 and bundles several operational fixes.
Ingestor image model shifts to floating-tag spawn by default (
images.ingestor.digestempty): jobs-manager getsINGESTOR_IMAGE_REPOSITORY/INGESTOR_IMAGE_TAG/ optional digest so each ingestion Job pulls the current tag (Always), avoidinghelm upgradereverting a stale pinned digest. The image-refresh CronJob drops the ingestor (class-2) pass and only rolls jobs-manager + pods-monitor; CI now asserts the floating tag (and optional pin) is multi-arch.GPU scheduling: new
env.SINGLE_NODE(defaults fromhostPath.enabled) gates jobs-manager’s GPU→pending→CPU fallback so EKS/AKS/OpenShift leave Pending GPU pods for autoscaling; installer k3d writesSINGLE_NODE: "true".Corporate proxy: new
tracebloc.proxyEnvinjects HTTP(S)_PROXY and mergedNO_PROXYinto jobs-manager, requests-proxy, and upgrade/refresh CronJobs, with proxy keys excluded from the generic.Values.envloop so in-cluster traffic isn’t proxied.MySQL startup/liveness/readiness probes use TCP (
mysqladmin ping -h 127.0.0.1) instead of a unix socket path mismatch.Installers add post-install CLI PATH verification (bash/PowerShell) with shell-specific fix hints. Docs/tests/schema updated for the above.
Reviewed by Cursor Bugbot for commit dce8692. Bugbot is set up for automated code reviews on this repo. Configure here.