Sync develop → main for v1.6.0 chart release by saadqbal · Pull Request #235 · tracebloc/client

saadqbal · 2026-06-09T08:08:34Z

Sync develop -> main to cut the v1.6.0 client chart release. Tracks #234.

main is 6 commits behind develop — exactly the changes validated this cycle:

fix(#222): wire SINGLE_NODE — chart default from hostPath.enabled + installer #224 (chore: revert default CODEOWNERS (back to author-picks-reviewer) #92) single-node GPU fallback
fix(mysql): probe over TCP so a moved socket can't kill a healthy DB #225 mysql probe over TCP
fix(proxy): make all egress workloads proxy-aware (wire the dead HTTP_PROXY_* values) #229 proxy-aware egress
fix(ingestor): spawn by floating tag + retire chart-pinned digest machinery (companion to client-runtime#94) #231 ingestor floating-tag spawn
#738/Post-install self-verification of CLI usability (#738) #215 CLI self-verify
test(charts): add helm-unittest suite for requests-proxy-service template #223 requests-proxy service test coverage

Validated on dev (full e2e) and on a prod canary (hasan-prod, healthy on 1.6.0).

After merge, publishing GitHub Release v1.6.0 triggers release-helm-chart.yaml (packages client + ingestor, pushes to gh-pages). NOTE: this flips fleet-wide auto-upgrade — client-chart releases with auto-upgrade ON converge to 1.6.0 at their next :23 cronjob.

🤖 Generated with Claude Code

Note

Medium Risk
Fleet auto-upgrade to 1.6.0 changes ingestion image resolution and GPU pending behavior on managed clusters; proxy NO_PROXY merging affects egress if HTTP_PROXY_HOST is set—validated on dev and a prod canary per release notes.

Overview
v1.6.0 release sync: bumps the client chart to 1.6.0 and bundles several operational fixes.

Ingestor image model shifts to floating-tag spawn by default (images.ingestor.digest empty): jobs-manager gets INGESTOR_IMAGE_REPOSITORY / INGESTOR_IMAGE_TAG / optional digest so each ingestion Job pulls the current tag (Always), avoiding helm upgrade reverting a stale pinned digest. The image-refresh CronJob drops the ingestor (class-2) pass and only rolls jobs-manager + pods-monitor; CI now asserts the floating tag (and optional pin) is multi-arch.

GPU scheduling: new env.SINGLE_NODE (defaults from hostPath.enabled) gates jobs-manager’s GPU→pending→CPU fallback so EKS/AKS/OpenShift leave Pending GPU pods for autoscaling; installer k3d writes SINGLE_NODE: "true".

Corporate proxy: new tracebloc.proxyEnv injects HTTP(S)_PROXY and merged NO_PROXY into jobs-manager, requests-proxy, and upgrade/refresh CronJobs, with proxy keys excluded from the generic .Values.env loop so in-cluster traffic isn’t proxied.

MySQL startup/liveness/readiness probes use TCP (mysqladmin ping -h 127.0.0.1) instead of a unix socket path mismatch.

Installers add post-install CLI PATH verification (bash/PowerShell) with shell-specific fix hints. Docs/tests/schema updated for the above.

^{Reviewed by Cursor Bugbot for commit dce8692. Bugbot is set up for automated code reviews on this repo. Configure here.}

…ervice-suite test(charts): add helm-unittest suite for requests-proxy-service template

fix(#222): wire SINGLE_NODE — chart default from hostPath.enabled + installer

…225) * fix(mysql): probe over TCP so a moved socket can't kill a healthy DB The startup/liveness/readiness probes ran `mysqladmin ping -h localhost`, which connects via the client-default unix socket /var/run/mysqld/mysqld.sock. The mysql-client image actually writes its socket to /var/lib/mysql/mysql.sock, so the probe can never reach a healthy mysqld: the startup probe exhausts (~130s) and the kubelet kill-loops the database, cascading jobs-manager and requests-proxy into CrashLoopBackOff. Switch all three probes to TCP (-h 127.0.0.1, port 3306) — immune to where the image places the socket — and document why so it is not reverted. Reproduced on a fresh install; cached environments were masked by the mutable :prod tag + imagePullPolicy: IfNotPresent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(mysql): lock probes to TCP + keep writable scratch mounts (kill-loop regression guard, backend#767) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…_PROXY_* values) (#229) * fix(proxy): make all egress workloads proxy-aware (wire the dead HTTP_PROXY_* values) The env.HTTP_PROXY_HOST/PORT/USERNAME/PASSWORD values were declared in values.yaml + values.schema.json but consumed by no template, and no workload pod received HTTP(S)_PROXY/NO_PROXY env. Behind a corporate proxy the installer's node-level proxy (scripts/lib/cluster.sh) handles image pulls, but the application pods make direct external calls (jobs-manager -> api.tracebloc.io) the network refuses -> CrashLoopBackOff. Add a tracebloc.proxyEnv helper that derives HTTP(S)_PROXY + an auto-augmented NO_PROXY (cluster-internal ranges always included, mirroring cluster.sh, so in-cluster + MySQL traffic never traverses the proxy) from the env.HTTP_PROXY_* values, and reference it on every external-egress workload: jobs-manager (api + pods-monitor), requests-proxy, image-refresh CronJob, auto-upgrade CronJob. Renders nothing when no proxy is set, so non-proxy installs are unchanged. Excludes mysql-client and resource-monitor (no external egress) and the ingestor sub-chart (talks only to jobs-manager.<ns>.svc, in-cluster). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(proxy): assert every egress workload gets proxy env when configured, none when not (backend#768) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…hinery (companion to client-runtime#94) (#231) * fix(ingestor): spawn by floating tag + retire chart-pinned digest machinery Companion to client-runtime#94. That PR makes jobs-manager spawn the ingestor Job by its floating tag with imagePullPolicy=Always by default (digest pinning becomes opt-in). This wires the chart to that model and removes the now-dead machinery the old pinned-digest design needed. Root problem: the ingestor was the only spawned image pinned by a chart digest. A `helm upgrade` that reset values reverted images.ingestor.digest to the chart baseline, silently downgrading the ingestor — the failure mode behind the recurring "empty columns rejected as non-numeric" customer reports. Training pods never had this because they spawn by floating tag + Always; this makes the ingestor match. jobs-manager-deployment.yaml: * Wire INGESTOR_IMAGE_REPOSITORY + INGESTOR_IMAGE_TAG (new) alongside the now-optional INGESTOR_IMAGE_DIGEST. Empty digest (default) -> floating tag. values.yaml / values.schema.json: * images.ingestor: add `repository`, default `digest: ""` (opt-in pin), keep `tag: "0.3"`; drop the dead `autoRefresh` flag. v0.3.6 digest kept as a comment for easy pinning. Repository overridable for air-gap mirror. * imageRefresh: drop `ingestorResolveFailureThreshold` (class-2 only). image-refresh-cronjob.yaml + _helpers.tpl: * Retire the class-2 (ingestor) pass entirely — the kubelet now resolves the ingestor digest at each spawn, so there is nothing to reconcile. The CronJob keeps only the class-1 pass (jobs-manager + pods-monitor rollout-restart). imageRefreshEnabled no longer factors in the ingestor. helm-ci.yaml: * ingestor-multiarch now validates the floating TAG (the default spawn target) and only checks a pinned digest when one is set, instead of failing on the now-empty digest. ingestor subchart (README + values): update docs that described the old auto-upgrade / INGESTOR_IMAGE_DIGEST currency mechanism. Validated locally: helm lint --strict (aks/eks/bm/oc), helm template, helm unittest (167/167), sh -n on the rendered refresh script. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore(chart): bump 1.5.1 -> 1.6.0 + helm-unittest for ingestor env wiring Minor version bump for the floating-tag ingestor feature. Adds three jobs-manager deployment unittests covering the new ingestor image wiring: floating-tag default (INGESTOR_IMAGE_REPOSITORY/TAG set, DIGEST empty), opt-in digest pin, and repository+tag override for air-gapped mirrors. helm unittest ./client: 170/170 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* feat(installer): self-verify CLI usability post-install with a shell-correct PATH fix (#738) Step 5 installed the tracebloc CLI and then told the user "open a new terminal so it's on your PATH" — without ever proving a fresh terminal would actually find it. That is exactly the cli#61 failure mode (binary lands in ~/.local/bin, which a brand-new shell doesn't have on PATH), left undetected until the customer hits it. The installer is the last place to catch it. After the install attempt, self-verify and report precisely: - Probe `command -v tracebloc` in BOTH a fresh login shell ("$SHELL" -lic) and a non-login shell ("$SHELL" -ic) — they read different startup files (~/.profile vs ~/.bashrc), and cli#61 was "works in my login shell, missing in a plain `bash` subshell". - If found: confirm via `tracebloc version` and print a VERIFIED verdict. The canonical `tracebloc dataset push ./data` next step stays in the summary's "What to do next" — not duplicated here. - If a fresh shell would NOT find it: print the EXACT shell-correct fix for the user's actual $SHELL (zsh→~/.zshrc, bash+linux→~/.bashrc, bash+darwin→~/.bash_profile, fish→fish_add_path + ~/.config/fish/config.fish, else ~/.profile), not a generic "open a new terminal". Stays NON-FATAL by design: the client is already connected by Step 5, so the verification always returns 0 and is hardened against the orchestrator's `set -e`. Mirrored in install-k8s.ps1 (RefreshPath is the faithful "fresh terminal" probe on Windows, since the CLI installer edits the user-scope registry PATH). Tests: extend install-cli.bats (verified-command success, actionable shell-correct PATH hint on miss, fish-specific fix, non-fatal under `set -e`) and mirror in install-k8s.Tests.ps1 (Test-TraceblocCli: verified verdict, actionable hint, non-fatal when RefreshPath throws). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(installer): make the #738 Windows CLI-verify Pester-safe on Linux CI Two follow-ups so the Pester jobs go green (they were the only red checks on #215; Pester is green on develop, so this PR introduced both): - install-k8s.ps1: the new $TRACEBLOC_CLI_INSTALL_DIR ran Join-Path on $env:LOCALAPPDATA at top level. The Pester suite dot-sources this script, and on the Linux runner $env:LOCALAPPDATA is null — Join-Path throws on a null -Path, aborting BeforeAll and failing the whole container (0/65). Guard it; "" placeholder off Windows since the value is only used there. - install-k8s.Tests.ps1: add a `function tracebloc { }` stub so `Mock tracebloc` can bind. Pester v5 only mocks commands that already exist (cf. the existing kubectl/docker/helm/k3d stubs); without it the "fresh-shell success" test threw CommandNotFoundException — the lone windows-latest failure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(installer): make the #738 PATH-fix guidance actually persist Review follow-up on #215 (comment #2). The miss-branch hint printed a bare `export PATH=…` (fixes only the current shell) followed by `source ${rc}` on an rc that did not yet contain the line — so neither command persisted the fix, while the closing note implied ${rc} should hold it. The user is never told to write the line into the rc. Rewrite the guidance per-shell: - POSIX shells (zsh / bash / sh / dash): `echo '<export>' >> ${rc}` then `source ${rc}` — one copy-pasteable step that fixes THIS terminal and every new one. - fish: `fish_add_path "…"` already persists (a universal var) AND applies to the running shell, so drop the misleading `source ~/.config/fish/config.fish`. Tests (install-cli.bats): the zsh miss-path now asserts the `echo … >> ~/.zshrc` form; the fish case asserts no POSIX `export` and no `source`. bats 8/8 pass, shellcheck --severity=error gate clean, bash -n clean. NOTE: review comment #1 (fish fresh-shell probe using `command -v`, which fish's `command` builtin lacks) is NOT addressed here — it needs verification on a real fish and is tracked separately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>

…232) (#233) The Amazon Linux and AlmaLinux/Rocky branches of install_docker_engine gated on `[[ -f /etc/os-release ]] && grep ... /etc/os-release`. The bats suite mocks `grep`, but a bash builtin file-test can't be intercepted by a function mock — so on macOS (no /etc/os-release) both branches short-circuited and the two install_docker_engine distro tests fell through to the get.docker.com path and failed. CI was green because unit-bash runs on ubuntu. Read the os-release path from `${TB_OS_RELEASE_FILE:-/etc/os-release}` for both the file-test and the grep. The bats setup() now writes a real os-release fixture per $TEST_DISTRO to a temp file and exports TB_OS_RELEASE_FILE, so distro detection runs for real (real `[[ -f ]]` + real `grep`) on every dev host. Behaviour is unchanged when the env var is unset (production / Linux CI). All 20 tests in scripts/tests/setup-linux.bats now pass on macOS. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…passthrough jobs-manager's api and pods-monitor containers include tracebloc.proxyEnv (merged, cluster-safe NO_PROXY) AND a generic .Values.env passthrough. The passthrough re-emitted a user-set NO_PROXY UNMERGED after proxyEnv; k8s keeps the last duplicate, so the unmerged copy won — dropping the cluster-internal entries (.svc, 10.0.0.0/8, ...) and routing in-cluster traffic through the proxy. Exclude proxy-owned keys (via a shared $proxyKeys list) from both passthrough loops so proxyEnv is the sole source. Adds a proxy_env_test regression (custom NO_PROXY + proxy -> single merged NO_PROXY, no unmerged copy) for both jobs-manager containers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fix(#229): dedupe NO_PROXY — exclude proxy keys from generic env passthrough

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit dce8692. Configure here.}

saadqbal and others added 7 commits June 9, 2026 11:21

Merge pull request #223 from tracebloc/auto-coverage/requests-proxy-s…

6a8ce92

…ervice-suite test(charts): add helm-unittest suite for requests-proxy-service template

Merge pull request #224 from tracebloc/feat/92-single-node-gpu-fallback

81f4f80

fix(#222): wire SINGLE_NODE — chart default from hostPath.enabled + installer

saadqbal self-assigned this Jun 9, 2026

cursor Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread client/templates/jobs-manager-deployment.yaml Outdated

aptracebloc previously approved these changes Jun 9, 2026

View reviewed changes

saadqbal added the skip-fr-gate Bypass FR gate for this PR (use only for bootstrap or emergencies — visible in audit) label Jun 9, 2026

saadqbal mentioned this pull request Jun 9, 2026

fix(#229): jobs-manager re-emits NO_PROXY unmerged for proxy customers #236

Open

saadqbal mentioned this pull request Jun 9, 2026

fix(#229): dedupe NO_PROXY — exclude proxy keys from generic env passthrough #237

Merged

Merge pull request #237 from tracebloc/fix/229-no-proxy-dedup

dce8692

fix(#229): dedupe NO_PROXY — exclude proxy keys from generic env passthrough

saadqbal dismissed aptracebloc’s stale review via dce8692 June 9, 2026 08:36

cursor Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread client/templates/jobs-manager-deployment.yaml

saadqbal mentioned this pull request Jun 9, 2026

Refine #236: gate proxy-key exclusion on HTTP_PROXY_HOST (preserve direct HTTP_PROXY) #238

Closed

LukasWodka approved these changes Jun 9, 2026

View reviewed changes

divyasinghds approved these changes Jun 9, 2026

View reviewed changes

saadqbal merged commit e1e7a39 into main Jun 9, 2026
48 checks passed

This was referenced Jun 9, 2026

release: client chart 1.6.1 — proxy-key exclusion gate (#238) + appVersion lockstep #239

Merged

Sync develop → main for v1.6.1 chart release #241

Merged

Sync develop → main for v1.7.0 chart release (§8.2 egress gateway, inert) #252

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync develop → main for v1.6.0 chart release#235

Sync develop → main for v1.6.0 chart release#235
saadqbal merged 9 commits into
mainfrom
sync/develop-to-main-v1.6.0

saadqbal commented Jun 9, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

saadqbal commented Jun 9, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

saadqbal commented Jun 9, 2026 •

edited by cursor Bot

Loading