Skip to content

Sync develop → main for v1.6.0 chart release#235

Merged
saadqbal merged 9 commits into
mainfrom
sync/develop-to-main-v1.6.0
Jun 9, 2026
Merged

Sync develop → main for v1.6.0 chart release#235
saadqbal merged 9 commits into
mainfrom
sync/develop-to-main-v1.6.0

Conversation

@saadqbal

@saadqbal saadqbal commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Sync develop -> main to cut the v1.6.0 client chart release. Tracks #234.

main is 6 commits behind develop — exactly the changes validated this cycle:

Validated on dev (full e2e) and on a prod canary (hasan-prod, healthy on 1.6.0).

After merge, publishing GitHub Release v1.6.0 triggers release-helm-chart.yaml (packages client + ingestor, pushes to gh-pages). NOTE: this flips fleet-wide auto-upgrade — client-chart releases with auto-upgrade ON converge to 1.6.0 at their next :23 cronjob.

🤖 Generated with Claude Code


Note

Medium Risk
Fleet auto-upgrade to 1.6.0 changes ingestion image resolution and GPU pending behavior on managed clusters; proxy NO_PROXY merging affects egress if HTTP_PROXY_HOST is set—validated on dev and a prod canary per release notes.

Overview
v1.6.0 release sync: bumps the client chart to 1.6.0 and bundles several operational fixes.

Ingestor image model shifts to floating-tag spawn by default (images.ingestor.digest empty): jobs-manager gets INGESTOR_IMAGE_REPOSITORY / INGESTOR_IMAGE_TAG / optional digest so each ingestion Job pulls the current tag (Always), avoiding helm upgrade reverting a stale pinned digest. The image-refresh CronJob drops the ingestor (class-2) pass and only rolls jobs-manager + pods-monitor; CI now asserts the floating tag (and optional pin) is multi-arch.

GPU scheduling: new env.SINGLE_NODE (defaults from hostPath.enabled) gates jobs-manager’s GPU→pending→CPU fallback so EKS/AKS/OpenShift leave Pending GPU pods for autoscaling; installer k3d writes SINGLE_NODE: "true".

Corporate proxy: new tracebloc.proxyEnv injects HTTP(S)_PROXY and merged NO_PROXY into jobs-manager, requests-proxy, and upgrade/refresh CronJobs, with proxy keys excluded from the generic .Values.env loop so in-cluster traffic isn’t proxied.

MySQL startup/liveness/readiness probes use TCP (mysqladmin ping -h 127.0.0.1) instead of a unix socket path mismatch.

Installers add post-install CLI PATH verification (bash/PowerShell) with shell-specific fix hints. Docs/tests/schema updated for the above.

Reviewed by Cursor Bugbot for commit dce8692. Bugbot is set up for automated code reviews on this repo. Configure here.

saadqbal and others added 7 commits June 9, 2026 11:21
…ervice-suite

test(charts): add helm-unittest suite for requests-proxy-service template
fix(#222): wire SINGLE_NODE — chart default from hostPath.enabled + installer
…225)

* fix(mysql): probe over TCP so a moved socket can't kill a healthy DB

The startup/liveness/readiness probes ran `mysqladmin ping -h localhost`,
which connects via the client-default unix socket /var/run/mysqld/mysqld.sock.
The mysql-client image actually writes its socket to /var/lib/mysql/mysql.sock,
so the probe can never reach a healthy mysqld: the startup probe exhausts
(~130s) and the kubelet kill-loops the database, cascading jobs-manager and
requests-proxy into CrashLoopBackOff. Switch all three probes to TCP
(-h 127.0.0.1, port 3306) — immune to where the image places the socket —
and document why so it is not reverted.

Reproduced on a fresh install; cached environments were masked by the
mutable :prod tag + imagePullPolicy: IfNotPresent.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(mysql): lock probes to TCP + keep writable scratch mounts (kill-loop regression guard, backend#767)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…_PROXY_* values) (#229)

* fix(proxy): make all egress workloads proxy-aware (wire the dead HTTP_PROXY_* values)

The env.HTTP_PROXY_HOST/PORT/USERNAME/PASSWORD values were declared in
values.yaml + values.schema.json but consumed by no template, and no workload
pod received HTTP(S)_PROXY/NO_PROXY env. Behind a corporate proxy the installer's
node-level proxy (scripts/lib/cluster.sh) handles image pulls, but the application
pods make direct external calls (jobs-manager -> api.tracebloc.io) the network
refuses -> CrashLoopBackOff.

Add a tracebloc.proxyEnv helper that derives HTTP(S)_PROXY + an auto-augmented
NO_PROXY (cluster-internal ranges always included, mirroring cluster.sh, so
in-cluster + MySQL traffic never traverses the proxy) from the env.HTTP_PROXY_*
values, and reference it on every external-egress workload: jobs-manager
(api + pods-monitor), requests-proxy, image-refresh CronJob, auto-upgrade CronJob.
Renders nothing when no proxy is set, so non-proxy installs are unchanged.
Excludes mysql-client and resource-monitor (no external egress) and the ingestor
sub-chart (talks only to jobs-manager.<ns>.svc, in-cluster).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(proxy): assert every egress workload gets proxy env when configured, none when not (backend#768)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…hinery (companion to client-runtime#94) (#231)

* fix(ingestor): spawn by floating tag + retire chart-pinned digest machinery

Companion to client-runtime#94. That PR makes jobs-manager spawn the
ingestor Job by its floating tag with imagePullPolicy=Always by default
(digest pinning becomes opt-in). This wires the chart to that model and
removes the now-dead machinery the old pinned-digest design needed.

Root problem: the ingestor was the only spawned image pinned by a chart
digest. A `helm upgrade` that reset values reverted images.ingestor.digest
to the chart baseline, silently downgrading the ingestor — the failure mode
behind the recurring "empty columns rejected as non-numeric" customer
reports. Training pods never had this because they spawn by floating tag +
Always; this makes the ingestor match.

jobs-manager-deployment.yaml:
  * Wire INGESTOR_IMAGE_REPOSITORY + INGESTOR_IMAGE_TAG (new) alongside the
    now-optional INGESTOR_IMAGE_DIGEST. Empty digest (default) -> floating tag.

values.yaml / values.schema.json:
  * images.ingestor: add `repository`, default `digest: ""` (opt-in pin),
    keep `tag: "0.3"`; drop the dead `autoRefresh` flag. v0.3.6 digest kept
    as a comment for easy pinning. Repository overridable for air-gap mirror.
  * imageRefresh: drop `ingestorResolveFailureThreshold` (class-2 only).

image-refresh-cronjob.yaml + _helpers.tpl:
  * Retire the class-2 (ingestor) pass entirely — the kubelet now resolves
    the ingestor digest at each spawn, so there is nothing to reconcile. The
    CronJob keeps only the class-1 pass (jobs-manager + pods-monitor
    rollout-restart). imageRefreshEnabled no longer factors in the ingestor.

helm-ci.yaml:
  * ingestor-multiarch now validates the floating TAG (the default spawn
    target) and only checks a pinned digest when one is set, instead of
    failing on the now-empty digest.

ingestor subchart (README + values): update docs that described the old
auto-upgrade / INGESTOR_IMAGE_DIGEST currency mechanism.

Validated locally: helm lint --strict (aks/eks/bm/oc), helm template,
helm unittest (167/167), sh -n on the rendered refresh script.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore(chart): bump 1.5.1 -> 1.6.0 + helm-unittest for ingestor env wiring

Minor version bump for the floating-tag ingestor feature. Adds three
jobs-manager deployment unittests covering the new ingestor image wiring:
floating-tag default (INGESTOR_IMAGE_REPOSITORY/TAG set, DIGEST empty),
opt-in digest pin, and repository+tag override for air-gapped mirrors.

helm unittest ./client: 170/170 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
* feat(installer): self-verify CLI usability post-install with a shell-correct PATH fix (#738)

Step 5 installed the tracebloc CLI and then told the user "open a new
terminal so it's on your PATH" — without ever proving a fresh terminal
would actually find it. That is exactly the cli#61 failure mode (binary
lands in ~/.local/bin, which a brand-new shell doesn't have on PATH),
left undetected until the customer hits it. The installer is the last
place to catch it.

After the install attempt, self-verify and report precisely:

- Probe `command -v tracebloc` in BOTH a fresh login shell ("$SHELL" -lic)
  and a non-login shell ("$SHELL" -ic) — they read different startup files
  (~/.profile vs ~/.bashrc), and cli#61 was "works in my login shell,
  missing in a plain `bash` subshell".
- If found: confirm via `tracebloc version` and print a VERIFIED verdict.
  The canonical `tracebloc dataset push ./data` next step stays in the
  summary's "What to do next" — not duplicated here.
- If a fresh shell would NOT find it: print the EXACT shell-correct fix
  for the user's actual $SHELL (zsh→~/.zshrc, bash+linux→~/.bashrc,
  bash+darwin→~/.bash_profile, fish→fish_add_path + ~/.config/fish/config.fish,
  else ~/.profile), not a generic "open a new terminal".

Stays NON-FATAL by design: the client is already connected by Step 5, so
the verification always returns 0 and is hardened against the orchestrator's
`set -e`. Mirrored in install-k8s.ps1 (RefreshPath is the faithful
"fresh terminal" probe on Windows, since the CLI installer edits the
user-scope registry PATH).

Tests: extend install-cli.bats (verified-command success, actionable
shell-correct PATH hint on miss, fish-specific fix, non-fatal under
`set -e`) and mirror in install-k8s.Tests.ps1 (Test-TraceblocCli:
verified verdict, actionable hint, non-fatal when RefreshPath throws).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(installer): make the #738 Windows CLI-verify Pester-safe on Linux CI

Two follow-ups so the Pester jobs go green (they were the only red checks
on #215; Pester is green on develop, so this PR introduced both):

- install-k8s.ps1: the new $TRACEBLOC_CLI_INSTALL_DIR ran Join-Path on
  $env:LOCALAPPDATA at top level. The Pester suite dot-sources this script,
  and on the Linux runner $env:LOCALAPPDATA is null — Join-Path throws on a
  null -Path, aborting BeforeAll and failing the whole container (0/65).
  Guard it; "" placeholder off Windows since the value is only used there.

- install-k8s.Tests.ps1: add a `function tracebloc { }` stub so
  `Mock tracebloc` can bind. Pester v5 only mocks commands that already
  exist (cf. the existing kubectl/docker/helm/k3d stubs); without it the
  "fresh-shell success" test threw CommandNotFoundException — the lone
  windows-latest failure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(installer): make the #738 PATH-fix guidance actually persist

Review follow-up on #215 (comment #2). The miss-branch hint printed a bare
`export PATH=…` (fixes only the current shell) followed by `source ${rc}` on
an rc that did not yet contain the line — so neither command persisted the
fix, while the closing note implied ${rc} should hold it. The user is never
told to write the line into the rc. Rewrite the guidance per-shell:

- POSIX shells (zsh / bash / sh / dash): `echo '<export>' >> ${rc}` then
  `source ${rc}` — one copy-pasteable step that fixes THIS terminal and every
  new one.
- fish: `fish_add_path "…"` already persists (a universal var) AND applies to
  the running shell, so drop the misleading `source ~/.config/fish/config.fish`.

Tests (install-cli.bats): the zsh miss-path now asserts the `echo … >> ~/.zshrc`
form; the fish case asserts no POSIX `export` and no `source`. bats 8/8 pass,
shellcheck --severity=error gate clean, bash -n clean.

NOTE: review comment #1 (fish fresh-shell probe using `command -v`, which fish's
`command` builtin lacks) is NOT addressed here — it needs verification on a real
fish and is tracked separately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>
…232) (#233)

The Amazon Linux and AlmaLinux/Rocky branches of install_docker_engine gated
on `[[ -f /etc/os-release ]] && grep ... /etc/os-release`. The bats suite
mocks `grep`, but a bash builtin file-test can't be intercepted by a function
mock — so on macOS (no /etc/os-release) both branches short-circuited and the
two install_docker_engine distro tests fell through to the get.docker.com
path and failed. CI was green because unit-bash runs on ubuntu.

Read the os-release path from `${TB_OS_RELEASE_FILE:-/etc/os-release}` for
both the file-test and the grep. The bats setup() now writes a real
os-release fixture per $TEST_DISTRO to a temp file and exports
TB_OS_RELEASE_FILE, so distro detection runs for real (real `[[ -f ]]` + real
`grep`) on every dev host. Behaviour is unchanged when the env var is unset
(production / Linux CI).

All 20 tests in scripts/tests/setup-linux.bats now pass on macOS.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
@saadqbal saadqbal self-assigned this Jun 9, 2026
Comment thread client/templates/jobs-manager-deployment.yaml Outdated
aptracebloc
aptracebloc previously approved these changes Jun 9, 2026
@saadqbal saadqbal added the skip-fr-gate Bypass FR gate for this PR (use only for bootstrap or emergencies — visible in audit) label Jun 9, 2026
…passthrough

jobs-manager's api and pods-monitor containers include tracebloc.proxyEnv
(merged, cluster-safe NO_PROXY) AND a generic .Values.env passthrough. The
passthrough re-emitted a user-set NO_PROXY UNMERGED after proxyEnv; k8s keeps
the last duplicate, so the unmerged copy won — dropping the cluster-internal
entries (.svc, 10.0.0.0/8, ...) and routing in-cluster traffic through the
proxy. Exclude proxy-owned keys (via a shared $proxyKeys list) from both
passthrough loops so proxyEnv is the sole source.

Adds a proxy_env_test regression (custom NO_PROXY + proxy -> single merged
NO_PROXY, no unmerged copy) for both jobs-manager containers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fix(#229): dedupe NO_PROXY — exclude proxy keys from generic env passthrough

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit dce8692. Configure here.

Comment thread client/templates/jobs-manager-deployment.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-fr-gate Bypass FR gate for this PR (use only for bootstrap or emergencies — visible in audit)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants