Skip to content

fix(#222): wire SINGLE_NODE — chart default from hostPath.enabled + installer#224

Merged
saadqbal merged 1 commit into
developfrom
feat/92-single-node-gpu-fallback
Jun 9, 2026
Merged

fix(#222): wire SINGLE_NODE — chart default from hostPath.enabled + installer#224
saadqbal merged 1 commit into
developfrom
feat/92-single-node-gpu-fallback

Conversation

@saadqbal

@saadqbal saadqbal commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Companion to tracebloc/client-runtime#92 (PR #93), which gates jobs-manager's GPU→CPU pending-job fallback behind a SINGLE_NODE env var. This PR wires the flag through the chart + installer.

Why

On multi-node/elastic clusters (EKS/AKS) a Pending GPU pod usually means a GPU node is still autoscaling in; the runtime's premature GPU→CPU downgrade silently moves the experiment to CPU and drives the client-runtime#80 token race. On single-node k3d the fallback is correct. SINGLE_NODE lets the runtime tell them apart.

Auto-update safe

The client self-upgrades via the auto-upgrade CronJob (helm upgrade --reset-then-reuse-values). Existing clusters never re-run the installer, so their stored values have no SINGLE_NODE. The chart default is therefore derived from a signal already in stored values — hostPath.enabled:

Deployment hostPath.enabled SINGLE_NODE default Behavior
Installer k3d / bare-metal single-host true "true" GPU→CPU fallback on (unchanged)
EKS / AKS / OpenShift (dynamic PVC) false "false" Pending GPU pods left for the autoscaler

So existing EKS/AKS clusters auto-stop the premature downgrade on next upgrade with no value change, and existing single-node installs keep today's behavior. An explicit env.SINGLE_NODE always wins (e.g. a fixed multi-node bare-metal/NFS cluster that wants the fallback sets "true").

Changes

  • jobs-manager-deployment.yaml: inject SINGLE_NODE into both containers = env.SINGLE_NODE if set, else hostPath.enabled (quoted; hostPath nil-guarded for --reset-then-reuse-values); added to the generic-range exclusion so it's never double-emitted.
  • install-client-helm.sh: installer writes env.SINGLE_NODE: "true" for new k3d installs.
  • values.yaml + values.schema.json: document the flag (quoted string; env.additionalProperties already allows it — schema correctly rejects a bare boolean).
  • MIGRATION.md: 1.5.1 upgrade note.
  • Chart.yaml: 1.5.0 → 1.5.1 so auto-upgrade actually ships the template change.

Tests

+3 helm-unittest (default "false"; "true" when hostPath.enabled; explicit override wins and is emitted exactly once) and +1 installer bats assertion. 178 chart tests green, helm lint clean. Rendered all scenarios manually (EKS→false, bare-metal→true, --set-string override).

Closes #222.

🤖 Generated with Claude Code


Note

Medium Risk
Changes GPU job scheduling behavior on upgrade for EKS/AKS (intended fix) and depends on a matching jobs-manager image; mis-set SINGLE_NODE could leave GPU jobs pending or trigger unwanted CPU downgrades.

Overview
Chart 1.5.1 wires env.SINGLE_NODE into jobs-manager so runtime can gate the GPU→CPU pending-pod fallback (client-runtime#92): on elastic clusters, long-pending GPU jobs stay on GPU instead of being downgraded after ~180s.

The jobs-manager Deployment now sets SINGLE_NODE on both containers: explicit env.SINGLE_NODE wins; otherwise it defaults to hostPath.enabled (quoted, nil-safe for --reset-then-reuse-values). The generic env range excludes SINGLE_NODE so it is not duplicated. The installer writes SINGLE_NODE: "true" for new k3d installs; values, schema, and MIGRATION.md document the flag and auto-upgrade behavior.

Tests: three helm-unittest cases for default/override behavior and one installer bats check for generated values.

Reviewed by Cursor Bugbot for commit eeaf810. Bugbot is set up for automated code reviews on this repo. Configure here.

…nstaller

Companion to tracebloc/client-runtime#92, which gates jobs-manager's GPU->CPU
pending fallback behind a SINGLE_NODE env var. This wires the flag through the
chart + installer so existing clusters get correct behavior across the
hands-off auto-upgrade (helm upgrade --reset-then-reuse-values).

- jobs-manager-deployment.yaml: inject SINGLE_NODE (both containers) =
  env.SINGLE_NODE if set, else hostPath.enabled (quoted string, hostPath
  nil-guarded). Added to the range-exclusion so it's never double-emitted.
- install-client-helm.sh: installer writes env.SINGLE_NODE: "true" for new k3d
  installs (fixed single-host, cannot autoscale).
- values.yaml + values.schema.json: document the flag (quoted string).
- MIGRATION.md: 1.5.1 upgrade note (behavior table + override guidance).
- Chart.yaml: 1.5.0 -> 1.5.1 so auto-upgrade ships it.

Default = hostPath.enabled discriminates existing clusters that have no
SINGLE_NODE in stored values: installer/bare-metal single-host (hostPath true)
keep the GPU->CPU fallback; managed dynamic-PVC clusters (EKS/AKS/OpenShift,
hostPath false) auto-stop the premature downgrade. A fixed multi-node
bare-metal cluster that wants the fallback sets env.SINGLE_NODE: "true".

Tests: +3 helm-unittest (default false; true when hostPath enabled; explicit
override wins + emitted once) and +1 installer bats assertion. 178 chart tests
green, helm lint clean.

Closes #222.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@LukasWodka

Copy link
Copy Markdown
Contributor

👋 Heads-up — Code review queue is at 10 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

This was referenced Jun 8, 2026
@LukasWodka LukasWodka requested a review from aptracebloc June 8, 2026 18:34
@saadqbal saadqbal self-assigned this Jun 9, 2026
@saadqbal saadqbal merged commit 81f4f80 into develop Jun 9, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants