fix(#222): wire SINGLE_NODE — chart default from hostPath.enabled + installer#224
Merged
Merged
Conversation
…nstaller Companion to tracebloc/client-runtime#92, which gates jobs-manager's GPU->CPU pending fallback behind a SINGLE_NODE env var. This wires the flag through the chart + installer so existing clusters get correct behavior across the hands-off auto-upgrade (helm upgrade --reset-then-reuse-values). - jobs-manager-deployment.yaml: inject SINGLE_NODE (both containers) = env.SINGLE_NODE if set, else hostPath.enabled (quoted string, hostPath nil-guarded). Added to the range-exclusion so it's never double-emitted. - install-client-helm.sh: installer writes env.SINGLE_NODE: "true" for new k3d installs (fixed single-host, cannot autoscale). - values.yaml + values.schema.json: document the flag (quoted string). - MIGRATION.md: 1.5.1 upgrade note (behavior table + override guidance). - Chart.yaml: 1.5.0 -> 1.5.1 so auto-upgrade ships it. Default = hostPath.enabled discriminates existing clusters that have no SINGLE_NODE in stored values: installer/bare-metal single-host (hostPath true) keep the GPU->CPU fallback; managed dynamic-PVC clusters (EKS/AKS/OpenShift, hostPath false) auto-stop the premature downgrade. A fixed multi-node bare-metal cluster that wants the fallback sets env.SINGLE_NODE: "true". Tests: +3 helm-unittest (default false; true when hostPath enabled; explicit override wins + emitted once) and +1 installer bats assertion. 178 chart tests green, helm lint clean. Closes #222. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
|
👋 Heads-up — Code review queue is at 10 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
This was referenced Jun 8, 2026
Merged
Merged
Merged
Merged
divyasinghds
approved these changes
Jun 9, 2026
This was referenced Jun 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Companion to tracebloc/client-runtime#92 (PR #93), which gates jobs-manager's GPU→CPU pending-job fallback behind a
SINGLE_NODEenv var. This PR wires the flag through the chart + installer.Why
On multi-node/elastic clusters (EKS/AKS) a
PendingGPU pod usually means a GPU node is still autoscaling in; the runtime's premature GPU→CPU downgrade silently moves the experiment to CPU and drives the client-runtime#80 token race. On single-node k3d the fallback is correct.SINGLE_NODElets the runtime tell them apart.Auto-update safe
The client self-upgrades via the
auto-upgradeCronJob (helm upgrade --reset-then-reuse-values). Existing clusters never re-run the installer, so their stored values have noSINGLE_NODE. The chart default is therefore derived from a signal already in stored values —hostPath.enabled:hostPath.enabledSINGLE_NODEdefaulttrue"true"false"false"So existing EKS/AKS clusters auto-stop the premature downgrade on next upgrade with no value change, and existing single-node installs keep today's behavior. An explicit
env.SINGLE_NODEalways wins (e.g. a fixed multi-node bare-metal/NFS cluster that wants the fallback sets"true").Changes
SINGLE_NODEinto both containers =env.SINGLE_NODEif set, elsehostPath.enabled(quoted;hostPathnil-guarded for--reset-then-reuse-values); added to the generic-range exclusion so it's never double-emitted.env.SINGLE_NODE: "true"for new k3d installs.env.additionalPropertiesalready allows it — schema correctly rejects a bare boolean).1.5.0 → 1.5.1so auto-upgrade actually ships the template change.Tests
+3 helm-unittest (default
"false";"true"whenhostPath.enabled; explicit override wins and is emitted exactly once) and +1 installer bats assertion. 178 chart tests green,helm lintclean. Rendered all scenarios manually (EKS→false, bare-metal→true,--set-stringoverride).Closes #222.
🤖 Generated with Claude Code
Note
Medium Risk
Changes GPU job scheduling behavior on upgrade for EKS/AKS (intended fix) and depends on a matching jobs-manager image; mis-set SINGLE_NODE could leave GPU jobs pending or trigger unwanted CPU downgrades.
Overview
Chart 1.5.1 wires
env.SINGLE_NODEinto jobs-manager so runtime can gate the GPU→CPU pending-pod fallback (client-runtime#92): on elastic clusters, long-pending GPU jobs stay on GPU instead of being downgraded after ~180s.The jobs-manager Deployment now sets
SINGLE_NODEon both containers: explicitenv.SINGLE_NODEwins; otherwise it defaults tohostPath.enabled(quoted, nil-safe for--reset-then-reuse-values). The genericenvrange excludesSINGLE_NODEso it is not duplicated. The installer writesSINGLE_NODE: "true"for new k3d installs; values, schema, and MIGRATION.md document the flag and auto-upgrade behavior.Tests: three helm-unittest cases for default/override behavior and one installer bats check for generated values.
Reviewed by Cursor Bugbot for commit eeaf810. Bugbot is set up for automated code reviews on this repo. Configure here.