Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 37 additions & 16 deletions .github/workflows/helm-ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -117,28 +117,49 @@ jobs:
echo "Schema validation passed for ${{ matrix.platform }}"

ingestor-multiarch:
# Guard: the chart's PINNED ingestor digest must be a multi-arch index
# (linux/amd64 + linux/arm64). Greenfield installs spawn the ingestor from this
# pinned digest (before image-refresh ticks), so an amd64-only pin breaks data
# ingestion on arm64 hosts (Apple Silicon, Graviton) with ImagePullBackOff. This
# would have caught #160 (which pinned the amd64-only v0.3.1). See client#186.
name: Pinned ingestor digest is multi-arch
# Guard: the ingestor image the cluster spawns must be a multi-arch index
# (linux/amd64 + linux/arm64), or arm64 hosts (Apple Silicon, Graviton)
# fail data ingestion with "no match for platform" / ImagePullBackOff.
# jobs-manager spawns ingestion Jobs by the floating tag
# `images.ingestor.tag` (imagePullPolicy=Always) by DEFAULT, so that tag
# must be multi-arch. If an operator opts into pinning via
# `images.ingestor.digest` (empty by default), the pinned digest must be
# multi-arch too. This would have caught #160 (the amd64-only v0.3.1 pin).
# See client#186.
name: Spawned ingestor image is multi-arch
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Assert images.ingestor.digest supports linux/amd64 + linux/arm64
- name: Assert the ingestor tag (and pinned digest, if any) is multi-arch
run: |
repo=$(yq '.images.ingestor.repository' client/values.yaml)
if [ -z "$repo" ] || [ "$repo" = "null" ]; then repo="ghcr.io/tracebloc/ingestor"; fi
tag=$(yq '.images.ingestor.tag' client/values.yaml)
digest=$(yq '.images.ingestor.digest' client/values.yaml)
echo "Pinned ingestor digest: $digest"
if [ -z "$digest" ] || [ "$digest" = "null" ]; then
echo "::error::images.ingestor.digest is empty — it must be a pinned multi-arch digest."; exit 1

assert_multiarch() {
ref="$1"; label="$2"
echo "Inspecting $label: $ref"
plats=$(docker buildx imagetools inspect "$ref" 2>&1 \
| awk '/Platform:/{print $2}' | grep -v '^unknown' | sort -u)
echo "Platforms: $(echo "$plats" | paste -sd' ' -)"
echo "$plats" | grep -qx 'linux/arm64' || { echo "::error::$label ($ref) is NOT multi-arch (no linux/arm64). arm64 installs (Apple Silicon, Graviton) would fail ingestion with 'no match for platform' / ImagePullBackOff. See client#186 / #160."; exit 1; }
echo "$plats" | grep -qx 'linux/amd64' || { echo "::error::$label ($ref) is missing linux/amd64."; exit 1; }
echo "OK — $label is multi-arch (amd64 + arm64)."
}

# The floating tag is the default spawn target — always validate it.
if [ -z "$tag" ] || [ "$tag" = "null" ]; then
echo "::error::images.ingestor.tag is empty — the chart must define a floating tag to spawn by."; exit 1
fi
assert_multiarch "${repo}:${tag}" "floating tag"

# A pinned digest is opt-in (empty by default). When set, it must be multi-arch too.
if [ -n "$digest" ] && [ "$digest" != "null" ]; then
assert_multiarch "${repo}@${digest}" "pinned digest"
else
echo "images.ingestor.digest empty (default) — spawning by floating tag; no pinned digest to check."
fi
plats=$(docker buildx imagetools inspect "ghcr.io/tracebloc/ingestor@$digest" 2>&1 \
| awk '/Platform:/{print $2}' | grep -v '^unknown' | sort -u)
echo "Platforms: $(echo "$plats" | paste -sd' ' -)"
echo "$plats" | grep -qx 'linux/arm64' || { echo "::error::Pinned ingestor digest is NOT multi-arch (no linux/arm64). arm64 installs (Apple Silicon, Graviton) would fail ingestion with 'no match for platform' / ImagePullBackOff. Pin a multi-arch :0.x index. See client#186 / #160."; exit 1; }
echo "$plats" | grep -qx 'linux/amd64' || { echo "::error::Pinned ingestor digest is missing linux/amd64."; exit 1; }
echo "OK — pinned ingestor digest is multi-arch (amd64 + arm64)."

# Installer script tests (bats + Pester) + the cross-distro prerequisite matrix
# live in their own workflow: .github/workflows/installer-tests.yaml
Expand Down
2 changes: 1 addition & 1 deletion client/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ apiVersion: v2
name: client
description: A unified Helm chart for tracebloc on AKS, EKS, bare-metal, and OpenShift
type: application
version: 1.5.1
version: 1.6.0
appVersion: "1.5.1"
keywords:
- tracebloc
Expand Down
38 changes: 38 additions & 0 deletions client/MIGRATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,44 @@

This guide explains how to migrate from the legacy per-platform charts (`aks/`, `bm/`, `eks/`, `oc/`) to the unified `client/` chart.

## Upgrading to 1.5.1 — single-node gating of the GPU→CPU pending fallback

[client-runtime#92](https://github.com/tracebloc/client-runtime/issues/92) /
[#222](https://github.com/tracebloc/client/issues/222): jobs-manager's
GPU→CPU fallback (a GPU pod stuck `Pending` past the scheduling-overdue
interval is stopped and respun as a CPU job) is now gated on a new
`env.SINGLE_NODE` flag.

**Why:** on a multi-node / elastic cluster (EKS cluster-autoscaler / Karpenter,
AKS) a `Pending` GPU pod usually just means a GPU node is still autoscaling in
(3–10 min). Downgrading to CPU after ~180s is premature — it silently moves a
GPU experiment onto CPU and drives the stop→respin token churn behind the
[client-runtime#80](https://github.com/tracebloc/client-runtime/issues/80) 401
race. On a fixed single-node cluster (installer-provisioned k3d) GPU presence is
known at install time and no node will autoscale in, so the fallback is correct.

**What you need to do: nothing for most clusters.** `SINGLE_NODE` defaults to
`hostPath.enabled`, so the behavior tracks your existing topology across the
hands-off auto-upgrade:

| Deployment | `hostPath.enabled` | `SINGLE_NODE` default | Behavior |
|---|---|---|---|
| Installer k3d / bare-metal single-host | `true` | `"true"` | GPU→CPU fallback **on** (unchanged) |
| EKS / AKS / OpenShift (dynamic PVC) | `false` | `"false"` | Pending GPU pods left for the autoscaler |

- **EKS/AKS/OpenShift** automatically stop the premature downgrade on the next
auto-upgrade — no value change needed.
- **The installer** now writes `env.SINGLE_NODE: "true"` explicitly for new k3d
installs, so they don't depend on the heuristic.
- **A fixed multi-node bare-metal cluster** (e.g. an NFS-backed cluster with
`hostPath.enabled: false`) that *wants* the hard CPU/GPU fallback must set it
explicitly: `env.SINGLE_NODE: "true"`. Must be a quoted string.

`SINGLE_NODE` requires the matching jobs-manager image (it ships in the same
release train). During the brief window where image-refresh rolls the new image
before this chart upgrade injects the var, an absent `SINGLE_NODE` is treated as
single-node (fallback on), so a single-node cluster is never regressed mid-rollout.

## Upgrading to 1.3.4 — parent chart owns the shared ingestor ServiceAccount

[#129](https://github.com/tracebloc/client/issues/129): the ingestor
Expand Down
63 changes: 47 additions & 16 deletions client/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -145,27 +145,20 @@ mysql-pvc
{{- $imgs := default dict .Values.images -}}
{{- $jm := default dict $imgs.jobsManager -}}
{{- $pm := default dict $imgs.podsMonitor -}}
{{- $in := default dict $imgs.ingestor -}}
{{/*
Per-image pin signals (each one means "skip auto-refresh for this image"):
* jobs-manager / pods-monitor: digest set (non-empty) — same signal as
the deployment uses to switch imagePullPolicy to IfNotPresent.
* ingestor: explicit `autoRefresh: false` flag — asymmetric because
ingestor.digest must be non-empty for jobs-manager to work, so we
can't use digest-presence as the signal there.
Per-image pin signal (means "skip auto-refresh for this image"):
jobs-manager / pods-monitor are pinned when `digest` is set (non-empty) —
the same signal the deployment uses to switch imagePullPolicy to
IfNotPresent. The ingestor is no longer refreshed by this CronJob (it is
spawned by jobs-manager from a floating tag — see the
image-refresh-cronjob.yaml header and submit_ingestion_run in
client-runtime), so the CronJob exists only to refresh the two class-1
images: when BOTH are pinned there is nothing left for it to do.
*/}}
{{- $jmPinned := $jm.digest -}}
{{- $pmPinned := $pm.digest -}}
{{/*
Can't use `default true $in.autoRefresh` here — Go templates treat
the bool `false` as falsy, so `default true false` returns `true`
and flips the pin state on the explicit-disable case. Instead test
for the literal `false` directly; absence (nil) and explicit `true`
both fall through to "not pinned".
*/}}
{{- $inPinned := eq $in.autoRefresh false -}}
{{- if not $ir.enabled -}}
{{- else if and (and $jmPinned $pmPinned) $inPinned -}}
{{- else if and $jmPinned $pmPinned -}}
{{- else -}}
true
{{- end -}}
Expand Down Expand Up @@ -207,3 +200,41 @@ Usage: {{ include "tracebloc.image" (dict "repository" "tracebloc/jobs-manager"
{{ $registry }}/{{ .repository }}:{{ .tag | default "prod" }}
{{- end -}}
{{- end }}

{{/*
tracebloc.proxyEnv — corporate-proxy env for egress-needing workloads.
Derives HTTP(S)_PROXY + an auto-augmented NO_PROXY from .Values.env.HTTP_PROXY_*
so workload pods can reach the backend / registries through a corporate proxy.
Renders nothing when HTTP_PROXY_HOST is unset (non-proxy installs unchanged).
NO_PROXY always carries the cluster-internal ranges so in-cluster + MySQL
traffic never traverses the proxy (mirrors scripts/lib/cluster.sh defaults).
Usage inside a container's env: list:
{{- include "tracebloc.proxyEnv" . | nindent 8 }}
*/}}
{{- define "tracebloc.proxyEnv" -}}
{{- if .Values.env.HTTP_PROXY_HOST }}
{{- $host := .Values.env.HTTP_PROXY_HOST -}}
{{- $port := .Values.env.HTTP_PROXY_PORT | default "" -}}
{{- $user := .Values.env.HTTP_PROXY_USERNAME | default "" -}}
{{- $pass := .Values.env.HTTP_PROXY_PASSWORD | default "" -}}
{{- $hostport := $host -}}
{{- if $port }}{{- $hostport = printf "%s:%v" $host $port -}}{{- end -}}
{{- $cred := "" -}}
{{- if $user }}{{- $cred = printf "%s:%s@" $user $pass -}}{{- end -}}
{{- $url := printf "http://%s%s" $cred $hostport -}}
{{- $noProxy := "localhost,127.0.0.1,0.0.0.0,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.svc,.svc.cluster.local,.cluster.local,host.k3d.internal" -}}
{{- with .Values.env.NO_PROXY }}{{- $noProxy = printf "%s,%s" . $noProxy -}}{{- end }}
- name: HTTP_PROXY
value: {{ $url | quote }}
- name: HTTPS_PROXY
value: {{ $url | quote }}
- name: http_proxy
value: {{ $url | quote }}
- name: https_proxy
value: {{ $url | quote }}
- name: NO_PROXY
value: {{ $noProxy | quote }}
- name: no_proxy
value: {{ $noProxy | quote }}
{{- end }}
{{- end -}}
1 change: 1 addition & 0 deletions client/templates/auto-upgrade-cronjob.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@ spec:
# our script instead.
command: ["/bin/sh", "/scripts/auto-upgrade.sh"]
env:
{{- include "tracebloc.proxyEnv" . | nindent 16 }}
- name: RELEASE_NAME
value: {{ .Release.Name | quote }}
- name: RELEASE_NAMESPACE
Expand Down
Loading
Loading