Skip to content

Sync develop → main for v1.7.0 chart release (§8.2 egress gateway, inert)#252

Merged
saadqbal merged 1 commit into
mainfrom
sync/develop-to-main-v1.7.0
Jun 12, 2026
Merged

Sync develop → main for v1.7.0 chart release (§8.2 egress gateway, inert)#252
saadqbal merged 1 commit into
mainfrom
sync/develop-to-main-v1.7.0

Conversation

@saadqbal

@saadqbal saadqbal commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Promotes chart 1.7.0 (develop → main) — the §8.2 training-pod egress gateway (tracebloc/client-runtime#102). Snapshot of develop at 2fb5b8f.

What ships — and it ships INERT

A new in-cluster squid egress gateway (egressProxy.*) + the lockdown flags, all default-off:

  • egressProxy.enabled: true → the gateway Deployment/Service/ConfigMap deploy (idle ~50m/64Mi pod; nothing routes to it).
  • egressProxy.routeWorkloads: false → jobs-manager does not inject HTTPS_PROXY; training pods unchanged.
  • networkPolicy.training.allowExternalHttps: true → the 0.0.0.0/0:443 egress rule is kept; no lockdown.

Net effect of this release on every fleet: a new idle gateway pod and zero change to training-pod egress. The actual lockdown is a separate, per-fleet flag flip (below), not enabled here.

Validation

  • helm unittest 212/212, pytest 201/201, and the new upgrade-e2e CI gate green (installs the last published chart, upgrades through --reuse-values + --reset-then-reuse-values, confirms nil-guards hold + the flip persists).
  • Proven end-to-end on both enforcement engines: k3s/kube-router (the installer's k3d path = bare-metal/VM installs) and EKS/VPC-CNI eBPF — a real training pod reached the backend + App Insights only via the gateway and was blocked from arbitrary hosts; a real experiment completed.
  • Auto-upgrade safety verified: --reset-then-reuse-values carries the new defaults in inert; new keys nil-guarded against --reuse-values; the training netpol's podSelector is pinned so the lockdown can never catch the auto-upgrade/image-refresh cronjobs.

Rollout posture (do NOT flip on merge)

This release is Phase 1 — ship the mechanism inert. Enabling the lockdown is a staged, per-fleet action gated on:

Cross-repo dependency for the flip: before setting routeWorkloads=true on a fleet, that fleet's jobs-manager image must include client-runtime#103 (the EGRESS_PROXY_URLHTTPS_PROXY injection). Inert-on-upgrade is unaffected; this only matters at flip time.

Post-merge (per release runbook)

Tag + publish the GitHub Release on main → release workflow packages from the tag → gh-pages → fleets auto-upgrade at :23 (gateway lands inert).

Refs tracebloc/client-runtime#102. skip-fr-gate per release convention (matches #241/#235).


Note

Medium Risk
Changes training NetworkPolicy semantics and fleet upgrade paths; defaults keep direct HTTPS and routing off, but misconfigured per-fleet lockdown or non-enforcing CNI could break training egress.

Overview
Chart 1.7.0 adds the §8.2 in-cluster squid egress gateway (egressProxy) and the knobs to tighten training-pod egress without changing behavior on upgrade by default.

New Helm resources deploy squid (ConfigMap/Deployment/Service) with an FQDN allowlist, PSA-hardened pod spec, and optional upstream cache_peer when env.HTTP_PROXY_HOST is set. egressProxy.routeWorkloads defaults to false, so jobs-manager does not set EGRESS_PROXY_URL and training traffic is unchanged; when enabled, jobs-manager points training pods at egress-proxy-service. The training NetworkPolicy gates the broad 0.0.0.0/0:443 rule on networkPolicy.training.allowExternalHttps (default true, nil-safe for --reuse-values) and adds an explicit egress rule to the gateway pod when the proxy is enabled.

values.yaml, values.schema.json, and egress_proxy_test.yaml document and lock in nil-guards, fail-closed allowlisting, and netpol/job-manager wiring. docs/SECURITY.md describes the gated rollout and updated interim risk for Service Bus.

CI adds scripts/tests/e2e-auto-upgrade.sh and an upgrade-e2e Helm job: install last published chart on k3d, exercise --reuse-values and --reset-then-reuse-values, assert lockdown stays off until flipped, then that operator lockdown settings survive the next auto-upgrade path. Shellcheck workflows include the new script.

Reviewed by Cursor Bugbot for commit 2fb5b8f. Bugbot is set up for automated code reviews on this repo. Configure here.

feat(egress-proxy): training-pod egress lockdown — squid gateway, gated rollout (client-runtime#102)
@saadqbal saadqbal added the skip-fr-gate Bypass FR gate for this PR (use only for bootstrap or emergencies — visible in audit) label Jun 12, 2026
@saadqbal saadqbal self-assigned this Jun 12, 2026
@saadqbal saadqbal merged commit 849a49a into main Jun 12, 2026
71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-fr-gate Bypass FR gate for this PR (use only for bootstrap or emergencies — visible in audit)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants