Skip to content

feat(egress-proxy): deploy-time egress-enforcement pre-flight (non-blocking) [client-runtime#104]#253

Merged
saadqbal merged 2 commits into
developfrom
feat/104-egress-enforcement-check
Jun 12, 2026
Merged

feat(egress-proxy): deploy-time egress-enforcement pre-flight (non-blocking) [client-runtime#104]#253
saadqbal merged 2 commits into
developfrom
feat/104-egress-enforcement-check

Conversation

@saadqbal

@saadqbal saadqbal commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Closes part of tracebloc/client-runtime#104. Chart-only.

Why

The §8.2 lockdown (networkPolicy.training.allowExternalHttps=false) only blocks egress on a CNI that enforces egress NetworkPolicy. On a non-enforcing CNI (EKS VPC-CNI without enableNetworkPolicy — hit live during the #102 dev validation) the flip is a silent no-op: the policy renders, training pods still reach the internet, the operator believes they're protected.

What

A helm test hook (templates/egress-enforcement-check.yaml, annotated helm.sh/hook: test) that renders only when the lockdown is enabled (networkPolicy.training.enabled && !allowExternalHttps && enforcementProbeHost):

  • A tracebloc.io/workload: training-labelled probe pod (so the real training-egress NetworkPolicy governs it) curls a canary host directly (--noproxy '*', 5s timeout).
  • Blocked → the CNI is enforcing → OK egress lockdown verified, test passes (exit 0). Reachable → not enforcing → loud WARNING EGRESS LOCKDOWN NOT ENFORCED … and the test fails (exit 1).
  • Never runs during install/upgrade. As a test hook it can't fail or block them — including the hourly fleet auto-upgrade. Operators run helm test <release> after flipping the lockdown for a clear pass/fail.

Scope / design

  • Chart-only. No backend, no client-runtime, no jobs-manager image dependency. Enforcement is treated as a per-fleet prerequisite, so no central audit is built — this just gives operators a deliberate pass/fail check when they enable the lockdown.
  • Ships dormant: gated on the lockdown being on (default off), so it renders nothing on today's fleet and only on clusters that have flipped allowExternalHttps=false.
  • networkPolicy.training.enforcementProbeHost (default 1.1.1.1; "" disables — e.g. air-gapped) + schema. Gate dig-defaults the host to 1.1.1.1, so a --reuse-values upgrade from a release predating the key still renders the check.
  • curl probe image digest-pinned (multi-arch amd64+arm64), PSA-restricted, runAsUser: 100 (curlimages/curl's non-numeric user needs an explicit uid — learned during the docs: fix README Deploy section (Helm not docker), surface in-repo docs #102 EKS probe).
  • Chart 1.7.0 → 1.7.1 (version + appVersion lockstep).

Tests

helm lint + helm unittest 248/248 — new egress_enforcement_check_test.yaml asserts the hook renders only when the lockdown is on (+ host set), is absent otherwise (default / training disabled / empty host), carries helm.sh/hook: test, and that the probe pod is training-labelled, PSA-restricted, digest-pinned, exits 1 on non-enforcement, and curls the configured host with --noproxy.

e2e (manual, per-fleet)

After enabling allowExternalHttps=false, run helm test <release>: on a non-enforcing CNI the probe reaches the host and the test FAILS with the WARNING block; on an enforcing CNI egress is blocked and the test PASSES (OK). kubectl logs job/<release>-egress-enforcement-check.

Note: drain/sequencing on flip/rollback (the in-flight-pod disruption seen during dev validation) is a separate operational/runbook item, not addressed here.

🤖 Generated with Claude Code


Note

Low Risk
Chart-only, dormant by default (allowExternalHttps stays true); the probe runs only on explicit helm test after lockdown is enabled.

Overview
Adds an optional helm test Job that verifies the §8.2 training egress lockdown is actually enforced by the CNI when networkPolicy.training.allowExternalHttps is false.

The new template renders only when training NetworkPolicy is enabled, lockdown is on, and enforcementProbeHost is non-empty (default 1.1.1.1; "" disables for air-gapped). A tracebloc.io/workload: training probe curls the host with --noproxy; blocked egress passes the test, reachable HTTPS fails with loud warnings. The Job is annotated helm.sh/hook: test, so it does not run on install/upgrade—operators run helm test after flipping the lockdown.

Also documents networkPolicy.training.enforcementProbeHost in values.yaml / values.schema.json, adds egress_enforcement_check_test.yaml (render guards, PSA-hardened probe, digest-pinned curl), and bumps the chart to 1.7.1.

Reviewed by Cursor Bugbot for commit 00104c2. Bugbot is set up for automated code reviews on this repo. Configure here.

…ocking) [#104]

When the §8.2 lockdown is enabled (networkPolicy.training.allowExternalHttps=false)
but the cluster's CNI doesn't actually enforce egress NetworkPolicy, the lockdown is
a silent no-op (false sense of security — hit on EKS VPC-CNI during the #102 dev
validation). This adds a post-install/post-upgrade Helm hook that, in that case, runs
a tracebloc.io/workload=training-labelled probe which curls a canary host DIRECTLY:
reachable => the CNI isn't enforcing => logs a loud WARNING. Always exits 0
(non-blocking), so it never fails an upgrade or the hourly auto-upgrade.

- gated on enabled && !allowExternalHttps && enforcementProbeHost -> ships DORMANT
  (does nothing until a fleet flips the lockdown).
- new values networkPolicy.training.enforcementProbeHost (default 1.1.1.1; "" disables,
  e.g. air-gapped); registered in values.schema.json.
- curl probe image digest-pinned multi-arch, PSA-restricted, runAsUser 100.
- helm-unittest egress_enforcement_check_test.yaml (gating both ways + shape); 248/248.
- Chart 1.7.0 -> 1.7.1 (version + appVersion lockstep).

Chart-only; no backend / client-runtime changes (enforcement treated as a per-fleet
prerequisite, so no central audit needed). Refs tracebloc/client-runtime#104, #102.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@saadqbal saadqbal self-assigned this Jun 12, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 824adac. Configure here.

Comment thread client/templates/egress-enforcement-check.yaml Outdated
Comment thread client/templates/egress-enforcement-check.yaml
…blocking helm test [#104]

- Gate default for enforcementProbeHost "" -> "1.1.1.1" (matches values.yaml + $host),
  so `helm upgrade --reuse-values` from a release predating the key still runs the
  check instead of silently skipping it. Explicit "" still disables it.
- Convert the post-install/post-upgrade hook to a `helm test` hook. A post-upgrade
  hook fails the release if the probe pod can't run (image pull / PSA / OOM) regardless
  of the script's exit 0 — which would block the hourly auto-upgrade (not truly
  non-blocking). As a test hook it never runs during install/upgrade, so it can't
  block them, and it FAILS (exit 1) on non-enforcement — run `helm test <release>`
  after flipping the lockdown for a clear pass/fail.

helm-unittest 248/248; lint clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants