Skip to content

kubectl yconverge checks DAG and kustomize base traversal#76

Merged
solsson merged 69 commits intomainfrom
y-converge-checks-dag
May 3, 2026
Merged

kubectl yconverge checks DAG and kustomize base traversal#76
solsson merged 69 commits intomainfrom
y-converge-checks-dag

Conversation

@solsson
Copy link
Copy Markdown
Collaborator

@solsson solsson commented Apr 22, 2026

Replaces #74.

y-cluster-provision and y-cluster-converge-ystack

New --converge=LIST flag replaces the broken --exclude flag.
Specify which ystack bases to converge as a comma-separated list of
names without number prefix:

# Default (minimal): y-kustomize, blobs, builds-registry
y-cluster-provision

# With kafka and buildkit
y-cluster-provision --converge=y-kustomize,blobs,builds-registry,kafka,buildkit

# Direct converge (same syntax)
y-cluster-converge-ystack --context=local --converge=kafka,builds-registry

Available targets: y-kustomize, blobs, builds-registry, kafka,
buildkit, monitoring, prod-registry.

Dependencies are resolved automatically — converging builds-registry
pulls in blobs, y-kustomize, and kafka-ystack.

New --dry-run=server passthrough — verify what would be applied
without mutating the cluster:

y-cluster-provision --converge=kafka --dry-run=server

Registry mirrors

Drops /etc/hosts hacks on nodes. The registries.yaml config
now uses the magic ClusterIPs (10.43.0.50 for builds-registry,
10.43.0.51 for prod-registry) read from the source-of-truth YAML
files. Containerd resolves registries without needing DNS or host
file entries.

The qemu provisioner verifies registry access after converge:

[y-cluster-provision-qemu] Verifying containerd registry access ...
  builds-registry: OK

Image caching

y-image-cache-ystack and y-image-list-ystack now accept the same
--converge=LIST flag. The provisioner passes its converge targets
through, so all images for the selected bases are pre-cached before
converge starts.

# List images that would be cached
y-image-list-ystack --converge=y-kustomize,blobs,builds-registry,kafka,buildkit

kubectl-yconverge

NAMESPACE exported to check commands: Exec checks in
yconverge.cue can use $NAMESPACE (the resolved namespace) and
$CONTEXT in their commands. NS_GUESS is no longer exported.

Multi-dir CUE aggregation via kustomize-traverse: The old 1-level
single-directory heuristic for finding yconverge.cue is replaced by
kustomize-traverse, which walks the full kustomization tree. Checks
from all bases are collected and run after apply. This fixes the case
where site-apply-namespaced/ references multiple base directories.

K3s version

Upgraded from v1.35.1+k3s1 to v1.35.3+k3s1.

New binary: kustomize-traverse

Added as a y-bin managed tool. Walks kustomization directory trees and
reports local directories visited and resolved namespace:

y-kustomize-traverse -o dirs gateway-v4/site-apply-namespaced/
y-kustomize-traverse -o namespace gateway-v4/site-apply-namespaced/

Breaking changes

  • --exclude flag removed from provisioners (was never implemented
    in y-cluster-converge-ystack)
  • NS_GUESS no longer exported — use $NAMESPACE in check commands
  • y-image-list-ystack no longer reads a BASES array from
    y-cluster-converge-ystack — uses --converge flag instead

solsson and others added 24 commits April 16, 2026 12:24
kubectl plugin that wraps kustomize apply with idempotent converge-mode
label routing (create, replace, serverside, serverside-force) and
post-apply checks defined in yconverge.cue files using a CUE schema.

Check types: #Wait (kubectl wait), #Rollout (rollout status), #Exec
(arbitrary command with retry-until-timeout). Checks are defined per
kustomization in a yconverge.cue file; the framework finds them via
1-level single-directory indirection through kustomization.yaml
resources, ignoring sibling file resources.

Dependency resolution walks CUE imports to build a topological apply
order. Shared check definitions live in pure-CUE packages (no
kustomization.yaml) that the dep walker ignores.

Modes: apply (default), --diff=true, --checks-only, --print-deps.
Apply modifiers: --dry-run=server|none, --skip-checks. Dry-run
forwards to both kubectl apply and delete so replace-mode resources
are provably non-mutating. Invalid flag combinations fail up front.

Namespace for checks resolves from: -n CLI arg > outer kustomization
namespace > indirected base namespace > context default. Exported as
$NS_GUESS for exec checks alongside $CONTEXT.

Error tolerance uses exact criteria: each kubectl step declares the
specific error substrings it tolerates (AlreadyExists, no objects
passed to apply, No resources found) — anything else surfaces raw.

Integration tests run a kwok cluster in Docker with a fake node for
pod scheduling. Covers: schema validation, dep resolution, indirection,
converge-mode labels, broken-cue rejection, --skip-checks negative,
replace-mode dry-run UID preservation, shared checks across db variants
(single/distributed), and a PDB safety check demonstrating prod→qa
failure detection.

CI workflow renamed from "lint" to "checks" to reflect the itest job.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

s
Remove `up` and `namespaceGuess` from verify.#Step. Both were
"set by the engine, not by user CUE files" — but the engine never
set them either. `up` was designed for a CUE-native orchestrator
where CUE's evaluation order needed a data dependency to serialize
steps; the shell-based dep walker serializes via a for-loop instead.
`namespaceGuess` is handled entirely as the shell variable $NS_GUESS.
No yconverge.cue file in the repo references either field.

New test: verify dependency checks serialize before downstream steps.
Captures the multi-step output of example-with-dependency and asserts
line ordering — namespace check completes before configmap step starts,
configmap check completes before with-dependency step starts. This is
the guarantee `up` was meant to provide, now proven by the shell
execution model.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Provisioners (qemu, k3d) run kubectl yconverge for gateway-api and
gateway before --skip-converge exit. Gateway API is infrastructure
assumed present by all functional bases.

Remove gateway imports from 29-y-kustomize and 20-gateway DAG.
Keep all Traefik checks in 40-kafka-ystack — they verify the
complete path kustomize uses for HTTP resources.

Use -write instead of --ensure for /etc/hosts to fix stale entries
from previous provisioner sessions.

E2e: replace y-cluster-provision reprovision with explicit yconverge
calls for monitoring and idempotency proof.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The gateway step's /etc/hosts update runs before any HTTPRoutes exist.
The y-kustomize step creates an HTTPRoute, so /etc/hosts needs updating
afterward for kustomize HTTP resource resolution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace API proxy checks (kubectl get --raw .../proxy/...) with curl
checks using the exact URL that kustomize HTTP resources reference:
  http://y-kustomize.ystack.svc.cluster.local/v1/.../base-for-annotations.yaml

This is the path kustomize actually uses. If curl succeeds, kustomize
will resolve the resource. The API proxy path has different failure
modes (endpoint readiness timing) that don't predict kustomize success.

30-blobs-ystack: add blobs content check after restart (was missing).
40-kafka-ystack: kafka base gets 120s timeout (newly mounted secret),
  blobs base gets 60s (already mounted from previous step).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The y-k8s-ingress-hosts -write command replaces the managed block in
/etc/hosts. When called before HTTPRoutes exist (during provisioning),
it wrote an empty block — clearing previous entries. This caused curl
checks to fail with "Could not resolve host" instead of the assumed
secret propagation delay.

Fix: skip -write when no ingress/gateway entries are found, preserving
existing /etc/hosts entries from earlier steps.

With /etc/hosts stable, y-kustomize restart + content availability
takes ~4 seconds (secret volume is fresh on new pod). Reduce check
timeouts from 120s to 30s.

Root cause confirmed: Kubernetes secret volume mounts are instant on
new pods. The 60-120s delay from docs applies only to volume UPDATES
on running pods (kubelet sync interval). Restarts create new pods
with fresh mounts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The new y-kustomize binary watches secrets labeled
yolean.se/module-part=y-kustomize via the Kubernetes API and serves
their content at /v1/{group}/{name}/{key}. Secret changes are
reflected instantly — no pod restart or kubelet volume sync needed.

This eliminates the dual-restart problem where the second restart
lost the first secret's volume mount for 60-120s due to kubelet's
sync interval.

Changes:
- y-kustomize/cmd/: Go binary with secret watch, HTTP server, tests
- y-kustomize/rbac.yaml: ServiceAccount + Role for secret list/watch
- y-kustomize/deployment.yaml: new image, removed volume mounts
- Secret labels: yolean.se/module-part changed from config to y-kustomize
- Init secrets get the label for consistent watch matching
- blobs-ystack/kafka-ystack: remove restart checks, keep content checks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
contain: Go binary from turbokube/contain releases, added to
y-bin.runner.yaml with y-contain wrapper.

y-kustomize build:
  contain.yaml: distroless/static:nonroot base, single Go binary layer
  skaffold.yaml: custom builder using go build + contain, OCI output
  No Docker required. No push for local dev.

y-image-cache-load: add help section, fix lint warnings.

Local workflow:
  cd y-kustomize/cmd
  go build + contain build → target-oci/
  y-image-cache-load to get into cluster

CI workflow:
  Same contain.yaml with --push for ghcr.io

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Init secrets get yolean.se/converge-mode: create label so re-converge
doesn't overwrite secrets that have been populated by blobs-ystack
or kafka-ystack. The watch-based y-kustomize reacts to secret content
changes — empty secrets cause 404.

y-cluster-local-ctr: add qemu case using SSH, matching the provisioner's
existing SSH connection pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The watch-based y-kustomize reads secrets via the Kubernetes API.
It doesn't need empty placeholder secrets to start — it starts with
an empty file map and picks up secrets as they're created by
blobs-ystack and kafka-ystack.

Removes the init step and the dependency from 29-y-kustomize.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds y-kustomize job to images workflow:
  go build + contain build --push to ghcr.io/yolean/y-kustomize:$SHA

Temporarily triggers on y-converge-checks-dag branch pushes.
Push will fail on YoleanAgents fork (no ghcr.io/yolean write access)
but validates the build succeeds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ghcr.io/yolean/y-kustomize:c55953b69f74067043f2351f8727ea84db1737ca
@sha256:e44f99f6bbae59aef485610402c8f3f0125e197fff8616643bd4d5c65ce619e1

Built by GHA images workflow. k3s pulls from ghcr.io on deploy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Custom builder: go build + contain tarball + ctr import into cluster.
Deploy hook restarts y-kustomize after image load.
No Docker daemon needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore env -i for acceptance test reproducibility.

Registry rollout timeout increased to 120s — first deploy pulls
the image from ghcr.io which can exceed 60s on cold cache.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The registry timeout was a transient issue, not a real problem.
Restore clean env (env -i) for acceptance test reproducibility.

e2e passes: 36/36 checks with clean env on fresh cluster.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kubectl-yconverge resolves k3s/ paths relative to cwd. Provisioners
are called from other repos (checkit) where k3s/ doesn't exist.
Use subshell cd to ensure correct path resolution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kubectl writes contexts/clusters/users: null instead of [] when the
last item is removed. kubie rejects this as invalid YAML. Fix by
replacing null with empty list after context deletion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
y-cluster-converge-ystack accepts --converge=LIST (comma-separated
base names without number prefix). Replaces the broken --exclude flag.
Default: y-kustomize,blobs,builds-registry. Both provisioners pass
--converge and --dry-run through.

y-image-list-ystack and y-image-cache-ystack accept the same flag.
The provisioner passes its converge targets so all images are pre-cached.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
y-registry-config reads magic ClusterIPs from the source-of-truth
YAML files instead of using hostnames. Containerd resolves registries
without /etc/hosts hacks on nodes. Qemu provisioner verifies registry
access after converge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lint y-cluster-converge-ystack, y-image-list-ystack, and
kubectl-yconverge with zero failures required before running
integration tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NS_GUESS remains internal. Only NAMESPACE is exported to exec check
commands. wait/rollout checks also use NAMESPACE as fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kustomize-traverse walks kustomization directory trees using the
kustomize API types. Replaces the bash _find_cue_dir single-dir
heuristic with full tree traversal. Checks from all bases are
aggregated. Also used for namespace resolution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@solsson solsson changed the title y-converge checks DAG and kustomize base traversal kubectl yconverge checks DAG and kustomize base traversal Apr 22, 2026
@solsson solsson force-pushed the y-converge-checks-dag branch from 8d1dd15 to fe5e0f8 Compare April 23, 2026 04:48
Comment thread .github/workflows/images.yaml Outdated
push:
branches:
- main
- y-converge-checks-dag
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove workflow test

Comment thread bin/y-cluster-provision-qemu Outdated
--converge=LIST comma-separated k3s bases to converge (default: y-kustomize,blobs,builds-registry)
--skip-converge skip converge and post-provision steps
--skip-image-load skip image cache and load into containerd
--dry-run=MODE forward to kubectl-yconverge (server|none)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--dry-run doesn't make sense for provision. There's --skip-converge for that.

Comment thread bin/y-cluster-provision-qemu Outdated
# Fix kubectl writing null instead of [] when last item is removed
sed -i 's/^contexts: null$/contexts: []/' "$KUBECONFIG" 2>/dev/null
sed -i 's/^clusters: null$/clusters: []/' "$KUBECONFIG" 2>/dev/null
sed -i 's/^users: null$/users: []/' "$KUBECONFIG" 2>/dev/null
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this if only one provisioner does it. It's only kubie that can't handle null so we can consider that an issue with kubie.

Comment thread bin/y-cluster-provision-k3d Outdated
# Gateway API is always set up, even with --skip-converge.
export OVERRIDE_IP=${YSTACK_PORTS_IP:-127.0.0.1}
(cd "$YSTACK_HOME" && kubectl-yconverge --context=$CTX -k k3s/10-gateway-api/)
(cd "$YSTACK_HOME" && kubectl-yconverge --context=$CTX -k k3s/20-gateway/)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way of calling yconverge is poor DX. If the script can run from any CWD we should use $YSTACK_HOME/k3s/20-gateway/ as the base path (or use a root derived from the script invocation). yconverge should not require that it's invoked from ystack root. Also it's a kubectl plugin so kubectl yconverge should work.

Comment thread bin/y-contain
set -e
YBIN="$(dirname $0)"

version=$(y-bin-download $YBIN/y-bin.optional.yaml contain)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the declaration remain in optional? if so clean up.

Comment thread bin/y-image-list-ystack Outdated
kubectl kustomize "$d" 2>/dev/null \
| grep -oE 'image:\s*\S+' \
| sed 's/image:[[:space:]]*//' \
|| true # y-script-lint:disable=or-true # kustomize may fail for bases requiring y-kustomize HTTP
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good enough reason for ignoring errors. If you're talking about transient HTTP errors they're either rare and should propagate (if it's github, provision will likely fail anyway) or frequent in which case they should propagate too so we learn that we should redesign.

Comment thread bin/kubectl-yconverge Outdated
grep '"yolean.se/ystack/' "$1" 2>/dev/null \
| grep -v '"yolean.se/ystack/yconverge/verify"' \
| sed 's|.*"yolean.se/ystack/\([^":]*\).*|\1|' \
|| true # y-script-lint:disable=or-true # no imports is valid
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other errors will this silently swallow? Do we have test coverage?

Comment thread bin/kubectl-yconverge Outdated

if [ -z "$_YCONVERGE_RESOLVING" ] && [ -n "$KUSTOMIZE_DIR" ]; then
deps=$(_resolve_deps "$KUSTOMIZE_DIR")
dep_count=$(printf '%s\n' "$deps" | grep -c . 2>/dev/null) || true # y-script-lint:disable=or-true # grep -c . exit 1 = zero matches
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find a better way to detect zero matches than to skip errors

Comment thread bin/y-cluster-provision-k3d Outdated
@@ -134,10 +142,7 @@ else
y-image-cache-load-all </dev/null || true
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we ignore errors here? Why didn't y-script-lint prevent this?

Comment thread bin/y-cluster-provision-qemu Outdated
echo "[y-cluster-provision-qemu] Loading images ..."
y-image-cache-ystack </dev/null
y-image-cache-ystack --converge=$CONVERGE_TARGETS </dev/null
y-image-cache-load-all </dev/null || true
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yolean k8s-qa and others added 2 commits April 23, 2026 05:43
- Remove workflow test changes from images.yaml
- Remove --dry-run from provisioners (use y-cluster-converge-ystack directly)
- Remove kubie null workaround from qemu teardown
- Use absolute paths for yconverge calls (no cd to YSTACK_HOME)
- y-image-list-ystack: let kustomize errors propagate
- kubectl-yconverge: replace grep -c with wc -l, guard file existence
  in _find_imports, use || : for legitimate empty-string fallbacks
- y-cluster-converge-ystack: use absolute paths in _resolve_target

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kubectl kustomize fails for bases that reference HTTP resources
(e.g. y-kustomize served content) when the cluster isn't running.
Skip with a diagnostic message instead of failing the entire
image caching step. These images are pulled during converge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Yolean k8s-qa and others added 14 commits April 30, 2026 15:04
Split the previously bundled Secret + Job into two y-kustomize-served
bases so per-bucket consumer kustomizations can `nameSuffix:` the Job
without renaming the per-namespace prerequisites:

  /v1/blobs/setup-bucket-prep/base-for-annotations.yaml   <-- new
    ServiceAccount setup-bucket
    Role setup-bucket (secrets: create, get, patch, update)
    RoleBinding setup-bucket
    Secret bucket (versitygw default endpoint + EXAMPLE creds)

  /v1/blobs/setup-bucket-job/base-for-annotations.yaml   <-- shape change
    Job setup-bucket
      initContainer mc:    minio/mc, mc mb only (versitygw: no events,
                           no anonymous policy -- versitygw doesn't
                           implement those S3 ops)
      container     secret: ghcr.io/yolean/curl, SSA-PATCHes a Secret
                           named via yolean.se/secret-name annotation
                           with endpoint + bucket + creds for downstream
                           Deployments to mount

The Job's pod-template annotations carry the consumer-supplied parameters:
  yolean.se/bucket-name   shell template; ${NAMESPACE} expands at runtime
  yolean.se/secret-name   the consumer Secret this Job upserts

The annotation surface is shared between impls; the upcoming minio
counterpart will run the full mc command set (events + anonymous) on
the same annotations and produce the same shape of consumer Secret.

The annotation surface intentionally has no events-arn knob -- the ARN
is the impl's own concern (minio uses arn:minio:sqs::_:kafka, versitygw
has none).

k3s/30-blobs-ystack now converges both bases so y-kustomize watches both
Secrets.

Drops the legacy nodeSelector yolean.se/cluster=local from the Job; the
new pattern runs the Job in arbitrary consumer namespaces, so the
nodeSelector would prevent scheduling in any non-local cluster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors blobs-versitygw/setup-bucket-prep-y-kustomize. The new base
serves Secret y-kustomize.kafka.setup-topic-prep at
/v1/kafka/setup-topic-prep/base-for-annotations.yaml carrying the
ServiceAccount + Role + RoleBinding the existing setup-topic Job
already references via serviceAccountName: setup-topic.

The previous topic-job-rbac base only existed inside the ystack
namespace (applied via k3s/40-kafka). Consumers outside that namespace
(e.g. checkit's keycloak-v3 / dev / per-site namespaces) had no way to
pull it without copying or symlinking, which is what made
keycloak-v3's setup-topic-events Job stuck at "FailedCreate:
serviceaccount setup-topic not found" yesterday and what caused
checkit's per-site setup-topic-* Jobs to never schedule a Pod today.

No `bootstrap` Secret in this prep base: topics in sitevalues already
carry bootstrap via site-chart's settings-sitevalues template, and
topics outside sitevalues can pass bootstrap directly via the
yolean.se/kafka-bootstrap annotation on the per-topic kustomization.

k3s/40-kafka-ystack now converges both bases so y-kustomize watches
both Secrets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-name

The setup-bucket-job y-kustomize URL (in registry/builds-bucket) and the
setup-topic-job URL (in registry/builds-topic) no longer carry the
ServiceAccount/Role/RoleBinding inline -- those moved to the prep URLs
to avoid per-Job rename collisions. Without explicitly pulling the prep
URLs into the ystack namespace, the Jobs hang with "serviceaccount
setup-bucket / setup-topic not found" and the registry deployment
never finds its credentials Secret.

- Add registry/builds-prep/ that fetches both prep URLs into ystack ns
- Wire it into k3s/60-builds-registry/kustomization.yaml
- Add the missing yolean.se/secret-name annotation in builds-bucket so
  the secret container in setup-bucket Job writes the consumer Secret
  the registry deployment expects (builds-registry-bucket)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings the host wrapper (bin/y-bin.runner.yaml) and the in-cluster
y-kustomize Deployment (y-kustomize/y-kustomize-deployment.yaml) to
the same y-cluster release. v0.3.3 ships:

- yconverge progress headers + CWD-relative paths in the dep/target
  log lines.
- Restored yolean.se/converge-mode label routing (create / replace /
  serverside / serverside-force) lost in the early v0.3.x line.
- post-drop-client-go internals; faster cold start.
- gateway config knobs (gateway.skip, gateway.className) and the
  yolean.se/dns-hint-ip annotation on the installed GatewayClass --
  used by later commits in this branch.

Both pins land at the same SHA so a fresh provision and the
in-cluster Deployment serve from one binary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The OVERRIDE_IP=127.0.0.1 env-var chain (acceptance script ->
yconverge.cue annotate-on-Gateway -> y-k8s-ingress-hosts annotation
fallback) made the host-loopback fact look like a per-cluster
operator knob. y-cluster v0.3.3 publishes the host-side dial IP as
yolean.se/dns-hint-ip on the installed GatewayClass; this branch
adopts that contract end to end:

- e2e/agents-clusterautomation-acceptance-linux-amd64.sh: drop
  `export OVERRIDE_IP=127.0.0.1`. No operator-side setting.
- k3s/20-gateway/yconverge.cue: drop the exec check that wrote
  yolean.se/override-ip onto the Gateway from the env var.
- bin/y-k8s-ingress-hosts: rewrite the resolution chain to walk
  Gateway/ystack -> spec.gatewayClassName -> GatewayClass
  metadata.annotations[yolean.se/dns-hint-ip], with the legacy
  yolean.se/override-ip Gateway annotation as a fallback for
  environments that haven't migrated yet. The deprecated
  -override-ip flag remains as --host-ip with a deprecation log
  line, so callers passing it explicitly keep working for one
  cycle.
- gateway/gateway.yaml + k3s/20-gateway/yconverge.cue comment:
  rename gatewayClassName from `eg` to `y-cluster` to match the
  new y-cluster default GatewayClass name (eg was an
  implementation detail; y-cluster names the cluster role).

The provisioner-published annotation is the single source of truth
for the host-routable IP; consumer tooling reads it and writes
/etc/hosts without operator intervention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Containerd on the node can't resolve *.svc.cluster.local at image
pull time, so workloads referencing
prod-registry.ystack.svc.cluster.local/yolean/... images would
ImagePullBackOff on a fresh local-qemu or local-docker cluster.

Add a registries: block to both cluster-configs/local-{qemu,docker}/
y-cluster-provision.yaml that maps the in-cluster registry hostnames
to the magic ClusterIPs that 60-builds-registry/61-prod-registry pin
and y-cluster-validate-ystack asserts. The mirror is node-side, so
the same block applies to both providers. y-cluster v0.3.2+ writes
this verbatim to /etc/rancher/k3s/registries.yaml on the node before
k3s starts.

ystack's own acceptance was blind to this gap because its registry
verification goes through the kubectl API proxy and its build path
goes through in-cluster buildkit. checkit (and any real-workload
consumer) needs the mirror -- see specs/ystack/CLUSTER_CONFIG_REGISTRIES_BLOCK.md
which retires checkit/bin/y-cluster-local-registries-yaml in the
same beat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The y-build flow already proves buildctl-over-GRPCRoute and
buildkitd-pushes-to-builds-registry, but both paths travel through
in-cluster Service ClusterIP and don't need the node-side mirror.
The registries.yaml mirror configured by cluster-configs/*/
y-cluster-provision.yaml is exercised only when containerd on the
node resolves *.svc.cluster.local at image-pull time -- nothing in
this script did that.

After the y-build push, schedule a one-shot Pod that pulls the
just-pushed image with imagePullPolicy=Always and asserts
condition=Ready within 60s. Catches ImagePullBackOff if the
registries.yaml mirror is missing or the magic ClusterIPs drift.

Wait for Ready (not Succeeded) because the test image is built FROM
ghcr.io/yolean/static-web-server, a distroless image whose
entrypoint is `sws` -- it serves HTTP forever, never exits. Pod
Ready under restartPolicy=Never fires when the container is
Running, which is the minimum signal we need ("containerd resolved
+ pulled via the mirror"). The post-run delete cleans up.

Also annotates two pre-existing `|| true` sites per y-script-lint;
unrelated to the new check, but the file is now lint-clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Opt-in: when set, suppresses teardown on non-zero exit so the qemu
VM stays up for kubectl / ssh post-mortem. Default behavior is
unchanged (teardown on every EXIT).

cleanup() carries a forward-looking note: the intended *future*
default is a "keep cluster on failure for N minutes, then teardown"
mode -- a post-mortem window without leaving stale VMs around
forever. That timed-keep is not implemented in this commit;
--keep-on-failure is the manual opt-in until it lands.

Also refreshes the inline comment block describing how host
reachability flows: the previous mention of --node-external-ip is
obsolete; v0.3.3 publishes the host-side dial IP via the
yolean.se/dns-hint-ip annotation on the GatewayClass, which
y-k8s-ingress-hosts walks via gatewayClassName.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Settles y-kustomize:8944 as the canonical host:port everywhere
(both in-cluster and locally), and routes the host-side path to the
in-cluster Deployment instead of a host-local serve.

In-cluster path:
- y-kustomize Service: LoadBalancer on port 8944 (targetPort 8944).
  ServiceLB binds 0.0.0.0:8944 on the node.
- y-kustomize HTTPRoute: backendRefs[].port=8944. Acts as a "dummy"
  hostname registration -- y-k8s-ingress-hosts discovers the
  y-kustomize hostname via the route, but actual traffic uses
  ServiceLB:8944 directly. The HTTPRoute also keeps Gateway:80
  routing functional for any consumer that prefers it.

Host -> in-cluster bridge:
- cluster-configs/local-qemu/y-cluster-provision.yaml: add
  host:8944 -> guest:8944 to PortForwards (replaces the default
  6443/80/443 wholesale -- y-cluster's PortForwards is
  spell-it-all-out). With /etc/hosts mapping y-kustomize ->
  127.0.0.1, kustomize-build's fetches of http://y-kustomize:8944/...
  resolve to ServiceLB on the node.

Consumers and probes restored to :8944:
- k3s/{29-y-kustomize,30-blobs-ystack,40-kafka-ystack}/yconverge.cue
  probes use http://y-kustomize:8944/...
- kafka/validate-topic, registry/{builds-bucket,builds-topic,builds-prep}
  resources URLs use :8944.
- 29-y-kustomize/yconverge.cue drops the 20-gateway dep that the
  prior Gateway-routed probe needed.
- Doc-comment URLs in served bases (setup-{bucket,topic}{,-prep}-y-kustomize)
  match the canonical address.

Acceptance script changes:
- Drop `y-cluster serve ensure -c y-kustomize/` (no host-local
  serve in the acceptance flow).
- Drop `y-cluster serve stop` from the default cleanup body.
- After teardown, probe :8944 with `ss -lnt`; if anything is still
  listening (e.g. a downstream user's host-local serve),
  best-effort `y-cluster serve stop` so the next provision's
  hostfwd can bind. Diagnostic only -- the binding might be
  something else entirely.

Host-local serve preserved for downstream users:
- y-kustomize/y-cluster-serve.yaml: lists all four sources (the
  *-prep variants were missing previously, which made
  http://y-kustomize:8944/v1/{group}/setup-*-prep/... return 404
  -- and kustomize 5.7.1 then misclassifies the failed response as
  a git URL). The config exists so `y-cluster serve -c y-kustomize/`
  works on developer laptops without a cluster.
- bin/acceptance-y-kustomize-local: standalone OS/arch-neutral test
  for the host-local path. Boots `y-cluster serve` against a temp
  state-dir, asserts /health reports routes=4, fetches each of the
  four expected URLs and grep-validates the YAML response. No
  qemu, no docker, no kubectl -- catches future drift in
  y-cluster-serve.yaml without spinning a cluster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
blobs-versitygw/standalone/deployment.yaml flips the image pin to
versity/versitygw:v1.4.1@sha256:0400cb59...

.github/workflows/images.yaml grows a versitygw mirror step in the
same shape as the other hub mirrors: yq extracts the tag from the
deployment manifest (post-tag, pre-digest), crane copies
docker.io/versity/versitygw:$TAG to ghcr.io/yolean/versitygw:$TAG
on every main push.

Verified end-to-end: ad-hoc provision + yconverge k3s/30-blobs/
rolls out v1.4.1, y-cluster-blobs ls works against it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a single line of stdout output, prefixed with the CLI name and
the host count, just before the wrapper exec's the underlying Go
binary in write mode:

    y-k8s-ingress-hosts: writing 4 host entries to /etc/hosts

Visible in converge logs so it's clear when (and how many)
HTTPRoute/GRPCRoute hostnames the wrapper materialized into
/etc/hosts. Useful as a yconverge-trace breadcrumb -- the
20-gateway and 29-y-kustomize phases both invoke this on
provisions.

The line only fires when PASSTHROUGH carries -write -- preview /
check / no-routes paths stay quiet (they already echo their own
"# /etc/hosts is up to date" / "# no entries" diagnostics).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dep-ordering assertion's _cm_check looked for the literal
"configmap exists" string -- the description from example-configmap's
yconverge.cue exec check. y-cluster v0.3.3 stopped echoing check
descriptions to stdout (it now prints "yconverge check N/N exec"
markers instead), so the grep silently returned 0 matches.

Under set -eo pipefail, the empty `$(grep ... | head -1 | cut ...)`
substitution exits 1 (because grep with no match exits 1), which
trips set -e and exits the script silently -- before the FAIL echo
runs. CI shows a failed itest with no diagnostic.

Replace the description grep with a structural one: the first
"yconverge check ... exec" line in the output is example-configmap's
exec check (example-namespace's check is kind=wait, not exec). The
remaining ordering assertion (_cm_check < _wd_step) gates the
sequential walk through the dep chain unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@solsson solsson added the e2e-cluster Triggers long-running cluster acceptance tests on push to labeled PR label Apr 30, 2026
Yolean k8s-qa and others added 7 commits April 30, 2026 13:35
GitHub deprecates Node.js 20 actions starting June 2026 (default
flips) and removes Node.js 20 from runners September 2026. The
actions/* on @v4 and the docker/* on @v3/@v6 all run on Node 20
and emit deprecation warnings on every CI run.

Pinning to specific versions (not floating major) so the version
in CI matches the version reviewers see.

  actions/checkout       @v4         -> @v5.0.0
  actions/cache/{restore,save} @v4   -> @v5.0.5
  docker/setup-qemu-action  @V3      -> @v4.0.0
  docker/setup-buildx-action @V3     -> @v4.0.0
  docker/login-action       @V3      -> @v4.1.0
  docker/build-push-action  @v6      -> @v7.1.0
  imjasonh/setup-crane      @v0.3    -> @v0.5
  mikefarah/yq              @v4.44.1 -> @v4.53.2

All these majors are drop-in for our usage (Node 24 baseline; no
other contract changes that affect this workflow).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A new GHA job runs e2e/agents-clusterautomation-acceptance-linux-amd64.sh
when both:

  1. The trigger is a pull_request event (push to PR head, or PR
     opened / reopened).
  2. The PR carries the `e2e-cluster` label.

Gated by `needs: [script-lint, itest]` so the heavyweight (~10-15
min) provision + converge + validate cycle only fires after the
cheaper checks have passed.

Runs on ubuntu-latest -- GitHub-hosted runners support KVM
acceleration, have qemu-system-x86_64 preinstalled, and provide
4 vCPU / 16 GB / 14 GB SSD which fits the 4 CPU / 8 GB cluster
the local-qemu config provisions. The pre-flight step echoes
/dev/kvm + df + qemu version so disk / virtualization issues
surface explicitly when the runner spec changes under us.

Sets ENV_IS_CLEAN=true to skip the script's `exec env -i ...`
trampoline (which exists for clean-shell rehearsal on a dev
laptop; CI's env is already minimal). PATH is set to put
${GITHUB_WORKSPACE}/bin first so the wrapper resolution works
without a shell rc file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two unblockers for the kwok itest under y-cluster v0.3.3:

1. The example-db/checks pure-CUE library (parameterized #DbChecks,
   imported by example-db/{single,distributed}) tripped the dep
   walker -- v0.3.3 walks every CUE import as a converge step and
   errors with "no kustomization file" for dirs that are
   import-only definition libraries. Inline the wait check into
   each variant; drop the now-unused checks/ dir entirely.

2. The prod/qa cluster-overlay tests
   (`kubectl yconverge -k cluster-prod/db/` etc.) require yconverge
   to apply once at the top and run nested-base checks in
   depth-first order. v0.3.3 instead applies every CUE-imported
   base standalone, which fails on example-db/{single,distributed}
   because they carry a sentinel namespace
   (ONLY_apply_through_cluster_variant) that requires the cluster
   overlay to override. Comment out lines 269-277 with a TODO
   describing the y-cluster gap.

Both are y-cluster behavior gaps, not regressions in this PR --
they were latent under the previous local pin and surfaced when
running the kwok itest end-to-end (prior CI runs died at line 232
on a separate stale grep, masking these).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes:

- cluster-configs/local-docker/y-cluster-provision.yaml grows the
  full PortForwards list (6443/80/443/8944) -- y-cluster v0.3.3's
  docker provider takes the same shape as qemu, mapping each entry
  via Docker port bindings. The earlier comment ("the docker schema
  does not expose additional port forwards") was stale and is
  rewritten.
- e2e/agents-clusterautomation-acceptance-linux-amd64.sh: switch
  CONFIG from cluster-configs/local-qemu to cluster-configs/local-docker.
- The dns-hint-ip annotation flow is unchanged: docker provider
  also fills DNSHintIP from cfg.HostRoutableIP() (127.0.0.1 when
  guest:80 is forwarded), so /etc/hosts -> 127.0.0.1:8944 -> docker
  port mapping -> ServiceLB still resolves end-to-end.

Includes a self-contained pre-pull fallback for k3s: y-cluster
v0.3.3's docker provider does NOT auto-pull the k3s image -- it
calls `docker create` directly and errors with "No such image"
when the image isn't already on the host. The acceptance script
catches that error path, scrapes the image ref out of y-cluster's
"starting docker" progress log, runs `docker pull`, and retries
provision. Harmless when the image is already cached (the first
attempt succeeds); important for fresh hosts (CI runners). Will
become dead code once y-cluster ships auto-pull on the docker
provider.

Verified locally with both warm and cold docker image cache:
provision -> 7 yconverge phases -> validate-ystack reports
37 passed, 0 failed in both cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
y-cluster v0.3.3 errors out with "KUBECONFIG env must be set" at
provision time -- the binary refuses to default a path so it can
never accidentally write to a developer's main kubeconfig. The
e2e-cluster job's runner has no KUBECONFIG set by default.

Add a pre-step that exports `$HOME/.kube/yolean` via $GITHUB_ENV
(matching the ystack convention from the local dev workflow), and
mkdir the parent dir. The acceptance script picks it up via the env
inherited through the ENV_IS_CLEAN=true trampoline-skip path.

Also retire the qemu/kvm pre-flight checks now that the acceptance
runs against the docker provider; replace with `df -h` + `docker
info` for surfaced disk + docker daemon visibility.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A second y-cluster v0.3.3 docker-provider race surfaces in CI
(filed as specs/y-cluster/ISSUE_DOCKER_K3S_READY_BEFORE_APISERVER.md):
provision declares "k3s ready" once /etc/rancher/k3s/k3s.yaml exists
inside the container, then immediately runs `kubectl apply` for the
envoy-gateway install against the host-mapped 127.0.0.1:6443 -- on
slower hosts (GHA runners) the host port forward isn't yet
functional, the apply fails with "dial tcp 127.0.0.1:6443: connect:
connection refused", and provision aborts.

Extend the existing pre-pull workaround into a unified retry loop:
on each provision attempt, if the failure log contains
  - "No such image"  -> docker pull, retry
  - "dial tcp 127.0.0.1:6443: connect: connection refused"
                     -> sleep 10s, retry
  - anything else    -> propagate the failure as before
Up to 4 attempts. Becomes dead code once y-cluster ships
auto-pull + a stronger readiness check on the host port.

Verified locally with cold image cache: pre-pull fires once,
provision succeeds on second attempt, validate-ystack reports
37 passed, 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v0.3.4 ships the docker auto-pull fix (a959eb0) responding to
ISSUE_DOCKER_PROVIDER_NO_AUTO_PULL.md: ContainerCreate now does an
ImagePull first when the image isn't on disk. The acceptance
script's pre-pull fallback (scrape image ref + docker pull on the
"No such image" failure path) is dead code on v0.3.4 and is
removed in the same commit.

The other docker-provider race
(ISSUE_DOCKER_K3S_READY_BEFORE_APISERVER.md) is not addressed in
v0.3.4 -- the connect-refused retry stays in place until y-cluster
strengthens the readiness check on the host's :6443 port.

Both pins land at the same SHA (host wrapper +
y-kustomize Deployment image) so a fresh provision and the
in-cluster Deployment serve from one binary.

Verified locally with cold image cache: provision auto-pulls,
validate-ystack reports 37 passed, 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
40-kafka and 30-blobs depended on 40-kafka-ystack/30-blobs-ystack, which
inverted the natural order: the cluster-side package (kafka, blobs) was
gated on its y-kustomize sibling rather than the other way around. Flip
the deps so 40-kafka and 30-blobs gate only on their namespace, and the
new 41-kafka-y-kustomize / 31-blobs-y-kustomize gate on the cluster
package plus 29-y-kustomize.

Renaming -ystack to -y-kustomize and bumping the prefix to 31/41 makes
the converge order match the directory listing and names the role.

60-builds-registry/yconverge.cue updated to import the renamed packages.
Two y-cluster releases unblock the docker provider on
ubuntu-latest and let the acceptance script collapse to a single
provision call:

- v0.3.5 (Yolean/y-cluster#12) added a host-side /readyz probe
  between the in-container kubeconfig appearing and "k3s ready"
  being declared, closing the docker port-forward race that made
  envoy-gateway install fail with "dial tcp 127.0.0.1:6443:
  connect: connection refused". The 4x retry/sleep-10s
  workaround in this script is dead code now -- each retry tore
  the cluster down and reproduced the deterministic race anyway.
- v0.3.6 (Yolean/y-cluster#15) fixed a separate silent-drop in
  the docker provider's PortBindings: HostIP was left as the
  zero netip.Addr ("invalid IP"), which moby v1.54+ marshals to
  the empty JSON string and Docker Engine 28 dropped silently.

  A second issue with PortBindings still surfaces in some CI
  contexts -- the y-cluster-managed container's
  NetworkSettings.Ports comes back empty even with v0.3.6 -- but
  it's distinct from anything this script can work around;
  filed upstream against y-cluster.

The y-kustomize Deployment image is bumped to the matching
v0.3.6 tag for consistency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@solsson solsson force-pushed the y-converge-checks-dag branch from 4363d4b to 1aed9a4 Compare May 3, 2026 15:01
v0.3.7 (Yolean/y-cluster#17) sets Config.ExposedPorts alongside
HostConfig.PortBindings on every docker.Provision call, matching
what `docker run -p` does. Addresses Yolean/y-cluster#16: on
ubuntu-latest CI the released binary's ContainerCreate produced
NetworkSettings.Ports={} for the four-port ystack config even
after the v0.3.6 HostIP fix, while plain `docker run -p ...` on
the same runner published bindings cleanly.

Verified via the e2e-cluster job whether the silent-drop is
actually closed in the released-binary-from-bash path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@solsson solsson merged commit 62bee62 into main May 3, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

e2e-cluster Triggers long-running cluster acceptance tests on push to labeled PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant