kubectl yconverge checks DAG and kustomize base traversal#76
Conversation
kubectl plugin that wraps kustomize apply with idempotent converge-mode label routing (create, replace, serverside, serverside-force) and post-apply checks defined in yconverge.cue files using a CUE schema. Check types: #Wait (kubectl wait), #Rollout (rollout status), #Exec (arbitrary command with retry-until-timeout). Checks are defined per kustomization in a yconverge.cue file; the framework finds them via 1-level single-directory indirection through kustomization.yaml resources, ignoring sibling file resources. Dependency resolution walks CUE imports to build a topological apply order. Shared check definitions live in pure-CUE packages (no kustomization.yaml) that the dep walker ignores. Modes: apply (default), --diff=true, --checks-only, --print-deps. Apply modifiers: --dry-run=server|none, --skip-checks. Dry-run forwards to both kubectl apply and delete so replace-mode resources are provably non-mutating. Invalid flag combinations fail up front. Namespace for checks resolves from: -n CLI arg > outer kustomization namespace > indirected base namespace > context default. Exported as $NS_GUESS for exec checks alongside $CONTEXT. Error tolerance uses exact criteria: each kubectl step declares the specific error substrings it tolerates (AlreadyExists, no objects passed to apply, No resources found) — anything else surfaces raw. Integration tests run a kwok cluster in Docker with a fake node for pod scheduling. Covers: schema validation, dep resolution, indirection, converge-mode labels, broken-cue rejection, --skip-checks negative, replace-mode dry-run UID preservation, shared checks across db variants (single/distributed), and a PDB safety check demonstrating prod→qa failure detection. CI workflow renamed from "lint" to "checks" to reflect the itest job. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> s
Remove `up` and `namespaceGuess` from verify.#Step. Both were "set by the engine, not by user CUE files" — but the engine never set them either. `up` was designed for a CUE-native orchestrator where CUE's evaluation order needed a data dependency to serialize steps; the shell-based dep walker serializes via a for-loop instead. `namespaceGuess` is handled entirely as the shell variable $NS_GUESS. No yconverge.cue file in the repo references either field. New test: verify dependency checks serialize before downstream steps. Captures the multi-step output of example-with-dependency and asserts line ordering — namespace check completes before configmap step starts, configmap check completes before with-dependency step starts. This is the guarantee `up` was meant to provide, now proven by the shell execution model. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Provisioners (qemu, k3d) run kubectl yconverge for gateway-api and gateway before --skip-converge exit. Gateway API is infrastructure assumed present by all functional bases. Remove gateway imports from 29-y-kustomize and 20-gateway DAG. Keep all Traefik checks in 40-kafka-ystack — they verify the complete path kustomize uses for HTTP resources. Use -write instead of --ensure for /etc/hosts to fix stale entries from previous provisioner sessions. E2e: replace y-cluster-provision reprovision with explicit yconverge calls for monitoring and idempotency proof. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The gateway step's /etc/hosts update runs before any HTTPRoutes exist. The y-kustomize step creates an HTTPRoute, so /etc/hosts needs updating afterward for kustomize HTTP resource resolution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace API proxy checks (kubectl get --raw .../proxy/...) with curl checks using the exact URL that kustomize HTTP resources reference: http://y-kustomize.ystack.svc.cluster.local/v1/.../base-for-annotations.yaml This is the path kustomize actually uses. If curl succeeds, kustomize will resolve the resource. The API proxy path has different failure modes (endpoint readiness timing) that don't predict kustomize success. 30-blobs-ystack: add blobs content check after restart (was missing). 40-kafka-ystack: kafka base gets 120s timeout (newly mounted secret), blobs base gets 60s (already mounted from previous step). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The y-k8s-ingress-hosts -write command replaces the managed block in /etc/hosts. When called before HTTPRoutes exist (during provisioning), it wrote an empty block — clearing previous entries. This caused curl checks to fail with "Could not resolve host" instead of the assumed secret propagation delay. Fix: skip -write when no ingress/gateway entries are found, preserving existing /etc/hosts entries from earlier steps. With /etc/hosts stable, y-kustomize restart + content availability takes ~4 seconds (secret volume is fresh on new pod). Reduce check timeouts from 120s to 30s. Root cause confirmed: Kubernetes secret volume mounts are instant on new pods. The 60-120s delay from docs applies only to volume UPDATES on running pods (kubelet sync interval). Restarts create new pods with fresh mounts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The new y-kustomize binary watches secrets labeled
yolean.se/module-part=y-kustomize via the Kubernetes API and serves
their content at /v1/{group}/{name}/{key}. Secret changes are
reflected instantly — no pod restart or kubelet volume sync needed.
This eliminates the dual-restart problem where the second restart
lost the first secret's volume mount for 60-120s due to kubelet's
sync interval.
Changes:
- y-kustomize/cmd/: Go binary with secret watch, HTTP server, tests
- y-kustomize/rbac.yaml: ServiceAccount + Role for secret list/watch
- y-kustomize/deployment.yaml: new image, removed volume mounts
- Secret labels: yolean.se/module-part changed from config to y-kustomize
- Init secrets get the label for consistent watch matching
- blobs-ystack/kafka-ystack: remove restart checks, keep content checks
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
contain: Go binary from turbokube/contain releases, added to y-bin.runner.yaml with y-contain wrapper. y-kustomize build: contain.yaml: distroless/static:nonroot base, single Go binary layer skaffold.yaml: custom builder using go build + contain, OCI output No Docker required. No push for local dev. y-image-cache-load: add help section, fix lint warnings. Local workflow: cd y-kustomize/cmd go build + contain build → target-oci/ y-image-cache-load to get into cluster CI workflow: Same contain.yaml with --push for ghcr.io Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Init secrets get yolean.se/converge-mode: create label so re-converge doesn't overwrite secrets that have been populated by blobs-ystack or kafka-ystack. The watch-based y-kustomize reacts to secret content changes — empty secrets cause 404. y-cluster-local-ctr: add qemu case using SSH, matching the provisioner's existing SSH connection pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The watch-based y-kustomize reads secrets via the Kubernetes API. It doesn't need empty placeholder secrets to start — it starts with an empty file map and picks up secrets as they're created by blobs-ystack and kafka-ystack. Removes the init step and the dependency from 29-y-kustomize. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds y-kustomize job to images workflow: go build + contain build --push to ghcr.io/yolean/y-kustomize:$SHA Temporarily triggers on y-converge-checks-dag branch pushes. Push will fail on YoleanAgents fork (no ghcr.io/yolean write access) but validates the build succeeds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ghcr.io/yolean/y-kustomize:c55953b69f74067043f2351f8727ea84db1737ca @sha256:e44f99f6bbae59aef485610402c8f3f0125e197fff8616643bd4d5c65ce619e1 Built by GHA images workflow. k3s pulls from ghcr.io on deploy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Custom builder: go build + contain tarball + ctr import into cluster. Deploy hook restarts y-kustomize after image load. No Docker daemon needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore env -i for acceptance test reproducibility. Registry rollout timeout increased to 120s — first deploy pulls the image from ghcr.io which can exceed 60s on cold cache. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The registry timeout was a transient issue, not a real problem. Restore clean env (env -i) for acceptance test reproducibility. e2e passes: 36/36 checks with clean env on fresh cluster. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kubectl-yconverge resolves k3s/ paths relative to cwd. Provisioners are called from other repos (checkit) where k3s/ doesn't exist. Use subshell cd to ensure correct path resolution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kubectl writes contexts/clusters/users: null instead of [] when the last item is removed. kubie rejects this as invalid YAML. Fix by replacing null with empty list after context deletion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
y-cluster-converge-ystack accepts --converge=LIST (comma-separated base names without number prefix). Replaces the broken --exclude flag. Default: y-kustomize,blobs,builds-registry. Both provisioners pass --converge and --dry-run through. y-image-list-ystack and y-image-cache-ystack accept the same flag. The provisioner passes its converge targets so all images are pre-cached. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
y-registry-config reads magic ClusterIPs from the source-of-truth YAML files instead of using hostnames. Containerd resolves registries without /etc/hosts hacks on nodes. Qemu provisioner verifies registry access after converge. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lint y-cluster-converge-ystack, y-image-list-ystack, and kubectl-yconverge with zero failures required before running integration tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NS_GUESS remains internal. Only NAMESPACE is exported to exec check commands. wait/rollout checks also use NAMESPACE as fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kustomize-traverse walks kustomization directory trees using the kustomize API types. Replaces the bash _find_cue_dir single-dir heuristic with full tree traversal. Checks from all bases are aggregated. Also used for namespace resolution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8d1dd15 to
fe5e0f8
Compare
| push: | ||
| branches: | ||
| - main | ||
| - y-converge-checks-dag |
There was a problem hiding this comment.
remove workflow test
| --converge=LIST comma-separated k3s bases to converge (default: y-kustomize,blobs,builds-registry) | ||
| --skip-converge skip converge and post-provision steps | ||
| --skip-image-load skip image cache and load into containerd | ||
| --dry-run=MODE forward to kubectl-yconverge (server|none) |
There was a problem hiding this comment.
--dry-run doesn't make sense for provision. There's --skip-converge for that.
| # Fix kubectl writing null instead of [] when last item is removed | ||
| sed -i 's/^contexts: null$/contexts: []/' "$KUBECONFIG" 2>/dev/null | ||
| sed -i 's/^clusters: null$/clusters: []/' "$KUBECONFIG" 2>/dev/null | ||
| sed -i 's/^users: null$/users: []/' "$KUBECONFIG" 2>/dev/null |
There was a problem hiding this comment.
We should remove this if only one provisioner does it. It's only kubie that can't handle null so we can consider that an issue with kubie.
| # Gateway API is always set up, even with --skip-converge. | ||
| export OVERRIDE_IP=${YSTACK_PORTS_IP:-127.0.0.1} | ||
| (cd "$YSTACK_HOME" && kubectl-yconverge --context=$CTX -k k3s/10-gateway-api/) | ||
| (cd "$YSTACK_HOME" && kubectl-yconverge --context=$CTX -k k3s/20-gateway/) |
There was a problem hiding this comment.
This way of calling yconverge is poor DX. If the script can run from any CWD we should use $YSTACK_HOME/k3s/20-gateway/ as the base path (or use a root derived from the script invocation). yconverge should not require that it's invoked from ystack root. Also it's a kubectl plugin so kubectl yconverge should work.
| set -e | ||
| YBIN="$(dirname $0)" | ||
|
|
||
| version=$(y-bin-download $YBIN/y-bin.optional.yaml contain) |
There was a problem hiding this comment.
does the declaration remain in optional? if so clean up.
| kubectl kustomize "$d" 2>/dev/null \ | ||
| | grep -oE 'image:\s*\S+' \ | ||
| | sed 's/image:[[:space:]]*//' \ | ||
| || true # y-script-lint:disable=or-true # kustomize may fail for bases requiring y-kustomize HTTP |
There was a problem hiding this comment.
This is not a good enough reason for ignoring errors. If you're talking about transient HTTP errors they're either rare and should propagate (if it's github, provision will likely fail anyway) or frequent in which case they should propagate too so we learn that we should redesign.
| grep '"yolean.se/ystack/' "$1" 2>/dev/null \ | ||
| | grep -v '"yolean.se/ystack/yconverge/verify"' \ | ||
| | sed 's|.*"yolean.se/ystack/\([^":]*\).*|\1|' \ | ||
| || true # y-script-lint:disable=or-true # no imports is valid |
There was a problem hiding this comment.
What other errors will this silently swallow? Do we have test coverage?
|
|
||
| if [ -z "$_YCONVERGE_RESOLVING" ] && [ -n "$KUSTOMIZE_DIR" ]; then | ||
| deps=$(_resolve_deps "$KUSTOMIZE_DIR") | ||
| dep_count=$(printf '%s\n' "$deps" | grep -c . 2>/dev/null) || true # y-script-lint:disable=or-true # grep -c . exit 1 = zero matches |
There was a problem hiding this comment.
find a better way to detect zero matches than to skip errors
| @@ -134,10 +142,7 @@ else | |||
| y-image-cache-load-all </dev/null || true | |||
There was a problem hiding this comment.
Why do we ignore errors here? Why didn't y-script-lint prevent this?
| echo "[y-cluster-provision-qemu] Loading images ..." | ||
| y-image-cache-ystack </dev/null | ||
| y-image-cache-ystack --converge=$CONVERGE_TARGETS </dev/null | ||
| y-image-cache-load-all </dev/null || true |
There was a problem hiding this comment.
- Remove workflow test changes from images.yaml - Remove --dry-run from provisioners (use y-cluster-converge-ystack directly) - Remove kubie null workaround from qemu teardown - Use absolute paths for yconverge calls (no cd to YSTACK_HOME) - y-image-list-ystack: let kustomize errors propagate - kubectl-yconverge: replace grep -c with wc -l, guard file existence in _find_imports, use || : for legitimate empty-string fallbacks - y-cluster-converge-ystack: use absolute paths in _resolve_target Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kubectl kustomize fails for bases that reference HTTP resources (e.g. y-kustomize served content) when the cluster isn't running. Skip with a diagnostic message instead of failing the entire image caching step. These images are pulled during converge. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Split the previously bundled Secret + Job into two y-kustomize-served
bases so per-bucket consumer kustomizations can `nameSuffix:` the Job
without renaming the per-namespace prerequisites:
/v1/blobs/setup-bucket-prep/base-for-annotations.yaml <-- new
ServiceAccount setup-bucket
Role setup-bucket (secrets: create, get, patch, update)
RoleBinding setup-bucket
Secret bucket (versitygw default endpoint + EXAMPLE creds)
/v1/blobs/setup-bucket-job/base-for-annotations.yaml <-- shape change
Job setup-bucket
initContainer mc: minio/mc, mc mb only (versitygw: no events,
no anonymous policy -- versitygw doesn't
implement those S3 ops)
container secret: ghcr.io/yolean/curl, SSA-PATCHes a Secret
named via yolean.se/secret-name annotation
with endpoint + bucket + creds for downstream
Deployments to mount
The Job's pod-template annotations carry the consumer-supplied parameters:
yolean.se/bucket-name shell template; ${NAMESPACE} expands at runtime
yolean.se/secret-name the consumer Secret this Job upserts
The annotation surface is shared between impls; the upcoming minio
counterpart will run the full mc command set (events + anonymous) on
the same annotations and produce the same shape of consumer Secret.
The annotation surface intentionally has no events-arn knob -- the ARN
is the impl's own concern (minio uses arn:minio:sqs::_:kafka, versitygw
has none).
k3s/30-blobs-ystack now converges both bases so y-kustomize watches both
Secrets.
Drops the legacy nodeSelector yolean.se/cluster=local from the Job; the
new pattern runs the Job in arbitrary consumer namespaces, so the
nodeSelector would prevent scheduling in any non-local cluster.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors blobs-versitygw/setup-bucket-prep-y-kustomize. The new base serves Secret y-kustomize.kafka.setup-topic-prep at /v1/kafka/setup-topic-prep/base-for-annotations.yaml carrying the ServiceAccount + Role + RoleBinding the existing setup-topic Job already references via serviceAccountName: setup-topic. The previous topic-job-rbac base only existed inside the ystack namespace (applied via k3s/40-kafka). Consumers outside that namespace (e.g. checkit's keycloak-v3 / dev / per-site namespaces) had no way to pull it without copying or symlinking, which is what made keycloak-v3's setup-topic-events Job stuck at "FailedCreate: serviceaccount setup-topic not found" yesterday and what caused checkit's per-site setup-topic-* Jobs to never schedule a Pod today. No `bootstrap` Secret in this prep base: topics in sitevalues already carry bootstrap via site-chart's settings-sitevalues template, and topics outside sitevalues can pass bootstrap directly via the yolean.se/kafka-bootstrap annotation on the per-topic kustomization. k3s/40-kafka-ystack now converges both bases so y-kustomize watches both Secrets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-name The setup-bucket-job y-kustomize URL (in registry/builds-bucket) and the setup-topic-job URL (in registry/builds-topic) no longer carry the ServiceAccount/Role/RoleBinding inline -- those moved to the prep URLs to avoid per-Job rename collisions. Without explicitly pulling the prep URLs into the ystack namespace, the Jobs hang with "serviceaccount setup-bucket / setup-topic not found" and the registry deployment never finds its credentials Secret. - Add registry/builds-prep/ that fetches both prep URLs into ystack ns - Wire it into k3s/60-builds-registry/kustomization.yaml - Add the missing yolean.se/secret-name annotation in builds-bucket so the secret container in setup-bucket Job writes the consumer Secret the registry deployment expects (builds-registry-bucket) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings the host wrapper (bin/y-bin.runner.yaml) and the in-cluster y-kustomize Deployment (y-kustomize/y-kustomize-deployment.yaml) to the same y-cluster release. v0.3.3 ships: - yconverge progress headers + CWD-relative paths in the dep/target log lines. - Restored yolean.se/converge-mode label routing (create / replace / serverside / serverside-force) lost in the early v0.3.x line. - post-drop-client-go internals; faster cold start. - gateway config knobs (gateway.skip, gateway.className) and the yolean.se/dns-hint-ip annotation on the installed GatewayClass -- used by later commits in this branch. Both pins land at the same SHA so a fresh provision and the in-cluster Deployment serve from one binary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The OVERRIDE_IP=127.0.0.1 env-var chain (acceptance script -> yconverge.cue annotate-on-Gateway -> y-k8s-ingress-hosts annotation fallback) made the host-loopback fact look like a per-cluster operator knob. y-cluster v0.3.3 publishes the host-side dial IP as yolean.se/dns-hint-ip on the installed GatewayClass; this branch adopts that contract end to end: - e2e/agents-clusterautomation-acceptance-linux-amd64.sh: drop `export OVERRIDE_IP=127.0.0.1`. No operator-side setting. - k3s/20-gateway/yconverge.cue: drop the exec check that wrote yolean.se/override-ip onto the Gateway from the env var. - bin/y-k8s-ingress-hosts: rewrite the resolution chain to walk Gateway/ystack -> spec.gatewayClassName -> GatewayClass metadata.annotations[yolean.se/dns-hint-ip], with the legacy yolean.se/override-ip Gateway annotation as a fallback for environments that haven't migrated yet. The deprecated -override-ip flag remains as --host-ip with a deprecation log line, so callers passing it explicitly keep working for one cycle. - gateway/gateway.yaml + k3s/20-gateway/yconverge.cue comment: rename gatewayClassName from `eg` to `y-cluster` to match the new y-cluster default GatewayClass name (eg was an implementation detail; y-cluster names the cluster role). The provisioner-published annotation is the single source of truth for the host-routable IP; consumer tooling reads it and writes /etc/hosts without operator intervention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Containerd on the node can't resolve *.svc.cluster.local at image
pull time, so workloads referencing
prod-registry.ystack.svc.cluster.local/yolean/... images would
ImagePullBackOff on a fresh local-qemu or local-docker cluster.
Add a registries: block to both cluster-configs/local-{qemu,docker}/
y-cluster-provision.yaml that maps the in-cluster registry hostnames
to the magic ClusterIPs that 60-builds-registry/61-prod-registry pin
and y-cluster-validate-ystack asserts. The mirror is node-side, so
the same block applies to both providers. y-cluster v0.3.2+ writes
this verbatim to /etc/rancher/k3s/registries.yaml on the node before
k3s starts.
ystack's own acceptance was blind to this gap because its registry
verification goes through the kubectl API proxy and its build path
goes through in-cluster buildkit. checkit (and any real-workload
consumer) needs the mirror -- see specs/ystack/CLUSTER_CONFIG_REGISTRIES_BLOCK.md
which retires checkit/bin/y-cluster-local-registries-yaml in the
same beat.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The y-build flow already proves buildctl-over-GRPCRoute and
buildkitd-pushes-to-builds-registry, but both paths travel through
in-cluster Service ClusterIP and don't need the node-side mirror.
The registries.yaml mirror configured by cluster-configs/*/
y-cluster-provision.yaml is exercised only when containerd on the
node resolves *.svc.cluster.local at image-pull time -- nothing in
this script did that.
After the y-build push, schedule a one-shot Pod that pulls the
just-pushed image with imagePullPolicy=Always and asserts
condition=Ready within 60s. Catches ImagePullBackOff if the
registries.yaml mirror is missing or the magic ClusterIPs drift.
Wait for Ready (not Succeeded) because the test image is built FROM
ghcr.io/yolean/static-web-server, a distroless image whose
entrypoint is `sws` -- it serves HTTP forever, never exits. Pod
Ready under restartPolicy=Never fires when the container is
Running, which is the minimum signal we need ("containerd resolved
+ pulled via the mirror"). The post-run delete cleans up.
Also annotates two pre-existing `|| true` sites per y-script-lint;
unrelated to the new check, but the file is now lint-clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Opt-in: when set, suppresses teardown on non-zero exit so the qemu VM stays up for kubectl / ssh post-mortem. Default behavior is unchanged (teardown on every EXIT). cleanup() carries a forward-looking note: the intended *future* default is a "keep cluster on failure for N minutes, then teardown" mode -- a post-mortem window without leaving stale VMs around forever. That timed-keep is not implemented in this commit; --keep-on-failure is the manual opt-in until it lands. Also refreshes the inline comment block describing how host reachability flows: the previous mention of --node-external-ip is obsolete; v0.3.3 publishes the host-side dial IP via the yolean.se/dns-hint-ip annotation on the GatewayClass, which y-k8s-ingress-hosts walks via gatewayClassName. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Settles y-kustomize:8944 as the canonical host:port everywhere
(both in-cluster and locally), and routes the host-side path to the
in-cluster Deployment instead of a host-local serve.
In-cluster path:
- y-kustomize Service: LoadBalancer on port 8944 (targetPort 8944).
ServiceLB binds 0.0.0.0:8944 on the node.
- y-kustomize HTTPRoute: backendRefs[].port=8944. Acts as a "dummy"
hostname registration -- y-k8s-ingress-hosts discovers the
y-kustomize hostname via the route, but actual traffic uses
ServiceLB:8944 directly. The HTTPRoute also keeps Gateway:80
routing functional for any consumer that prefers it.
Host -> in-cluster bridge:
- cluster-configs/local-qemu/y-cluster-provision.yaml: add
host:8944 -> guest:8944 to PortForwards (replaces the default
6443/80/443 wholesale -- y-cluster's PortForwards is
spell-it-all-out). With /etc/hosts mapping y-kustomize ->
127.0.0.1, kustomize-build's fetches of http://y-kustomize:8944/...
resolve to ServiceLB on the node.
Consumers and probes restored to :8944:
- k3s/{29-y-kustomize,30-blobs-ystack,40-kafka-ystack}/yconverge.cue
probes use http://y-kustomize:8944/...
- kafka/validate-topic, registry/{builds-bucket,builds-topic,builds-prep}
resources URLs use :8944.
- 29-y-kustomize/yconverge.cue drops the 20-gateway dep that the
prior Gateway-routed probe needed.
- Doc-comment URLs in served bases (setup-{bucket,topic}{,-prep}-y-kustomize)
match the canonical address.
Acceptance script changes:
- Drop `y-cluster serve ensure -c y-kustomize/` (no host-local
serve in the acceptance flow).
- Drop `y-cluster serve stop` from the default cleanup body.
- After teardown, probe :8944 with `ss -lnt`; if anything is still
listening (e.g. a downstream user's host-local serve),
best-effort `y-cluster serve stop` so the next provision's
hostfwd can bind. Diagnostic only -- the binding might be
something else entirely.
Host-local serve preserved for downstream users:
- y-kustomize/y-cluster-serve.yaml: lists all four sources (the
*-prep variants were missing previously, which made
http://y-kustomize:8944/v1/{group}/setup-*-prep/... return 404
-- and kustomize 5.7.1 then misclassifies the failed response as
a git URL). The config exists so `y-cluster serve -c y-kustomize/`
works on developer laptops without a cluster.
- bin/acceptance-y-kustomize-local: standalone OS/arch-neutral test
for the host-local path. Boots `y-cluster serve` against a temp
state-dir, asserts /health reports routes=4, fetches each of the
four expected URLs and grep-validates the YAML response. No
qemu, no docker, no kubectl -- catches future drift in
y-cluster-serve.yaml without spinning a cluster.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
blobs-versitygw/standalone/deployment.yaml flips the image pin to versity/versitygw:v1.4.1@sha256:0400cb59... .github/workflows/images.yaml grows a versitygw mirror step in the same shape as the other hub mirrors: yq extracts the tag from the deployment manifest (post-tag, pre-digest), crane copies docker.io/versity/versitygw:$TAG to ghcr.io/yolean/versitygw:$TAG on every main push. Verified end-to-end: ad-hoc provision + yconverge k3s/30-blobs/ rolls out v1.4.1, y-cluster-blobs ls works against it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a single line of stdout output, prefixed with the CLI name and
the host count, just before the wrapper exec's the underlying Go
binary in write mode:
y-k8s-ingress-hosts: writing 4 host entries to /etc/hosts
Visible in converge logs so it's clear when (and how many)
HTTPRoute/GRPCRoute hostnames the wrapper materialized into
/etc/hosts. Useful as a yconverge-trace breadcrumb -- the
20-gateway and 29-y-kustomize phases both invoke this on
provisions.
The line only fires when PASSTHROUGH carries -write -- preview /
check / no-routes paths stay quiet (they already echo their own
"# /etc/hosts is up to date" / "# no entries" diagnostics).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dep-ordering assertion's _cm_check looked for the literal "configmap exists" string -- the description from example-configmap's yconverge.cue exec check. y-cluster v0.3.3 stopped echoing check descriptions to stdout (it now prints "yconverge check N/N exec" markers instead), so the grep silently returned 0 matches. Under set -eo pipefail, the empty `$(grep ... | head -1 | cut ...)` substitution exits 1 (because grep with no match exits 1), which trips set -e and exits the script silently -- before the FAIL echo runs. CI shows a failed itest with no diagnostic. Replace the description grep with a structural one: the first "yconverge check ... exec" line in the output is example-configmap's exec check (example-namespace's check is kind=wait, not exec). The remaining ordering assertion (_cm_check < _wd_step) gates the sequential walk through the dep chain unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub deprecates Node.js 20 actions starting June 2026 (default flips) and removes Node.js 20 from runners September 2026. The actions/* on @v4 and the docker/* on @v3/@v6 all run on Node 20 and emit deprecation warnings on every CI run. Pinning to specific versions (not floating major) so the version in CI matches the version reviewers see. actions/checkout @v4 -> @v5.0.0 actions/cache/{restore,save} @v4 -> @v5.0.5 docker/setup-qemu-action @V3 -> @v4.0.0 docker/setup-buildx-action @V3 -> @v4.0.0 docker/login-action @V3 -> @v4.1.0 docker/build-push-action @v6 -> @v7.1.0 imjasonh/setup-crane @v0.3 -> @v0.5 mikefarah/yq @v4.44.1 -> @v4.53.2 All these majors are drop-in for our usage (Node 24 baseline; no other contract changes that affect this workflow). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A new GHA job runs e2e/agents-clusterautomation-acceptance-linux-amd64.sh
when both:
1. The trigger is a pull_request event (push to PR head, or PR
opened / reopened).
2. The PR carries the `e2e-cluster` label.
Gated by `needs: [script-lint, itest]` so the heavyweight (~10-15
min) provision + converge + validate cycle only fires after the
cheaper checks have passed.
Runs on ubuntu-latest -- GitHub-hosted runners support KVM
acceleration, have qemu-system-x86_64 preinstalled, and provide
4 vCPU / 16 GB / 14 GB SSD which fits the 4 CPU / 8 GB cluster
the local-qemu config provisions. The pre-flight step echoes
/dev/kvm + df + qemu version so disk / virtualization issues
surface explicitly when the runner spec changes under us.
Sets ENV_IS_CLEAN=true to skip the script's `exec env -i ...`
trampoline (which exists for clean-shell rehearsal on a dev
laptop; CI's env is already minimal). PATH is set to put
${GITHUB_WORKSPACE}/bin first so the wrapper resolution works
without a shell rc file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two unblockers for the kwok itest under y-cluster v0.3.3:
1. The example-db/checks pure-CUE library (parameterized #DbChecks,
imported by example-db/{single,distributed}) tripped the dep
walker -- v0.3.3 walks every CUE import as a converge step and
errors with "no kustomization file" for dirs that are
import-only definition libraries. Inline the wait check into
each variant; drop the now-unused checks/ dir entirely.
2. The prod/qa cluster-overlay tests
(`kubectl yconverge -k cluster-prod/db/` etc.) require yconverge
to apply once at the top and run nested-base checks in
depth-first order. v0.3.3 instead applies every CUE-imported
base standalone, which fails on example-db/{single,distributed}
because they carry a sentinel namespace
(ONLY_apply_through_cluster_variant) that requires the cluster
overlay to override. Comment out lines 269-277 with a TODO
describing the y-cluster gap.
Both are y-cluster behavior gaps, not regressions in this PR --
they were latent under the previous local pin and surfaced when
running the kwok itest end-to-end (prior CI runs died at line 232
on a separate stale grep, masking these).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes:
- cluster-configs/local-docker/y-cluster-provision.yaml grows the
full PortForwards list (6443/80/443/8944) -- y-cluster v0.3.3's
docker provider takes the same shape as qemu, mapping each entry
via Docker port bindings. The earlier comment ("the docker schema
does not expose additional port forwards") was stale and is
rewritten.
- e2e/agents-clusterautomation-acceptance-linux-amd64.sh: switch
CONFIG from cluster-configs/local-qemu to cluster-configs/local-docker.
- The dns-hint-ip annotation flow is unchanged: docker provider
also fills DNSHintIP from cfg.HostRoutableIP() (127.0.0.1 when
guest:80 is forwarded), so /etc/hosts -> 127.0.0.1:8944 -> docker
port mapping -> ServiceLB still resolves end-to-end.
Includes a self-contained pre-pull fallback for k3s: y-cluster
v0.3.3's docker provider does NOT auto-pull the k3s image -- it
calls `docker create` directly and errors with "No such image"
when the image isn't already on the host. The acceptance script
catches that error path, scrapes the image ref out of y-cluster's
"starting docker" progress log, runs `docker pull`, and retries
provision. Harmless when the image is already cached (the first
attempt succeeds); important for fresh hosts (CI runners). Will
become dead code once y-cluster ships auto-pull on the docker
provider.
Verified locally with both warm and cold docker image cache:
provision -> 7 yconverge phases -> validate-ystack reports
37 passed, 0 failed in both cases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
y-cluster v0.3.3 errors out with "KUBECONFIG env must be set" at provision time -- the binary refuses to default a path so it can never accidentally write to a developer's main kubeconfig. The e2e-cluster job's runner has no KUBECONFIG set by default. Add a pre-step that exports `$HOME/.kube/yolean` via $GITHUB_ENV (matching the ystack convention from the local dev workflow), and mkdir the parent dir. The acceptance script picks it up via the env inherited through the ENV_IS_CLEAN=true trampoline-skip path. Also retire the qemu/kvm pre-flight checks now that the acceptance runs against the docker provider; replace with `df -h` + `docker info` for surfaced disk + docker daemon visibility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A second y-cluster v0.3.3 docker-provider race surfaces in CI
(filed as specs/y-cluster/ISSUE_DOCKER_K3S_READY_BEFORE_APISERVER.md):
provision declares "k3s ready" once /etc/rancher/k3s/k3s.yaml exists
inside the container, then immediately runs `kubectl apply` for the
envoy-gateway install against the host-mapped 127.0.0.1:6443 -- on
slower hosts (GHA runners) the host port forward isn't yet
functional, the apply fails with "dial tcp 127.0.0.1:6443: connect:
connection refused", and provision aborts.
Extend the existing pre-pull workaround into a unified retry loop:
on each provision attempt, if the failure log contains
- "No such image" -> docker pull, retry
- "dial tcp 127.0.0.1:6443: connect: connection refused"
-> sleep 10s, retry
- anything else -> propagate the failure as before
Up to 4 attempts. Becomes dead code once y-cluster ships
auto-pull + a stronger readiness check on the host port.
Verified locally with cold image cache: pre-pull fires once,
provision succeeds on second attempt, validate-ystack reports
37 passed, 0 failed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v0.3.4 ships the docker auto-pull fix (a959eb0) responding to ISSUE_DOCKER_PROVIDER_NO_AUTO_PULL.md: ContainerCreate now does an ImagePull first when the image isn't on disk. The acceptance script's pre-pull fallback (scrape image ref + docker pull on the "No such image" failure path) is dead code on v0.3.4 and is removed in the same commit. The other docker-provider race (ISSUE_DOCKER_K3S_READY_BEFORE_APISERVER.md) is not addressed in v0.3.4 -- the connect-refused retry stays in place until y-cluster strengthens the readiness check on the host's :6443 port. Both pins land at the same SHA (host wrapper + y-kustomize Deployment image) so a fresh provision and the in-cluster Deployment serve from one binary. Verified locally with cold image cache: provision auto-pulls, validate-ystack reports 37 passed, 0 failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
40-kafka and 30-blobs depended on 40-kafka-ystack/30-blobs-ystack, which inverted the natural order: the cluster-side package (kafka, blobs) was gated on its y-kustomize sibling rather than the other way around. Flip the deps so 40-kafka and 30-blobs gate only on their namespace, and the new 41-kafka-y-kustomize / 31-blobs-y-kustomize gate on the cluster package plus 29-y-kustomize. Renaming -ystack to -y-kustomize and bumping the prefix to 31/41 makes the converge order match the directory listing and names the role. 60-builds-registry/yconverge.cue updated to import the renamed packages.
Two y-cluster releases unblock the docker provider on ubuntu-latest and let the acceptance script collapse to a single provision call: - v0.3.5 (Yolean/y-cluster#12) added a host-side /readyz probe between the in-container kubeconfig appearing and "k3s ready" being declared, closing the docker port-forward race that made envoy-gateway install fail with "dial tcp 127.0.0.1:6443: connect: connection refused". The 4x retry/sleep-10s workaround in this script is dead code now -- each retry tore the cluster down and reproduced the deterministic race anyway. - v0.3.6 (Yolean/y-cluster#15) fixed a separate silent-drop in the docker provider's PortBindings: HostIP was left as the zero netip.Addr ("invalid IP"), which moby v1.54+ marshals to the empty JSON string and Docker Engine 28 dropped silently. A second issue with PortBindings still surfaces in some CI contexts -- the y-cluster-managed container's NetworkSettings.Ports comes back empty even with v0.3.6 -- but it's distinct from anything this script can work around; filed upstream against y-cluster. The y-kustomize Deployment image is bumped to the matching v0.3.6 tag for consistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4363d4b to
1aed9a4
Compare
v0.3.7 (Yolean/y-cluster#17) sets Config.ExposedPorts alongside HostConfig.PortBindings on every docker.Provision call, matching what `docker run -p` does. Addresses Yolean/y-cluster#16: on ubuntu-latest CI the released binary's ContainerCreate produced NetworkSettings.Ports={} for the four-port ystack config even after the v0.3.6 HostIP fix, while plain `docker run -p ...` on the same runner published bindings cleanly. Verified via the e2e-cluster job whether the silent-drop is actually closed in the released-binary-from-bash path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces #74.
y-cluster-provisionandy-cluster-converge-ystackNew
--converge=LISTflag replaces the broken--excludeflag.Specify which ystack bases to converge as a comma-separated list of
names without number prefix:
Available targets:
y-kustomize,blobs,builds-registry,kafka,buildkit,monitoring,prod-registry.Dependencies are resolved automatically — converging
builds-registrypulls in
blobs,y-kustomize, andkafka-ystack.New
--dry-run=serverpassthrough — verify what would be appliedwithout mutating the cluster:
Registry mirrors
Drops
/etc/hostshacks on nodes. Theregistries.yamlconfignow uses the magic ClusterIPs (
10.43.0.50for builds-registry,10.43.0.51for prod-registry) read from the source-of-truth YAMLfiles. Containerd resolves registries without needing DNS or host
file entries.
The qemu provisioner verifies registry access after converge:
Image caching
y-image-cache-ystackandy-image-list-ystacknow accept the same--converge=LISTflag. The provisioner passes its converge targetsthrough, so all images for the selected bases are pre-cached before
converge starts.
# List images that would be cached y-image-list-ystack --converge=y-kustomize,blobs,builds-registry,kafka,buildkitkubectl-yconverge
NAMESPACEexported to check commands: Exec checks inyconverge.cuecan use$NAMESPACE(the resolved namespace) and$CONTEXTin their commands.NS_GUESSis no longer exported.Multi-dir CUE aggregation via kustomize-traverse: The old 1-level
single-directory heuristic for finding
yconverge.cueis replaced bykustomize-traverse, which walks the full kustomization tree. Checksfrom all bases are collected and run after apply. This fixes the case
where
site-apply-namespaced/references multiple base directories.K3s version
Upgraded from v1.35.1+k3s1 to v1.35.3+k3s1.
New binary: kustomize-traverse
Added as a y-bin managed tool. Walks kustomization directory trees and
reports local directories visited and resolved namespace:
y-kustomize-traverse -o dirs gateway-v4/site-apply-namespaced/ y-kustomize-traverse -o namespace gateway-v4/site-apply-namespaced/Breaking changes
--excludeflag removed from provisioners (was never implementedin
y-cluster-converge-ystack)NS_GUESSno longer exported — use$NAMESPACEin check commandsy-image-list-ystackno longer reads aBASESarray fromy-cluster-converge-ystack— uses--convergeflag instead