Skip to content

Allow large image pulls to succeed by decoupling workflow deadline from lock TTL #233

@jjamroga

Description

Expected Behavior

User Facing: I should be able to pull a large image and schedule it to a worker.
(such as one containing a large number of executables).

Technical: ResumeActor and SuspendActor should succeed for any actor whose image pull + extract + gVisor restore can reasonably complete, including cold pulls on slow networks, large container images, and first-time pulls on freshly scheduled worker pods.
The Redis lock TTL that serializes ResumeActor/SuspendActor on a single actor (so a crashed ateapi replica doesn't pin the actor on a stale lock) should be tunable independently of how long an individual workflow is allowed to run.

Actual Behavior

User Facing: Prior to #231, large images (specifically our test image) would eventually finish pulling in ~20 min, at the cost of RSS increasing in ateleti.e. worker successfully has actor scheduled to it at the cost of overall system stability. This wasn't an acceptable tradeoff. Now the control plane is no longer at risk of OOM, but still needs to successfully pull larger images.

Technical:
ActorWorkflow.ResumeActor and SuspendActor derive the workflow context from the Redis lock TTL via:

ctx, releaseLock, err := w.acquireActorLock(ctx, id, 30*time.Second, 2*time.Second)

at cmd/ateapi/internal/controlapi/workflow.go:145 (and :174). acquireActorLock makes the workflow ctx expire at lockTTL − padding = 28s, conflating two unrelated concerns into a single knob:

  1. How long a single Resume/Suspend workflow is allowed to run. Bounded by what the slowest legitimate path (image pull + decompress + extract + gVisor restore) actually needs.
  2. How long a peer must wait to retry an actor after a crashed ateapi mid-workflow. Bounded by tolerable failover latency.

Raising the knob to fix (1) makes (2) unacceptably slow — a crashed ateapi would pin an individual actor for minutes. Lowering it to keep (2) fast breaks (1) — image pulls that can't complete inside 28s death-loop forever. There is no operator-facing knob to raise either.

After #231 (the Bug B / RSS amplification fix on parent #230), the loop is no longer an OOM risk to atelet, but the user-visible symptom is unchanged:

  • kubectl ate get actor <golden-actor-id> stays STATUS_RESUMING.
  • kubectl get actortemplate ... -o jsonpath='{.status.phase}' stays PhaseResumeGoldenActor.
  • ate-controller logs spray DeadlineExceeded every ~30s.
  • The workflow makes no progress — and after Propagate ctx into memorypullcache image pulls #231 it is less likely to escape via the previously-observed "incidental success" mechanism (bandwidth variance + warmed HTTP keep-alives), because cancelled requests now drop their underlying TCP connections cleanly instead of leaving them warm for the next retry.

Most commonly hit when:

  • Resuming an actor for the first time on a freshly scheduled worker pod.
  • Resuming on a fresh cluster (no upstream registry caching, atelet in-process pull cache empty).
  • ActorTemplate uses a large image (north of ~200MB compressed on non-LAN bandwidth).
  • Self-hosted or rate-limited image registries.
  • After a worker reboot or atelet restart (in-process pull cache lost).

Steps to Reproduce the Problem

Identical underlying reproduction to #230; the death loop happens whether or not #231 is in place — what changes is whether atelet RSS amplifies during it.

  1. Set up kind + install substrate:

    ./hack/create-kind-cluster.sh
    ./hack/install-ate-kind.sh --deploy-ate-system
  2. Build the ateom-gvisor image into the kind-local registry and capture the resolved reference:

    export KO_DOCKER_REPO=localhost:5001
    export KO_DEFAULTPLATFORMS=linux/$(go env GOARCH)
    ATEOM_IMAGE=$(./hack/run-tool.sh ko build -B ./cmd/ateom-gvisor)
    echo "ateom image: $ATEOM_IMAGE"   # localhost:5001/ateom-gvisor@sha256:...
  3. Install metrics-server (needed for kubectl top on a single-node kind cluster):

    kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/high-availability-1.21+.yaml
    kubectl scale -n kube-system deploy/metrics-server --replicas=1
    kubectl patch -n kube-system deploy metrics-server --type=json \
      -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'
    kubectl rollout status -n kube-system deploy/metrics-server
  4. Apply the WorkerPool + ActorTemplate. The container image is a ~360MB compressed / ~1.2GB rootfs sandbox image observed in the field; the container's command is irrelevant because the death loop happens during pull/extract, before the container ever starts:

    kubectl apply -f - <<EOF
    apiVersion: v1
    kind: Namespace
    metadata:
      name: ate-repro
    ---
    apiVersion: ate.dev/v1alpha1
    kind: WorkerPool
    metadata:
      name: repro-pool
      namespace: ate-repro
    spec:
      replicas: 1
      ateomImage: ${ATEOM_IMAGE}
    ---
    apiVersion: ate.dev/v1alpha1
    kind: ActorTemplate
    metadata:
      name: repro-fat-image
      namespace: ate-repro
    spec:
      workerPoolRef:
        name: repro-pool
        namespace: ate-repro
      runsc:
        amd64:
          url: "gs://gvisor/releases/nightly/2026-05-19/x86_64/runsc"
          sha256Hash: "a397be1abc2420d26bce6c70e6e2ff96c73aaaab929756c56f5e2089ea842b63"
        arm64:
          url: "gs://gvisor/releases/nightly/2026-05-19/aarch64/runsc"
          sha256Hash: "1ba2366ae2efceba166046f51a4104f9261c9cb72c6db8f5b3fe2dc57dea86b9"
      pauseImage: "registry.k8s.io/pause:3.10.2@sha256:f548e0e8e3dc1896ca956272154dde3314e8cc4fde0a57577ee9fa1c63f5baf4"
      containers:
        - name: fat
          image: ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee415dc4c0dba7164f9eabe727574c056d4f211781f20af249707883a3b4
          command: ["/bin/sh", "-c", "sleep 3600"]
      snapshotsConfig:
        location: gs://ate-snapshots/repro/
    EOF

    No manual kubectl ate resume actor is needed — ate-controller calls ResumeActor against the golden-actor ID as part of the ActorTemplate reconcile, which is enough to exercise the bug.

  5. Confirm the failure mode:

    # ate-controller retries every ~30s with DeadlineExceeded indefinitely
    kubectl logs -n ate-system deploy/ate-controller --tail=0 -f | grep -i "deadline\|golden"
    
    # AT never leaves ResumeGoldenActor
    kubectl get actortemplate -n ate-repro repro-fat-image -o jsonpath='{.status.phase}{"\n"}'
    
    # Golden actor never leaves STATUS_RESUMING
    GOLDEN_ID=$(kubectl get actortemplate -n ate-repro repro-fat-image -o jsonpath='{.status.goldenActorID}')
    kubectl ate get actor "$GOLDEN_ID"

If the bug doesn't trigger on your bandwidth: the cold-pull path completes inside 28s on fast corp LAN / warmed registry mirrors. Force it by throttling the kind node (docker exec kind-control-plane tc qdisc add dev eth0 root tbf rate 10mbit burst 32kbit latency 400msapt-get install -y iproute2 first if tc is missing) or by swapping the container image: for something much larger (e.g. nvcr.io/nvidia/pytorch:24.05-py3 at ~12GB compressed). Watch for at least 5 minutes before concluding it didn't reproduce — the loop occasionally escapes via bandwidth variance.

Specifications

  • Version: HEAD of main (Propagate ctx into memorypullcache image pulls #231 merged or unmerged — Bug A behaves identically either way).
  • Platform: any cluster where the chosen image's cold-pull path exceeds ~28s. Reproduced on kind (single-node, control-plane only) on darwin/arm64.

Proposed Fix

Decouple the Redis lock TTL from the workflow deadline via a heartbeat. Sketch:

  1. Add RefreshLock(ctx, key, value, ttl) (bool, error) to store.Interface. Redis implementation is a CAS Lua script — GET key == value ? PEXPIRE key ttl : 0 — mirroring the existing ReleaseLock script's shape at cmd/ateapi/internal/store/ateredis/ateredis.go:577. Returns false if we no longer own the lock so the caller can abort.

  2. Keep the Redis lock TTL short — internal constant (~30s). This is the only thing that bounds how long a peer waits to retry an actor after a crashed ateapi.

  3. Make the workflow deadline a separate, operator-configurable knob. New --actor-workflow-deadline pflag on ateapi with a 5-minute default — long enough for cold cluster pulls on typical bandwidth, short enough that hung workflows stay bounded for ops. This is what bounds a single Resume/Suspend.

  4. Spawn a heartbeat goroutine on lock acquire. Refreshes the lock every lockTTL/3 (~10s) for the full workflow duration. On RefreshLock=false or any Redis error (peer stole the lock, Redis blip), cancel the workflow ctx with a distinguishable cause (errLostActorLock) so in-flight workflow steps see ctx.Err() and tear down cleanly. Preserves the original mutual-exclusion invariant — only one workflow runs per actor at any instant.

  5. releaseLock() stops the heartbeat first (waits for goroutine exit), then best-effort ReleaseLock.

Customer-visible impact after fix

  • Resume on a cold node completes in one call (taking however long the pull legitimately takes), instead of looping forever.
  • Crash-recovery latency unchanged — peer failover stays bounded by the ~30s Redis lock TTL.
  • A workflow torn down because its peer stole the lock returns a specific errLostActorLock cause, useful for triaging flapping ateapi replicas.

Why this shape (not alternatives)

  • Just raise the 28s knob. Rejected — operators do not want multi-minute failover delays after a crashed ateapi.
  • Heartbeat only, no server-side workflow deadline (workflow ctx inherits from caller). Considered: ate-controller and other gRPC callers already carry their own ctx, so in the happy path the heartbeat alone is enough for large pulls to succeed. Rejected on defense-in-depth grounds — substrate does not configure gRPC keepalive, so a network partition or a hung-but-not-killed caller would leave the workflow running indefinitely on ateapi; and a workflow step that forgets to propagate ctx would have no upstream timeout to fall back on. The 5-minute server-side cap covers both cases at the cost of one config knob.
  • Move image pulls off the workflow path entirely (background pull worker + status-polling RPC). The "right" long-term fix, but a much bigger redesign; should follow the on-disk/shared layer cache (internal/memorypullcache/memorypullcache.go:47-58 TODO and ategcs.ObjectStorage). The heartbeat fix is small enough to ship now without prejudicing that work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions