Allow large image pulls to succeed by decoupling workflow deadline from lock TTL

## Expected Behavior

User Facing: I should be able to pull a large image and schedule it to a worker.
(such as one containing a large number of executables).

Technical: `ResumeActor` and `SuspendActor` should succeed for any actor whose image pull + extract + gVisor restore can reasonably complete, including cold pulls on slow networks, large container images, and first-time pulls on freshly scheduled worker pods.
The Redis lock TTL that serializes `ResumeActor`/`SuspendActor` on a single actor (so a crashed `ateapi` replica doesn't pin the actor on a stale lock) should be tunable independently of how long an individual workflow is allowed to run.

## Actual Behavior
User Facing: Prior to #231, large images (specifically our test image) would eventually finish pulling in  ~20 min, at the cost of RSS increasing in ateleti.e. worker successfully has actor scheduled to it at the  cost of overall system stability. This wasn't an acceptable tradeoff. Now the control plane is no longer at risk of  OOM, but still needs to successfully pull larger images.

Technical:
`ActorWorkflow.ResumeActor` and `SuspendActor` derive the workflow context from the Redis lock TTL via:

```go
ctx, releaseLock, err := w.acquireActorLock(ctx, id, 30*time.Second, 2*time.Second)
```

at `cmd/ateapi/internal/controlapi/workflow.go:145` (and `:174`). `acquireActorLock` makes the workflow ctx expire at `lockTTL − padding = 28s`, conflating two unrelated concerns into a single knob:

1. **How long a single Resume/Suspend workflow is allowed to run.** Bounded by what the slowest legitimate path (image pull + decompress + extract + gVisor restore) actually needs.
2. **How long a peer must wait to retry an actor after a crashed `ateapi` mid-workflow.** Bounded by tolerable failover latency.

Raising the knob to fix (1) makes (2) unacceptably slow — a crashed `ateapi` would pin an individual actor for minutes. Lowering it to keep (2) fast breaks (1) — image pulls that can't complete inside 28s death-loop forever. There is no operator-facing knob to raise either.

After #231 (the Bug B / RSS amplification fix on parent #230), the loop is no longer an OOM risk to `atelet`, but the user-visible symptom is unchanged:

- `kubectl ate get actor <golden-actor-id>` stays `STATUS_RESUMING`.
- `kubectl get actortemplate ... -o jsonpath='{.status.phase}'` stays `PhaseResumeGoldenActor`.
- `ate-controller` logs spray `DeadlineExceeded` every ~30s.
- The workflow makes no progress — and after #231 it is *less* likely to escape via the previously-observed "incidental success" mechanism (bandwidth variance + warmed HTTP keep-alives), because cancelled requests now drop their underlying TCP connections cleanly instead of leaving them warm for the next retry.

Most commonly hit when:

- Resuming an actor for the first time on a freshly scheduled worker pod.
- Resuming on a fresh cluster (no upstream registry caching, `atelet` in-process pull cache empty).
- ActorTemplate uses a large image (north of ~200MB compressed on non-LAN bandwidth).
- Self-hosted or rate-limited image registries.
- After a worker reboot or `atelet` restart (in-process pull cache lost).

## Steps to Reproduce the Problem

Identical underlying reproduction to #230; the death loop happens whether or not #231 is in place — what changes is whether `atelet` RSS amplifies during it.

1. Set up kind + install substrate:

   ```bash
   ./hack/create-kind-cluster.sh
   ./hack/install-ate-kind.sh --deploy-ate-system
   ```

1. Build the `ateom-gvisor` image into the kind-local registry and capture the resolved reference:

   ```bash
   export KO_DOCKER_REPO=localhost:5001
   export KO_DEFAULTPLATFORMS=linux/$(go env GOARCH)
   ATEOM_IMAGE=$(./hack/run-tool.sh ko build -B ./cmd/ateom-gvisor)
   echo "ateom image: $ATEOM_IMAGE"   # localhost:5001/ateom-gvisor@sha256:...
   ```

1. Install metrics-server (needed for `kubectl top` on a single-node kind cluster):

   ```bash
   kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/high-availability-1.21+.yaml
   kubectl scale -n kube-system deploy/metrics-server --replicas=1
   kubectl patch -n kube-system deploy metrics-server --type=json \
     -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'
   kubectl rollout status -n kube-system deploy/metrics-server
   ```

1. Apply the WorkerPool + ActorTemplate. The container image is a ~360MB compressed / ~1.2GB rootfs sandbox image observed in the field; the container's `command` is irrelevant because the death loop happens during pull/extract, before the container ever starts:

   ```bash
   kubectl apply -f - <<EOF
   apiVersion: v1
   kind: Namespace
   metadata:
     name: ate-repro
   ---
   apiVersion: ate.dev/v1alpha1
   kind: WorkerPool
   metadata:
     name: repro-pool
     namespace: ate-repro
   spec:
     replicas: 1
     ateomImage: ${ATEOM_IMAGE}
   ---
   apiVersion: ate.dev/v1alpha1
   kind: ActorTemplate
   metadata:
     name: repro-fat-image
     namespace: ate-repro
   spec:
     workerPoolRef:
       name: repro-pool
       namespace: ate-repro
     runsc:
       amd64:
         url: "gs://gvisor/releases/nightly/2026-05-19/x86_64/runsc"
         sha256Hash: "a397be1abc2420d26bce6c70e6e2ff96c73aaaab929756c56f5e2089ea842b63"
       arm64:
         url: "gs://gvisor/releases/nightly/2026-05-19/aarch64/runsc"
         sha256Hash: "1ba2366ae2efceba166046f51a4104f9261c9cb72c6db8f5b3fe2dc57dea86b9"
     pauseImage: "registry.k8s.io/pause:3.10.2@sha256:f548e0e8e3dc1896ca956272154dde3314e8cc4fde0a57577ee9fa1c63f5baf4"
     containers:
       - name: fat
         image: ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee415dc4c0dba7164f9eabe727574c056d4f211781f20af249707883a3b4
         command: ["/bin/sh", "-c", "sleep 3600"]
     snapshotsConfig:
       location: gs://ate-snapshots/repro/
   EOF
   ```

   No manual `kubectl ate resume actor` is needed — `ate-controller` calls `ResumeActor` against the golden-actor ID as part of the ActorTemplate reconcile, which is enough to exercise the bug.

1. Confirm the failure mode:

   ```bash
   # ate-controller retries every ~30s with DeadlineExceeded indefinitely
   kubectl logs -n ate-system deploy/ate-controller --tail=0 -f | grep -i "deadline\|golden"

   # AT never leaves ResumeGoldenActor
   kubectl get actortemplate -n ate-repro repro-fat-image -o jsonpath='{.status.phase}{"\n"}'

   # Golden actor never leaves STATUS_RESUMING
   GOLDEN_ID=$(kubectl get actortemplate -n ate-repro repro-fat-image -o jsonpath='{.status.goldenActorID}')
   kubectl ate get actor "$GOLDEN_ID"
   ```

> **If the bug doesn't trigger on your bandwidth:** the cold-pull path completes inside 28s on fast corp LAN / warmed registry mirrors. Force it by throttling the kind node (`docker exec kind-control-plane tc qdisc add dev eth0 root tbf rate 10mbit burst 32kbit latency 400ms` — `apt-get install -y iproute2` first if `tc` is missing) or by swapping the container `image:` for something much larger (e.g. `nvcr.io/nvidia/pytorch:24.05-py3` at ~12GB compressed). Watch for at least 5 minutes before concluding it didn't reproduce — the loop occasionally escapes via bandwidth variance.

## Specifications

- Version: HEAD of `main` (#231 merged or unmerged — Bug A behaves identically either way).
- Platform: any cluster where the chosen image's cold-pull path exceeds ~28s. Reproduced on kind (single-node, control-plane only) on darwin/arm64.

## Proposed Fix

Decouple the Redis lock TTL from the workflow deadline via a heartbeat. Sketch:

1. **Add `RefreshLock(ctx, key, value, ttl) (bool, error)` to `store.Interface`.** Redis implementation is a CAS Lua script — `GET key == value ? PEXPIRE key ttl : 0` — mirroring the existing `ReleaseLock` script's shape at `cmd/ateapi/internal/store/ateredis/ateredis.go:577`. Returns `false` if we no longer own the lock so the caller can abort.

2. **Keep the Redis lock TTL short** — internal constant (~30s). This is the only thing that bounds how long a peer waits to retry an actor after a crashed `ateapi`.

3. **Make the workflow deadline a separate, operator-configurable knob.** New `--actor-workflow-deadline` pflag on `ateapi` with a 5-minute default — long enough for cold cluster pulls on typical bandwidth, short enough that hung workflows stay bounded for ops. This is what bounds a single Resume/Suspend.

4. **Spawn a heartbeat goroutine on lock acquire.** Refreshes the lock every `lockTTL/3` (~10s) for the full workflow duration. On `RefreshLock=false` or any Redis error (peer stole the lock, Redis blip), cancel the workflow ctx with a distinguishable cause (`errLostActorLock`) so in-flight workflow steps see `ctx.Err()` and tear down cleanly. Preserves the original mutual-exclusion invariant — only one workflow runs per actor at any instant.

5. **`releaseLock()` stops the heartbeat first** (waits for goroutine exit), then best-effort `ReleaseLock`.

### Customer-visible impact after fix

- Resume on a cold node completes in one call (taking however long the pull legitimately takes), instead of looping forever.
- Crash-recovery latency unchanged — peer failover stays bounded by the ~30s Redis lock TTL.
- A workflow torn down because its peer stole the lock returns a specific `errLostActorLock` cause, useful for triaging flapping `ateapi` replicas.

### Why this shape (not alternatives)

- *Just raise the 28s knob.* Rejected — operators do not want multi-minute failover delays after a crashed `ateapi`.
- *Heartbeat only, no server-side workflow deadline (workflow ctx inherits from caller).* Considered: `ate-controller` and other gRPC callers already carry their own ctx, so in the happy path the heartbeat alone is enough for large pulls to succeed. Rejected on defense-in-depth grounds — substrate does not configure gRPC keepalive, so a network partition or a hung-but-not-killed caller would leave the workflow running indefinitely on `ateapi`; and a workflow step that forgets to propagate ctx would have no upstream timeout to fall back on. The 5-minute server-side cap covers both cases at the cost of one config knob.
- *Move image pulls off the workflow path entirely* (background pull worker + status-polling RPC). The "right" long-term fix, but a much bigger redesign; should follow the on-disk/shared layer cache (`internal/memorypullcache/memorypullcache.go:47-58` TODO and `ategcs.ObjectStorage`). The heartbeat fix is small enough to ship now without prejudicing that work.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow large image pulls to succeed by decoupling workflow deadline from lock TTL #233

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

Proposed Fix

Customer-visible impact after fix

Why this shape (not alternatives)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Allow large image pulls to succeed by decoupling workflow deadline from lock TTL #233

Description

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

Proposed Fix

Customer-visible impact after fix

Why this shape (not alternatives)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions