You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
User Facing: I should be able to pull a large image and schedule it to a worker.
(such as one containing a large number of executables).
Technical: ResumeActor and SuspendActor should succeed for any actor whose image pull + extract + gVisor restore can reasonably complete, including cold pulls on slow networks, large container images, and first-time pulls on freshly scheduled worker pods.
The Redis lock TTL that serializes ResumeActor/SuspendActor on a single actor (so a crashed ateapi replica doesn't pin the actor on a stale lock) should be tunable independently of how long an individual workflow is allowed to run.
Actual Behavior
User Facing: Prior to #231, large images (specifically our test image) would eventually finish pulling in ~20 min, at the cost of RSS increasing in ateleti.e. worker successfully has actor scheduled to it at the cost of overall system stability. This wasn't an acceptable tradeoff. Now the control plane is no longer at risk of OOM, but still needs to successfully pull larger images.
Technical: ActorWorkflow.ResumeActor and SuspendActor derive the workflow context from the Redis lock TTL via:
at cmd/ateapi/internal/controlapi/workflow.go:145 (and :174). acquireActorLock makes the workflow ctx expire at lockTTL − padding = 28s, conflating two unrelated concerns into a single knob:
How long a single Resume/Suspend workflow is allowed to run. Bounded by what the slowest legitimate path (image pull + decompress + extract + gVisor restore) actually needs.
How long a peer must wait to retry an actor after a crashed ateapi mid-workflow. Bounded by tolerable failover latency.
Raising the knob to fix (1) makes (2) unacceptably slow — a crashed ateapi would pin an individual actor for minutes. Lowering it to keep (2) fast breaks (1) — image pulls that can't complete inside 28s death-loop forever. There is no operator-facing knob to raise either.
After #231 (the Bug B / RSS amplification fix on parent #230), the loop is no longer an OOM risk to atelet, but the user-visible symptom is unchanged:
kubectl ate get actor <golden-actor-id> stays STATUS_RESUMING.
kubectl get actortemplate ... -o jsonpath='{.status.phase}' stays PhaseResumeGoldenActor.
ate-controller logs spray DeadlineExceeded every ~30s.
The workflow makes no progress — and after Propagate ctx into memorypullcache image pulls #231 it is less likely to escape via the previously-observed "incidental success" mechanism (bandwidth variance + warmed HTTP keep-alives), because cancelled requests now drop their underlying TCP connections cleanly instead of leaving them warm for the next retry.
Most commonly hit when:
Resuming an actor for the first time on a freshly scheduled worker pod.
Resuming on a fresh cluster (no upstream registry caching, atelet in-process pull cache empty).
ActorTemplate uses a large image (north of ~200MB compressed on non-LAN bandwidth).
Self-hosted or rate-limited image registries.
After a worker reboot or atelet restart (in-process pull cache lost).
Steps to Reproduce the Problem
Identical underlying reproduction to #230; the death loop happens whether or not #231 is in place — what changes is whether atelet RSS amplifies during it.
Apply the WorkerPool + ActorTemplate. The container image is a ~360MB compressed / ~1.2GB rootfs sandbox image observed in the field; the container's command is irrelevant because the death loop happens during pull/extract, before the container ever starts:
No manual kubectl ate resume actor is needed — ate-controller calls ResumeActor against the golden-actor ID as part of the ActorTemplate reconcile, which is enough to exercise the bug.
Confirm the failure mode:
# ate-controller retries every ~30s with DeadlineExceeded indefinitely
kubectl logs -n ate-system deploy/ate-controller --tail=0 -f | grep -i "deadline\|golden"# AT never leaves ResumeGoldenActor
kubectl get actortemplate -n ate-repro repro-fat-image -o jsonpath='{.status.phase}{"\n"}'# Golden actor never leaves STATUS_RESUMING
GOLDEN_ID=$(kubectl get actortemplate -n ate-repro repro-fat-image -o jsonpath='{.status.goldenActorID}')
kubectl ate get actor "$GOLDEN_ID"
If the bug doesn't trigger on your bandwidth: the cold-pull path completes inside 28s on fast corp LAN / warmed registry mirrors. Force it by throttling the kind node (docker exec kind-control-plane tc qdisc add dev eth0 root tbf rate 10mbit burst 32kbit latency 400ms — apt-get install -y iproute2 first if tc is missing) or by swapping the container image: for something much larger (e.g. nvcr.io/nvidia/pytorch:24.05-py3 at ~12GB compressed). Watch for at least 5 minutes before concluding it didn't reproduce — the loop occasionally escapes via bandwidth variance.
Platform: any cluster where the chosen image's cold-pull path exceeds ~28s. Reproduced on kind (single-node, control-plane only) on darwin/arm64.
Proposed Fix
Decouple the Redis lock TTL from the workflow deadline via a heartbeat. Sketch:
Add RefreshLock(ctx, key, value, ttl) (bool, error) to store.Interface. Redis implementation is a CAS Lua script — GET key == value ? PEXPIRE key ttl : 0 — mirroring the existing ReleaseLock script's shape at cmd/ateapi/internal/store/ateredis/ateredis.go:577. Returns false if we no longer own the lock so the caller can abort.
Keep the Redis lock TTL short — internal constant (~30s). This is the only thing that bounds how long a peer waits to retry an actor after a crashed ateapi.
Make the workflow deadline a separate, operator-configurable knob. New --actor-workflow-deadline pflag on ateapi with a 5-minute default — long enough for cold cluster pulls on typical bandwidth, short enough that hung workflows stay bounded for ops. This is what bounds a single Resume/Suspend.
Spawn a heartbeat goroutine on lock acquire. Refreshes the lock every lockTTL/3 (~10s) for the full workflow duration. On RefreshLock=false or any Redis error (peer stole the lock, Redis blip), cancel the workflow ctx with a distinguishable cause (errLostActorLock) so in-flight workflow steps see ctx.Err() and tear down cleanly. Preserves the original mutual-exclusion invariant — only one workflow runs per actor at any instant.
releaseLock() stops the heartbeat first (waits for goroutine exit), then best-effort ReleaseLock.
Customer-visible impact after fix
Resume on a cold node completes in one call (taking however long the pull legitimately takes), instead of looping forever.
Crash-recovery latency unchanged — peer failover stays bounded by the ~30s Redis lock TTL.
A workflow torn down because its peer stole the lock returns a specific errLostActorLock cause, useful for triaging flapping ateapi replicas.
Why this shape (not alternatives)
Just raise the 28s knob. Rejected — operators do not want multi-minute failover delays after a crashed ateapi.
Heartbeat only, no server-side workflow deadline (workflow ctx inherits from caller). Considered: ate-controller and other gRPC callers already carry their own ctx, so in the happy path the heartbeat alone is enough for large pulls to succeed. Rejected on defense-in-depth grounds — substrate does not configure gRPC keepalive, so a network partition or a hung-but-not-killed caller would leave the workflow running indefinitely on ateapi; and a workflow step that forgets to propagate ctx would have no upstream timeout to fall back on. The 5-minute server-side cap covers both cases at the cost of one config knob.
Move image pulls off the workflow path entirely (background pull worker + status-polling RPC). The "right" long-term fix, but a much bigger redesign; should follow the on-disk/shared layer cache (internal/memorypullcache/memorypullcache.go:47-58 TODO and ategcs.ObjectStorage). The heartbeat fix is small enough to ship now without prejudicing that work.
Expected Behavior
User Facing: I should be able to pull a large image and schedule it to a worker.
(such as one containing a large number of executables).
Technical:
ResumeActorandSuspendActorshould succeed for any actor whose image pull + extract + gVisor restore can reasonably complete, including cold pulls on slow networks, large container images, and first-time pulls on freshly scheduled worker pods.The Redis lock TTL that serializes
ResumeActor/SuspendActoron a single actor (so a crashedateapireplica doesn't pin the actor on a stale lock) should be tunable independently of how long an individual workflow is allowed to run.Actual Behavior
User Facing: Prior to #231, large images (specifically our test image) would eventually finish pulling in ~20 min, at the cost of RSS increasing in ateleti.e. worker successfully has actor scheduled to it at the cost of overall system stability. This wasn't an acceptable tradeoff. Now the control plane is no longer at risk of OOM, but still needs to successfully pull larger images.
Technical:
ActorWorkflow.ResumeActorandSuspendActorderive the workflow context from the Redis lock TTL via:at
cmd/ateapi/internal/controlapi/workflow.go:145(and:174).acquireActorLockmakes the workflow ctx expire atlockTTL − padding = 28s, conflating two unrelated concerns into a single knob:ateapimid-workflow. Bounded by tolerable failover latency.Raising the knob to fix (1) makes (2) unacceptably slow — a crashed
ateapiwould pin an individual actor for minutes. Lowering it to keep (2) fast breaks (1) — image pulls that can't complete inside 28s death-loop forever. There is no operator-facing knob to raise either.After #231 (the Bug B / RSS amplification fix on parent #230), the loop is no longer an OOM risk to
atelet, but the user-visible symptom is unchanged:kubectl ate get actor <golden-actor-id>staysSTATUS_RESUMING.kubectl get actortemplate ... -o jsonpath='{.status.phase}'staysPhaseResumeGoldenActor.ate-controllerlogs sprayDeadlineExceededevery ~30s.Most commonly hit when:
ateletin-process pull cache empty).ateletrestart (in-process pull cache lost).Steps to Reproduce the Problem
Identical underlying reproduction to #230; the death loop happens whether or not #231 is in place — what changes is whether
ateletRSS amplifies during it.Set up kind + install substrate:
Build the
ateom-gvisorimage into the kind-local registry and capture the resolved reference:Install metrics-server (needed for
kubectl topon a single-node kind cluster):kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/high-availability-1.21+.yaml kubectl scale -n kube-system deploy/metrics-server --replicas=1 kubectl patch -n kube-system deploy metrics-server --type=json \ -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]' kubectl rollout status -n kube-system deploy/metrics-serverApply the WorkerPool + ActorTemplate. The container image is a ~360MB compressed / ~1.2GB rootfs sandbox image observed in the field; the container's
commandis irrelevant because the death loop happens during pull/extract, before the container ever starts:No manual
kubectl ate resume actoris needed —ate-controllercallsResumeActoragainst the golden-actor ID as part of the ActorTemplate reconcile, which is enough to exercise the bug.Confirm the failure mode:
Specifications
main(Propagate ctx into memorypullcache image pulls #231 merged or unmerged — Bug A behaves identically either way).Proposed Fix
Decouple the Redis lock TTL from the workflow deadline via a heartbeat. Sketch:
Add
RefreshLock(ctx, key, value, ttl) (bool, error)tostore.Interface. Redis implementation is a CAS Lua script —GET key == value ? PEXPIRE key ttl : 0— mirroring the existingReleaseLockscript's shape atcmd/ateapi/internal/store/ateredis/ateredis.go:577. Returnsfalseif we no longer own the lock so the caller can abort.Keep the Redis lock TTL short — internal constant (~30s). This is the only thing that bounds how long a peer waits to retry an actor after a crashed
ateapi.Make the workflow deadline a separate, operator-configurable knob. New
--actor-workflow-deadlinepflag onateapiwith a 5-minute default — long enough for cold cluster pulls on typical bandwidth, short enough that hung workflows stay bounded for ops. This is what bounds a single Resume/Suspend.Spawn a heartbeat goroutine on lock acquire. Refreshes the lock every
lockTTL/3(~10s) for the full workflow duration. OnRefreshLock=falseor any Redis error (peer stole the lock, Redis blip), cancel the workflow ctx with a distinguishable cause (errLostActorLock) so in-flight workflow steps seectx.Err()and tear down cleanly. Preserves the original mutual-exclusion invariant — only one workflow runs per actor at any instant.releaseLock()stops the heartbeat first (waits for goroutine exit), then best-effortReleaseLock.Customer-visible impact after fix
errLostActorLockcause, useful for triaging flappingateapireplicas.Why this shape (not alternatives)
ateapi.ate-controllerand other gRPC callers already carry their own ctx, so in the happy path the heartbeat alone is enough for large pulls to succeed. Rejected on defense-in-depth grounds — substrate does not configure gRPC keepalive, so a network partition or a hung-but-not-killed caller would leave the workflow running indefinitely onateapi; and a workflow step that forgets to propagate ctx would have no upstream timeout to fall back on. The 5-minute server-side cap covers both cases at the cost of one config knob.internal/memorypullcache/memorypullcache.go:47-58TODO andategcs.ObjectStorage). The heartbeat fix is small enough to ship now without prejudicing that work.