Skip to content

Actor Pause/Resume flow#227

Open
Dmitry Berkovich (dberkov) wants to merge 7 commits into
agent-substrate:mainfrom
dberkov:actor_pause
Open

Actor Pause/Resume flow#227
Dmitry Berkovich (dberkov) wants to merge 7 commits into
agent-substrate:mainfrom
dberkov:actor_pause

Conversation

@dberkov

@dberkov Dmitry Berkovich (dberkov) commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Implements PAUSED state for issue #119.

The snapshot files are kept locally on node VM in a separate folder. At resume time, scheduler uses a node VM hint and picks up a worker from the same node where files were stored at suspend time.

The local file management solution is temporary and will be replaced once Michelle Au (@msau42) introduces a new component that is supposed to manage files on the node VM.

  • Tests pass
  • Manual tests with counter demo
>kubectl ate create actor my-counter-1 --template ate-demo-counter/counter
>curl -X POST -H "Host: my-counter-1.actors.resources.substrate.ate.dev" http://localhost:8000
hello from: 169.254.17.2 | preserved memory count: 1
>curl -X POST -H "Host: my-counter-1.actors.resources.substrate.ate.dev" http://localhost:8000
hello from: 169.254.17.2 | preserved memory count: 2
>kubectl ate pause actor my-counter-1 -o json
> curl -X POST -H "Host: my-counter-1.actors.resources.substrate.ate.dev" http://localhost:8000
hello from: 169.254.17.2 | preserved memory count: 3

Breaking change

This PR introduces a breaking change in the Actor proto. All existing actor needs to be recreated, prior testing PAUSE functionality.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits inline.

  • Can we add coverage for this to the e2e tests? I think this should work on gvisor+kind and it would be nice to cover in CI. We could extend the demo counter e2e.
  • Worker.node_name is only populated when the syncer re-mirrors a pod, so workers created before this rollout won't satisfy the node restriction until re-synced.
  • I know the storage situation is temporary, but we have no GC for snapshots and these may fill local disk quickly. Maybe a TODO somewhere or a doc warning?

Comment thread internal/ateompath/ateompath.go Outdated
Comment thread cmd/ateapi/internal/controlapi/workflow_pause.go
Comment thread cmd/ateapi/internal/controlapi/workflow_pause.go
Comment thread cmd/atelet/main.go Outdated
Comment thread cmd/atelet/main.go Outdated
Comment thread cmd/ateapi/internal/controlapi/functional_test.go Outdated
Comment thread cmd/ateapi/internal/controlapi/functional_test.go Outdated
@dberkov

Copy link
Copy Markdown
Collaborator Author

Can we add coverage for this to the e2e tests? I think this should work on gvisor+kind and it would be nice to cover in CI. We could extend the demo counter e2e.

I have extended existing e2e demo test to test the PAUSE API.

Worker.node_name is only populated when the syncer re-mirrors a pod, so workers created before this rollout won't satisfy the node restriction until re-synced.

Added a disclaimer in the PR description.

I know the storage situation is temporary, but we have no GC for snapshots and these may fill local disk quickly. Maybe a TODO somewhere or a doc warning?

Added TODO in the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants