WIP: introduce ateom-cloud-hypervisor by BenTheElder · Pull Request #239 · agent-substrate/substrate

Benjamin Elder (BenTheElder) · 2026-06-12T23:47:35Z

NOTE: This is currently a rough draft for visibility. It does not quite have everything how I'd want, even for MVP.

High Level:

New peer binary to ateom-gvisor
Cloud Hypervisor + Kata + virtiofsd
- LF Projects, richer functionality than firecracker, same underlying VMM crate from crosVM.
Support for development in kind IF the host has /dev/kvm.
- Devs on Linux just need to have KVM enabled + docker, on macOS M2+ hardware you can limactl start --nested-virt https://lima-vm.io/
Actor Template supports "runtime" (IE ateom / sandbox selection) and more generic asset specification (more on that below)

There are some pretty obvious TODOs (non-exhaustive):

Documentation [Will update before merge].
guest VM memory sizing (major perf implications ... but perhaps punted to iterate)
gvisor config is unnecessarily disjoint, while a generic config is introduced, I'm not totally sold on it, and gvisor + demos haven't been moved yet
gvisor and uVM assets should probably be pulled out of actor template to a shared resource admins can manage with separate RBAC? These feel awkward. The ActorTemplate needs to be coupled to a version of the sandbox, but maybe not all the finer details inline. (punted to iterate)
It depends on a recent virtiofsd build and upstream builds lack arm64 so we have to build ourselves, that's annoying. It does cache well though. Eventually we can use kata's builds.
update cmd/setup-gcp to optionally provision a nested virt pool (though also I think we should move this to another repo or something ... but as long as it exists ...)
e2e tests and CI: ... this is problematic, free GHA runners don't support nested virt AFAICT. We can probably at least get the "integration" tests running ...

Fixes #123

It's a good idea to open an issue first for discussion.

Tests pass
Appropriate changes to documentation are included in the PR

This PR is AI assisted. Everything will be completely vetted before converting from draft mode. (It is tested and I have taken a review pass, but sharing a bit early)

ateom-cloud-hypervisor speaks the containerd shim v2 task API directly over ttrpc (no containerd daemon), so vendor the task/v2 protos, ttrpc, and their transitive deps.

- ateompb: runtime_asset_paths on Run/Checkpoint/Restore requests; CheckpointWorkloadResponse.snapshot_files so ateom reports exactly the files it wrote (CH snapshots differ from gVisor's fixed set). - ateletpb: RuntimeAsset/RuntimeAssetsConfig (runtime type, content-addressed asset map, authentication) on Run/Checkpoint/Restore requests.

WorkerPool.spec.runtime (gvisor|microvm, default gvisor) drives worker pod shape; ActorTemplate.spec.runtime carries the runtime type plus the content-addressed asset set (URL + sha256) atelet fetches at runtime, with optional authentication. CRDs + deepcopy regenerated.

runtime=microvm worker pods get the /dev/kvm host device and are pinned to nested-virt nodes via nodeSelector + toleration on ate.dev/runtime=microvm.

- Generalize the runsc fetch into content-addressed fetchRuntimeAssets (keyed asset map from the ActorTemplate, cached under the static-files dir), with per-template authentication selecting the storage client. - Checkpoint uploads exactly the files ateom reported (plus manifest.json); Restore downloads the manifest first, then the listed files. gVisor's fixed three-file set is preserved via the same path.

A peer to cmd/ateom-gvisor implementing ateompb.Ateom with full suspend/resume: - Run: drive containerd-shim-kata-v2 directly over ttrpc (no containerd daemon), foreground shim server, rendered configuration.toml pointing at runtime-fetched assets (fetch-not-bake). /run/kata-containers is made rshared at boot so kata's virtio-fs mounts/->shared/ propagation works inside a pod (runc roots are rprivate). - Networking mirrors ateom-gvisor: pod eth0 moves into a per-pod interior netns; kata builds its tap + TC mirror there. Stale taps/qdiscs and leftover per-sandbox state/processes are cleaned before each run (the sandbox id is the actor id, so retries collide otherwise). - Checkpoint: pause (idempotent via vm.info), capture the virtio-fs shared dir (nsenter into the shim mountns, or locally for restored actors), CH native snapshot, teardown; reports the snapshot file list. - Restore: ateom-owned bare-CH relaunch. The snapshot's virtio-net is fd-backed, so restore recreates the tap + mirror and passes fresh tap FDs via vm.restore net_fds (SCM_RIGHTS over the api-socket). Snapshot socket paths are rewritten to the restoring actor's VMDir; a patched virtiofsd (find-paths migration mode) serves the reconstructed shared dir. - Guest IP mobility: the snapshot freezes the source pod's IP, so restore re-addresses guest eth0 to the new pod's IP via the kata-agent UpdateInterface/UpdateRoutes ttrpc RPCs over hybrid vsock (agentpb is a minimal wire-compatible mirror of the kata 3.31 protos). The address is /32 + gateway routes so forwarding works on CNIs without host-veth proxy-ARP (GKE) as well as those with it (kind ptp). Long-lived processes (shim, CH, virtiofsd) are deliberately started without exec.CommandContext: gRPC cancels the request ctx when the handler returns, which would SIGKILL them under a healthy actor. Verified end-to-end on kind (arm64, cross-pod) and GKE (amd64 nested-virt node pool, cross-node across per-node pod subnets): an in-RAM counter continues counting across suspend -> object storage -> restore.

- create-kind-cluster.sh mounts /dev/kvm into the node (TEMP: always; make conditional on host support later) and labels it ate.dev/runtime=microvm. - hack/microvm-assets: assemble.sh builds the per-arch asset set (kata static release pieces, cloud-hypervisor v52, virtiofsd built from source with the vhost 0.16 bump that fixes CH snapshot/restore) and prints the sha256s; stage-to-rustfs.sh uploads to the in-cluster bucket for kind.

WorkerPool (runtime=microvm) + ActorTemplate with the content-addressed kata/ cloud-hypervisor asset set; the in-RAM counter proves guest memory survives suspend/resume.

Attach pause/capture/snapshot/teardown durations to the 'Actor checkpointed' log line (and time the CH API shutdown in teardown). Measuring on GKE showed the CH snapshot write dominates suspend latency and scales with guest memory size: with kata's stock default_memory=2048 the snapshot phase was ~6.8s on a pd-balanced boot disk; a 512MiB guest (set purely via the fetched configuration.toml asset) drops it to ~0.1-0.3s, taking end-to-end suspend from ~18s to ~3.4s and resume from ~4.4s to ~1.6s.

Guest memory currently comes from the fetched configuration.toml asset's default_memory; it should become an ActorTemplate field rewritten during config rendering, since suspend/resume latency scales directly with it.

Probe the Docker environment (the provider VM on macOS, the host on Linux) with a --device run; only then emit the node /dev/kvm extraMount, chmod it, and label nodes ate.dev/runtime=microvm. Clusters without KVM now create cleanly with gVisor-only support instead of failing.

GKE attaches no label or taint to nested-virt node pools (verified on a live pool — unlike GKE Sandbox's automatic sandbox.gke.io/runtime=gvisor), so there is no upstream label to mirror: ate.dev/runtime=microvm stays our cross-platform convention until WorkerPool node selection is configurable. The toleration covers operator-tainted dedicated KVM pools.

Port the kata runtime to main's veth model (mirrors the updated cmd/ateom-gvisor): a fresh per-activation veth pair with the worker side (ateom0) staying in the pod netns and the peer renamed eth0 in the interior netns at the stable actor address 169.254.17.2/30, plus nftables masquerade + pod-IP:80 DNAT. kata's tcfilter consumes the interior netns exactly like a CNI-provisioned container netns. Because the actor address is now constant, a restored guest's frozen network config is valid on every pod, which retires the whole post-restore re-IP mechanism (kata-agent UpdateInterface client + wire-mirror protos) and the eth0 move/scrape/restore machinery. One kata-specific delta from gVisor: the host-side veth gets a FIXED MAC, because a CH snapshot freezes the guest's ARP cache entry for the gateway (gVisor rebuilds its netstack on restore; a full-VM snapshot does not). Also adapts to upstream's TracingOptions (NoExporter removed).

Benjamin Elder (BenTheElder) added 14 commits June 12, 2026 12:54

vendor: add containerd task v2 API + ttrpc for driving the kata shim

f4c91ad

ateom-cloud-hypervisor speaks the containerd shim v2 task API directly over ttrpc (no containerd daemon), so vendor the task/v2 protos, ttrpc, and their transitive deps.

controller: micro-VM worker pod shape

808a9a1

runtime=microvm worker pods get the /dev/kvm host device and are pinned to nested-virt nodes via nodeSelector + toleration on ate.dev/runtime=microvm.

ateapi: thread runtime assets into suspend/resume workflows

3e3977b

demos: micro-VM counter template

a33b031

WorkerPool (runtime=microvm) + ActorTemplate with the content-addressed kata/ cloud-hypervisor asset set; the in-RAM counter proves guest memory survives suspend/resume.

ateom-cloud-hypervisor: TODO for first-class guest memory sizing

6845d45

Guest memory currently comes from the fetched configuration.toml asset's default_memory; it should become an ActorTemplate field rewritten during config rendering, since suspend/resume latency scales directly with it.

Benjamin Elder (BenTheElder) mentioned this pull request Jun 13, 2026

[Feature] Implementation of Actor State Machine: Lifecycle and Transitions #119

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: introduce ateom-cloud-hypervisor#239

WIP: introduce ateom-cloud-hypervisor#239
Benjamin Elder (BenTheElder) wants to merge 14 commits into
agent-substrate:mainfrom
BenTheElder:ateom-cloud-hypervisor

Benjamin Elder (BenTheElder) commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Benjamin Elder (BenTheElder) commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant