WIP: introduce ateom-cloud-hypervisor#239
Draft
Benjamin Elder (BenTheElder) wants to merge 14 commits into
Draft
WIP: introduce ateom-cloud-hypervisor#239Benjamin Elder (BenTheElder) wants to merge 14 commits into
Benjamin Elder (BenTheElder) wants to merge 14 commits into
Conversation
ateom-cloud-hypervisor speaks the containerd shim v2 task API directly over ttrpc (no containerd daemon), so vendor the task/v2 protos, ttrpc, and their transitive deps.
- ateompb: runtime_asset_paths on Run/Checkpoint/Restore requests; CheckpointWorkloadResponse.snapshot_files so ateom reports exactly the files it wrote (CH snapshots differ from gVisor's fixed set). - ateletpb: RuntimeAsset/RuntimeAssetsConfig (runtime type, content-addressed asset map, authentication) on Run/Checkpoint/Restore requests.
WorkerPool.spec.runtime (gvisor|microvm, default gvisor) drives worker pod shape; ActorTemplate.spec.runtime carries the runtime type plus the content-addressed asset set (URL + sha256) atelet fetches at runtime, with optional authentication. CRDs + deepcopy regenerated.
runtime=microvm worker pods get the /dev/kvm host device and are pinned to nested-virt nodes via nodeSelector + toleration on ate.dev/runtime=microvm.
- Generalize the runsc fetch into content-addressed fetchRuntimeAssets (keyed asset map from the ActorTemplate, cached under the static-files dir), with per-template authentication selecting the storage client. - Checkpoint uploads exactly the files ateom reported (plus manifest.json); Restore downloads the manifest first, then the listed files. gVisor's fixed three-file set is preserved via the same path.
A peer to cmd/ateom-gvisor implementing ateompb.Ateom with full suspend/resume: - Run: drive containerd-shim-kata-v2 directly over ttrpc (no containerd daemon), foreground shim server, rendered configuration.toml pointing at runtime-fetched assets (fetch-not-bake). /run/kata-containers is made rshared at boot so kata's virtio-fs mounts/->shared/ propagation works inside a pod (runc roots are rprivate). - Networking mirrors ateom-gvisor: pod eth0 moves into a per-pod interior netns; kata builds its tap + TC mirror there. Stale taps/qdiscs and leftover per-sandbox state/processes are cleaned before each run (the sandbox id is the actor id, so retries collide otherwise). - Checkpoint: pause (idempotent via vm.info), capture the virtio-fs shared dir (nsenter into the shim mountns, or locally for restored actors), CH native snapshot, teardown; reports the snapshot file list. - Restore: ateom-owned bare-CH relaunch. The snapshot's virtio-net is fd-backed, so restore recreates the tap + mirror and passes fresh tap FDs via vm.restore net_fds (SCM_RIGHTS over the api-socket). Snapshot socket paths are rewritten to the restoring actor's VMDir; a patched virtiofsd (find-paths migration mode) serves the reconstructed shared dir. - Guest IP mobility: the snapshot freezes the source pod's IP, so restore re-addresses guest eth0 to the new pod's IP via the kata-agent UpdateInterface/UpdateRoutes ttrpc RPCs over hybrid vsock (agentpb is a minimal wire-compatible mirror of the kata 3.31 protos). The address is /32 + gateway routes so forwarding works on CNIs without host-veth proxy-ARP (GKE) as well as those with it (kind ptp). Long-lived processes (shim, CH, virtiofsd) are deliberately started without exec.CommandContext: gRPC cancels the request ctx when the handler returns, which would SIGKILL them under a healthy actor. Verified end-to-end on kind (arm64, cross-pod) and GKE (amd64 nested-virt node pool, cross-node across per-node pod subnets): an in-RAM counter continues counting across suspend -> object storage -> restore.
- create-kind-cluster.sh mounts /dev/kvm into the node (TEMP: always; make conditional on host support later) and labels it ate.dev/runtime=microvm. - hack/microvm-assets: assemble.sh builds the per-arch asset set (kata static release pieces, cloud-hypervisor v52, virtiofsd built from source with the vhost 0.16 bump that fixes CH snapshot/restore) and prints the sha256s; stage-to-rustfs.sh uploads to the in-cluster bucket for kind.
WorkerPool (runtime=microvm) + ActorTemplate with the content-addressed kata/ cloud-hypervisor asset set; the in-RAM counter proves guest memory survives suspend/resume.
Attach pause/capture/snapshot/teardown durations to the 'Actor checkpointed' log line (and time the CH API shutdown in teardown). Measuring on GKE showed the CH snapshot write dominates suspend latency and scales with guest memory size: with kata's stock default_memory=2048 the snapshot phase was ~6.8s on a pd-balanced boot disk; a 512MiB guest (set purely via the fetched configuration.toml asset) drops it to ~0.1-0.3s, taking end-to-end suspend from ~18s to ~3.4s and resume from ~4.4s to ~1.6s.
Guest memory currently comes from the fetched configuration.toml asset's default_memory; it should become an ActorTemplate field rewritten during config rendering, since suspend/resume latency scales directly with it.
Probe the Docker environment (the provider VM on macOS, the host on Linux) with a --device run; only then emit the node /dev/kvm extraMount, chmod it, and label nodes ate.dev/runtime=microvm. Clusters without KVM now create cleanly with gVisor-only support instead of failing.
GKE attaches no label or taint to nested-virt node pools (verified on a live pool — unlike GKE Sandbox's automatic sandbox.gke.io/runtime=gvisor), so there is no upstream label to mirror: ate.dev/runtime=microvm stays our cross-platform convention until WorkerPool node selection is configurable. The toleration covers operator-tainted dedicated KVM pools.
Port the kata runtime to main's veth model (mirrors the updated cmd/ateom-gvisor): a fresh per-activation veth pair with the worker side (ateom0) staying in the pod netns and the peer renamed eth0 in the interior netns at the stable actor address 169.254.17.2/30, plus nftables masquerade + pod-IP:80 DNAT. kata's tcfilter consumes the interior netns exactly like a CNI-provisioned container netns. Because the actor address is now constant, a restored guest's frozen network config is valid on every pod, which retires the whole post-restore re-IP mechanism (kata-agent UpdateInterface client + wire-mirror protos) and the eth0 move/scrape/restore machinery. One kata-specific delta from gVisor: the host-side veth gets a FIXED MAC, because a CH snapshot freezes the guest's ARP cache entry for the gateway (gVisor rebuilds its netstack on restore; a full-VM snapshot does not). Also adapts to upstream's TracingOptions (NoExporter removed).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NOTE: This is currently a rough draft for visibility. It does not quite have everything how I'd want, even for MVP.
High Level:
kindIF the host has/dev/kvm.limactl start --nested-virthttps://lima-vm.io/There are some pretty obvious TODOs (non-exhaustive):
Fixes #123
This PR is AI assisted. Everything will be completely vetted before converting from draft mode. (It is tested and I have taken a review pass, but sharing a bit early)