Skip to content

WIP: introduce ateom-cloud-hypervisor#239

Draft
Benjamin Elder (BenTheElder) wants to merge 14 commits into
agent-substrate:mainfrom
BenTheElder:ateom-cloud-hypervisor
Draft

WIP: introduce ateom-cloud-hypervisor#239
Benjamin Elder (BenTheElder) wants to merge 14 commits into
agent-substrate:mainfrom
BenTheElder:ateom-cloud-hypervisor

Conversation

@BenTheElder

Copy link
Copy Markdown
Collaborator

NOTE: This is currently a rough draft for visibility. It does not quite have everything how I'd want, even for MVP.

High Level:

  • New peer binary to ateom-gvisor
  • Cloud Hypervisor + Kata + virtiofsd
    • LF Projects, richer functionality than firecracker, same underlying VMM crate from crosVM.
  • Support for development in kind IF the host has /dev/kvm.
    • Devs on Linux just need to have KVM enabled + docker, on macOS M2+ hardware you can limactl start --nested-virt https://lima-vm.io/
  • Actor Template supports "runtime" (IE ateom / sandbox selection) and more generic asset specification (more on that below)

There are some pretty obvious TODOs (non-exhaustive):

  • Documentation [Will update before merge].
  • guest VM memory sizing (major perf implications ... but perhaps punted to iterate)
  • gvisor config is unnecessarily disjoint, while a generic config is introduced, I'm not totally sold on it, and gvisor + demos haven't been moved yet
  • gvisor and uVM assets should probably be pulled out of actor template to a shared resource admins can manage with separate RBAC? These feel awkward. The ActorTemplate needs to be coupled to a version of the sandbox, but maybe not all the finer details inline. (punted to iterate)
  • It depends on a recent virtiofsd build and upstream builds lack arm64 so we have to build ourselves, that's annoying. It does cache well though. Eventually we can use kata's builds.
  • update cmd/setup-gcp to optionally provision a nested virt pool (though also I think we should move this to another repo or something ... but as long as it exists ...)
  • e2e tests and CI: ... this is problematic, free GHA runners don't support nested virt AFAICT. We can probably at least get the "integration" tests running ...

Fixes #123

It's a good idea to open an issue first for discussion.

  • Tests pass
  • Appropriate changes to documentation are included in the PR

This PR is AI assisted. Everything will be completely vetted before converting from draft mode. (It is tested and I have taken a review pass, but sharing a bit early)

ateom-cloud-hypervisor speaks the containerd shim v2 task API directly over
ttrpc (no containerd daemon), so vendor the task/v2 protos, ttrpc, and their
transitive deps.
- ateompb: runtime_asset_paths on Run/Checkpoint/Restore requests;
  CheckpointWorkloadResponse.snapshot_files so ateom reports exactly the files
  it wrote (CH snapshots differ from gVisor's fixed set).
- ateletpb: RuntimeAsset/RuntimeAssetsConfig (runtime type, content-addressed
  asset map, authentication) on Run/Checkpoint/Restore requests.
WorkerPool.spec.runtime (gvisor|microvm, default gvisor) drives worker pod
shape; ActorTemplate.spec.runtime carries the runtime type plus the
content-addressed asset set (URL + sha256) atelet fetches at runtime, with
optional authentication. CRDs + deepcopy regenerated.
runtime=microvm worker pods get the /dev/kvm host device and are pinned to
nested-virt nodes via nodeSelector + toleration on ate.dev/runtime=microvm.
- Generalize the runsc fetch into content-addressed fetchRuntimeAssets (keyed
  asset map from the ActorTemplate, cached under the static-files dir), with
  per-template authentication selecting the storage client.
- Checkpoint uploads exactly the files ateom reported (plus manifest.json);
  Restore downloads the manifest first, then the listed files. gVisor's fixed
  three-file set is preserved via the same path.
A peer to cmd/ateom-gvisor implementing ateompb.Ateom with full suspend/resume:

- Run: drive containerd-shim-kata-v2 directly over ttrpc (no containerd
  daemon), foreground shim server, rendered configuration.toml pointing at
  runtime-fetched assets (fetch-not-bake). /run/kata-containers is made
  rshared at boot so kata's virtio-fs mounts/->shared/ propagation works
  inside a pod (runc roots are rprivate).
- Networking mirrors ateom-gvisor: pod eth0 moves into a per-pod interior
  netns; kata builds its tap + TC mirror there. Stale taps/qdiscs and
  leftover per-sandbox state/processes are cleaned before each run (the
  sandbox id is the actor id, so retries collide otherwise).
- Checkpoint: pause (idempotent via vm.info), capture the virtio-fs shared
  dir (nsenter into the shim mountns, or locally for restored actors), CH
  native snapshot, teardown; reports the snapshot file list.
- Restore: ateom-owned bare-CH relaunch. The snapshot's virtio-net is
  fd-backed, so restore recreates the tap + mirror and passes fresh tap FDs
  via vm.restore net_fds (SCM_RIGHTS over the api-socket). Snapshot socket
  paths are rewritten to the restoring actor's VMDir; a patched virtiofsd
  (find-paths migration mode) serves the reconstructed shared dir.
- Guest IP mobility: the snapshot freezes the source pod's IP, so restore
  re-addresses guest eth0 to the new pod's IP via the kata-agent
  UpdateInterface/UpdateRoutes ttrpc RPCs over hybrid vsock (agentpb is a
  minimal wire-compatible mirror of the kata 3.31 protos). The address is
  /32 + gateway routes so forwarding works on CNIs without host-veth
  proxy-ARP (GKE) as well as those with it (kind ptp).

Long-lived processes (shim, CH, virtiofsd) are deliberately started without
exec.CommandContext: gRPC cancels the request ctx when the handler returns,
which would SIGKILL them under a healthy actor.

Verified end-to-end on kind (arm64, cross-pod) and GKE (amd64 nested-virt
node pool, cross-node across per-node pod subnets): an in-RAM counter
continues counting across suspend -> object storage -> restore.
- create-kind-cluster.sh mounts /dev/kvm into the node (TEMP: always; make
  conditional on host support later) and labels it ate.dev/runtime=microvm.
- hack/microvm-assets: assemble.sh builds the per-arch asset set (kata
  static release pieces, cloud-hypervisor v52, virtiofsd built from source
  with the vhost 0.16 bump that fixes CH snapshot/restore) and prints the
  sha256s; stage-to-rustfs.sh uploads to the in-cluster bucket for kind.
WorkerPool (runtime=microvm) + ActorTemplate with the content-addressed kata/
cloud-hypervisor asset set; the in-RAM counter proves guest memory survives
suspend/resume.
Attach pause/capture/snapshot/teardown durations to the 'Actor checkpointed'
log line (and time the CH API shutdown in teardown). Measuring on GKE showed
the CH snapshot write dominates suspend latency and scales with guest memory
size: with kata's stock default_memory=2048 the snapshot phase was ~6.8s on a
pd-balanced boot disk; a 512MiB guest (set purely via the fetched
configuration.toml asset) drops it to ~0.1-0.3s, taking end-to-end suspend
from ~18s to ~3.4s and resume from ~4.4s to ~1.6s.
Guest memory currently comes from the fetched configuration.toml asset's
default_memory; it should become an ActorTemplate field rewritten during
config rendering, since suspend/resume latency scales directly with it.
Probe the Docker environment (the provider VM on macOS, the host on Linux)
with a --device run; only then emit the node /dev/kvm extraMount, chmod it,
and label nodes ate.dev/runtime=microvm. Clusters without KVM now create
cleanly with gVisor-only support instead of failing.
GKE attaches no label or taint to nested-virt node pools (verified on a live
pool — unlike GKE Sandbox's automatic sandbox.gke.io/runtime=gvisor), so
there is no upstream label to mirror: ate.dev/runtime=microvm stays our
cross-platform convention until WorkerPool node selection is configurable.
The toleration covers operator-tainted dedicated KVM pools.
Port the kata runtime to main's veth model (mirrors the updated
cmd/ateom-gvisor): a fresh per-activation veth pair with the worker side
(ateom0) staying in the pod netns and the peer renamed eth0 in the interior
netns at the stable actor address 169.254.17.2/30, plus nftables masquerade +
pod-IP:80 DNAT. kata's tcfilter consumes the interior netns exactly like a
CNI-provisioned container netns.

Because the actor address is now constant, a restored guest's frozen network
config is valid on every pod, which retires the whole post-restore re-IP
mechanism (kata-agent UpdateInterface client + wire-mirror protos) and the
eth0 move/scrape/restore machinery. One kata-specific delta from gVisor: the
host-side veth gets a FIXED MAC, because a CH snapshot freezes the guest's
ARP cache entry for the gateway (gVisor rebuilds its netstack on restore;
a full-VM snapshot does not).

Also adapts to upstream's TracingOptions (NoExporter removed).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] MicroVM support

1 participant