You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During actor resume, nearly all wall-clock time is spent rebuilding the OCI rootfs from scratch (prepareOCIDirectory → untar), not restoring the checkpoint. This proposal builds on #166 and recommends overlayfs as the concrete first implementation: cache one extracted, read-only rootfs per immutable image digest on each node, and materialize each actor's bundle as a thin overlay mount instead of re-untarring the whole image.
Background / Problem
Today, every Restore call in atelet does the following for each container (pause + application), in cmd/atelet/oci.go (prepareOCIDirectory):
os.RemoveAll(rootfs) — wipe the actor rootfs
os.MkdirAll(rootfs)
pullCache.Fetch(ref) — fetch the flattened image tar (the memorypullcache avoids re-pulling, but not re-extracting)
untar(...) — extract the entire rootfs into the actor bundle
This repeats the full rootfs materialization on every resume, even when the same image digest has already been extracted on the same node many times before.
Observed cost (coding-agent-style image, consistent with #166):
prepareOCIDirectory (untar rootfs) ~15–20 s
Actual runsc restore (checkpoint restore) ~268 ms
So resume latency is dominated ~99% by rootfs extraction, not by the checkpoint restore itself. This is especially painful for larger agent images (e.g. Node-based images) that bundle a lot of files.
There is currently no overlayfs, copy-on-write, reflink, or layer-dedup mechanism: the rootfs is a flat directory that is torn down and rebuilt on every resume.
Change the scaling behavior from "every restore pays extraction cost" to "first restore of a digest on a node pays extraction; later restores pay only an overlay mount."
On first use of an image digest on a node
Pull + extract the flattened rootfs once into a node-local, read-only cache directory keyed by the image digest (not tag), e.g. BasePath/rootfs-cache//lower.
On every restore using the same digest
Instead of RemoveAll + untar, set up an overlay for the actor bundle's rootfs:
lowerdir = the cached, read-only extracted rootfs (shared, never mutated)
Then proceed directly to runsc create + runsc restore.
This keeps the cached lower layer immutable and gives each actor an isolated writable layer, so container writes never pollute the shared cache.
Why overlayfs over the alternatives
reflink (FICLONE): cleanest semantically, but requires the underlying fs to support reflinks (XFS/Btrfs/overlay-on-supported-fs). Good fallback where available, but availability is environment-dependent.
hardlink: simplest to implement, but unsafe here — the actor rootfs isnly: false in oci.go), so a container writing a file would truncate the shared inode and corrupt the cache. It would only be safe combined with c --overlay2), which we don't currently pass in cmd/ateom-gvisor/runsc.go. - overlayfs: standard containerd-snapshotter approach, isolates per-acto is natively understood by gVisor as an OCI overlay mount. Best balance of safety and portability for a first version.
Compatibility with gVisor restore
This should be orthogonal to checkpoint restore. runsc restore -direct -heckpoint memory image, which is independent of the rootfs directory.runsc restore -bundle only needs the rootfs path to contain correct content — whether produced by untar, reflink, or an overlay mount is transparent to runsc. (We should still validate this end-to-end with gVisor as a first step.)
Open questions
Mount lifecycle: where to mount/unmount the per-actor overlay (during ir reset / delete), and ensuring clean teardown on crash or workerrestart.
Eviction policy: cache should be bounded by total disk usage (LRU by size), rather than the current memorypullcache's count-only lru.New(256). What's an adequate first-version policy?
Large images / missing digest: memorypullcache.Fetch currently bypasseiB and only caches when a digest is present in the ref. The rootfs cache needs a reliable digest even on the bypass path.
Observability: expose cache hit/miss and rootfs-materialization time on (currently only the image attribute).
Expected impact
For repeated resumes of the same image on a node, rootfs materializationerlay mount (sub-second), making the checkpoint restore (~268 ms) the dominant cost — the intended behavior.
Summary
During actor resume, nearly all wall-clock time is spent rebuilding the OCI rootfs from scratch (prepareOCIDirectory → untar), not restoring the checkpoint. This proposal builds on #166 and recommends overlayfs as the concrete first implementation: cache one extracted, read-only rootfs per immutable image digest on each node, and materialize each actor's bundle as a thin overlay mount instead of re-untarring the whole image.
Background / Problem
Today, every Restore call in atelet does the following for each container (pause + application), in cmd/atelet/oci.go (prepareOCIDirectory):
This repeats the full rootfs materialization on every resume, even when the same image digest has already been extracted on the same node many times before.
Observed cost (coding-agent-style image, consistent with #166):
So resume latency is dominated ~99% by rootfs extraction, not by the checkpoint restore itself. This is especially painful for larger agent images (e.g. Node-based images) that bundle a lot of files.
There is currently no overlayfs, copy-on-write, reflink, or layer-dedup mechanism: the rootfs is a flat directory that is torn down and rebuilt on every resume.
Proposed Approach: node-local digest-keyed overlay
Change the scaling behavior from "every restore pays extraction cost" to "first restore of a digest on a node pays extraction; later restores pay only an overlay mount."
On first use of an image digest on a node
On every restore using the same digest
This keeps the cached lower layer immutable and gives each actor an isolated writable layer, so container writes never pollute the shared cache.
Why overlayfs over the alternatives
Compatibility with gVisor restore
This should be orthogonal to checkpoint restore. runsc restore -direct -heckpoint memory image, which is independent of the rootfs directory.runsc restore -bundle only needs the rootfs path to contain correct content — whether produced by untar, reflink, or an overlay mount is transparent to runsc. (We should still validate this end-to-end with gVisor as a first step.)
Open questions
Expected impact
For repeated resumes of the same image on a node, rootfs materializationerlay mount (sub-second), making the checkpoint restore (~268 ms) the dominant cost — the intended behavior.
Related #166