Skip to content

Use a node-local overlayfs rootfs cache to eliminate per-restore untar during actor resume #228

@chenggui53

Description

Summary

During actor resume, nearly all wall-clock time is spent rebuilding the OCI rootfs from scratch (prepareOCIDirectory → untar), not restoring the checkpoint. This proposal builds on #166 and recommends overlayfs as the concrete first implementation: cache one extracted, read-only rootfs per immutable image digest on each node, and materialize each actor's bundle as a thin overlay mount instead of re-untarring the whole image.

Background / Problem

Today, every Restore call in atelet does the following for each container (pause + application), in cmd/atelet/oci.go (prepareOCIDirectory):

  1. os.RemoveAll(rootfs) — wipe the actor rootfs
  2. os.MkdirAll(rootfs)
  3. pullCache.Fetch(ref) — fetch the flattened image tar (the memorypullcache avoids re-pulling, but not re-extracting)
  4. untar(...) — extract the entire rootfs into the actor bundle

This repeats the full rootfs materialization on every resume, even when the same image digest has already been extracted on the same node many times before.

Observed cost (coding-agent-style image, consistent with #166):

prepareOCIDirectory (untar rootfs) ~15–20 s
Actual runsc restore (checkpoint restore) ~268 ms

So resume latency is dominated ~99% by rootfs extraction, not by the checkpoint restore itself. This is especially painful for larger agent images (e.g. Node-based images) that bundle a lot of files.

There is currently no overlayfs, copy-on-write, reflink, or layer-dedup mechanism: the rootfs is a flat directory that is torn down and rebuilt on every resume.

Proposed Approach: node-local digest-keyed overlay

Change the scaling behavior from "every restore pays extraction cost" to "first restore of a digest on a node pays extraction; later restores pay only an overlay mount."

On first use of an image digest on a node

  • Pull + extract the flattened rootfs once into a node-local, read-only cache directory keyed by the image digest (not tag), e.g. BasePath/rootfs-cache//lower.

On every restore using the same digest

  • Instead of RemoveAll + untar, set up an overlay for the actor bundle's rootfs:
    • lowerdir = the cached, read-only extracted rootfs (shared, never mutated)
    • upperdir + workdir = per-actor, actor-private writable layers
  • Then proceed directly to runsc create + runsc restore.

This keeps the cached lower layer immutable and gives each actor an isolated writable layer, so container writes never pollute the shared cache.

Why overlayfs over the alternatives

  • reflink (FICLONE): cleanest semantically, but requires the underlying fs to support reflinks (XFS/Btrfs/overlay-on-supported-fs). Good fallback where available, but availability is environment-dependent.
  • hardlink: simplest to implement, but unsafe here — the actor rootfs isnly: false in oci.go), so a container writing a file would truncate the shared inode and corrupt the cache. It would only be safe combined with c --overlay2), which we don't currently pass in cmd/ateom-gvisor/runsc.go. - overlayfs: standard containerd-snapshotter approach, isolates per-acto is natively understood by gVisor as an OCI overlay mount. Best balance of safety and portability for a first version.

Compatibility with gVisor restore

This should be orthogonal to checkpoint restore. runsc restore -direct -heckpoint memory image, which is independent of the rootfs directory.runsc restore -bundle only needs the rootfs path to contain correct content — whether produced by untar, reflink, or an overlay mount is transparent to runsc. (We should still validate this end-to-end with gVisor as a first step.)

Open questions

  • Mount lifecycle: where to mount/unmount the per-actor overlay (during ir reset / delete), and ensuring clean teardown on crash or workerrestart.
  • Eviction policy: cache should be bounded by total disk usage (LRU by size), rather than the current memorypullcache's count-only lru.New(256). What's an adequate first-version policy?
  • Large images / missing digest: memorypullcache.Fetch currently bypasseiB and only caches when a digest is present in the ref. The rootfs cache needs a reliable digest even on the bypass path.
  • Scope: implement entirely in atelet, or factor into a broader host-locee Avoid re-untarring actor image rootfs on every restore #166)?
  • Observability: expose cache hit/miss and rootfs-materialization time on (currently only the image attribute).

Expected impact

For repeated resumes of the same image on a node, rootfs materializationerlay mount (sub-second), making the checkpoint restore (~268 ms) the dominant cost — the intended behavior.

Related #166

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions