Skip to content

Cf gvisor deferred start#504

Draft
vlast3k wants to merge 14 commits into
cloudfoundry:mainfrom
vlast3k:cf-gvisor-deferred-start
Draft

Cf gvisor deferred start#504
vlast3k wants to merge 14 commits into
cloudfoundry:mainfrom
vlast3k:cf-gvisor-deferred-start

Conversation

@vlast3k

@vlast3k vlast3k commented May 27, 2026

Copy link
Copy Markdown

Summary

Backward Compatibility

Breaking Change? Yes/No

vlast3k added 13 commits May 27, 2026 10:18
gVisor network=sandbox requires the netns to be configured BEFORE
task.Start() because setupNetwork() reads the netns at start time
and mirrors it into netstack. Previously Start was called before
silk-cni configured networking, leaving gVisor with empty netstack.

Split the lifecycle: Create only boots the sandbox (runsc create),
then Start is called after Networker.Network() configures the netns.
Wire a --containerd-runtime-type flag through to the containerd client
and Nerd layer. When set to a non-runc runtime (e.g. io.containerd.runsc.v1),
Create skips runc-specific task options (WithNoNewKeyring, WithUIDAndGID)
that runsc rejects, and nils spec.Windows which gVisor does not support.

Also adds a no-op Start() to RunRunc for OCIRuntime interface compliance
after the deferred-start split.
For gVisor containers, the sentry process PID on the host does not expose
the container filesystem via /proc/<pid>/root (the sentry runs in its own
mount namespace). Look up users against bundle.Spec.Root.Path (the host-side
rootfs) which is always accessible regardless of runtime type.
Non-runc runtimes (e.g. runsc/gVisor) do not write state.json to the runc
root directory. On cgroups v2 the memory.use_hierarchy knob does not exist
either. Treat a missing state.json as a no-op instead of returning an error
that would block container creation.
nstar uses nsenter to enter the container mount namespace, which does not
work with gVisor because the sentry PID on the host has a different mount
namespace that does not expose the container filesystem.

For non-runc runtimes, use runtime.Exec to run tar inside the container
through the containerd -> shim -> runsc exec path. This goes through the
sentry and keeps its dentry cache consistent with the filesystem state.

StreamIn: exec /bin/tar -xf - -C <path> with TarStream on stdin.
StreamOut: exec /bin/tar -cf - -C <source> <path> with stdout piped back.

Detection is based on the configured runtimeType (from the
--containerd-runtime-type flag) rather than runtime probing.
The exec-based StreamIn runs tar -xf inside the container, but the target
directory (e.g. /tmp/app) may not exist in a freshly created container.
The nstar-based approach creates it implicitly. Use mkdir -p before tar
to ensure the path exists.
… errors, add constants

- Replace shell-based streamInViaExec (vulnerable to command injection via
  spec.Path) with two direct exec calls: /bin/mkdir then /bin/tar
- Capture stderr in streamOutViaExec and log tar failures instead of
  silently discarding errors
- Extract hardcoded "io.containerd.runc.v2" into defaultRuncRuntime constant
- Add debug log to RunRunc.Start no-op for observability
…ting

- Factor exec→wait→check-exit into execAndWait helper (eliminates repetition)
- Set user once at top of each streaming function (was duplicated 3x)
- Add stderr capture to streamInViaExec (symmetric with streamOut)
- Replace magic constant with isNonRuncRuntime() method
- Remove noisy debug log from RunRunc.Start no-op
- Add one-line comments for spec.Windows and taskOpts skipping in nerd.go
Move exec-based and nstar-based streaming into separate implementations
of a new Streamer interface. Containerizer delegates to the injected
streamer without knowing which runtime is in use.

- NstarStreamer: existing nstar/nsenter approach for runc containers
- ExecStreamer: exec-based tar for runtimes where /proc/pid/ns/mnt is
  not accessible (e.g. gVisor)

Selection happens once at construction time based on --containerd-runtime-type.
Containerizer no longer carries runtimeType or any streaming logic.
Three patches for full Docker container support on gVisor:

1. execPeas (containerizer.go): When gVisor runtime detected, convert pea
   requests to exec inside existing container. Peas create separate sandboxes
   that can't share gVisor's per-sandbox netstack. The healthcheck/envoy
   binaries are already bind-mounted, so exec works directly.

2. SkipUserNamespace (pea_creator.go): Defense-in-depth — skip userns join
   for gVisor since setns(CLONE_NEWUSER) is rejected with EINVAL.

3. setupLoopback (external_networker.go): After silk CNI 'up', bring up the
   loopback interface in the container's network namespace. Silk only creates
   a veth pair; gVisor's netstack scrapes the netns and needs lo present to
   support 127.0.0.1 binding (required by envoy).

Validated end-to-end on lod-aws-0515:
- Docker python:3.12-slim with port healthcheck: PASS
- Envoy container proxy (mTLS): PASS
- HTTP routing through gorouter: PASS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant