Cf gvisor deferred start#504
Draft
vlast3k wants to merge 14 commits into
Draft
Conversation
gVisor network=sandbox requires the netns to be configured BEFORE task.Start() because setupNetwork() reads the netns at start time and mirrors it into netstack. Previously Start was called before silk-cni configured networking, leaving gVisor with empty netstack. Split the lifecycle: Create only boots the sandbox (runsc create), then Start is called after Networker.Network() configures the netns.
Wire a --containerd-runtime-type flag through to the containerd client and Nerd layer. When set to a non-runc runtime (e.g. io.containerd.runsc.v1), Create skips runc-specific task options (WithNoNewKeyring, WithUIDAndGID) that runsc rejects, and nils spec.Windows which gVisor does not support. Also adds a no-op Start() to RunRunc for OCIRuntime interface compliance after the deferred-start split.
For gVisor containers, the sentry process PID on the host does not expose the container filesystem via /proc/<pid>/root (the sentry runs in its own mount namespace). Look up users against bundle.Spec.Root.Path (the host-side rootfs) which is always accessible regardless of runtime type.
Non-runc runtimes (e.g. runsc/gVisor) do not write state.json to the runc root directory. On cgroups v2 the memory.use_hierarchy knob does not exist either. Treat a missing state.json as a no-op instead of returning an error that would block container creation.
nstar uses nsenter to enter the container mount namespace, which does not work with gVisor because the sentry PID on the host has a different mount namespace that does not expose the container filesystem. For non-runc runtimes, use runtime.Exec to run tar inside the container through the containerd -> shim -> runsc exec path. This goes through the sentry and keeps its dentry cache consistent with the filesystem state. StreamIn: exec /bin/tar -xf - -C <path> with TarStream on stdin. StreamOut: exec /bin/tar -cf - -C <source> <path> with stdout piped back. Detection is based on the configured runtimeType (from the --containerd-runtime-type flag) rather than runtime probing.
The exec-based StreamIn runs tar -xf inside the container, but the target directory (e.g. /tmp/app) may not exist in a freshly created container. The nstar-based approach creates it implicitly. Use mkdir -p before tar to ensure the path exists.
… errors, add constants - Replace shell-based streamInViaExec (vulnerable to command injection via spec.Path) with two direct exec calls: /bin/mkdir then /bin/tar - Capture stderr in streamOutViaExec and log tar failures instead of silently discarding errors - Extract hardcoded "io.containerd.runc.v2" into defaultRuncRuntime constant - Add debug log to RunRunc.Start no-op for observability
…ting - Factor exec→wait→check-exit into execAndWait helper (eliminates repetition) - Set user once at top of each streaming function (was duplicated 3x) - Add stderr capture to streamInViaExec (symmetric with streamOut) - Replace magic constant with isNonRuncRuntime() method - Remove noisy debug log from RunRunc.Start no-op - Add one-line comments for spec.Windows and taskOpts skipping in nerd.go
Move exec-based and nstar-based streaming into separate implementations of a new Streamer interface. Containerizer delegates to the injected streamer without knowing which runtime is in use. - NstarStreamer: existing nstar/nsenter approach for runc containers - ExecStreamer: exec-based tar for runtimes where /proc/pid/ns/mnt is not accessible (e.g. gVisor) Selection happens once at construction time based on --containerd-runtime-type. Containerizer no longer carries runtimeType or any streaming logic.
Three patches for full Docker container support on gVisor: 1. execPeas (containerizer.go): When gVisor runtime detected, convert pea requests to exec inside existing container. Peas create separate sandboxes that can't share gVisor's per-sandbox netstack. The healthcheck/envoy binaries are already bind-mounted, so exec works directly. 2. SkipUserNamespace (pea_creator.go): Defense-in-depth — skip userns join for gVisor since setns(CLONE_NEWUSER) is rejected with EINVAL. 3. setupLoopback (external_networker.go): After silk CNI 'up', bring up the loopback interface in the container's network namespace. Silk only creates a veth pair; gVisor's netstack scrapes the netns and needs lo present to support 127.0.0.1 binding (required by envoy). Validated end-to-end on lod-aws-0515: - Docker python:3.12-slim with port healthcheck: PASS - Envoy container proxy (mTLS): PASS - HTTP routing through gorouter: PASS
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Backward Compatibility
Breaking Change? Yes/No