|
| 1 | +# NRI Pod Sandbox Lifecycle Hooks |
| 2 | + |
| 3 | +## Relationship to CRI API |
| 4 | + |
| 5 | +This specification defines how NRI plugins interact with pod sandbox lifecycle events. The underlying pod sandbox operations are defined by the [Kubernetes CRI API](https://github.com/kubernetes/cri-api): |
| 6 | + |
| 7 | +- **RunPodSandbox (CRI)**: Creates and starts a pod-level sandbox. Runtimes must ensure the sandbox is in the ready state on success. |
| 8 | +- **StopPodSandbox (CRI)**: Stops any running process that is part of the sandbox and reclaims network resources. |
| 9 | +- **RemovePodSandbox (CRI)**: Removes the sandbox. If there are any running containers, they must be forcibly terminated and removed. |
| 10 | + |
| 11 | +This NRI specification details when and under what conditions NRI plugins receive notifications for these events, ensuring plugins can reliably depend on consistent sandbox state across different runtime implementations. |
| 12 | + |
| 13 | +## Overview |
| 14 | + |
| 15 | +The pod sandbox lifecycle consists of three distinct phases, each with a corresponding NRI event that plugins can subscribe to: |
| 16 | + |
| 17 | +1. **RunPodSandbox**: Fired after the runtime successfully executes CRI RunPodSandbox |
| 18 | +2. **StopPodSandbox**: Fired when the runtime initiates CRI StopPodSandbox |
| 19 | +3. **RemovePodSandbox**: Fired when the runtime performs CRI RemovePodSandbox |
| 20 | + |
| 21 | +For each event, this specification defines: |
| 22 | + |
| 23 | +- **Sandbox State Contract**: What sandbox infrastructure conditions runtimes MUST satisfy when firing the NRI event |
| 24 | +- **Plugin Responsibilities and Capabilities**: What plugins can safely do in response to the event |
| 25 | + |
| 26 | +## RunPodSandbox |
| 27 | + |
| 28 | +**CRI Operation**: RunPodSandbox - Creates and starts a pod-level sandbox. |
| 29 | + |
| 30 | +**NRI Event Timing**: The RunPodSandbox NRI event is fired after the runtime has successfully executed the CRI RunPodSandbox operation and the sandbox has reached a "Ready" state, but before any workload containers are started. |
| 31 | + |
| 32 | +### Sandbox State Contract |
| 33 | + |
| 34 | +When the runtime fires the RunPodSandbox NRI event, it guarantees: |
| 35 | + |
| 36 | +- The Pod-level cgroup hierarchy has been established |
| 37 | +- The Sandbox namespaces (IPC, Network, UTS) are created and active |
| 38 | +- Network setup has been fully configured (network interfaces are up and assigned addressing) |
| 39 | +- The pod IP address (if applicable) is assigned and available |
| 40 | +- The "pause" container (if the runtime uses one) is running |
| 41 | +- All prerequisite operations for workload container startup are complete |
| 42 | + |
| 43 | +### Plugin Responsibilities and Capabilities |
| 44 | + |
| 45 | +Upon receiving the RunPodSandbox event, plugins can safely: |
| 46 | + |
| 47 | +- Access the network namespace and inspect network configuration |
| 48 | +- Perform network-level operations or monitoring |
| 49 | +- Inject sandbox-level hardware configurations (e.g., RDMA, RoCEv2) |
| 50 | +- Establish plugin-specific tracking or monitoring for the pod |
| 51 | +- Store initial state or baseline metrics for later reference |
| 52 | + |
| 53 | +Plugins should treat this as an initialization phase. The sandbox infrastructure will remain accessible throughout the pod's lifetime until StopPodSandbox is called. |
| 54 | + |
| 55 | +## StopPodSandbox |
| 56 | + |
| 57 | +**CRI Operation**: StopPodSandbox - Stops any running process that is part of the sandbox and reclaims network resources. |
| 58 | + |
| 59 | +**NRI Event Timing**: The StopPodSandbox NRI event is fired when the runtime initiates the CRI StopPodSandbox operation. |
| 60 | + |
| 61 | +### Sandbox State Contract |
| 62 | + |
| 63 | +When the runtime fires the StopPodSandbox NRI event, it guarantees: |
| 64 | + |
| 65 | +- Workload containers within the sandbox are stopped or are stopping |
| 66 | +- **CRITICAL**: The sandbox infrastructure still exists and remains fully accessible during this hook |
| 67 | +- The network namespace is not unmounted or deleted until this hook completes |
| 68 | +- The pod's cgroups remain accessible |
| 69 | +- All pod-level resources remain stable until this hook returns |
| 70 | + |
| 71 | +### Plugin Responsibilities and Capabilities |
| 72 | + |
| 73 | +StopPodSandbox is the designated cleanup and observation phase for plugins. Upon receiving this event, plugins can: |
| 74 | + |
| 75 | +- Access the pod's network namespace to read final telemetry or metrics |
| 76 | +- Collect final state for observability or troubleshooting |
| 77 | +- Detach hardware interfaces or reconfigure resources |
| 78 | +- Clean up custom firewall configurations, routing rules, or other network-level state |
| 79 | +- Perform graceful cleanup or resource release before sandbox teardown |
| 80 | + |
| 81 | +**Important**: Plugin processing must complete within the configured request timeout. Do not assume sandbox access persists after this hook returns or times out. |
| 82 | + |
| 83 | +## RemovePodSandbox |
| 84 | + |
| 85 | +**CRI Operation**: RemovePodSandbox - Removes the sandbox and forcibly terminates any remaining containers. |
| 86 | + |
| 87 | +**NRI Event Timing**: The RemovePodSandbox NRI event is fired when the runtime initiates the CRI RemovePodSandbox operation, during final garbage collection. |
| 88 | + |
| 89 | +### Sandbox State Contract |
| 90 | + |
| 91 | +When the runtime fires the RemovePodSandbox NRI event: |
| 92 | + |
| 93 | +- All workload containers have been removed |
| 94 | +- The StopPodSandbox operation has completed |
| 95 | +- Network setup teardown may be underway or complete |
| 96 | +- The pod's namespaces (Network, IPC, UTS) may have already been deleted |
| 97 | +- Pod-level cgroups may be destroyed |
| 98 | +- Sandbox infrastructure access is **not guaranteed** |
| 99 | + |
| 100 | +### Plugin Responsibilities and Capabilities |
| 101 | + |
| 102 | +RemovePodSandbox is strictly for plugin-internal cleanup. Plugins MUST NOT attempt to access pod infrastructure (namespaces, cgroups, network configuration) during this hook, as their existence is not guaranteed. |
| 103 | + |
| 104 | +Plugins receiving this event should only: |
| 105 | + |
| 106 | +- Clean up plugin-internal memory caches or object tracking associated with the podSandboxID |
| 107 | +- Remove host-level tracking files, database entries, or other locally stored pod references |
| 108 | +- Release any plugin resources held for this specific pod |
| 109 | +- Perform final accounting or bookkeeping |
| 110 | + |
| 111 | +**Important**: This hook is informational only. Plugins should not assume any pod infrastructure exists. Only clean up information the plugin created or stored internally. |
| 112 | + |
| 113 | +## Event Ordering and Guarantees |
| 114 | + |
| 115 | +Runtimes MUST guarantee the following ordering: |
| 116 | + |
| 117 | +1. **RunPodSandbox** NRI event fires after successful CRI RunPodSandbox execution |
| 118 | +2. **StopPodSandbox** NRI event fires during CRI StopPodSandbox execution |
| 119 | +3. **RemovePodSandbox** NRI event fires during CRI RemovePodSandbox execution |
| 120 | +4. These events MUST fire in strict order: RunPodSandbox → StopPodSandbox → RemovePodSandbox |
| 121 | +5. No workload containers will be started until after RunPodSandbox hook completes |
| 122 | +6. All workload containers will be stopped before StopPodSandbox hook is called |
| 123 | +7. No network resource reclamation should occur during StopPodSandbox hook execution |
| 124 | + |
| 125 | +See the [CRI API specification](https://github.com/kubernetes/cri-api) for details on each CRI operation. |
| 126 | + |
| 127 | +## Plugin Implementation Guidance |
| 128 | + |
| 129 | +### Subscribing to Events |
| 130 | + |
| 131 | +Plugins subscribe to these events during the Configure phase by returning the appropriate event flags in the ConfigureResponse: |
| 132 | + |
| 133 | +- `Event_RUN_POD_SANDBOX` (1 << 0) for RunPodSandbox |
| 134 | +- `Event_STOP_POD_SANDBOX` (1 << 1) for StopPodSandbox |
| 135 | +- `Event_REMOVE_POD_SANDBOX` (1 << 2) for RemovePodSandbox |
| 136 | + |
| 137 | +These events are notified via the StateChange RPC call. |
| 138 | + |
| 139 | +### Timeout Handling |
| 140 | + |
| 141 | +All plugin processing must complete within the configured request timeout. Plugins should plan accordingly: |
| 142 | + |
| 143 | +- **RunPodSandbox**: Failure may result in pod creation failure |
| 144 | +- **StopPodSandbox**: Non-blocking for subsequent operations; the plugin should not depend on completion of subsequent teardown |
| 145 | +- **RemovePodSandbox**: Non-blocking; removal will proceed regardless of plugin timeout |
| 146 | + |
| 147 | +### Error Handling |
| 148 | + |
| 149 | +Plugins should handle errors gracefully and avoid leaving the pod or system in an inconsistent state. Error recovery strategies: |
| 150 | + |
| 151 | +- **RunPodSandbox errors**: Problematic; may block pod creation depending on failure severity and runtime policy |
| 152 | +- **StopPodSandbox errors**: May not prevent scenario termination depending on runtime policy |
| 153 | +- **RemovePodSandbox errors**: Should not prevent sandbox removal |
| 154 | + |
| 155 | +### Multi-Plugin Coordination |
| 156 | + |
| 157 | +When multiple plugins are active: |
| 158 | + |
| 159 | +- All RunPodSandbox hooks complete before first workload container starts |
| 160 | +- Hooks execute in plugin index order; later plugins should not assume earlier plugins' modifications will persist |
| 161 | +- RemovePodSandbox hooks are independent; plugins should not rely on side effects from other plugins |
0 commit comments