Fix: image vulnerabilities#124
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Good vulnerability cleanup overall, especially pinning versions and removing stale metadata that causes false-positive My blocker is that this PR still depends on the report-only Trivy behavior from #122. The action still does not fail Because this PR’s goal is “fix image vulnerabilities”, I’d like to see one of these before merge:
The |
There was a problem hiding this comment.
Superseded — see the Changes Requested review below.
Correction: my earlier note here claimed nvidia-pytorch "now ships Jupyter." That was wrong — the image always had Jupyter; dropping RP_SKIP_JUPYTER just lets RunPod manage/patch it. Please disregard this review.
TimPietruskyRunPod
left a comment
There was a problem hiding this comment.
Requesting changes — a few quality fixes (details inline). The stale-metadata approach, the efa_metrics strip, and the version pins all look good, and this PR also resolves the missing if: guard I flagged on #122.
scrab_stale_metadatatypo (bake files +COPY --from=).scrub-stale-metadata.pyuses 3.10+ annotation syntax + an unimportedIterator.- Trailing space / missing newline in the new
requirements.txtfiles.
| proxy = "container-template/proxy" | ||
| logo = "container-template" | ||
| requirements = "official-templates/base" | ||
| scrab_stale_metadata = "scripts" |
There was a problem hiding this comment.
Typo: scrab_stale_metadata → scrub_stale_metadata. It's consistent across the bake files and the COPY --from= in the Dockerfile, so it builds, but it reads as a mistake. Same typo in official-templates/nvidia-pytorch/docker-bake.hcl and official-templates/rocm/docker-bake.hcl.
| COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ | ||
|
|
||
| COPY --from=requirements requirements.txt /requirements.txt | ||
| COPY --from=scrab_stale_metadata scrub-stale-metadata.py /tmp/scrub-stale-metadata.py |
There was a problem hiding this comment.
Same scrab_stale_metadata typo here in the COPY --from= (should be scrub_stale_metadata).
| return pinned | ||
|
|
||
|
|
||
| def read_meta(meta_dir: pathlib.Path) -> tuple[str, str] | None: |
There was a problem hiding this comment.
This uses PEP 604 syntax (tuple[str, str] | None) in an annotation evaluated at import time, so it raises on Python 3.9. It runs under the 3.12 base symlink / NGC python today so it's fine in practice — but please add from __future__ import annotations (or a comment pinning the assumption) since the script is generic. Also Iterator is referenced in the annotation on line 54 but never imported (harmless only because it's a string annotation).
| @@ -0,0 +1,5 @@ | |||
| hf_transfer | |||
There was a problem hiding this comment.
Nit: trailing space after hf_transfer, and the file has no trailing newline (line 5). Same in the nvidia-pytorch / rocm requirements files.
Summary
Drives all
runpod/*images to a clean Trivy / Hadolint scan, plus a few CI fixes that surfaced along the way. Targets every image we ship out ofofficial-templates/andhelper-templates/.What's fixed
Image vulnerabilities (Trivy `--severity HIGH,CRITICAL)
jupyterlab,notebook, OpenSSH-related deps; stripped theefa_metricsdirectory from NVIDIA Nsight Compute. That directory ships an internal Go binary (nic_sampler) that NVIDIA builds with an old Go toolchain and was triggering recurring Go-stdlib HIGH/CRITICAL findings on every rebuild. The plugin is AWS-EFA-only (x86, AWS hardware) and never runs on RunPod, so deleting it is safe and thefind ... || trueguard keeps it a no-op on ROCm / CPU images.max-parallelismto 3 in CI and increased the workflow timeout (the matrix was OOM-killing the runner before).scrub-stale-metadata.py(see below) to remove orphan.dist-info/.egg-infotrees that kept Trivy reporting fixed wheels as still-vulnerable.Hadolint
DL3008/DL3009/DL3015findings fixed across the touched Dockerfiles (--no-install-recommends,apt-get clean && rm -rf /var/lib/apt/lists/*, version pins where reasonable).CI / tooling
nvidia.yml,rocm.yml,hadolint-pr.yml,hadolint-push.yml..github/actions/trivy— exposes askip_filesinput sonvidia-pytorchcan skip the publicly-known CA bundle that Trivy flags as a "secret". The cert is the upstream NGC trust bundle published on GitHub, so flagging it is a false positive.RUNPODCTL_VERSION=v2.3.0inbase/Dockerfileto stop trackinglatest.docker/setup-qemu-actioninvocation that started failing after the action's input rename.New:
scripts/scrub-stale-metadata.pySmall helper invoked by Dockerfiles after
pip install. NGC base images bundle several Python packages as in-tree source builds whose.egg-infolives next to the source.pip install --upgradeupgrades the wheel install but cannot reach those bundled trees, so Trivy keeps reporting the old version even though the runtime resolves to the new one. The script reads our pinnedrequirements.txtand deletes any.dist-info/.egg-infowhoseVersion:disagrees with the pin.What's NOT fixed (deliberate)
Three images still have findings we can't act on in this PR:
runpod/base:...-rocm644-...-pytorch251runpod/autoresearch:...-cuda1281-ubuntu2204runpod/autoresearch:...-cuda1281-ubuntu2404These are tracked separately; everything else is now clean.
Validation
Follow-ups (separate PRs)