Skip to content

Fix: image vulnerabilities#124

Open
Chmokachka wants to merge 32 commits into
feat/image-security-scannerfrom
fix/image-vulnerabilities
Open

Fix: image vulnerabilities#124
Chmokachka wants to merge 32 commits into
feat/image-security-scannerfrom
fix/image-vulnerabilities

Conversation

@Chmokachka
Copy link
Copy Markdown
Collaborator

@Chmokachka Chmokachka commented May 18, 2026

Summary

Drives all runpod/* images to a clean Trivy / Hadolint scan, plus a few CI fixes that surfaced along the way. Targets every image we ship out of official-templates/ and helper-templates/.

What's fixed

Image vulnerabilities (Trivy `--severity HIGH,CRITICAL)

  • base — bumped jupyterlab, notebook, OpenSSH-related deps; stripped the efa_metrics directory from NVIDIA Nsight Compute. That directory ships an internal Go binary (nic_sampler) that NVIDIA builds with an old Go toolchain and was triggering recurring Go-stdlib HIGH/CRITICAL findings on every rebuild. The plugin is AWS-EFA-only (x86, AWS hardware) and never runs on RunPod, so deleting it is safe and the find ... || true guard keeps it a no-op on ROCm / CPU images.
  • autoresearch — fixed Hadolint findings, aligned with new base.
  • pytorch — Hadolint fixes; bumped max-parallelism to 3 in CI and increased the workflow timeout (the matrix was OOM-killing the runner before).
  • rocm — addressed all fixable CVEs; pinned the relevant deps.
  • nvidia-pytorch — patched OS-package CVEs; added scrub-stale-metadata.py (see below) to remove orphan .dist-info / .egg-info trees that kept Trivy reporting fixed wheels as still-vulnerable.

Hadolint

  • All DL3008 / DL3009 / DL3015 findings fixed across the touched Dockerfiles (--no-install-recommends, apt-get clean && rm -rf /var/lib/apt/lists/*, version pins where reasonable).
  • Hadolint-on-push workflow now ignores the rules we already chose to accept project-wide (matches the PR check behaviour).

CI / tooling

  • Upgraded GitHub Actions versions across nvidia.yml, rocm.yml, hadolint-pr.yml, hadolint-push.yml.
  • Replaced the brittle Trivy action call with our internal .github/actions/trivy — exposes a skip_files input so nvidia-pytorch can skip the publicly-known CA bundle that Trivy flags as a "secret". The cert is the upstream NGC trust bundle published on GitHub, so flagging it is a false positive.
  • Pinned RUNPODCTL_VERSION=v2.3.0 in base/Dockerfile to stop tracking latest.
  • Fixed docker/setup-qemu-action invocation that started failing after the action's input rename.

New: scripts/scrub-stale-metadata.py

Small helper invoked by Dockerfiles after pip install. NGC base images bundle several Python packages as in-tree source builds whose .egg-info lives next to the source. pip install --upgrade upgrades the wheel install but cannot reach those bundled trees, so Trivy keeps reporting the old version even though the runtime resolves to the new one. The script reads our pinned requirements.txt and deletes any .dist-info / .egg-info whose Version: disagrees with the pin.

What's NOT fixed (deliberate)

Three images still have findings we can't act on in this PR:

Image Reason
runpod/base:...-rocm644-...-pytorch251 All remaining CVEs are in PyTorch 2.5.1 itself, fixed only in 2.6.0+. Two options: drop the 2.5.1 variant, or wait for an upstream backport. Left for a separate decision.
runpod/autoresearch:...-cuda1281-ubuntu2204 Findings are in transitive deps that need an autoresearch app-level dependency upgrade — out of scope for this PR.
runpod/autoresearch:...-cuda1281-ubuntu2404 Same as above.

These are tracked separately; everything else is now clean.

Validation

  • Trivy table-mode scans of each rebuilt tag — clean HIGH/CRITICAL on every targeted image.
  • Hadolint runs against the touched Dockerfiles — clean.

Follow-ups (separate PRs)

  • Open autoresearch-side PR to upgrade transitive deps.

@Chmokachka Chmokachka changed the base branch from main to feat/image-security-scanner May 18, 2026 13:19
@blacksmith-sh

This comment has been minimized.

@blacksmith-sh

This comment has been minimized.

@Chmokachka Chmokachka marked this pull request as ready for review May 21, 2026 21:01
@kodxana
Copy link
Copy Markdown
Contributor

kodxana commented Jun 3, 2026

Good vulnerability cleanup overall, especially pinning versions and removing stale metadata that causes false-positive
Trivy reports.

My blocker is that this PR still depends on the report-only Trivy behavior from #122. The action still does not fail
when HIGH/CRITICAL findings are found, and the workflows still scan after push: true, so the CI does not prove that
vulnerable images are blocked from publication.

Because this PR’s goal is “fix image vulnerabilities”, I’d like to see one of these before merge:

  • Trivy exits non-zero for HIGH/CRITICAL fixed vulnerabilities and runs before publish, or
  • the PR clearly states that CI is not enforcing this yet and includes links/logs showing the claimed clean scans for
    each targeted image.

The skip-files addition seems reasonable for the known civetweb cert false positives, but it makes the enforcement
story even more important so real findings do not get hidden behind a passing workflow.

Copy link
Copy Markdown
Member

@TimPietruskyRunPod TimPietruskyRunPod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superseded — see the Changes Requested review below.

Correction: my earlier note here claimed nvidia-pytorch "now ships Jupyter." That was wrong — the image always had Jupyter; dropping RP_SKIP_JUPYTER just lets RunPod manage/patch it. Please disregard this review.

Copy link
Copy Markdown
Member

@TimPietruskyRunPod TimPietruskyRunPod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes — a few quality fixes (details inline). The stale-metadata approach, the efa_metrics strip, and the version pins all look good, and this PR also resolves the missing if: guard I flagged on #122.

  • scrab_stale_metadata typo (bake files + COPY --from=).
  • scrub-stale-metadata.py uses 3.10+ annotation syntax + an unimported Iterator.
  • Trailing space / missing newline in the new requirements.txt files.

proxy = "container-template/proxy"
logo = "container-template"
requirements = "official-templates/base"
scrab_stale_metadata = "scripts"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: scrab_stale_metadatascrub_stale_metadata. It's consistent across the bake files and the COPY --from= in the Dockerfile, so it builds, but it reads as a mistake. Same typo in official-templates/nvidia-pytorch/docker-bake.hcl and official-templates/rocm/docker-bake.hcl.

COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

COPY --from=requirements requirements.txt /requirements.txt
COPY --from=scrab_stale_metadata scrub-stale-metadata.py /tmp/scrub-stale-metadata.py
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same scrab_stale_metadata typo here in the COPY --from= (should be scrub_stale_metadata).

return pinned


def read_meta(meta_dir: pathlib.Path) -> tuple[str, str] | None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses PEP 604 syntax (tuple[str, str] | None) in an annotation evaluated at import time, so it raises on Python 3.9. It runs under the 3.12 base symlink / NGC python today so it's fine in practice — but please add from __future__ import annotations (or a comment pinning the assumption) since the script is generic. Also Iterator is referenced in the annotation on line 54 but never imported (harmless only because it's a string annotation).

@@ -0,0 +1,5 @@
hf_transfer
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: trailing space after hf_transfer, and the file has no trailing newline (line 5). Same in the nvidia-pytorch / rocm requirements files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants