Refactor: Improve Proxy Handling and Secure Boot in GPU Install Script#1374
Refactor: Improve Proxy Handling and Secure Boot in GPU Install Script#1374cjac wants to merge 6 commits into
Conversation
Summary of ChangesHello @cjac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the GPU driver installation script's reliability and adaptability, particularly in complex network environments requiring HTTP/HTTPS proxies and for systems utilizing Secure Boot. The changes focus on making the installation process more robust, configurable, and resilient to common issues like network restrictions and module signing requirements, while also refining the Conda environment setup and updating documentation. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the GPU driver installation script by introducing robust proxy handling, improving Secure Boot integration, and refining Conda environment setup. Key improvements include flexible proxy configuration with support for HTTPS proxies and custom CA certificates, a new import_gpg_keys function for reliable GPG key fetching, and more thorough verification steps for signed kernel modules under Secure Boot. The documentation has also been updated to reflect these new features and provide better troubleshooting guidance. Overall, these changes make the script more resilient and configurable for diverse network environments and security requirements.
| if [[ -v METADATA_HTTP_PROXY_PEM_URI ]] && [[ -n "${METADATA_HTTP_PROXY_PEM_URI}" ]]; then | ||
| if [[ -z "${trusted_pem_path:-}" ]]; then | ||
| echo "WARNING: METADATA_HTTP_PROXY_PEM_URI is set, but trusted_pem_path is not defined." >&2 | ||
| else | ||
| curl_retry_args+=(--cacert "${trusted_pem_path}") | ||
| fi |
There was a problem hiding this comment.
The warning METADATA_HTTP_PROXY_PEM_URI is set, but trusted_pem_path is not defined indicates a potential issue. trusted_pem_path is only set within set_proxy if both a proxy (http-proxy/https-proxy) and a PEM URI are provided. If http-proxy-pem-uri is provided but no http-proxy or https-proxy is set, set_proxy returns early, leaving trusted_pem_path undefined. This could lead to GPG key imports failing to use the custom CA, even if the PEM URI is present.
| pkg_proxy_conf_file="/etc/apt/apt.conf.d/99proxy" | ||
| cat > "${pkg_proxy_conf_file}" <<EOF | ||
| Acquire::http::Proxy "http://${METADATA_HTTP_PROXY}"; | ||
| Acquire::https::Proxy "http://${METADATA_HTTP_PROXY}"; | ||
| EOF | ||
| echo "Acquire::http::Proxy \"http://${effective_proxy}\";" > "${pkg_proxy_conf_file}" | ||
| echo "Acquire::https::Proxy \"http://${effective_proxy}\";" >> "${pkg_proxy_conf_file}" | ||
| echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}" | ||
| elif is_rocky ; then | ||
| pkg_proxy_conf_file="/etc/dnf/dnf.conf" | ||
|
|
||
| touch "${pkg_proxy_conf_file}" | ||
|
|
||
| if grep -q "^proxy=" "${pkg_proxy_conf_file}"; then | ||
| sed -i.bak "s@^proxy=.*@proxy=${HTTP_PROXY}@" "${pkg_proxy_conf_file}" | ||
| elif grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then | ||
| sed -i.bak "/^\[main\]/a proxy=${HTTP_PROXY}" "${pkg_proxy_conf_file}" | ||
| sed -i.bak '/^proxy=/d' "${pkg_proxy_conf_file}" | ||
| if grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then | ||
| sed -i.bak "/^\\\[main\\\\]/a proxy=http://${effective_proxy}" "${pkg_proxy_conf_file}" | ||
| else | ||
| local TMP_FILE=$(mktemp) | ||
| printf "[main]\nproxy=%s\n" "${HTTP_PROXY}" > "${TMP_FILE}" | ||
|
|
||
| cat "${TMP_FILE}" "${pkg_proxy_conf_file}" > "${pkg_proxy_conf_file}".new | ||
| mv "${pkg_proxy_conf_file}".new "${pkg_proxy_conf_file}" | ||
| echo -e "[main]\nproxy=http://${effective_proxy}" >> "${pkg_proxy_conf_file}" | ||
| fi | ||
| echo "DEBUG: set_proxy: Configured dnf proxy: ${pkg_proxy_conf_file}" | ||
| fi |
There was a problem hiding this comment.
The apt and dnf proxy configurations (Acquire::http::Proxy "http://${effective_proxy}"; and proxy=http://${effective_proxy}) use an http:// prefix. If effective_proxy is derived solely from https_proxy_val (meaning only an HTTPS proxy was specified), this could lead to apt/dnf attempting to connect to an HTTPS proxy using an HTTP scheme. While a later sed command attempts to correct this if http-proxy-pem-uri is set, it might be incorrect if http-proxy-pem-uri is not provided.
| pkg_proxy_conf_file="/etc/apt/apt.conf.d/99proxy" | |
| cat > "${pkg_proxy_conf_file}" <<EOF | |
| Acquire::http::Proxy "http://${METADATA_HTTP_PROXY}"; | |
| Acquire::https::Proxy "http://${METADATA_HTTP_PROXY}"; | |
| EOF | |
| echo "Acquire::http::Proxy \"http://${effective_proxy}\";" > "${pkg_proxy_conf_file}" | |
| echo "Acquire::https::Proxy \"http://${effective_proxy}\";" >> "${pkg_proxy_conf_file}" | |
| echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}" | |
| elif is_rocky ; then | |
| pkg_proxy_conf_file="/etc/dnf/dnf.conf" | |
| touch "${pkg_proxy_conf_file}" | |
| if grep -q "^proxy=" "${pkg_proxy_conf_file}"; then | |
| sed -i.bak "s@^proxy=.*@proxy=${HTTP_PROXY}@" "${pkg_proxy_conf_file}" | |
| elif grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then | |
| sed -i.bak "/^\[main\]/a proxy=${HTTP_PROXY}" "${pkg_proxy_conf_file}" | |
| sed -i.bak '/^proxy=/d' "${pkg_proxy_conf_file}" | |
| if grep -q "^\[main\]" "${pkg_proxy_conf_file}"; then | |
| sed -i.bak "/^\\\[main\\\\]/a proxy=http://${effective_proxy}" "${pkg_proxy_conf_file}" | |
| else | |
| local TMP_FILE=$(mktemp) | |
| printf "[main]\nproxy=%s\n" "${HTTP_PROXY}" > "${TMP_FILE}" | |
| cat "${TMP_FILE}" "${pkg_proxy_conf_file}" > "${pkg_proxy_conf_file}".new | |
| mv "${pkg_proxy_conf_file}".new "${pkg_proxy_conf_file}" | |
| echo -e "[main]\nproxy=http://${effective_proxy}" >> "${pkg_proxy_conf_file}" | |
| fi | |
| echo "DEBUG: set_proxy: Configured dnf proxy: ${pkg_proxy_conf_file}" | |
| fi | |
| echo "Acquire::http::Proxy \"${HTTP_PROXY}\";" > "${pkg_proxy_conf_file}" | |
| echo "Acquire::https::Proxy \"${HTTPS_PROXY}\";" >> "${pkg_proxy_conf_file}" | |
| echo "DEBUG: set_proxy: Configured apt proxy: ${pkg_proxy_conf_file}" |
There was a problem hiding this comment.
This was resolved by using a variable scheme
|
/gcbrun |
|
/gcbrun |
|
I've made substantial changes. Please perform another review, discarding previous review comments. Let's start from the top once again. /gemini review |
There was a problem hiding this comment.
Code Review
This pull request significantly refactors the GPU driver installation script, enhancing proxy handling, improving Secure Boot integration, and refining Conda environment setup. The documentation has been updated to reflect these changes, providing clearer guidance on proxy configuration and custom image building with Secure Boot. The introduction of the import_gpg_keys function centralizes GPG key management, making the script more robust in various network environments. Overall, these changes contribute to a more reliable and configurable GPU initialization process.
| building_file="" | ||
| rm "${local_tarball}" | ||
| make clean | ||
| popd |
There was a problem hiding this comment.
The execute_github_driver_build function includes a popd command. However, this function is called from build_driver_from_github, which also has a popd at its end (line 1331). This results in an extra popd call, which could lead to a directory stack underflow or unexpected behavior if the script's current directory is not what's expected. Ensure that pushd and popd calls are balanced.
25df48a to
3f5811e
Compare
|
I've made substantial changes in response to your previous review. Please perform another review. /gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces significant enhancements to the GPU installation script, primarily focusing on robust proxy handling and support for Secure Boot. The documentation in gpu/README.md has been extensively updated with detailed instructions for these complex scenarios, which is a great improvement. The script gpu/install_gpu_driver.sh has been refactored to centralize GPG key imports, improve caching logic for driver and source builds, and add comprehensive proxy configuration, including custom CA certificate handling.
My review focuses on ensuring the new logic is robust and consistent. I've identified a minor documentation inconsistency and a couple of potential issues in the script related to command execution order and error handling. Overall, this is a very strong refactoring that greatly improves the reliability and maintainability of the script.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This is a significant and valuable pull request that greatly enhances the GPU installation script's reliability, especially in complex environments with proxies and Secure Boot. The refactoring for proxy handling is comprehensive, and the new GPG key import mechanism is a major improvement for robustness. The documentation updates in the README are thorough and will be very helpful for users. The changes are well-executed, but I've identified one high-severity issue related to directory stack management (pushd/popd) in the new execute_github_driver_build function that should be addressed to prevent potential issues with script execution flow.
|
/gcbrun |
1 similar comment
|
/gcbrun |
This change improves the robustness of the GPU driver installation script by:
1. Standardizing URL existence checks to use `curl --head` with retry arguments (`${curl_retry_args[@]}`) instead of `curl -sSLfI` for better consistency and error handling.
2. Implementing GCS caching for the CUDA runfile in `set_cuda_runfile_url`. The script now checks a pre-defined GCS bucket (`${pkg_bucket}`) for an existing copy of the required CUDA `.run` file. If found, it downloads from the cache. Otherwise, it downloads from the official NVIDIA URL and uploads a copy to the GCS bucket for future use. This speeds up subsequent runs and reduces reliance on external network availability.
3. The driver runfile caching logic in `set_driver_version` was already present but this change ensures the URL check uses the standard `${curl_retry_args[@]}`.
These changes make the script more resilient to transient network issues and more efficient in environments where the same files might be needed multiple times across different cluster builds.
|
/gcbrun |
6 similar comments
|
/gcbrun |
|
/gcbrun |
|
/gcbrun |
|
/gcbrun |
|
/gcbrun |
|
/gcbrun |
|
/gcbrun |
|
/gcbrun |
|
/gcbrun |
|
/gcbrun |
|
/gcbrun |
1 similar comment
|
/gcbrun |
|
/gcbrun |
1 similar comment
|
/gcbrun |
…ing and lock retry This commit bundles a series of critical fixes targeting the stability, testability, and reliability of the GPU driver installation and its integration suite on Dataproc. * **Conda Execution:** Refactored Python integration test assertions (`test_gpu.py`, `verify_external_cluster.py`) and external Spark test scripts to locate Conda Python binaries dynamically using `find /opt/conda -maxdepth 6` rather than relying on `conda activate`. This resolves SSH parsing, quoting, and unbound variable issues (`$PS1`) that caused tests to fail when verifying PyTorch and TensorFlow installations. * **CUDA 13 gcc-12 Dependency:** Fixed a fatal kernel module compilation crash on Dataproc 2.2 (`2.2-ubuntu22`). The `install_build_dependencies` logic was previously gating the required `gcc-12` package installation on an exact match of `is_cuda12`. With the recent update of Dataproc 2.2 to CUDA 13.1.1, this check failed, defaulting to `gcc-11` and crashing the `open-gpu-kernel-modules` compilation (`make -j32 modules`). Changed the check to use `ge_cuda12` (greater than or equal to 12). * **APT Lock Retry Regex & Global Timeout:** Fixed a bug where the `execute_with_retries` bash wrapper silently failed to mitigate dpkg lock contention during `unattended-upgrades`. The bash regex (`[[ "$cmd" =~ "^apt-get install" ]]`) was incorrectly quoted, causing a literal string match failure. Fixed the regex to `^apt-get` and moved the `DPkg::Lock::Timeout="60";` configuration to a global file (`/etc/apt/apt.conf.d/99-dpkg-lock-timeout`) at the start of the script, ensuring *all* system `apt` invocations safely wait for locks during boot. * **Apt Source Quoting:** Removed invalid inner double-quotes around the `deb-src` URL for the `nvidia-container-toolkit` repository inside `install_gpu_driver.sh`, preventing syntax errors during `apt-get update`. * **Testing Infrastructure & Documentation:** Added `gpu/TESTING.md` to document the manual, fast-iterative testing loop. This covers provisioning bare clusters (`--no-init-action`), staging via an optimized `scp-m`, manual execution over SSH, and external validation via `spark-gpu-test.sh`. Added a comprehensive "Development and Testing" section to `gpu/README.md`. * **Bazel Authentication:** Updated `run-bazel-tests.sh` to explicitly map ADC credentials (`GOOGLE_APPLICATION_CREDENTIALS`) through `--test_env` arguments. This is required for the Podman sandbox to bypass local GCE metadata service lookups and successfully authenticate `gsutil` for bucket creation during `setUpClass`. * **Test Matrix Restoration:** Unskipped the core `NvidiaGpuDriverTestCase.test_install_gpu_cuda_nvidia` parameterized suite matrix (SINGLE, STANDARD, KERBEROS), removing the temporary `self.skipTest` used during debug isolation. * **Recreate Script Sync:** Synced the fallback driver defaults in `recreate-dpgce` to align with the init script's upgrade to CUDA 13.1.1 for images 2.2 and 2.3.
|
/gcbrun |
|
Tests are now passing. I will now enable tests that I disabled to get the suite passing. |
GPU Initialization Action Enhancements for Secure Boot, Proxy, and Reliability
This large update significantly improves the
install_gpu_driver.shscript and its accompanying documentation, focusing on robust support for complex environments involving Secure Boot and HTTP/S proxies, and increasing overall reliability and maintainability.1.
gpu/README.md:GoogleCloudDataproc/custom-imagesrepository to create Dataproc images with NVIDIA drivers signed for Secure Boot. It covers environment setup, key management in GCP Secret Manager, Docker builder image creation, and running the image generation process.--shielded-secure-boot. It includes instructions for private network setups using Google Cloud Secure Web Proxy, leveraging scripts from theGoogleCloudDataproc/cloud-dataprocrepository for VPC, subnet, and proxy configuration.http-proxy,https-proxy,proxy-uri,no-proxy, andhttp-proxy-pem-uri.2.
gpu/install_gpu_driver.sh:set_proxy):http-proxy,https-proxy, andproxy-urimetadata, determining the correct proxy values for HTTP and HTTPS.HTTP_PROXY,HTTPS_PROXY, andNO_PROXYenvironment variables./etc/environmentwith the current proxy settings.gcloudproxy settings only if the gcloud SDK version is 547.0.0 or greater.aptanddnfto use the proxy.dirmngrorgnupg2-smimeis installed and configuresdirmngr.confto use the HTTP proxy.http-proxy-pem-uriinto system, Java, and Conda trust stores. Switches to HTTPS for proxy communications when a CA cert is provided.import_gpg_keys):import_gpg_keysto handle GPG key fetching and importing in a proxy-aware manner usingcurlover HTTPS, replacing directgpg --recv-keyscalls to keyservers.install_pytorch):numba,pytorch,tensorflow[and-cuda],rapids,pyspark, andcuda-version<=${CUDA_VERSION}. Explicit CUDA runtime (e.g.,cudart_spec) is no longer added, allowing the solver more flexibility.install_gpu_driver-mainandpytorchsentinels to allow forced refreshes.set_driver_version: Usescurl -Ifor a more lightweight HEAD request to check URL validity.build_driver_from_github: Caches the open kernel module source tarball from GitHub to GCS. Checks for existing signed and loadable modules to avoid unnecessary rebuilds.execute_github_driver_build: Refactored to accept tarball paths.popdremoved to balancepushdin caller. Removed a debug echo of thesign-fileexit code.make -j$(nproc)tomodules_installfor parallelization.modinfoforsigner:to confirm modules are signed.prepare_to_install: Movedcurl_retry_argsdefinition earlier.install_nvidia_gpu_driver: Checks ifnvidiamodule loads at the start and marks incomplete if not.main: Addedmark_complete install_gpu_driver-mainat the end.configure_dkms_certs: Always fetches keys from secret manager ifPSNis set to ensuremodulus_md5sumis available.install_gpu_agent: Checks ifMETADATA_HTTP_PROXY_PEM_URIis non-empty before using it.