Skip to content

[build] Add A100 support: patch set, offline-friendly conda build, and examples#1832

Open
jason9693 wants to merge 6 commits intoTHUDM:mainfrom
jason9693:main
Open

[build] Add A100 support: patch set, offline-friendly conda build, and examples#1832
jason9693 wants to merge 6 commits intoTHUDM:mainfrom
jason9693:main

Conversation

@jason9693
Copy link
Copy Markdown

Summary

Add patch set, build scripts, Docker image, and examples to build and run slime on A100 environments.

Offline Build Flow

The offline build allows installing slime on air-gapped nodes without internet access by pre-downloading all dependencies on a networked machine.

┌─────────────────────────────────────────────────────────────┐
│                  Networked Machine                          │
│                                                             │
│  custom_proxy_net.sh                                        │
│  ├── Download conda pkgs (python, cuda, nccl, cudnn)        │
│  │   └── .downloads/conda-env.tar.gz                        │
│  └── Download pip wheels (torch, torchvision, torchaudio)   │
│      └── .downloads/pip-pkgs/                               │
└──────────────────────┬──────────────────────────────────────┘
                       │  scp / rsync
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                Air-gapped Target Node                       │
│                                                             │
│  Step 1: custom_proxy.sh                                    │
│  ├── Install micromamba from .downloads/mamba_install.sh     │
│  ├── Restore conda env from conda-env.tar.gz                │
│  └── Fix shebangs for target paths                          │
│                         │                                   │
│                         ▼                                   │
│  Step 2: build_conda.a100.sh  (PATCH_VERSION=v0.5.9.a100)  │
│  ├── Detect existing slime env → skip conda setup           │
│  ├── Install torch from local .downloads/pip-pkgs/          │
│  ├── Build from source (sglang, DeepEP, flash-attn, TE, …) │
│  ├── Apply A100 patches:                                    │
│  │   ├── deep_ep.patch          (libcuda RTLD_GLOBAL fix)   │
│  │   ├── sglang.patch           (SM80 compatibility)        │
│  │   ├── megatron.patch         (SM80 compatibility)        │
│  │   ├── slime.patch            (A100 runtime adjustments)  │
│  │   └── transformer_engine.patch (site-packages patch)     │
│  └── Final verification (torch, TE, DeepEP import check)    │
└─────────────────────────────────────────────────────────────┘

Changes

Build System

  • build_conda.a100.sh: Add A100-specific conda build script with offline pip/mamba cache support
  • build_conda.sh: Introduce PATCH_VERSION selector (v0.5.9, v0.5.9.a100), pin DeepEP commit (1d3963d), apply patches with --reject to land non-conflicting hunks
  • docker/Dockerfile.a100: Add A100 Docker image build file (DeepEP built without SM90/NVSHMEM, includes import-check layer)
  • .gitignore: Add build artifact paths

Patches (docker/patch/v0.5.9.a100/)

  • deep_ep.patch: DeepEP A100 compatibility patch
  • megatron.patch: Megatron A100 compatibility patch
  • sglang.patch: SGLang A100 compatibility patch (~4,700 lines)
  • slime.patch: Slime A100 compatibility patch
  • transformer_engine.patch: Transformer Engine A100 compatibility patch

Examples

  • examples/dapo_math/a100.sh: DAPO math run script for A100
  • examples/dapo_math/kanana2-30b-a3b.sh: Kanana2-30B-A3B model config (same architecture as DeepSeek V3)
  • examples/dapo_math/run_kanana2-30b-a3b.a100.sh: Kanana2-30B-A3B A100 run script (same architecture as DeepSeek V3)

Infra / Helpers

  • custom_proxy_net.sh: Pre-download conda/pip packages on a networked machine
  • custom_proxy.sh: Restore pre-downloaded packages on air-gapped target node
  • .downloads/mamba_install.sh: Offline micromamba installer

Docs

  • docs/en/get_started/quick_start.md, docs/zh/get_started/quick_start.md: Document the new A100 build flow

Commit History (6 commits)

Commit Description
5a4f297e [build] Add A100 patch set and offline-friendly conda build
6eb70d16 [build] Pin DeepEP commit and tolerate patch rejects
8ab0c1be Fix conda build script
055d530d Fix patch files
e4bb7fa0 [examples] Add example scripts
4f75ee60 Modify build and run logics

Stats

  • 16 files changed, 7,496 insertions(+), 2 deletions(-)

kakao-kevin-us and others added 6 commits April 11, 2026 11:32
Introduce PATCH_VERSION selector in build_conda.sh supporting v0.5.9
(default) and v0.5.9.a100, which additionally builds DeepEP without
SM90/NVSHMEM features and applies deep_ep, transformer_engine, and
slime patches. Make the script offline-friendly via optional pip/mamba
caches, add Dockerfile.a100, SLURM build script, custom proxy helpers,
and document the new flow in the quick start guides.

Co-Authored-By: Kevin-Yang <ykcha9@gmail.com>
Pin DEEPEP_COMMIT to 1d3963d in build_conda.sh and Dockerfile.a100,
hardcode mamba prefix to /root to match the prior Docker layout, apply
sglang/megatron/slime/deep_ep/transformer_engine patches with --reject
so non-conflicting hunks still land, bump the SLURM build job memory
to 512G, and add a final import-check layer to Dockerfile.a100.

Co-Authored-By: Kevin-Yang <ykcha9@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants