Skip to content

Commit 2db167f

Browse files
authored
Merge branch 'master' into doc-autocast-nesting
2 parents ac3800b + ece52bc commit 2db167f

15 files changed

Lines changed: 821 additions & 17 deletions

File tree

training/tensor_parallel/README.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,15 @@
1-
# tensor parallel example
2-
This project is adapted from https://github.com/tatsu-lab/stanford_alpaca.
3-
We only modified the ds_config to enable tensor parallelism and more detailed logging, as an example use case.
1+
# AutoTP Training Examples
42

5-
**Script**
6-
7-
``` bash run.sh ``` or ```bash run.sh MODE```
3+
This folder groups AutoTP training examples at different complexity levels.
84

5+
## Contents
6+
- [Basic example](basic_example): minimal AutoTP + ZeRO-2 example with synthetic tokens. It also shows that AutoTP recognizes typical parameter patterns and automatically applies proper partitioning.
7+
- [HuggingFace integration](hf_integration): Hugging Face Trainer example (adapted from Stanford Alpaca).
8+
- [Custom partitioning patterns](custom_patterns): AutoTP example with custom layer patterns and a simple
9+
text dataset that uses a DP-rank random sampler. It shows how to define
10+
parameter partitioning easily for custom models with non-standard parameter
11+
definitions.
912

13+
## Related references
14+
- [AutoTP training docs](https://deepspeed.readthedocs.io/en/latest/training.html)
15+
- [AutoTP training tutorial](https://github.com/deepspeedai/DeepSpeed/blob/master/docs/_tutorials/autotp-training.md)
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# AutoTP training (Tensor Parallel)
2+
3+
This directory documents the AutoTP training API for tensor-parallel sharding
4+
during training. AutoTP recognizes typical parameter patterns and
5+
automatically applies proper partitioning.
6+
7+
## Overview
8+
9+
This example provides a compact AutoTP + ZeRO-2 training script,
10+
`autotp_example.py`. It focuses on the AutoTP + ZeRO-2 flow and keeps only the
11+
pieces required to launch AutoTP:
12+
13+
- create TP/DP process groups
14+
- configure AutoTP with `tensor_parallel.autotp_size`
15+
- initialize DeepSpeed with the AutoTP config
16+
17+
The example feeds synthetic token batches (broadcast within each TP group) so
18+
you can validate the AutoTP setup without extra dataset plumbing.
19+
20+
AutoTP recognizes supported model architectures (for example, Llama) and
21+
automatically partitions parameters, so you do not need to specify any manual
22+
partitioning rules for those models. If your model is not supported by AutoTP,
23+
refer to the
24+
[custom layer pattern guide](../custom_patterns/)
25+
for custom layer pattern configuration.
26+
27+
## Key code (AutoTP path)
28+
The core setup mirrors the verification script but is trimmed down:
29+
30+
```python
31+
model = AutoModelForCausalLM.from_pretrained(args.model_name)
32+
33+
ds_config = {
34+
"train_batch_size": args.batch_size * args.dp_size,
35+
"train_micro_batch_size_per_gpu": args.batch_size,
36+
"zero_optimization": {"stage": args.zero_stage},
37+
"tensor_parallel": {"autotp_size": args.tp_size},
38+
"data_parallel_size": args.dp_size,
39+
}
40+
41+
mpu = ModelParallelUnit(tp_group, dp_group, args.tp_size, args.dp_size, tp_rank, dp_rank)
42+
engine, _, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config=ds_config, mpu=mpu)
43+
```
44+
45+
## How to run
46+
Pick a world size where `tp_size * dp_size = world_size`.
47+
48+
```bash
49+
# 8 GPUs: TP=4, DP=2 (AutoTP + ZeRO-2)
50+
deepspeed --num_gpus 8 autotp_example.py \
51+
--model_name meta-llama/Llama-3.1-8B \
52+
--tp_size 4 \
53+
--dp_size 2 \
54+
--zero_stage 2 \
55+
--batch_size 1 \
56+
--seq_length 1024 \
57+
--num_steps 10
58+
```
59+
60+
`torchrun` works as well if you prefer the PyTorch launcher.
61+
62+
For a smaller test, reduce the world size and TP/DP sizes together:
63+
64+
```bash
65+
deepspeed --num_gpus 2 autotp_example.py \
66+
--model_name meta-llama/Llama-3.1-8B \
67+
--tp_size 2 \
68+
--dp_size 1 \
69+
--num_steps 5
70+
```
71+
72+
## Backward Compatibility
73+
74+
Historically, AutoTP training required calling `set_autotp_mode(training=True)`
75+
and `deepspeed.tp_model_init(...)` before initialization. The traditional path
76+
is preserved for reference in
77+
[`autotp_memory_compare.py`](autotp_memory_compare.py) (see the `--mode traditional`
78+
branch), alongside the config-driven path in the same script.
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
import argparse
2+
from dataclasses import dataclass
3+
4+
import torch
5+
import torch.distributed as dist
6+
import deepspeed
7+
from transformers import AutoModelForCausalLM
8+
9+
10+
@dataclass
11+
class ModelParallelUnit:
12+
"""Minimal MPU for DeepSpeed TP+DP."""
13+
14+
tp_group: dist.ProcessGroup
15+
dp_group: dist.ProcessGroup
16+
tp_size: int
17+
dp_size: int
18+
tp_rank: int
19+
dp_rank: int
20+
21+
def get_data_parallel_group(self):
22+
return self.dp_group
23+
24+
def get_model_parallel_group(self):
25+
return self.tp_group
26+
27+
def get_data_parallel_world_size(self):
28+
return self.dp_size
29+
30+
def get_model_parallel_world_size(self):
31+
return self.tp_size
32+
33+
def get_data_parallel_rank(self):
34+
return self.dp_rank
35+
36+
def get_model_parallel_rank(self):
37+
return self.tp_rank
38+
39+
40+
def parse_args():
41+
parser = argparse.ArgumentParser(description="AutoTP training example (distilled from verify_autotp).")
42+
parser.add_argument("--local_rank", type=int, default=-1, help="Passed by deepspeed/torchrun.")
43+
parser.add_argument("--model_name", type=str, default="meta-llama/Llama-3.1-8B")
44+
parser.add_argument("--tp_size", type=int, default=4)
45+
parser.add_argument("--dp_size", type=int, default=2)
46+
parser.add_argument("--zero_stage", type=int, default=2)
47+
parser.add_argument("--batch_size", type=int, default=1)
48+
parser.add_argument("--seq_length", type=int, default=1024)
49+
parser.add_argument("--num_steps", type=int, default=10)
50+
parser.add_argument("--learning_rate", type=float, default=2e-5)
51+
parser.add_argument("--precision", type=str, default="bf16", choices=["bf16", "fp16", "fp32"])
52+
return parser.parse_args()
53+
54+
55+
def build_tp_dp_groups(rank, world_size, tp_size, dp_size):
56+
if tp_size * dp_size != world_size:
57+
raise ValueError(f"tp_size ({tp_size}) * dp_size ({dp_size}) must equal world_size ({world_size})")
58+
59+
tp_rank = rank % tp_size
60+
dp_rank = rank // tp_size
61+
62+
tp_group = None
63+
dp_group = None
64+
65+
for dp_idx in range(dp_size):
66+
tp_ranks = list(range(dp_idx * tp_size, (dp_idx + 1) * tp_size))
67+
group = dist.new_group(tp_ranks)
68+
if rank in tp_ranks:
69+
tp_group = group
70+
71+
for tp_idx in range(tp_size):
72+
dp_ranks = [tp_idx + dp_idx * tp_size for dp_idx in range(dp_size)]
73+
group = dist.new_group(dp_ranks)
74+
if rank in dp_ranks:
75+
dp_group = group
76+
77+
return tp_group, dp_group, tp_rank, dp_rank
78+
79+
80+
def broadcast_inputs(input_ids, labels, tp_group, tp_src_rank):
81+
dist.broadcast(input_ids, src=tp_src_rank, group=tp_group)
82+
dist.broadcast(labels, src=tp_src_rank, group=tp_group)
83+
84+
85+
def main():
86+
args = parse_args()
87+
deepspeed.init_distributed()
88+
89+
rank = dist.get_rank()
90+
world_size = dist.get_world_size()
91+
device = torch.device(f"cuda:{rank % torch.cuda.device_count()}")
92+
93+
tp_group, dp_group, tp_rank, dp_rank = build_tp_dp_groups(
94+
rank, world_size, args.tp_size, args.dp_size
95+
)
96+
97+
model = AutoModelForCausalLM.from_pretrained(args.model_name)
98+
model = model.to(device)
99+
100+
# AutoTP is enabled via the DeepSpeed config.
101+
ds_config = {
102+
"train_batch_size": args.batch_size * args.dp_size,
103+
"train_micro_batch_size_per_gpu": args.batch_size,
104+
"zero_optimization": {"stage": args.zero_stage},
105+
"tensor_parallel": {"autotp_size": args.tp_size},
106+
"data_parallel_size": args.dp_size,
107+
}
108+
if args.precision == "bf16":
109+
ds_config["bf16"] = {"enabled": True}
110+
elif args.precision == "fp16":
111+
ds_config["fp16"] = {"enabled": True}
112+
113+
optimizer = torch.optim.AdamW(model.parameters(), lr=args.learning_rate)
114+
mpu = ModelParallelUnit(tp_group, dp_group, args.tp_size, args.dp_size, tp_rank, dp_rank)
115+
engine, _, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config=ds_config, mpu=mpu)
116+
117+
vocab_size = model.config.vocab_size
118+
for _ in range(args.num_steps):
119+
if tp_rank == 0:
120+
input_ids = torch.randint(0, vocab_size, (args.batch_size, args.seq_length), device=device)
121+
labels = input_ids.clone()
122+
else:
123+
input_ids = torch.empty((args.batch_size, args.seq_length), dtype=torch.long, device=device)
124+
labels = torch.empty((args.batch_size, args.seq_length), dtype=torch.long, device=device)
125+
126+
tp_src_rank = dp_rank * args.tp_size
127+
broadcast_inputs(input_ids, labels, tp_group, tp_src_rank)
128+
outputs = engine(input_ids=input_ids, labels=labels)
129+
engine.backward(outputs.loss)
130+
engine.step()
131+
132+
if rank == 0:
133+
print("AutoTP example completed.")
134+
135+
136+
if __name__ == "__main__":
137+
main()

0 commit comments

Comments
 (0)