Skip to content

Commit d2f6b9e

Browse files
author
Edoardo Holzl
committed
Add Backend benchmark blog post
1 parent 5e0763e commit d2f6b9e

3 files changed

Lines changed: 114 additions & 0 deletions

File tree

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
---
2+
layout: post
3+
title: Communication Backends, Raw performance benchmarking
4+
author: e_hoelzl
5+
published: true
6+
tags: [performance, results]
7+
excerpt_separator: <!--more-->
8+
---
9+
10+
Distributed learning requires workers to collaborate by sharing learned information with their "colleagues". As MLBench
11+
supports both one and multiple processes per node, in addition to multi-node training, communication between workers is crucial
12+
and will heavily affect performance, notably for compute bound training algorithms.
13+
14+
This Blog post addresses and analyzes the raw performance of different communication backends, used to transmit tensors and other
15+
information between the workers.
16+
17+
<!--more-->
18+
19+
20+
## Communication Backends
21+
22+
Currently, MLBench supports 3 communication backends out of the box:
23+
24+
* MPI, or Message Passing Interface (using [OpenMPI](https://www.open-mpi.org/) 's implementation)
25+
* GLOO, available with PyTorch
26+
* [NCCL](https://developer.nvidia.com/nccl), high-speed connectivity between GPUs if used with correct hardware
27+
28+
Each backend presents its benefits and disadvantages, and is designed for specific
29+
use-cases, and those will be reflected in the results.
30+
31+
###### Differences
32+
33+
The following table illustrates the main differences between the 3 backends.
34+
35+
| Backend | Comm. Functions | Optimized for | Float32 | Float16|
36+
|---------|-----------------|---------------|---------|--------|
37+
| MPI | All | CPU, GPU | Yes | No |
38+
| GLOO | All (on CPU), broadcast & all-reduce (on GPU) | CPU | Yes | Yes |
39+
| NCCL | broadcast, all reduce, reduce and all gather (on GPU) | GPU only | Yes | Yes
40+
41+
As we can see, each has at least one main advantage, and must be used in specific cases.
42+
43+
It is also important to note that PyTorch (built from source) comes with NCCL and GLOO pre-installed, so it can be
44+
more convenient for a user to use one of those two. Otherwise, MPI needs to be compiled and installed on the machine.
45+
46+
---
47+
48+
## Experiments
49+
50+
In order to evaluate the performance of communication backends, we have created a dummy task called [Task 0](https://mlbench.readthedocs.io/en/latest/benchmark-tasks.html#task-0-communication-backend-raw-performance),
51+
which repeatedly sends random tensors of increasing sizes using an `all reduce` operation. This means that each worker shares his tensor with all other workers, and sums all the received tensors.
52+
53+
The time taken for this operation is accurately measured 100 times for each sent tensor on each worker, and averaged to get a statistically significant estimation of communication times.
54+
55+
To obtain those results, we have used the following hardware/software:
56+
57+
- `PyTorch` deep learning framework
58+
- `n1-standard-4` (4 cores, 15GB RAM) machines on Google Cloud.
59+
- `NVIDIA® Tesla® T4` (16GB GDDR6, Turing arch)
60+
61+
## Results
62+
63+
As stated above, we compare the times of communication for different tensor types and backends.
64+
65+
There are 4 tensor type: `Float16` & `Float32` CPU or GPU tensors.
66+
67+
### CPU vs GPU tensors?
68+
69+
Only MPI and GLOO support communication of CPU tensors, while NCCL requires the use of GPU acceleration. This is a great advantage, as CPU training is less costly
70+
and can be sped-up using distributed training.
71+
72+
#### CPU
73+
In the graph below, we compare the speeds taken to perform an `all reduce` operation between 2, 4 and 8 workers, of `Float16` and `Float32` CPU tensors.
74+
75+
<a href="{{ site.baseurl }}public/images/backends_comparison_by_workers.png" data-lightbox="backends_comparison_by_workers" data-title="Backend performance comparison (CPU tensors)">
76+
<img src="{{ site.baseurl }}public/images/backends_comparison_by_workers.png" alt="Backend performance comparison (CPU tensors)" style="max-width:100%;"/>
77+
</a>
78+
79+
##### Key differences
80+
81+
- GLOO supports `Float16` communication, while Open MPI's MPI implementation doesn't.
82+
- MPI performs much better for low size tensors, for all numbers of workers.
83+
- GLOO seems to be more sensitive to larger clusters: the increase in communication times is higher than MPI's.
84+
- For very large tensors, both seem to perform similarly, except as we add more workers (8 worker case).
85+
86+
#### GPU
87+
88+
We now compare the speeds for GPU tensors. Here, we have the addition of NCCL in the comparison.
89+
90+
<a href="{{ site.baseurl }}public/images/backends_comparison_by_workers_CUDA.png" data-lightbox="backends_comparison_by_workers" data-title="Backend performance comparison (GPU tensors)">
91+
<img src="{{ site.baseurl }}public/images/backends_comparison_by_workers_CUDA.png" alt="Backend performance comparison (GPU tensors)" style="max-width:100%;"/>
92+
</a>
93+
94+
##### Key differences
95+
- NCCL supports `Float16`, and always performs better than GLOO.
96+
- MPI performs better than NCCL for small tensors (especially as we the cluster gets bigger)
97+
- NCCL outperforms MPI and GLOO for very large tensors, regardless of cluster size.
98+
99+
#### Comparison
100+
101+
The results obtained clearly depict the different use cases for each backend, and how they could be used to fulfill one's:
102+
- GLOO has a main advantage of supporting `Float16` communication of CPU tensors, and should then be used for that case.
103+
- MPI can be used with or without GPU, and performs better than its counterparts for small tensors. It should then be used for small tensor
104+
communication in `Float32`, and in all cases for CPU communication, also in `Float32`.
105+
- NCCL can only be used with GPU acceleration, and is the best option if one wants to use `Float16`. For large `Float32` tensors, it performs better than MPI.
106+
107+
108+
## How to run
109+
110+
The code for this benchmark is available [here](https://github.com/mlbench/mlbench-benchmarks/tree/develop/pytorch/backend_benchmark), and the docker image can be pulled using :
111+
`docker pull mlbench/pytorch-backend-benchmark:latest`, and it can be used to benchmark any other backend.
112+
113+
To benchmark a custom backend, it must first be installed in the image. For that, simply modify the [Dockerfile](https://github.com/mlbench/mlbench-benchmarks/blob/develop/pytorch/backend_benchmark/Dockerfile)
114+
and rebuild the image.
111 KB
Loading
165 KB
Loading

0 commit comments

Comments
 (0)