|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: Communication Backends, Raw performance benchmarking |
| 4 | +author: e_hoelzl |
| 5 | +published: true |
| 6 | +tags: [performance, results] |
| 7 | +excerpt_separator: <!--more--> |
| 8 | +--- |
| 9 | + |
| 10 | +Distributed learning requires workers to collaborate by sharing learned information with their "colleagues". As MLBench |
| 11 | +supports both one and multiple processes per node, in addition to multi-node training, communication between workers is crucial |
| 12 | +and will heavily affect performance, notably for compute bound training algorithms. |
| 13 | + |
| 14 | +This Blog post addresses and analyzes the raw performance of different communication backends, used to transmit tensors and other |
| 15 | +information between the workers. |
| 16 | + |
| 17 | +<!--more--> |
| 18 | + |
| 19 | + |
| 20 | +## Communication Backends |
| 21 | + |
| 22 | +Currently, MLBench supports 3 communication backends out of the box: |
| 23 | + |
| 24 | +* MPI, or Message Passing Interface (using [OpenMPI](https://www.open-mpi.org/) 's implementation) |
| 25 | +* GLOO, available with PyTorch |
| 26 | +* [NCCL](https://developer.nvidia.com/nccl), high-speed connectivity between GPUs if used with correct hardware |
| 27 | + |
| 28 | +Each backend presents its benefits and disadvantages, and is designed for specific |
| 29 | +use-cases, and those will be reflected in the results. |
| 30 | + |
| 31 | +###### Differences |
| 32 | + |
| 33 | +The following table illustrates the main differences between the 3 backends. |
| 34 | + |
| 35 | +| Backend | Comm. Functions | Optimized for | Float32 | Float16| |
| 36 | +|---------|-----------------|---------------|---------|--------| |
| 37 | +| MPI | All | CPU, GPU | Yes | No | |
| 38 | +| GLOO | All (on CPU), broadcast & all-reduce (on GPU) | CPU | Yes | Yes | |
| 39 | +| NCCL | broadcast, all reduce, reduce and all gather (on GPU) | GPU only | Yes | Yes |
| 40 | + |
| 41 | +As we can see, each has at least one main advantage, and must be used in specific cases. |
| 42 | + |
| 43 | +It is also important to note that PyTorch (built from source) comes with NCCL and GLOO pre-installed, so it can be |
| 44 | +more convenient for a user to use one of those two. Otherwise, MPI needs to be compiled and installed on the machine. |
| 45 | + |
| 46 | +--- |
| 47 | + |
| 48 | +## Experiments |
| 49 | + |
| 50 | +In order to evaluate the performance of communication backends, we have created a dummy task called [Task 0](https://mlbench.readthedocs.io/en/latest/benchmark-tasks.html#task-0-communication-backend-raw-performance), |
| 51 | +which repeatedly sends random tensors of increasing sizes using an `all reduce` operation. This means that each worker shares his tensor with all other workers, and sums all the received tensors. |
| 52 | + |
| 53 | +The time taken for this operation is accurately measured 100 times for each sent tensor on each worker, and averaged to get a statistically significant estimation of communication times. |
| 54 | + |
| 55 | +To obtain those results, we have used the following hardware/software: |
| 56 | + |
| 57 | +- `PyTorch` deep learning framework |
| 58 | +- `n1-standard-4` (4 cores, 15GB RAM) machines on Google Cloud. |
| 59 | +- `NVIDIA® Tesla® T4` (16GB GDDR6, Turing arch) |
| 60 | + |
| 61 | +## Results |
| 62 | + |
| 63 | +As stated above, we compare the times of communication for different tensor types and backends. |
| 64 | + |
| 65 | +There are 4 tensor type: `Float16` & `Float32` CPU or GPU tensors. |
| 66 | + |
| 67 | +### CPU vs GPU tensors? |
| 68 | + |
| 69 | +Only MPI and GLOO support communication of CPU tensors, while NCCL requires the use of GPU acceleration. This is a great advantage, as CPU training is less costly |
| 70 | +and can be sped-up using distributed training. |
| 71 | + |
| 72 | +#### CPU |
| 73 | +In the graph below, we compare the speeds taken to perform an `all reduce` operation between 2, 4 and 8 workers, of `Float16` and `Float32` CPU tensors. |
| 74 | + |
| 75 | +<a href="{{ site.baseurl }}public/images/backends_comparison_by_workers.png" data-lightbox="backends_comparison_by_workers" data-title="Backend performance comparison (CPU tensors)"> |
| 76 | + <img src="{{ site.baseurl }}public/images/backends_comparison_by_workers.png" alt="Backend performance comparison (CPU tensors)" style="max-width:100%;"/> |
| 77 | +</a> |
| 78 | + |
| 79 | +##### Key differences |
| 80 | + |
| 81 | +- GLOO supports `Float16` communication, while Open MPI's MPI implementation doesn't. |
| 82 | +- MPI performs much better for low size tensors, for all numbers of workers. |
| 83 | +- GLOO seems to be more sensitive to larger clusters: the increase in communication times is higher than MPI's. |
| 84 | +- For very large tensors, both seem to perform similarly, except as we add more workers (8 worker case). |
| 85 | + |
| 86 | +#### GPU |
| 87 | + |
| 88 | +We now compare the speeds for GPU tensors. Here, we have the addition of NCCL in the comparison. |
| 89 | + |
| 90 | +<a href="{{ site.baseurl }}public/images/backends_comparison_by_workers_CUDA.png" data-lightbox="backends_comparison_by_workers" data-title="Backend performance comparison (GPU tensors)"> |
| 91 | + <img src="{{ site.baseurl }}public/images/backends_comparison_by_workers_CUDA.png" alt="Backend performance comparison (GPU tensors)" style="max-width:100%;"/> |
| 92 | +</a> |
| 93 | + |
| 94 | +##### Key differences |
| 95 | +- NCCL supports `Float16`, and always performs better than GLOO. |
| 96 | +- MPI performs better than NCCL for small tensors (especially as we the cluster gets bigger) |
| 97 | +- NCCL outperforms MPI and GLOO for very large tensors, regardless of cluster size. |
| 98 | + |
| 99 | +#### Comparison |
| 100 | + |
| 101 | +The results obtained clearly depict the different use cases for each backend, and how they could be used to fulfill one's: |
| 102 | +- GLOO has a main advantage of supporting `Float16` communication of CPU tensors, and should then be used for that case. |
| 103 | +- MPI can be used with or without GPU, and performs better than its counterparts for small tensors. It should then be used for small tensor |
| 104 | +communication in `Float32`, and in all cases for CPU communication, also in `Float32`. |
| 105 | +- NCCL can only be used with GPU acceleration, and is the best option if one wants to use `Float16`. For large `Float32` tensors, it performs better than MPI. |
| 106 | + |
| 107 | + |
| 108 | +## How to run |
| 109 | + |
| 110 | +The code for this benchmark is available [here](https://github.com/mlbench/mlbench-benchmarks/tree/develop/pytorch/backend_benchmark), and the docker image can be pulled using : |
| 111 | +`docker pull mlbench/pytorch-backend-benchmark:latest`, and it can be used to benchmark any other backend. |
| 112 | + |
| 113 | +To benchmark a custom backend, it must first be installed in the image. For that, simply modify the [Dockerfile](https://github.com/mlbench/mlbench-benchmarks/blob/develop/pytorch/backend_benchmark/Dockerfile) |
| 114 | +and rebuild the image. |
0 commit comments