Skip to content

Commit d94577f

Browse files
author
Edoardo Holzl
committed
PR changes
1 parent 656b9ad commit d94577f

1 file changed

Lines changed: 8 additions & 7 deletions

File tree

_posts/2020-09-08-communication-backend-comparison.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,13 @@ tags: [performance, results]
77
excerpt_separator: <!--more-->
88
---
99

10-
Distributed learning requires workers to collaborate by sharing learned information with their "colleagues". As MLBench
11-
supports both one and multiple processes per node, in addition to multi-node training, communication between workers is crucial
12-
and will heavily affect performance, notably for compute bound training algorithms.
10+
Distributed learning requires workers to collaborate by swiftly sharing learned information with their "colleagues".
11+
With the accelerating growth of model sizes in modern deep learning, this aspect gains even more importance.
1312

14-
This Blog post addresses and analyzes the raw performance of different communication backends, used to transmit tensors and other
15-
information between the workers.
13+
MLBench supports both one and multiple processes per node, in addition to multi-node training. Communication between workers is crucial
14+
and will heavily affect performance, notably for communication bound training algorithms.
15+
16+
This blog post addresses and analyzes the raw performance of different communication backends on commodity communication hardware, used to transmit large arrays or tensors.
1617

1718
<!--more-->
1819

@@ -66,8 +67,8 @@ There are 4 tensor type: `Float16` & `Float32` CPU or GPU tensors.
6667

6768
### CPU vs GPU tensors?
6869

69-
Only MPI and GLOO support communication of CPU tensors, while NCCL requires the use of GPU acceleration. This is a great advantage, as CPU training is less costly
70-
and can be sped-up using distributed training.
70+
MPI and GLOO support both CPU and GPU tensor communication, while NCCL only supports communication of GPU tensors. This is a great advantage, as CPU training is less costly
71+
and can be sped up using distributed training.
7172

7273
#### CPU
7374
In the graph below, we compare the speeds taken to perform an `all reduce` operation between 2, 4 and 8 workers, of `Float16` and `Float32` CPU tensors.

0 commit comments

Comments
 (0)