Skip to content

Commit 4f1e55b

Browse files
author
Edoardo Holzl
committed
PR changes
1 parent 538f15b commit 4f1e55b

1 file changed

Lines changed: 42 additions & 21 deletions

File tree

_posts/2020-10-02-nlp-translation.md

Lines changed: 42 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -7,27 +7,30 @@ tags: [performance, results]
77
excerpt_separator: <!--more-->
88
---
99

10-
Natural Language Processing (NLP) has gained a lot of interest in the Machine Learning community.
11-
This increasing attention towards the subject may come from the fascination of teaching machines to understand and assimilate human language,
10+
The popularity and relevance of Natural Language Processing (NLP) may come from the
11+
fascination of teaching machines to understand and assimilate human language,
1212
and use them as tools to complement and facilitate our everyday lives.
1313

14-
Machine translation is one branch of NLP, and consists of having automated model capable of translating text from one language to another in a few seconds.
14+
Machine translation is one branch of NLP, and consists of having automated model capable of
15+
translating text from one language to another almost instantaniously.
1516

16-
In this blog post, we analyze the performance increase brought by distributed training for two different Machine Translation models: an LSTM variant (GNMT) and an attention
17-
based model (Transformer).
17+
In this blog post, we analyze how distributed learning improves the training time of two different machine translation models:
18+
an LSTM variant (GNMT) and an attention based model (Transformer).
1819

1920
<!--more-->
2021

2122
Those models present two main limitations that makes training very time consuming:
2223
- They need millions of data points to reach acceptable performance.
23-
- Models are quite large, and computations take significantly more time.
24+
- Models are quite large (hundreds of millions of parameters), and computations take significant time compared to simpler models.
2425

2526
Each of those problems can be solved using distribution:
26-
- Distribute the data on multiple machines (data-parallel)
27-
- Distributed computations on multiple cores (compute-parallel)
27+
- Distribute the data on multiple machines (data-parallel).
28+
- Distributed computations for one data point on multiple cores (compute-parallel), requires model to be parallelizable.
2829

29-
However the second approach requires the models to be parallelizable. We have only used data-parallel distribution for those tasks, but hope to combine it
30-
with parallel computation in the future.
30+
Based on these limitations, we can divide processing of datapoints over multiple workers,
31+
or even subdivide the computations required to process a single datapoint.
32+
In our experiments, we focus on dividing the data (data-parallel).
33+
We plan to extend these results to model-parallel training in the future.
3134

3235

3336
## Models
@@ -59,7 +62,6 @@ This gives a model with a total of 160,671,297 trainable parameters.
5962

6063
This model was first published in {% cite attention %}, and aims at completely disregarding recurrence and relying entirely on self-attention
6164
mechanisms to perform sequence modelling and translation problems.
62-
Such a structure allows for better parallelization of training on multiple GPUs, and can reach significantly better performance than comparable models.
6365

6466
Transformer uses Multi-Head attention mechanisms: instead of computing the attention once, it runs through the scaled dot-product attention multiple times
6567
in parallel.
@@ -93,18 +95,30 @@ $$ Loss = confidence * NLLLoss + smoothing * SmoothLoss $$
9395
Where $$confidence = 1 - smoothing$$. The smoothing is set to a value of 0.1 for both tasks.
9496

9597
### Optimization
96-
As we have seen above, both models have a very high number of parameters to train. This can be an issue when using GPUs, as the model needs to fit in memory.
97-
Also, back-propagation requires the memory to be at least twice the size of the model for it to work in memory. For that, instead of using regular precision, we used
98-
mixed-precision training, where most computations are done in `Float16`. We use a centralized version of `Adam`, where gradients are aggregated amongst all workers before
99-
updating weights.
98+
As we have seen above, both models have a very high number of parameters to train.
99+
This can be an issue when using GPUs, as the model needs to fit in memory, and back-propagation requires the memory to be at least twice the size of the model for it to work in memory.
100+
101+
For example:
102+
- Transformer has 200 million trainable parameters. In full precision (`Float32`), this results in 800 MB for only storing the weights.
103+
- Forward pass requires to multiply and store each output. So we add another 800MB for forward pass
104+
- Each sent/received tensor for other workers will be of 800 MB. For 4 workers this results in 3.2 GB needed
105+
- Those sent tensors will take longer to be received as they are larger.
106+
- Backpropagation requires at least 3 to 4 times the amount of memory needed by the model to work, so another 3.2 GB of memory
107+
108+
Considering those numbers, this results in memory usage of already ~ 8GB, which in reality is much greater as CUDA amd CudNN need also their share of memory.
109+
From our experiments, a memory of 16GB is far from enough to train those models in full precision.
110+
111+
For that, instead of using regular precision, we used
112+
mixed-precision training, where most computations are done in `Float16`. We use a synchronous data-parallel version of `Adam`,
113+
where gradients are aggregated amongst all workers before updating weights.
100114

101115
### Datasets
102116
Both tasks use the same test set, but are trained on slightly different data sets:
103117
- The LSTM is trained on the English to German World Machine Translation 16 (WMT16) dataset, comprising of 3,975,116 translated sentences
104118
- The Transformer is trained on the English to German World Machine Translation 17 (WMT17) dataset, comprising of 4,590,101.
105119

106120

107-
More details on both tasks can be found on our [documentation](https://mlbench.readthedocs.io/en/latest/benchmark-tasks.html#task-4-machine-translation).
121+
More details on both tasks can be found in our [documentation](https://mlbench.readthedocs.io/en/latest/benchmark-tasks.html#task-4-machine-translation).
108122

109123
## Results
110124

@@ -129,23 +143,26 @@ Speedups are computed with respect to the 1 worker case, and are intended to ill
129143
The graphs below show the time speedups for the LSTM model and Transformer model (respectively).
130144

131145
<a href="{{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4a_speedup.png" data-lightbox="task4a_speedups" data-title="Speedups for GNMT">
132-
![test]({{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4a_speedup.png)
133146
*GNMT Speedups*
147+
![test]({{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4a_speedup.png)
134148
</a>
135149

150+
<br />
151+
136152
<a href="{{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4b_speedup.png" data-lightbox="task4b_speedups" data-title="Speedups for Transformer">
137-
![test]({{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4b_speedup.png)
138153
*Transformer Speedups*
154+
![test]({{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4b_speedup.png)
139155
</a>
140156

141157
The left graph shows the absolute speed ups with respect to one worker, and the right one omits
142158
communication times from the speed up. This allows us to better see the effect of communication.
143159

144160

145161
A few interesting points:
146-
- Overall speedups follows a $$ log_{2}(n) $$, with `n = num_workers`, while compute are roughly linear.
162+
- Overall speedups follow a sub-linear pattern, while compute are roughly linear.
147163
- Scaling the number of compute nodes gives nearly perfect scaling for both tasks (right plot)
148-
- Using more powerful communication hardware (e.g. Tesla V100) will positively affect speedups.
164+
- Using more powerful communication hardware (e.g. Tesla V100) will positively affect speedups. We currently have around 10Gbps
165+
connection speed between the workers, and such hardware could increase it by a factor of at least 10.
149166

150167
As the distribution level increases, we can see that communication becomes more and more heavy, and attenuates speedups quite significantly.
151168

@@ -163,8 +180,12 @@ The next figures show the total time spent in each step of training.
163180
*Step times for Transformer*
164181
</a>
165182

183+
- The top left graph in each figure shows the total training time `total = compute + communication`
184+
- Computation times are `compute = fwd + bwd + opt`
185+
- Communication times are precisely measured to take only into account communication of tensors between workers.
186+
166187
As expected, we can see that compute steps take less time as we increase the number of nodes,
167-
while communication increasingly takes more and more time, following a logarithmic path. Interestingly, the Transformer model's communication times quickly reach a plateau
188+
while communication increasingly takes more and more time, following a sub-linear path. Interestingly, the Transformer model's communication times quickly reach a plateau
168189
after 4 workers, while GNMT's communication times keeps increasing. This effect is probably due to larger values in the shared tensors.
169190

170191
Time spent optimizing doesn’t seem to follow the same path, but increases are insignificant (~10 seconds),

0 commit comments

Comments
 (0)