You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2020-10-02-nlp-translation.md
+42-21Lines changed: 42 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,27 +7,30 @@ tags: [performance, results]
7
7
excerpt_separator: <!--more-->
8
8
---
9
9
10
-
Natural Language Processing (NLP) has gained a lot of interest in the Machine Learning community.
11
-
This increasing attention towards the subject may come from the fascination of teaching machines to understand and assimilate human language,
10
+
The popularity and relevance of Natural Language Processing (NLP) may come from the
11
+
fascination of teaching machines to understand and assimilate human language,
12
12
and use them as tools to complement and facilitate our everyday lives.
13
13
14
-
Machine translation is one branch of NLP, and consists of having automated model capable of translating text from one language to another in a few seconds.
14
+
Machine translation is one branch of NLP, and consists of having automated model capable of
15
+
translating text from one language to another almost instantaniously.
15
16
16
-
In this blog post, we analyze the performance increase brought by distributed training for two different Machine Translation models: an LSTM variant (GNMT) and an attention
17
-
based model (Transformer).
17
+
In this blog post, we analyze how distributed learning improves the training time of two different machine translation models:
18
+
an LSTM variant (GNMT) and an attention based model (Transformer).
18
19
19
20
<!--more-->
20
21
21
22
Those models present two main limitations that makes training very time consuming:
22
23
- They need millions of data points to reach acceptable performance.
23
-
- Models are quite large, and computations take significantly more time.
24
+
- Models are quite large (hundreds of millions of parameters), and computations take significant time compared to simpler models.
24
25
25
26
Each of those problems can be solved using distribution:
26
-
- Distribute the data on multiple machines (data-parallel)
27
-
- Distributed computations on multiple cores (compute-parallel)
27
+
- Distribute the data on multiple machines (data-parallel).
28
+
- Distributed computations for one data point on multiple cores (compute-parallel), requires model to be parallelizable.
28
29
29
-
However the second approach requires the models to be parallelizable. We have only used data-parallel distribution for those tasks, but hope to combine it
30
-
with parallel computation in the future.
30
+
Based on these limitations, we can divide processing of datapoints over multiple workers,
31
+
or even subdivide the computations required to process a single datapoint.
32
+
In our experiments, we focus on dividing the data (data-parallel).
33
+
We plan to extend these results to model-parallel training in the future.
31
34
32
35
33
36
## Models
@@ -59,7 +62,6 @@ This gives a model with a total of 160,671,297 trainable parameters.
59
62
60
63
This model was first published in {% cite attention %}, and aims at completely disregarding recurrence and relying entirely on self-attention
61
64
mechanisms to perform sequence modelling and translation problems.
62
-
Such a structure allows for better parallelization of training on multiple GPUs, and can reach significantly better performance than comparable models.
63
65
64
66
Transformer uses Multi-Head attention mechanisms: instead of computing the attention once, it runs through the scaled dot-product attention multiple times
Where $$confidence = 1 - smoothing$$. The smoothing is set to a value of 0.1 for both tasks.
94
96
95
97
### Optimization
96
-
As we have seen above, both models have a very high number of parameters to train. This can be an issue when using GPUs, as the model needs to fit in memory.
97
-
Also, back-propagation requires the memory to be at least twice the size of the model for it to work in memory. For that, instead of using regular precision, we used
98
-
mixed-precision training, where most computations are done in `Float16`. We use a centralized version of `Adam`, where gradients are aggregated amongst all workers before
99
-
updating weights.
98
+
As we have seen above, both models have a very high number of parameters to train.
99
+
This can be an issue when using GPUs, as the model needs to fit in memory, and back-propagation requires the memory to be at least twice the size of the model for it to work in memory.
100
+
101
+
For example:
102
+
- Transformer has 200 million trainable parameters. In full precision (`Float32`), this results in 800 MB for only storing the weights.
103
+
- Forward pass requires to multiply and store each output. So we add another 800MB for forward pass
104
+
- Each sent/received tensor for other workers will be of 800 MB. For 4 workers this results in 3.2 GB needed
105
+
- Those sent tensors will take longer to be received as they are larger.
106
+
- Backpropagation requires at least 3 to 4 times the amount of memory needed by the model to work, so another 3.2 GB of memory
107
+
108
+
Considering those numbers, this results in memory usage of already ~ 8GB, which in reality is much greater as CUDA amd CudNN need also their share of memory.
109
+
From our experiments, a memory of 16GB is far from enough to train those models in full precision.
110
+
111
+
For that, instead of using regular precision, we used
112
+
mixed-precision training, where most computations are done in `Float16`. We use a synchronous data-parallel version of `Adam`,
113
+
where gradients are aggregated amongst all workers before updating weights.
100
114
101
115
### Datasets
102
116
Both tasks use the same test set, but are trained on slightly different data sets:
103
117
- The LSTM is trained on the English to German World Machine Translation 16 (WMT16) dataset, comprising of 3,975,116 translated sentences
104
118
- The Transformer is trained on the English to German World Machine Translation 17 (WMT17) dataset, comprising of 4,590,101.
105
119
106
120
107
-
More details on both tasks can be found on our [documentation](https://mlbench.readthedocs.io/en/latest/benchmark-tasks.html#task-4-machine-translation).
121
+
More details on both tasks can be found in our [documentation](https://mlbench.readthedocs.io/en/latest/benchmark-tasks.html#task-4-machine-translation).
108
122
109
123
## Results
110
124
@@ -129,23 +143,26 @@ Speedups are computed with respect to the 1 worker case, and are intended to ill
129
143
The graphs below show the time speedups for the LSTM model and Transformer model (respectively).
130
144
131
145
<ahref="{{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4a_speedup.png"data-lightbox="task4a_speedups"data-title="Speedups for GNMT">
<ahref="{{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4b_speedup.png"data-lightbox="task4b_speedups"data-title="Speedups for Transformer">
The left graph shows the absolute speed ups with respect to one worker, and the right one omits
142
158
communication times from the speed up. This allows us to better see the effect of communication.
143
159
144
160
145
161
A few interesting points:
146
-
- Overall speedups follows a $$ log_{2}(n) $$, with `n = num_workers`, while compute are roughly linear.
162
+
- Overall speedups follow a sub-linear pattern, while compute are roughly linear.
147
163
- Scaling the number of compute nodes gives nearly perfect scaling for both tasks (right plot)
148
-
- Using more powerful communication hardware (e.g. Tesla V100) will positively affect speedups.
164
+
- Using more powerful communication hardware (e.g. Tesla V100) will positively affect speedups. We currently have around 10Gbps
165
+
connection speed between the workers, and such hardware could increase it by a factor of at least 10.
149
166
150
167
As the distribution level increases, we can see that communication becomes more and more heavy, and attenuates speedups quite significantly.
151
168
@@ -163,8 +180,12 @@ The next figures show the total time spent in each step of training.
163
180
*Step times for Transformer*
164
181
</a>
165
182
183
+
- The top left graph in each figure shows the total training time `total = compute + communication`
184
+
- Computation times are `compute = fwd + bwd + opt`
185
+
- Communication times are precisely measured to take only into account communication of tensors between workers.
186
+
166
187
As expected, we can see that compute steps take less time as we increase the number of nodes,
167
-
while communication increasingly takes more and more time, following a logarithmic path. Interestingly, the Transformer model's communication times quickly reach a plateau
188
+
while communication increasingly takes more and more time, following a sub-linear path. Interestingly, the Transformer model's communication times quickly reach a plateau
168
189
after 4 workers, while GNMT's communication times keeps increasing. This effect is probably due to larger values in the shared tensors.
169
190
170
191
Time spent optimizing doesn’t seem to follow the same path, but increases are insignificant (~10 seconds),
0 commit comments