Skip to content

Commit aeed8d3

Browse files
author
Edoardo Holzl
committed
Add NLP blogpost
1 parent 2fa66b7 commit aeed8d3

9 files changed

Lines changed: 175 additions & 0 deletions

File tree

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
---
2+
layout: post
3+
title: NLP Translation tasks, results discussion
4+
author: e_hoelzl
5+
published: true
6+
tags: [performance, results]
7+
excerpt_separator: <!--more-->
8+
---
9+
10+
During the past years, Natural Language Processing (NLP) has gained a lot of interest in the Machine Learning community.
11+
This increasing attention towards the subject may come from the fascination of teaching machines to understand and assimilate human language,
12+
and use them as tools to complement and facilitate our everyday lives.
13+
14+
Machine translation is one branch of NLP, and consists of having automated model capable of translating text from one language to another in a few seconds.
15+
16+
In this blog post, we analyze the performance increase brought by distribution for two different Machine Translation models: an LSTM variant (GNMT) and an attention
17+
based model (Transformer).
18+
19+
<!--more-->
20+
21+
Those models present two main limitations that makes training very time consuming:
22+
- They need millions of data points to reach acceptable performance.
23+
- Models are quite large, and computations take significantly more time.
24+
25+
Each of those problems can be solved using distribution:
26+
- Distribute the data on multiple machines (data-parallel)
27+
- Distributed computations on multiple cores (compute-parallel)
28+
29+
However the second approach requires the models to be parallelizable. We have only used data-parallel distribution for those tasks, but hope to combine it
30+
with parallel computation in the future.
31+
32+
33+
## Models
34+
35+
First let's have a quick look at the models' architectures to understand the scale.
36+
37+
### LSTM
38+
The LSTM variant we implemented was designed by Google {% cite gnmt %} and is called the Google Neural Machine translation.
39+
The architecture is shown in the figure below
40+
41+
<a href="{{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/gnmt.png" data-lightbox="gnmt_architecture" data-title="GNMT Architecture">
42+
![test]({{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/gnmt.png)
43+
*GNMT Architecture*
44+
</a>
45+
46+
Left side is the Encoder network, right side is the Decoder, connected via the attention module.
47+
The first encoder LSTM layer is bi-directional, and others are uni-directional.
48+
Residual connections start from the layer third from the bottom in the encoder and decoder.
49+
50+
This model follows the sequence-to-sequence learning framework, and uses stacked residual LSTM connections in the encoder and decoder modules.
51+
The residual connections allow for deeper stacked LSTM layers, as without residuals, the stack typically suffer from vanishing/exploding
52+
gradients when too many layers are used.
53+
The attention module is based on the one described in {% cite bahdanau2014neural %}
54+
55+
In our implementation, the encoder and decoder have each 4 stacked LSTM layers with residual connections, and hidden sizes of 1024.
56+
This gives a model with a total of 160,671,297 trainable parameters.
57+
58+
### Transformer
59+
60+
This model was first published in {% cite attention %}, and aims at completely disregarding recurrence and relying entirely on self-attention
61+
mechanisms to perform sequence modelling and translation problems.
62+
Such a structure allows for better parallelization of training on multiple GPUs, and can reach significantly better performance than comparable models.
63+
64+
Transformer uses Multi-Head attention mechanisms: instead of computing the attention once, it runs through the scaled dot-product attention multiple times
65+
in parallel.
66+
67+
The figure below shows an overview of the architecture
68+
69+
<a href="{{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/transformer.png" data-lightbox="transformer_architecture" data-title="Transformer Architecture">
70+
![test]({{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/transformer.png)
71+
*Transformer Architecture*
72+
</a>
73+
74+
Our implementation follows the original one described in the paper: encoder and decoder each have 6 identical layers.
75+
Each of the layers are composed of:
76+
- Encoder layers: Multi-head attention, followed by position-wise feed-forward layer (with residual connections)
77+
- Decoder layers: Similar to encoder layers, but with an additional multi-head attention layer that performs attention on the encoder output.
78+
79+
All multi-head attention modules have 16 heads for both encoder and decoder layers. This results in a model that has a total of 210,808,832 parameters.
80+
81+
82+
## Training
83+
84+
### Loss Function
85+
For both of the models, we use Negative Log-Likelihood loss with label smoothing.
86+
The models output a probability for each word of the vocabulary for the translated sentence.
87+
From this, we can compute $$NLLLoss(\mathbf{\hat{y}}, \mathbf{y})$$ where $$\mathbf{\hat{y}}$$ is the model output and $\mathbf{y}$ is the target.
88+
89+
90+
$$ Smooth Loss = -mean (log softmax(\mathbf{\hat{y}})) $$
91+
$$ Loss = confidence * NLLLoss + smoothing * SmoothLoss $$
92+
93+
Where $$confidence = 1 - smoothing$$. The smoothing is set to a value of 0.1 for both tasks.
94+
95+
### Optimization
96+
As we have seen above, both models have a very high number of parameters to train. This can be an issue when using GPUs, as the model needs to fit in memory.
97+
Also, back-propagation requires the memory to be at least twice the size of the model for it to work in memory. For that, instead of using regular precision, we used
98+
mixed-precision training, where most computations are done in `Float16`. We use a centralized version of `Adam`, where gradients are aggregated amongst all workers before
99+
updating weights.
100+
101+
### Datasets
102+
Both tasks use the same test set, but are trained on slightly different data sets:
103+
- The LSTM is trained on the English to German World Machine Translation 16 (WMT16) dataset, comprising of 3,975,116 translated sentences
104+
- The Transformer is trained on the English to German World Machine Translation 17 (WMT17) dataset, comprising of 4,590,101.
105+
106+
107+
More details on both tasks can be found on our [documentation](https://mlbench.readthedocs.io/en/latest/benchmark-tasks.html#task-4-machine-translation).
108+
109+
## Results
110+
111+
Let us now get to fun part; the results. As previously discussed, those models have important training times, and the aim of MLBench is to study the benefit of distribution.
112+
For reproducibility purposes, here is the hardware and software we have used:
113+
- Cloud service: Google Cloud
114+
- Machine Type: `n1-standard-4`
115+
- PyTorch 1.5.1
116+
- NVIDIA Tesla-T4 GPU (1 per node)
117+
- 4 cores and 15GB of RAM
118+
- NCCL communication backend
119+
120+
The goal for both models is determined by the Bilingual Evaluation Understudy Score (BLEU):
121+
- The LSTM task stops when reaching a BLEU Score of 24.0
122+
- The Transformer task stops when reaching a BLEU score of 25.0
123+
124+
The models are trained on 1,2,4,8 and 16 workers, and all step times are precisely measured to obtain an accurate speed up quantification.
125+
Speedups are computed with respect to the 1 worker case, and are intended to illustrate the distributive capabilities of the task.
126+
127+
### LSTM (GNMT implementation)
128+
129+
The graph below shows the time speedups for the first model. The left graph shows the absolute speed ups with respect to one worker, and the right one omits
130+
communication times from the speed up. This allows us to better see the effect of communication.
131+
132+
<a href="{{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4a_speedup.png" data-lightbox="task4a_speedups" data-title="Speedups for GNMT">
133+
![test]({{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4a_speedup.png)
134+
*GNMT Speedups*
135+
</a>
136+
137+
A few interesting points:
138+
- Overall speedups follows a $$ log_{2}(n) $$, with `n = num_workers`, while compute are roughly linear.
139+
- Scaling the number of compute nodes gives nearly perfect scaling for this task
140+
- Using more powerful communication hardware (e.g. Tesla V100) will positively affect speedups.
141+
142+
The next figure shows the total time spent in each step of training. As expected, we can see that compute steps take less time as we increase the number of nodes,
143+
while communication increasingly takes more and more time, following a logarithmic path.
144+
145+
Time spent optimizing doesn’t seem to follow the same path, but increases are insignificant (~10 seconds),
146+
and are due to additional compute steps (averaging tensors, computations related to Mixed precision) when using distribution.
147+
148+
<a href="{{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4a_times.png" data-lightbox="task4a_times" data-title="Step times for GNMT">
149+
![test]({{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4a_times.png)
150+
*Step times for GNMT*
151+
</a>
152+
153+
Finally, the following figure shows the loss evolution (left), Ratio of communication to total time (center), and a price index (right),
154+
computed as follows $$ index = \frac{price\_increase}{performance\_increase} $$
155+
156+
<a href="{{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4a_loss_ratio_prices.png" data-lightbox="task4a_loss_ratio_prices" data-title="Step times for GNMT">
157+
![test]({{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4a_loss_ratio_prices.png)
158+
*Step times for GNMT*
159+
</a>
160+
161+
Communication takes up a huge part of training as we increase distribution: over 70% of the time is spent sending tensors for 16 workers.
162+
This could be made faster by using a more appropriate connectivity between the workers (currently it is at 10GB/s) that can reduce times by a factor of 10 or more.
163+
An interesting thing to observe is that the curve of cost index first decreases and has a valley before increasing again, which depicts the limits of distribution for this task.
164+
The price to performance increase seems to be the best for 4 workers, but all indices are lower than 1, meaning the cost compromise is worth it for this task.
165+
166+
### Transformer
167+
168+
169+
170+
-----
171+
172+
## References
173+
174+
175+
{% bibliography --cited %}
152 KB
Loading
237 KB
Loading
128 KB
Loading
317 KB
Loading
252 KB
Loading
121 KB
Loading
334 KB
Loading
367 KB
Loading

0 commit comments

Comments
 (0)