Skip to content

Commit a08e6aa

Browse files
author
Edoardo Holzl
committed
Finalize post
1 parent aeed8d3 commit a08e6aa

1 file changed

Lines changed: 41 additions & 11 deletions

File tree

_posts/2020-10-02-nlp-translation.md

Lines changed: 41 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ tags: [performance, results]
77
excerpt_separator: <!--more-->
88
---
99

10-
During the past years, Natural Language Processing (NLP) has gained a lot of interest in the Machine Learning community.
10+
Natural Language Processing (NLP) has gained a lot of interest in the Machine Learning community.
1111
This increasing attention towards the subject may come from the fascination of teaching machines to understand and assimilate human language,
1212
and use them as tools to complement and facilitate our everyday lives.
1313

@@ -124,35 +124,58 @@ The goal for both models is determined by the Bilingual Evaluation Understudy Sc
124124
The models are trained on 1,2,4,8 and 16 workers, and all step times are precisely measured to obtain an accurate speed up quantification.
125125
Speedups are computed with respect to the 1 worker case, and are intended to illustrate the distributive capabilities of the task.
126126

127-
### LSTM (GNMT implementation)
127+
### Overall Speedups
128128

129-
The graph below shows the time speedups for the first model. The left graph shows the absolute speed ups with respect to one worker, and the right one omits
130-
communication times from the speed up. This allows us to better see the effect of communication.
129+
The graphs below show the time speedups for the LSTM model and Transformer model (respectively).
131130

132131
<a href="{{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4a_speedup.png" data-lightbox="task4a_speedups" data-title="Speedups for GNMT">
133132
![test]({{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4a_speedup.png)
134133
*GNMT Speedups*
135134
</a>
136135

136+
<a href="{{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4b_speedup.png" data-lightbox="task4b_speedups" data-title="Speedups for Transformer">
137+
![test]({{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4b_speedup.png)
138+
*Transformer Speedups*
139+
</a>
140+
141+
The left graph shows the absolute speed ups with respect to one worker, and the right one omits
142+
communication times from the speed up. This allows us to better see the effect of communication.
143+
144+
137145
A few interesting points:
138146
- Overall speedups follows a $$ log_{2}(n) $$, with `n = num_workers`, while compute are roughly linear.
139-
- Scaling the number of compute nodes gives nearly perfect scaling for this task
147+
- Scaling the number of compute nodes gives nearly perfect scaling for both tasks (right plot)
140148
- Using more powerful communication hardware (e.g. Tesla V100) will positively affect speedups.
141149

142-
The next figure shows the total time spent in each step of training. As expected, we can see that compute steps take less time as we increase the number of nodes,
143-
while communication increasingly takes more and more time, following a logarithmic path.
150+
As the distribution level increases, we can see that communication becomes more and more heavy, and attenuates speedups quite significantly.
144151

145-
Time spent optimizing doesn’t seem to follow the same path, but increases are insignificant (~10 seconds),
146-
and are due to additional compute steps (averaging tensors, computations related to Mixed precision) when using distribution.
152+
### Step times
153+
154+
The next figures show the total time spent in each step of training.
147155

148156
<a href="{{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4a_times.png" data-lightbox="task4a_times" data-title="Step times for GNMT">
149157
![test]({{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4a_times.png)
150158
*Step times for GNMT*
151159
</a>
152160

153-
Finally, the following figure shows the loss evolution (left), Ratio of communication to total time (center), and a price index (right),
161+
<a href="{{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4b_times.png" data-lightbox="task4b_times" data-title="Step times for Transformer">
162+
![test]({{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4b_times.png)
163+
*Step times for Transformer*
164+
</a>
165+
166+
As expected, we can see that compute steps take less time as we increase the number of nodes,
167+
while communication increasingly takes more and more time, following a logarithmic path. Interestingly, the Transformer model's communication times quickly reach a plateau
168+
after 4 workers, while GNMT's communication times keeps increasing. This effect is probably due to larger values in the shared tensors.
169+
170+
Time spent optimizing doesn’t seem to follow the same path, but increases are insignificant (~10 seconds),
171+
and are due to additional compute steps (averaging tensors, computations related to Mixed precision) when using distribution.
172+
173+
### Performance comparison
174+
175+
Finally, the following figures show the loss evolution (left), Ratio of communication to total time (center), and a price index (right),
154176
computed as follows $$ index = \frac{price\_increase}{performance\_increase} $$
155177

178+
#### LSTM
156179
<a href="{{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4a_loss_ratio_prices.png" data-lightbox="task4a_loss_ratio_prices" data-title="Step times for GNMT">
157180
![test]({{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4a_loss_ratio_prices.png)
158181
*Step times for GNMT*
@@ -163,9 +186,16 @@ This could be made faster by using a more appropriate connectivity between the w
163186
An interesting thing to observe is that the curve of cost index first decreases and has a valley before increasing again, which depicts the limits of distribution for this task.
164187
The price to performance increase seems to be the best for 4 workers, but all indices are lower than 1, meaning the cost compromise is worth it for this task.
165188

166-
### Transformer
167189

190+
#### Transformer
191+
<a href="{{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4b_loss_ratio_prices.png" data-lightbox="task4b_loss_ratio_prices" data-title="Step times for Transformer">
192+
![test]({{ site.baseurl }}public/images/blog/2020-10-02-nlp-translation/task4b_loss_ratio_prices.png)
193+
*Step times for Transformer*
194+
</a>
168195

196+
Compared to the LSTM model, the communication time ratio is slighly lower, but follows a similar path. For 16 workers, it reaches above 60%. However, the price index has a different shape:
197+
the observed valley is missing, and the indices are decreasing as we add workers. This suggests a very good performance increase, with a lower price increase. The best configuration
198+
according to this index is with 8 workers, but the 16 worker case still has very impressive advantages.
169199

170200
-----
171201

0 commit comments

Comments
 (0)