This is the part of work of transferring DeepSpeed's work into MXNet. Since the difference between symbolic and imperative, we divide the whole proecss into two phases:
phase 1: Add reduce operation into graph. The reduce operation will do nothing in forward but reduce the gradient to the right GPU(according to POS-trainer).
phase2: In backward graph, delete the outputs in arrays so the memory planner can reuse such memory.
Since we use horovod to communicate, please firstly install horovod. And we use NCCL reduce, please also install it.
Please firstly compile it like lib pass. Run make and it will generate dynamic library
add_reduce_op_lib.so which is compiled from the add_reduce_op.cc file. Then load such file in your python code like
import mxnet as mx
mx.library.load('add_reduce_op_lib.so')Then we need know the correct partition of parameters and gradients about their GPUs.
So please use POS_Trainer from pos_trainer.py like normal trainer in MXNet.
from pos_trainer import POS_Trainer
trainer = POS_Trainer(params_dict, "adam", optimizer_params)Then trainer can generate corresponding options like:
options = trainer.generate_graph_pass_options()
backward_options = trainer.generate_backward_options()Before forward, we use
model.optimize_for(x, backend = "add_reduce_op", **options)to insert reduce operation into graphs.

Then we call backward option as
loss.backward(backward_option = backward_options)Please see test_reduce.py
- The reduce operation will cause deadlock (it won't happen in NaiveEngine). Moreover, it will meet invalid address problem in complex model like Bert-Base.
- We do remove outputs from backward graph using backward option. But we need to verify whether it decrease the memory consumption.