Skip to content

Commit 970cfa3

Browse files
author
Ralf Grubenmann
committed
Updates tutorial to new mlbench core version
1 parent 18539bf commit 970cfa3

1 file changed

Lines changed: 115 additions & 21 deletions

File tree

_posts/2018-11-20-pytorch-adaptation-tutorial.md

Lines changed: 115 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -84,53 +84,147 @@ At this point, the script could technically already run in MLBench. But so far i
8484
The PyTorch script reports loss to ``stdout``, but we can easily report the loss to MLBench as well. First we need to import the relevant MLBench functionality by adding the following line to the imports at the top of the file:
8585

8686
{% highlight python %}
87-
from mlbench_core.api import ApiClient
87+
from mlbench_core.utils import Tracker
88+
from mlbench_core.evaluation.goals import task1_time_to_accuracy_goal
89+
from mlbench_core.evaluation.pytorch.metrics import TopKAccuracy
90+
from mlbench_core.controlflow.pytorch import validation_round
8891
{% endhighlight %}
8992

90-
Then we can simply create an api client object and use it to report the loss. We instantiate the client as shown on lines 10 - 13 in this snippet and post the loss as shown on lines 32 - 35:
93+
Then we can simply create a ``Tracker`` object and use it to report the loss and add metrics (``TokKAccuracy``) to track. We add code to record the timing of different steps with ``tracker.record_batch_step()``.
94+
We have to tell the tracker that we're in the training loop ba calling ``tracker.train()`` and that the epoch is done by calling ``tracker.epoch_end()``. The loss is recorded with ``tracker.record_loss()``.
9195

9296
{% highlight python linenos %}
9397
def run(rank, size, run_id):
9498
""" Distributed Synchronous SGD Example """
9599
torch.manual_seed(1234)
96100
train_set, bsz = partition_dataset()
97101
model = Net()
98-
model = model
99-
# model = model.cuda(rank)
100102
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
103+
metrics = [
104+
TopKAccuracy(topk=1),
105+
TopKAccuracy(topk=5)
106+
]
107+
loss_func = nn.NLLLoss()
101108

102-
api_client = ApiClient(
103-
in_cluster=True,
104-
k8s_namespace='default',
105-
label_selector='component=master,app=mlbench')
109+
tracker = Tracker(metrics, run_id, rank)
106110

107111
num_batches = ceil(len(train_set.dataset) / float(bsz))
112+
113+
tracker.start()
114+
108115
for epoch in range(10):
116+
tracker.train()
117+
109118
epoch_loss = 0.0
110119
for data, target in train_set:
111-
data, target = Variable(data), Variable(target)
112-
# data, target = Variable(data.cuda(rank)), Variable(target.cuda(rank))
120+
tracker.batch_start()
121+
113122
optimizer.zero_grad()
114123
output = model(data)
115-
loss = F.nll_loss(output, target)
116-
epoch_loss += loss.data[0]
124+
125+
tracker.record_batch_step('forward')
126+
127+
loss = loss_func(output, target)
128+
epoch_loss += loss.data.item()
129+
130+
tracker.record_batch_step('loss')
131+
117132
loss.backward()
133+
134+
tracker.record_batch_step('backward')
135+
118136
average_gradients(model)
119137
optimizer.step()
120-
print('Rank ',
121-
dist.get_rank(), ', epoch ', epoch, ': ',
122-
epoch_loss / num_batches)
123-
124-
api_client.post_metric(
125-
run_id,
126-
"Rank {} loss".format(rank),
127-
epoch_loss / num_batches)
138+
139+
tracker.batch_end()
140+
141+
tracker.record_loss(epoch_loss, num_batches, log_to_api=True)
142+
143+
logging.debug('Rank %s, epoch %s: %s',
144+
dist.get_rank(), epoch,
145+
epoch_loss / num_batches)
146+
147+
tracker.epoch_end()
148+
149+
if tracker.goal_reached:
150+
logging.debug("Goal Reached!")
151+
return
128152
{% endhighlight %}
129153

130-
Make sure to change ``default`` on line 12 to the namespace MLBench is running under in Kubernetes.
131154

132155
That's it. Now the training will report the loss of each worker back to the Dashboard and show it in a nice Graph.
133156

157+
For the official tasks, we also need to report validation stats to the tracker and use the offical validation code. Rename the current ``partition_dataset()`` method to ``partition_dataset_train``
158+
and add a new partition method to load the validation set:
159+
160+
{% highlight python linenos %}
161+
def partition_dataset_val():
162+
""" Partitioning MNIST validation set"""
163+
dataset = datasets.MNIST(
164+
'./data',
165+
train=False,
166+
download=True,
167+
transform=transforms.Compose([
168+
transforms.ToTensor(),
169+
transforms.Normalize((0.1307, ), (0.3081, ))
170+
]))
171+
size = dist.get_world_size()
172+
bsz = int(128 / float(size))
173+
partition_sizes = [1.0 / size for _ in range(size)]
174+
partition = DataPartitioner(dataset, partition_sizes)
175+
partition = partition.use(dist.get_rank())
176+
val_set = torch.utils.data.DataLoader(
177+
partition, batch_size=bsz, shuffle=True)
178+
return val_set, bsz
179+
{% endhighlight %}
180+
181+
Then load the validation set and add the goal for the official task (The Task 1a goal is used for illustration purposes in thsi example):
182+
183+
{% highlight python linenos %}
184+
def run(rank, size, run_id):
185+
""" Distributed Synchronous SGD Example """
186+
torch.manual_seed(1234)
187+
train_set, bsz = partition_dataset_train()
188+
val_set, bsz_val = partition_dataset_val()
189+
model = Net()
190+
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
191+
metrics = [
192+
TopKAccuracy(topk=1),
193+
TopKAccuracy(topk=5)
194+
]
195+
loss_func = nn.NLLLoss()
196+
197+
goal = task1_time_to_accuracy_goal
198+
199+
tracker = Tracker(metrics, run_id, rank, goal=goal)
200+
201+
num_batches = ceil(len(train_set.dataset) / float(bsz))
202+
num_batches_val = ceil(len(val_set.dataset) / float(bsz_val))
203+
204+
tracker.start()
205+
{% endhighlight %}
206+
207+
Now all that is needed is to add the validation loop code (``validation_round()``) to run validation in the ``run()`` function. We also check if the goal is reached and stop training if it is.
208+
``validation_round()`` evaluates the metrics on the validation set and reports the results to the Dashboard.
209+
210+
{% highlight python linenos %}
211+
tracker.record_loss(epoch_loss, num_batches, log_to_api=True)
212+
213+
logging.debug('Rank %s, epoch %s: %s',
214+
dist.get_rank(), epoch,
215+
epoch_loss / num_batches)
216+
217+
validation_round(val_set, model, loss_func, metrics, run_id, rank,
218+
'fp32', transform_target_type=None, use_cuda=False,
219+
max_batch_per_epoch=num_batches_val, tracker=tracker)
220+
221+
tracker.epoch_end()
222+
223+
if tracker.goal_reached:
224+
logging.debug("Goal Reached!")
225+
return
226+
{% endhighlight %}
227+
134228
The full code (with some additional improvements) is in our [Github Repo](https://github.com/mlbench/mlbench-benchmarks/blob/master/examples/mlbench-pytorch-tutorial/)
135229

136230
## Creating a Docker Image for Kubernetes

0 commit comments

Comments
 (0)