Skip to content

Commit 9bb9a71

Browse files
committed
merge
2 parents f1774fd + 7d9436b commit 9bb9a71

5 files changed

Lines changed: 8 additions & 25 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ The MLCommons™ **AlgoPerf: Training Algorithms benchmark** is designed to find
3131
When training neural nets, practitioners face many critical yet often opaque decisions: What optimizer to choose? How should its learning rate be tuned? What learning rate schedule should be used? These choices can make or break training, yet the community has lacked a clear, standardized way to identify the state of the art.
3232
Unlike benchmarks focused on hardware or model architecture, AlgoPerf isolates the **training algorithm** itself, which includes the optimizer, regularization, data selection, and hyperparameters like the learning rate schedule. By standardizing the benchmark process, AlgoPerf offers a meaningful apples-to-apples comparison of training algorithms and follows the following **key principles**:
3333

34-
- 🎯 **Fixed Target, Model & Hardware:** Submitted training algorithms must train a set of [**fixed models**](/docs/DOCUMENTATION.md#workloads) to a pre-defined validation performance target as fast as possible. All submissions use the same model architecture and are run on the same [**standardized hardware**](/docs/DOCUMENTATION.md#benchmarking-hardware) (8x NVIDIA V100 GPUs). This isolates the training algorithm's performance and allows a fair apples-to-apples comparison.
34+
- 🎯 **Fixed Target, Model & Hardware:** Submitted training algorithms must train a set of [**fixed models**](/docs/DOCUMENTATION.md#workloads) to a pre-defined validation performance target as fast as possible. All submissions use the same model architecture and are run on the same [**standardized hardware**](/docs/DOCUMENTATION.md#benchmarking-hardware) (4x A100 (40GB) GPUs). This isolates the training algorithm's performance and allows a fair apples-to-apples comparison.
3535
- ⏱️ **Time-To-Result:** Submissions are evaluated based on the total wall-clock time required to reach the target, rewarding practical and efficient algorithms.
3636
- 🧠 **Diverse Workloads:** The benchmark includes [**8 diverse deep learning workloads**](/docs/DOCUMENTATION.md#workloads) across domains like image classification, speech recognition, and machine translation. A submission's score is computed by aggregating its performance, using [**performance profiles**](/docs/DOCUMENTATION.md#benchmark-score-using-performance-profiles), across all workloads to ensure general-purpose algorithms.
3737
- 📦 **Fully-Specified Algorithms:** Submissions must be complete procedures and thus hyperparameter tuning is treated as part of the algorithm. Submissions can either provide a search space for automated tuning ([**External tuning ruleset**](/docs/DOCUMENTATION.md#external-tuning-ruleset)) or be hyperparameter-free ([**Self-tuning ruleset**](/docs/DOCUMENTATION.md#self-tuning-ruleset)) with any tuning done automatically and "on the clock". This measures an algorithm's _total_ practical cost and provides practitioners with a complete method, eliminating the guesswork of how to apply it.

algoperf/workloads/imagenet_resnet/imagenet_pytorch/workload.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ def __init__(
5050
rebuild_cache: bool = False,
5151
cache_build_timeout_minutes: int = 30,
5252
):
53-
self.root = os.path.expanduser(root)
53+
self.root = os.path.abspath(root)
5454
self.transform = transform
5555
self.target_transform = target_transform
5656
self.loader = loader
@@ -223,7 +223,7 @@ def _build_dataset(
223223
dataset = ImageFolder(
224224
os.path.join(data_dir, folder),
225225
transform=transform_config,
226-
# cache_file='.imagenet_cache_index.json',
226+
cache_file='.imagenet_{}_cache_index.json'.format(split),
227227
)
228228

229229
if split == 'eval_train':
@@ -248,7 +248,6 @@ def _build_dataset(
248248
sampler = data_utils.DistributedEvalSampler(
249249
dataset, num_replicas=N_GPUS, rank=RANK, shuffle=False
250250
)
251-
252251
dataloader = torch.utils.data.DataLoader(
253252
dataset,
254253
batch_size=ds_iter_batch_size,

docs/DOCUMENTATION.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ The **AlgoPerf: Training Algorithms benchmark** challenges participants to submi
5555

5656
The benchmarking process follows these **key principles**:
5757

58-
- 🎯 **Fixed Target, Model & Hardware:** Submitted training algorithms must train a set of [**fixed models**](#workloads) to a pre-defined validation performance target as fast as possible. All submissions use the same model architecture and are run on the same [**standardized hardware**](#benchmarking-hardware) (currently `8x NVIDIA V100 GPUs`). This isolates the training algorithm's performance and allows a fair apples-to-apples comparison.
58+
- 🎯 **Fixed Target, Model & Hardware:** Submitted training algorithms must train a set of [**fixed models**](#workloads) to a pre-defined validation performance target as fast as possible. All submissions use the same model architecture and are run on the same [**standardized hardware**](#benchmarking-hardware) (currently `4x NVIDIA A100 GPUs`). This isolates the training algorithm's performance and allows a fair apples-to-apples comparison.
5959
- ⏱️ **Time-To-Result:** Submissions are evaluated based on the total wall-clock time required to reach the target, rewarding practical and efficient algorithms.
6060
- 🧠 **Diverse Workloads:** The benchmark includes [**8 diverse deep learning workloads**](#workloads) across domains like image classification, speech recognition, and machine translation. A submission's score is computed by aggregating its performance across all workloads, using [**performance profiles**](#algoperf-benchmark-score-via-integrated-performance-profiles), to ensure general-purpose algorithms.
6161
- 📦 **Fully-Specified Algorithms:** Submissions must be [**complete procedures**](#submission-api) and thus hyperparameter tuning is treated as part of the algorithm. Depending on the [**ruleset**](#tuning-rulesets), submissions may use parallel tuning resources. This ensures that the benchmark measures the _total_ practical cost of a training algorithm and provides practitioners with a complete method, eliminating the guesswork of how to apply it.
@@ -542,7 +542,7 @@ All officially scored runs will be performed on the same benchmarking hardware t
542542
This benchmarking hardware is chosen to be easily accessible via common cloud computing providers and will likely change with each iteration of the benchmark.
543543
The specs of the benchmarking hardware for this iteration of the benchmark are:
544544

545-
- 8× NVIDIA V100 (16 GB) GPUs
545+
- 4× NVIDIA A100 (40 GB) GPUs
546546
- 240 GB in RAM
547547
- 2 TB in storage (for datasets).
548548

@@ -595,7 +595,7 @@ Furthermore, all submitters must sign the following agreements:
595595
<details>
596596
<summary><strong>My machine only has one GPU. How can I use this repo?</strong></summary>
597597

598-
> You can run this repo on a machine with an arbitrary number of GPUs. However, the default batch sizes of our algorithms collection (e.g. `algorithms/`) are tuned for a machine with 8× NVIDIA V100 (16 GB) GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. If you run into these issues because you are using a machine with less total GPU memory, please reduce the batch sizes for the submission. Note that your final submission must 'fit' on the [**benchmarking hardware**](#benchmarking-hardware), so if you are using fewer GPUs with higher per-GPU memory, please monitor your memory usage to make sure it will fit on 8× NVIDIA V100 GPUs with 16 GB of VRAM per card.
598+
> You can run this repo on a machine with an arbitrary number of GPUs. However, the default batch sizes of our algorithms collection (e.g. `algorithms/`) are tuned for a machine with 4× NVIDIA A100 (40 GB) GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. If you run into these issues because you are using a machine with less total GPU memory, please reduce the batch sizes for the submission. Note that your final submission must 'fit' on the [**benchmarking hardware**](#benchmarking-hardware), so if you are using fewer GPUs with higher per-GPU memory, please monitor your memory usage to make sure it will fit on 4× NVIDIA A100 GPUs with 40 GB of VRAM per card.
599599

600600
</details>
601601

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -111,8 +111,8 @@ jax_gpu = [
111111
]
112112

113113
pytorch_cpu = [
114-
"torch==2.5.1",
115-
"torchvision==0.20.1"
114+
"torch==2.9.0",
115+
"torchvision==0.24.0"
116116
]
117117
pytorch_gpu = [
118118
"torch==2.9.0",

submission_runner.py

Lines changed: 0 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -353,7 +353,6 @@ def train_once(
353353
log_dir, flags.FLAGS, hyperparameters
354354
)
355355
workload.attach_metrics_logger(metrics_logger)
356-
step_10_end_time = None
357356
global_start_time = get_time()
358357
train_state['last_step_end_time'] = global_start_time
359358

@@ -410,21 +409,6 @@ def train_once(
410409
train_state['training_complete'] = True
411410

412411
train_step_end_time = get_time()
413-
if global_step == 11:
414-
step_10_end_time = train_step_end_time
415-
416-
# Log step time every 100 steps
417-
if (global_step - 1) % 100 == 0 and workload.metrics_logger is not None:
418-
if step_10_end_time is not None and global_step > 11:
419-
elapsed_time_ms = (train_step_end_time - step_10_end_time) * 1000.0
420-
elapsed_steps = global_step - 11
421-
avg_step_time_ms = elapsed_time_ms / elapsed_steps
422-
else:
423-
avg_step_time_ms = 0.0
424-
workload.metrics_logger.append_scalar_metrics(
425-
{'step_time_ms': avg_step_time_ms},
426-
global_step - 1,
427-
)
428412

429413
train_state['accumulated_submission_time'] += (
430414
train_step_end_time - train_state['last_step_end_time']

0 commit comments

Comments
 (0)