merge

priyakasimbeg · priyakasimbeg · commit 9bb9a71440fd · 2026-01-29T04:56:48.000Z
diff --git a/README.md b/README.md
@@ -31,7 +31,7 @@ The MLCommons™ **AlgoPerf: Training Algorithms benchmark** is designed to find
 When training neural nets, practitioners face many critical yet often opaque decisions: What optimizer to choose? How should its learning rate be tuned? What learning rate schedule should be used? These choices can make or break training, yet the community has lacked a clear, standardized way to identify the state of the art.
 Unlike benchmarks focused on hardware or model architecture, AlgoPerf isolates the **training algorithm** itself, which includes the optimizer, regularization, data selection, and hyperparameters like the learning rate schedule. By standardizing the benchmark process, AlgoPerf offers a meaningful apples-to-apples comparison of training algorithms and follows the following **key principles**:
 
-- 🎯 **Fixed Target, Model & Hardware:** Submitted training algorithms must train a set of [**fixed models**](/docs/DOCUMENTATION.md#workloads) to a pre-defined validation performance target as fast as possible. All submissions use the same model architecture and are run on the same [**standardized hardware**](/docs/DOCUMENTATION.md#benchmarking-hardware) (8x NVIDIA V100 GPUs). This isolates the training algorithm's performance and allows a fair apples-to-apples comparison.
+- 🎯 **Fixed Target, Model & Hardware:** Submitted training algorithms must train a set of [**fixed models**](/docs/DOCUMENTATION.md#workloads) to a pre-defined validation performance target as fast as possible. All submissions use the same model architecture and are run on the same [**standardized hardware**](/docs/DOCUMENTATION.md#benchmarking-hardware) (4x A100 (40GB) GPUs). This isolates the training algorithm's performance and allows a fair apples-to-apples comparison.
 - ⏱️ **Time-To-Result:** Submissions are evaluated based on the total wall-clock time required to reach the target, rewarding practical and efficient algorithms.
 - 🧠 **Diverse Workloads:** The benchmark includes [**8 diverse deep learning workloads**](/docs/DOCUMENTATION.md#workloads) across domains like image classification, speech recognition, and machine translation. A submission's score is computed by aggregating its performance, using [**performance profiles**](/docs/DOCUMENTATION.md#benchmark-score-using-performance-profiles), across all workloads to ensure general-purpose algorithms.
 - 📦 **Fully-Specified Algorithms:** Submissions must be complete procedures and thus hyperparameter tuning is treated as part of the algorithm. Submissions can either provide a search space for automated tuning ([**External tuning ruleset**](/docs/DOCUMENTATION.md#external-tuning-ruleset)) or be hyperparameter-free ([**Self-tuning ruleset**](/docs/DOCUMENTATION.md#self-tuning-ruleset)) with any tuning done automatically and "on the clock". This measures an algorithm's _total_ practical cost and provides practitioners with a complete method, eliminating the guesswork of how to apply it.
diff --git a/algoperf/workloads/imagenet_resnet/imagenet_pytorch/workload.py b/algoperf/workloads/imagenet_resnet/imagenet_pytorch/workload.py
@@ -50,7 +50,7 @@ def __init__(
     rebuild_cache: bool = False,
     cache_build_timeout_minutes: int = 30,
   ):
-    self.root = os.path.expanduser(root)
+    self.root = os.path.abspath(root)
     self.transform = transform
     self.target_transform = target_transform
     self.loader = loader
@@ -223,7 +223,7 @@ def _build_dataset(
     dataset = ImageFolder(
       os.path.join(data_dir, folder),
       transform=transform_config,
-      # cache_file='.imagenet_cache_index.json',
+      cache_file='.imagenet_{}_cache_index.json'.format(split),
     )
 
     if split == 'eval_train':
@@ -248,7 +248,6 @@ def _build_dataset(
         sampler = data_utils.DistributedEvalSampler(
           dataset, num_replicas=N_GPUS, rank=RANK, shuffle=False
         )
-
     dataloader = torch.utils.data.DataLoader(
       dataset,
       batch_size=ds_iter_batch_size,
diff --git a/docs/DOCUMENTATION.md b/docs/DOCUMENTATION.md
@@ -55,7 +55,7 @@ The **AlgoPerf: Training Algorithms benchmark** challenges participants to submi
 
 The benchmarking process follows these **key principles**:
 
-- 🎯 **Fixed Target, Model & Hardware:** Submitted training algorithms must train a set of [**fixed models**](#workloads) to a pre-defined validation performance target as fast as possible. All submissions use the same model architecture and are run on the same [**standardized hardware**](#benchmarking-hardware) (currently `8x NVIDIA V100 GPUs`). This isolates the training algorithm's performance and allows a fair apples-to-apples comparison.
+- 🎯 **Fixed Target, Model & Hardware:** Submitted training algorithms must train a set of [**fixed models**](#workloads) to a pre-defined validation performance target as fast as possible. All submissions use the same model architecture and are run on the same [**standardized hardware**](#benchmarking-hardware) (currently `4x NVIDIA A100 GPUs`). This isolates the training algorithm's performance and allows a fair apples-to-apples comparison.
 - ⏱️ **Time-To-Result:** Submissions are evaluated based on the total wall-clock time required to reach the target, rewarding practical and efficient algorithms.
 - 🧠 **Diverse Workloads:** The benchmark includes [**8 diverse deep learning workloads**](#workloads) across domains like image classification, speech recognition, and machine translation. A submission's score is computed by aggregating its performance across all workloads, using [**performance profiles**](#algoperf-benchmark-score-via-integrated-performance-profiles), to ensure general-purpose algorithms.
 - 📦 **Fully-Specified Algorithms:** Submissions must be [**complete procedures**](#submission-api) and thus hyperparameter tuning is treated as part of the algorithm. Depending on the [**ruleset**](#tuning-rulesets), submissions may use parallel tuning resources. This ensures that the benchmark measures the _total_ practical cost of a training algorithm and provides practitioners with a complete method, eliminating the guesswork of how to apply it.
@@ -542,7 +542,7 @@ All officially scored runs will be performed on the same benchmarking hardware t
 This benchmarking hardware is chosen to be easily accessible via common cloud computing providers and will likely change with each iteration of the benchmark.
 The specs of the benchmarking hardware for this iteration of the benchmark are:
 
-- 8× NVIDIA V100 (16 GB) GPUs
+- 4× NVIDIA A100 (40 GB) GPUs
 - 240 GB in RAM
 - 2 TB in storage (for datasets).
 
@@ -595,7 +595,7 @@ Furthermore, all submitters must sign the following agreements:
 <details>
 <summary><strong>My machine only has one GPU. How can I use this repo?</strong></summary>
 
-> You can run this repo on a machine with an arbitrary number of GPUs. However, the default batch sizes of our algorithms collection (e.g. `algorithms/`) are tuned for a machine with 8× NVIDIA V100 (16 GB) GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. If you run into these issues because you are using a machine with less total GPU memory, please reduce the batch sizes for the submission. Note that your final submission must 'fit' on the [**benchmarking hardware**](#benchmarking-hardware), so if you are using fewer GPUs with higher per-GPU memory, please monitor your memory usage to make sure it will fit on 8× NVIDIA V100 GPUs with 16 GB of VRAM per card.
+> You can run this repo on a machine with an arbitrary number of GPUs. However, the default batch sizes of our algorithms collection (e.g. `algorithms/`) are tuned for a machine with 4× NVIDIA A100 (40 GB) GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. If you run into these issues because you are using a machine with less total GPU memory, please reduce the batch sizes for the submission. Note that your final submission must 'fit' on the [**benchmarking hardware**](#benchmarking-hardware), so if you are using fewer GPUs with higher per-GPU memory, please monitor your memory usage to make sure it will fit on 4× NVIDIA A100 GPUs with 40 GB of VRAM per card.
 
 </details>
 
diff --git a/pyproject.toml b/pyproject.toml
@@ -111,8 +111,8 @@ jax_gpu = [
 ]
 
 pytorch_cpu = [
-  "torch==2.5.1", 
-  "torchvision==0.20.1"
+  "torch==2.9.0", 
+  "torchvision==0.24.0"
 ]
 pytorch_gpu = [
   "torch==2.9.0",
diff --git a/submission_runner.py b/submission_runner.py
@@ -353,7 +353,6 @@ def train_once(
         log_dir, flags.FLAGS, hyperparameters
       )
       workload.attach_metrics_logger(metrics_logger)
-  step_10_end_time = None
   global_start_time = get_time()
   train_state['last_step_end_time'] = global_start_time
 
@@ -410,21 +409,6 @@ def train_once(
       train_state['training_complete'] = True
 
     train_step_end_time = get_time()
-    if global_step == 11:
-      step_10_end_time = train_step_end_time
-
-    # Log step time every 100 steps
-    if (global_step - 1) % 100 == 0 and workload.metrics_logger is not None:
-      if step_10_end_time is not None and global_step > 11:
-        elapsed_time_ms = (train_step_end_time - step_10_end_time) * 1000.0
-        elapsed_steps = global_step - 11
-        avg_step_time_ms = elapsed_time_ms / elapsed_steps
-      else:
-        avg_step_time_ms = 0.0
-      workload.metrics_logger.append_scalar_metrics(
-        {'step_time_ms': avg_step_time_ms},
-        global_step - 1,
-      )
 
     train_state['accumulated_submission_time'] += (
       train_step_end_time - train_state['last_step_end_time']

Original file line number	Diff line number	Diff line change
`@@ -111,8 +111,8 @@ jax_gpu = [`
`111`	`111`	`]`
`112`	`112`
`113`	`113`	`pytorch_cpu = [`
`114`		`- "torch==2.5.1",`
`115`		`- "torchvision==0.20.1"`
	`114`	`+ "torch==2.9.0",`
	`115`	`+ "torchvision==0.24.0"`
`116`	`116`	`]`
`117`	`117`	`pytorch_gpu = [`
`118`	`118`	`"torch==2.9.0",`