docs: adjust number of workloads, datasets, rename datasets to dataset in readme

Niccolo-Ajroldi · Niccolo-Ajroldi · commit 75920c04a9df · 2026-02-08T17:22:26.000+01:00
diff --git a/.dockerignore b/.dockerignore
@@ -1,3 +1,3 @@
 *
-!datasets/
+!dataset/
 !docker/
diff --git a/dataset/README.md b/dataset/README.md
@@ -31,14 +31,15 @@ python3 dataset/dataset_setup.py \
   --<optional_flags>
 ```
 
-The complete benchmark uses 6 different datasets:
+The complete benchmark uses 7 different datasets:
 
 - [OGBG](#ogbg)
 - [WMT](#wmt)
 - [FastMRI](#fastmri)
 - [Imagenet](#imagenet)
 - [Criteo 1TB](#criteo1tb)
 - [Librispeech](#librispeech)
+- [Fineweb-edu 10B](#fineweb-edu-10b)
 
 Some dataset setups will require you to sign a third-party agreement with the dataset owners in order to get the download URLs.
 
@@ -456,7 +457,8 @@ python3 librispeech_preprocess.py --data_dir=$DATA_DIR/librispeech --tokenizer_v
 ```
 
 ### Fineweb-EDU 10B
-From `algorithmic-efficiency` run:
+
+The preprocessing script will download and tokenize a 10 bilion token sample of FinewebEdu from Huggingface. The raw dataset will be saved in `tmp_dir/fwedu_10B_raw`, the tokenized dataset in `data_dir/fwedu_10B_tokenized`, and the train, valid split in `data_dir/fineweb_edu_10B`. 
 
 ```bash
 python3 dataset/dataset_setup.py \
diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md
@@ -297,7 +297,7 @@ algorithm in `algorithms/target_setting_algorithms/`.
 We also have regression tests available in
 [.github/workflows/regression_tests.yml](.github/workflows/regression_tests.yml)
 that can be run semi-automatically. The regression tests are shorter end-to-end
-submissions run in a containerized environment across all 8 workloads, in both
+submissions run in a containerized environment across all 9 workloads, in both
 the JAX and PyTorch frameworks. The regression tests run on self-hosted runners
 and are triggered for pull requests that target the main branch. Typically these
 PRs will be from the `dev` branch so the tests will run containers based on
diff --git a/docs/GETTING_STARTED.md b/docs/GETTING_STARTED.md
@@ -219,11 +219,11 @@ Users that wish to customize their images are invited to check and modify the
 
 ## Download the Data
 
-The workloads in this benchmark use 6 different datasets across 8 workloads. You
+The workloads in this benchmark use 6 different datasets across 9 workloads. You
 may choose to download some or all of the datasets as you are developing your
-submission, but your submission will be scored across all 8 workloads. For
+submission, but your submission will be scored across all 9 workloads. For
 instructions on obtaining and setting up the datasets see
-[datasets/README](/datasets/README.md#dataset-setup).
+[dataset/README](/dataset/README.md#dataset-setup).
 
 ## Develop your Submission
 

Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,3 @@`
`1`	`1`	`*`
`2`		`-!datasets/`
	`2`	`+!dataset/`
`3`	`3`	`!docker/`