Skip to content

Commit 75920c0

Browse files
docs: adjust number of workloads, datasets, rename datasets to dataset in readme
1 parent 7792041 commit 75920c0

4 files changed

Lines changed: 9 additions & 7 deletions

File tree

.dockerignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
*
2-
!datasets/
2+
!dataset/
33
!docker/

dataset/README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,14 +31,15 @@ python3 dataset/dataset_setup.py \
3131
--<optional_flags>
3232
```
3333

34-
The complete benchmark uses 6 different datasets:
34+
The complete benchmark uses 7 different datasets:
3535

3636
- [OGBG](#ogbg)
3737
- [WMT](#wmt)
3838
- [FastMRI](#fastmri)
3939
- [Imagenet](#imagenet)
4040
- [Criteo 1TB](#criteo1tb)
4141
- [Librispeech](#librispeech)
42+
- [Fineweb-edu 10B](#fineweb-edu-10b)
4243

4344
Some dataset setups will require you to sign a third-party agreement with the dataset owners in order to get the download URLs.
4445

@@ -456,7 +457,8 @@ python3 librispeech_preprocess.py --data_dir=$DATA_DIR/librispeech --tokenizer_v
456457
```
457458

458459
### Fineweb-EDU 10B
459-
From `algorithmic-efficiency` run:
460+
461+
The preprocessing script will download and tokenize a 10 bilion token sample of FinewebEdu from Huggingface. The raw dataset will be saved in `tmp_dir/fwedu_10B_raw`, the tokenized dataset in `data_dir/fwedu_10B_tokenized`, and the train, valid split in `data_dir/fineweb_edu_10B`.
460462

461463
```bash
462464
python3 dataset/dataset_setup.py \

docs/CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -297,7 +297,7 @@ algorithm in `algorithms/target_setting_algorithms/`.
297297
We also have regression tests available in
298298
[.github/workflows/regression_tests.yml](.github/workflows/regression_tests.yml)
299299
that can be run semi-automatically. The regression tests are shorter end-to-end
300-
submissions run in a containerized environment across all 8 workloads, in both
300+
submissions run in a containerized environment across all 9 workloads, in both
301301
the JAX and PyTorch frameworks. The regression tests run on self-hosted runners
302302
and are triggered for pull requests that target the main branch. Typically these
303303
PRs will be from the `dev` branch so the tests will run containers based on

docs/GETTING_STARTED.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -219,11 +219,11 @@ Users that wish to customize their images are invited to check and modify the
219219
220220
## Download the Data
221221
222-
The workloads in this benchmark use 6 different datasets across 8 workloads. You
222+
The workloads in this benchmark use 6 different datasets across 9 workloads. You
223223
may choose to download some or all of the datasets as you are developing your
224-
submission, but your submission will be scored across all 8 workloads. For
224+
submission, but your submission will be scored across all 9 workloads. For
225225
instructions on obtaining and setting up the datasets see
226-
[datasets/README](/datasets/README.md#dataset-setup).
226+
[dataset/README](/dataset/README.md#dataset-setup).
227227
228228
## Develop your Submission
229229

0 commit comments

Comments
 (0)