Skip to content

Commit ecbf5cc

Browse files
docs: add details abut fwedu files
1 parent 75920c0 commit ecbf5cc

1 file changed

Lines changed: 39 additions & 1 deletion

File tree

dataset/README.md

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -465,4 +465,42 @@ python3 dataset/dataset_setup.py \
465465
--data_dir $DATA_DIR \
466466
--temp_dir $DATA_DIR/tmp \
467467
--fineweb_edu
468-
```
468+
```
469+
470+
<details>
471+
<summary>The final directory structure should look like this:</summary>
472+
473+
```bash
474+
$DATA_DIR
475+
├──fineweb_edu_10B
476+
│ ├── fwedu_10B_tokenized
477+
│ │ ├── data-00000-of-00080.arrow
478+
│ │ ├── data-00001-of-00080.arrow
479+
│ │ ├── data-00002-of-00080.arrow
480+
│ │ ├── [...]
481+
│ │ ├── data-00078-of-00080.arrow
482+
│ │ ├── data-00079-of-00080.arrow
483+
│ │ ├── dataset_info.json
484+
│ │ └── state.json
485+
│ ├── train
486+
│ │ ├── 11814516993635243069
487+
│ │ │ └── 00000000.shard
488+
│ │ │ └── 00000000.snapshot
489+
│ │ ├── 1309159339089188891
490+
│ │ ├── 13196585434617636667
491+
│ │ ├── 13328239765396585889
492+
│ │ ├── 13443989554399185472
493+
│ │ ├── 17062004665044410656
494+
│ │ ├── 832373293846386485
495+
│ │ ├── 9244072261762587327
496+
│ │ ├── dataset_spec.pb
497+
│ │ └── snapshot.metadata
498+
│ └── val
499+
│ ├── 8122001362029945413
500+
│ │ └── 00000000.shard
501+
│ │ └── 00000000.snapshot
502+
│ ├── dataset_spec.pb
503+
│ └── snapshot.metadata
504+
```
505+
In total, it should contain 88 files (via `find -type f | wc -l`) for a total of 112G GB (via `du -sch --apparent-size fineweb_edu_10B/`).
506+
</details>

0 commit comments

Comments
 (0)