File tree Expand file tree Collapse file tree
Expand file tree Collapse file tree Original file line number Diff line number Diff line change @@ -465,4 +465,42 @@ python3 dataset/dataset_setup.py \
465465--data_dir $DATA_DIR \
466466--temp_dir $DATA_DIR /tmp \
467467--fineweb_edu
468- ```
468+ ```
469+
470+ <details >
471+ <summary >The final directory structure should look like this:</summary >
472+
473+ ``` bash
474+ $DATA_DIR
475+ ├──fineweb_edu_10B
476+ │ ├── fwedu_10B_tokenized
477+ │ │ ├── data-00000-of-00080.arrow
478+ │ │ ├── data-00001-of-00080.arrow
479+ │ │ ├── data-00002-of-00080.arrow
480+ │ │ ├── [...]
481+ │ │ ├── data-00078-of-00080.arrow
482+ │ │ ├── data-00079-of-00080.arrow
483+ │ │ ├── dataset_info.json
484+ │ │ └── state.json
485+ │ ├── train
486+ │ │ ├── 11814516993635243069
487+ │ │ │ └── 00000000.shard
488+ │ │ │ └── 00000000.snapshot
489+ │ │ ├── 1309159339089188891
490+ │ │ ├── 13196585434617636667
491+ │ │ ├── 13328239765396585889
492+ │ │ ├── 13443989554399185472
493+ │ │ ├── 17062004665044410656
494+ │ │ ├── 832373293846386485
495+ │ │ ├── 9244072261762587327
496+ │ │ ├── dataset_spec.pb
497+ │ │ └── snapshot.metadata
498+ │ └── val
499+ │ ├── 8122001362029945413
500+ │ │ └── 00000000.shard
501+ │ │ └── 00000000.snapshot
502+ │ ├── dataset_spec.pb
503+ │ └── snapshot.metadata
504+ ```
505+ In total, it should contain 88 files (via ` find -type f | wc -l ` ) for a total of 112G GB (via ` du -sch --apparent-size fineweb_edu_10B/ ` ).
506+ </details >
You can’t perform that action at this time.
0 commit comments