Skip to content

MIT-LCP/croissant-baker

Repository files navigation

croissant-baker

Croissant Baker

CI Python 3.10+ uv License: MIT PyPI PRs Welcome

Automatically generate Croissant JSON-LD metadata for ML datasets — e.g. for PhysioNet, NeurIPS Datasets & Benchmarks submissions, or any platform that benefits from standardized dataset metadata.

📖 Documentation


Installation

pip install croissant-baker

or with uv:

uv add croissant-baker

Quick start

croissant-baker \
  --input /path/to/dataset \
  --creator "Your Name,you@example.com" \
  --description "My ML dataset" \
  --license "CC-BY-4.0" \
  --output my-dataset-croissant.jsonld

Or try with the bundled MIMIC-IV Demo test data:

git clone https://github.com/MIT-LCP/croissant-baker.git && cd croissant-baker
uv sync --group dev
croissant-baker \
  --input tests/data/input/mimiciv_demo/physionet.org/files/mimic-iv-demo/ \
  --creator "Alistair Johnson,aewj@mit.edu,https://physionet.org/" \
  --creator "Tom Pollard,tpollard@mit.edu,https://physionet.org/" \
  --name "MIMIC-IV Clinical Database Demo" \
  --description "Demo subset of MIMIC-IV containing 100 de-identified patients from Beth Israel Deaconess Medical Center" \
  --url "https://physionet.org/content/mimic-iv-demo/2.2/" \
  --license "https://opendatacommons.org/licenses/odbl/1-0/" \
  --rai-data-biases "Single-site cohort from a US academic medical centre" \
  --rai-data-limitations "Demo subset limited to 100 patients" \
  --output mimic-iv-demo-croissant.jsonld
croissant-baker validate mimic-iv-demo-croissant.jsonld

Supported formats

Format Extensions Notes
CSV / TSV .csv, .tsv + .gz, .bz2, .xz Streaming with automatic type inference
Parquet .parquet Partitioned datasets supported
FHIR .ndjson, .ndjson.gz, .json (Bundle) NDJSON bulk export and JSON Bundle
JSON / JSONL .json, .jsonl + .gz Arrays, single objects, and JSON Lines
WFDB .hea + .dat / .atr PhysioNet waveform data
Images .png, .jpg, .tiff, .bmp, .gif, .webp Dimensions and format via Pillow

Key features

  • Automatic type inference for all supported formats
  • RAI metadata via --rai-* CLI flags or --rai-config rai.yaml
  • Validation against the Croissant spec via mlcroissant
  • Dry-run mode, include/exclude glob filters, multiple creators

See the documentation for full CLI reference, examples, and RAI configuration.

Contributing

See CONTRIBUTING.md for guidelines and DEVELOPMENT.md for setup, testing, releases, and how to add new file handlers.

License

MIT License - see LICENSE file.

About

Makes Croissant for a set of files

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors