data-generator-tool is a small Rust CLI for generating fake tabular data from a JSON schema.
It can also read existing CSV or Parquet data, then write it back out as a single file or a partitioned dataset.
This project was built to make it easy to create realistic-looking datasets for demos, local development, and data-processing experiments without depending on production data.
- Generates a
polars::DataFramefrom a schema file. - Writes output as CSV or Parquet.
- Supports single-file output and partitioned output.
- Can read back existing CSV/Parquet input for a simple round trip.
- Uses deterministic per-row seeding for most generators, so repeated runs are stable when the seed is fixed.
Generate 1,000 rows of CSV:
cargo run -- -s schema.json -r 1000 -f csv -o out.csvRead it back:
cargo run -- -i out.csv -f csvGenerate partitioned Parquet output into a directory:
mkdir out
cargo run -- -s schema.json -r 1000 -f parquet -o outThe main flags can also be supplied through environment variables.
| Flag | Env var | Description | Default |
|---|---|---|---|
-s, --schema <SCHEMA_FILE> |
DATAGEN_SCHEMA_FILE |
JSON schema file | schema.json |
-r, --rows <NUM_ROWS> |
DATAGEN_NUM_ROWS |
Number of rows to generate | 10000 |
-t, --threads <NO_THREADS> |
DATAGEN_NUM_THREADS |
Rayon worker threads | 1 |
-o, --output <OUTPUT_PATH> |
DATAGEN_OUTPUT_PATH |
Output file or directory | not set |
-i, --input <INPUT_PATH> |
DATAGEN_INPUT_PATH |
Read an existing file or dataset directory | not set |
-f, --format <FORMAT> |
DATAGEN_OUTPUT_FORMAT |
csv or parquet |
csv |
Notes:
-f/--formatcontrols both read mode and write mode.- The tool does not infer CSV vs Parquet from the file extension.
-t/--threadssetsRAYON_NUM_THREADSbefore generation starts.
The schema is a JSON file with a main_seed and a columns array.
Each column needs at least a name and type, and may include extra fields depending on the generator.
Minimal example:
{
"main_seed": 0,
"columns": [
{ "name": "id", "type": "RowNumber" },
{ "name": "name", "type": "FullName" },
{ "name": "active", "type": "Bool", "ratio": 0.5 }
]
}Implemented generators include:
HashUuidRowNumberNumberDateBoolEnumWordsNumerifyFirstNameLastNameFullNameAddressCityStateZipFreeEmailCompanyNamePhoneNumberStreetName
The example schema.json in this repo shows a broader mix of these types working together.
Some types accept additional fields:
seed— per-column override for the default seedformat— used byHash,Numerify, andDatemin/max— forNumberstart/end— forDateratio— forBoolvalues— forEnumcount— forWords
If an unsupported type is used, generation currently panics.
If -o points to a file path, the tool writes:
- the data file itself, and
- a sibling stats file named
*-stats.csvor*-stats.parquet
The stats file contains basic string-column summaries such as column name, row count, and max string length.
If -o points to a directory, or ends with /, the tool writes a partitioned layout:
output/
dataset=0/
part-00000.csv
part-00000-stats.csv
part-00001.csv
part-00001-stats.csv
For Parquet, the same layout is used with .parquet files.
Existing .csv or .parquet files inside the dataset directory are cleaned up before rewriting.
- CSV output uses
;as the separator. - This applies to both single-file and partitioned CSV output.
When -i/--input is set, the tool reads the given file or dataset directory instead of generating new data.
- If the input path is a file, it reads a single CSV or Parquet file.
- If the input path is a directory, it recursively reads partitioned files.
- The tool does not infer CSV vs Parquet from the file extension.
- Unsupported generator types still fail fast.
- The CLI is intentionally simple and geared toward local data generation, not large-scale orchestration.
schema.jsonis the quickest smoke test for generator changes.- The project uses
polars,rayon,fake,clap,serde_json,sha2,chrono,uuid, andstrum. - On Windows,
cargo testmay require the MSVC build tools /link.exe.
Good next steps for the project would be:
- friendlier schema validation and error messages
- more supported fake-data generators
- clearer reporting around unsupported column types
- optional examples or presets for common dataset shapes
This project is licensed under the MIT License. See LICENSE for details.