Severity: HIGH (data integrity) — a re-run silently doubles the dataset. Found during the #67 stress sweep.
Repro
# push #1 — succeeds, 6 rows
tracebloc dataset push ./tab/clean --no-input -n tracebloc-templates \
--category tabular_classification --table qa_clean --intent train --label-column label
# push #2 — identical command, same table
tracebloc dataset push ./tab/clean --no-input -n tracebloc-templates \
--category tabular_classification --table qa_clean --intent train --label-column label
Observed
Both runs report ✅ Successfully Processed: 6 / 🚀 Sent to API: 6 / 🎉 completed successfully and exit 0. The "Duplicate Validator" passes both times. Querying MySQL afterward:
SELECT COUNT(*) FROM training_test_datasets.qa_clean; -- 12
The table now holds 12 rows — the data was appended/re-sent with no indication. There is no "table already exists with N rows" warning and no flag to declare intent.
Why it matters
Re-running a push is one of the most common things a user does — unsure the first run worked, fixing an unrelated flag, retrying after a network blip. Today that silently corrupts the dataset (duplicated rows skew training/splits) with a cheerful green summary.
Expected (pick one, ideally both)
- Pre-flight: detect the table already exists (the CLI already does existence checks for
dataset list/rm) and, in interactive mode, prompt table "qa_clean" already exists (6 rows) — [a]ppend / [o]verwrite / [c]ancel?.
- Add explicit
--if-exists=error|append|overwrite (default error) so --no-input/CI runs fail safe instead of silently appending.
Note: the idempotency story (--idempotency-key) only helps within a single retried invocation; it does not cover a fresh re-invocation, which is the common case here.
Part of #67.
Severity: HIGH (data integrity) — a re-run silently doubles the dataset. Found during the #67 stress sweep.
Repro
Observed
Both runs report
✅ Successfully Processed: 6 / 🚀 Sent to API: 6 / 🎉 completed successfullyand exit 0. The "Duplicate Validator" passes both times. Querying MySQL afterward:The table now holds 12 rows — the data was appended/re-sent with no indication. There is no "table already exists with N rows" warning and no flag to declare intent.
Why it matters
Re-running a push is one of the most common things a user does — unsure the first run worked, fixing an unrelated flag, retrying after a network blip. Today that silently corrupts the dataset (duplicated rows skew training/splits) with a cheerful green summary.
Expected (pick one, ideally both)
dataset list/rm) and, in interactive mode, prompttable "qa_clean" already exists (6 rows) — [a]ppend / [o]verwrite / [c]ancel?.--if-exists=error|append|overwrite(defaulterror) so--no-input/CI runs fail safe instead of silently appending.Note: the idempotency story (
--idempotency-key) only helps within a single retried invocation; it does not cover a fresh re-invocation, which is the common case here.Part of #67.