Skip to content

dataset push: re-pushing to an existing table silently duplicates data (no warning, no --overwrite/--append) #70

@saadqbal

Description

@saadqbal

Severity: HIGH (data integrity) — a re-run silently doubles the dataset. Found during the #67 stress sweep.

Repro

# push #1 — succeeds, 6 rows
tracebloc dataset push ./tab/clean --no-input -n tracebloc-templates \
  --category tabular_classification --table qa_clean --intent train --label-column label
# push #2 — identical command, same table
tracebloc dataset push ./tab/clean --no-input -n tracebloc-templates \
  --category tabular_classification --table qa_clean --intent train --label-column label

Observed

Both runs report ✅ Successfully Processed: 6 / 🚀 Sent to API: 6 / 🎉 completed successfully and exit 0. The "Duplicate Validator" passes both times. Querying MySQL afterward:

SELECT COUNT(*) FROM training_test_datasets.qa_clean;  -- 12

The table now holds 12 rows — the data was appended/re-sent with no indication. There is no "table already exists with N rows" warning and no flag to declare intent.

Why it matters

Re-running a push is one of the most common things a user does — unsure the first run worked, fixing an unrelated flag, retrying after a network blip. Today that silently corrupts the dataset (duplicated rows skew training/splits) with a cheerful green summary.

Expected (pick one, ideally both)

  • Pre-flight: detect the table already exists (the CLI already does existence checks for dataset list/rm) and, in interactive mode, prompt table "qa_clean" already exists (6 rows) — [a]ppend / [o]verwrite / [c]ancel?.
  • Add explicit --if-exists=error|append|overwrite (default error) so --no-input/CI runs fail safe instead of silently appending.

Note: the idempotency story (--idempotency-key) only helps within a single retried invocation; it does not cover a fresh re-invocation, which is the common case here.

Part of #67.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions