Skip to content

dataset push: image pre-flight is too shallow — missing-image refs, undecodable/zero-byte images, and mixed resolutions all pass dry-run #72

@saadqbal

Description

@saadqbal

Severity: MEDIUM — three image-layout problems sail through local validation with a green ✔ and only fail (or silently misbehave) at ingest. Found during the #67 stress sweep. Grouped because they share one fix surface (the image walk in internal/push/walk.go).

(a) labels.csv references images that don't exist on disk

# labels.csv lists 001.png AND 002.png; only 001.png exists
tracebloc dataset push ./img/missingref --no-input -n tracebloc-templates \
  --category image_classification --table qa_x --intent train --label-column label --dry-run
# → ✔ "images: 1 files"  (no warning that 002.png is referenced but absent)

No referential-integrity check between labels.csv rows and images/ contents.

(b) undecodable / zero-byte files with image extensions pass

: > img/zerobyte/images/001.jpg              # 0-byte .jpg
printf 'not an image' > img/fakeext/images/001.jpg   # text with .jpg ext
# both →  Note: couldn't auto-detect image size ... using the ingestor default.  ✔ Dry-run complete

A failed image decode is downgraded to a soft "couldn't auto-detect size" note instead of being flagged as "this file isn't a valid image."

(c) mixed-resolution images not detected

# 001.png is 1x1, 002.png is 2x2
tracebloc dataset push ./img/mixres ... --dry-run   # → ✔  (target_size auto-detected from the FIRST image only)

--help says "All images must share this resolution — the ingestor validates it, it does not resize", but the CLI auto-detects from the first image and doesn't pre-check the rest, so the mismatch only surfaces as an ingest-time Image Resolution Validator failure.

Expected

Deepen the image pre-flight walk to (a) cross-check labels.csv filenames against files on disk and report missing/extra, (b) attempt to decode each image (or at least reject zero-byte ones) and fail with a per-file list, (c) when auto-detecting target_size, scan all images and report any that differ before staging.

Part of #67.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions