Severity: MEDIUM — three image-layout problems sail through local validation with a green ✔ and only fail (or silently misbehave) at ingest. Found during the #67 stress sweep. Grouped because they share one fix surface (the image walk in internal/push/walk.go).
(a) labels.csv references images that don't exist on disk
# labels.csv lists 001.png AND 002.png; only 001.png exists
tracebloc dataset push ./img/missingref --no-input -n tracebloc-templates \
--category image_classification --table qa_x --intent train --label-column label --dry-run
# → ✔ "images: 1 files" (no warning that 002.png is referenced but absent)
No referential-integrity check between labels.csv rows and images/ contents.
(b) undecodable / zero-byte files with image extensions pass
: > img/zerobyte/images/001.jpg # 0-byte .jpg
printf 'not an image' > img/fakeext/images/001.jpg # text with .jpg ext
# both → Note: couldn't auto-detect image size ... using the ingestor default. ✔ Dry-run complete
A failed image decode is downgraded to a soft "couldn't auto-detect size" note instead of being flagged as "this file isn't a valid image."
(c) mixed-resolution images not detected
# 001.png is 1x1, 002.png is 2x2
tracebloc dataset push ./img/mixres ... --dry-run # → ✔ (target_size auto-detected from the FIRST image only)
--help says "All images must share this resolution — the ingestor validates it, it does not resize", but the CLI auto-detects from the first image and doesn't pre-check the rest, so the mismatch only surfaces as an ingest-time Image Resolution Validator failure.
Expected
Deepen the image pre-flight walk to (a) cross-check labels.csv filenames against files on disk and report missing/extra, (b) attempt to decode each image (or at least reject zero-byte ones) and fail with a per-file list, (c) when auto-detecting target_size, scan all images and report any that differ before staging.
Part of #67.
Severity: MEDIUM — three image-layout problems sail through local validation with a green ✔ and only fail (or silently misbehave) at ingest. Found during the #67 stress sweep. Grouped because they share one fix surface (the image walk in
internal/push/walk.go).(a) labels.csv references images that don't exist on disk
No referential-integrity check between labels.csv rows and
images/contents.(b) undecodable / zero-byte files with image extensions pass
A failed image decode is downgraded to a soft "couldn't auto-detect size" note instead of being flagged as "this file isn't a valid image."
(c) mixed-resolution images not detected
--helpsays "All images must share this resolution — the ingestor validates it, it does not resize", but the CLI auto-detects from the first image and doesn't pre-check the rest, so the mismatch only surfaces as an ingest-time Image Resolution Validator failure.Expected
Deepen the image pre-flight walk to (a) cross-check labels.csv filenames against files on disk and report missing/extra, (b) attempt to decode each image (or at least reject zero-byte ones) and fail with a per-file list, (c) when auto-detecting
target_size, scan all images and report any that differ before staging.Part of #67.