Severity: CRITICAL — breaks the default --category image_classification path end-to-end. Found during the #67 stress sweep (dev cluster tb-client-dev-templates, ns tracebloc-templates, chart 1.6.1, ingestor image tag 0.3).
Repro
Push valid images of any type other than .jpeg:
# 6 properly-encoded 32x32 PNGs + labels.csv
tracebloc dataset push ./img/realpng --no-input -n tracebloc-templates \
--category image_classification --table qa_realpng --intent train --label-column label
# → exit 9
# Also fails identically for a .jpg dataset (qa_realjpg).
Observed
Local dry-run is green. The ingestor's Image Resolution Validator passes (it decodes the images fine), then the File Type Validator fails:
Ingestion failed: File Type Validator Validator failed:
Files with invalid extensions found: ['/data/shared/.tracebloc-staging/qa_realpng/images/img4.png', ... all 6 .png]
Error: ingestion Job exited non-zero — see logs above
A .jpg dataset fails the same way. Only a uniformly-.jpeg dataset ingests.
Root cause (CLI side)
internal/push/spec.go buildImage() sets images, label, target_size, etc. but never emits an extension field. The embedded schema does define one (internal/schema/ingest.v1.json:203, default .jpg for image categories), so the CLI could send it — it just doesn't.
On the ingestor side, for image categories tracebloc_ingestor/utils/validators_mapping.py builds FileTypeValidator(allowed_extension=options["extension"]) (a required key, single value — unlike text which uses .get(..., default)). With the CLI omitting it, the convention default kicks in: tracebloc_ingestor/cli/conventions.py:105 → IMAGE_CLASSIFICATION: {"target_size": [256,256], "extension": ".jpeg"}. The validator (validators/file_validator.py:180) then strict-rejects anything != .jpeg — so .png fails, and even .jpg fails because .jpg != .jpeg.
Expected
The CLI should detect the dataset's actual image extension during the walk (it already enumerates image files and decodes the first one for target_size) and emit it in the spec (spec.file_options.extension, alongside the existing target_size). Then .png/.jpg/.webp datasets ingest as --help advertises ("Accepted image extensions: .jpg, .jpeg, .png, .webp").
Notes / follow-ups
- The FileTypeValidator also enforces single-extension uniformity, so a mixed
.jpg+.png dataset would still fail even after this fix — worth deciding whether mixed sets are supported (the --help text implies they might be). Likely a companion data-ingestors issue (accept the supported set, or derive the extension from the staged files).
.webp may not be in the ingestor's FileExtension enum (utils/constants.py) — verify before claiming support.
Part of #67.
Severity: CRITICAL — breaks the default
--category image_classificationpath end-to-end. Found during the #67 stress sweep (dev clustertb-client-dev-templates, nstracebloc-templates, chart 1.6.1, ingestor image tag0.3).Repro
Push valid images of any type other than
.jpeg:Observed
Local dry-run is green. The ingestor's Image Resolution Validator passes (it decodes the images fine), then the File Type Validator fails:
A
.jpgdataset fails the same way. Only a uniformly-.jpegdataset ingests.Root cause (CLI side)
internal/push/spec.gobuildImage()setsimages,label,target_size, etc. but never emits anextensionfield. The embedded schema does define one (internal/schema/ingest.v1.json:203, default.jpgfor image categories), so the CLI could send it — it just doesn't.On the ingestor side, for image categories
tracebloc_ingestor/utils/validators_mapping.pybuildsFileTypeValidator(allowed_extension=options["extension"])(a required key, single value — unlike text which uses.get(..., default)). With the CLI omitting it, the convention default kicks in:tracebloc_ingestor/cli/conventions.py:105→IMAGE_CLASSIFICATION: {"target_size": [256,256], "extension": ".jpeg"}. The validator (validators/file_validator.py:180) then strict-rejects anything!= .jpeg— so.pngfails, and even.jpgfails because.jpg != .jpeg.Expected
The CLI should detect the dataset's actual image extension during the walk (it already enumerates image files and decodes the first one for
target_size) and emit it in the spec (spec.file_options.extension, alongside the existingtarget_size). Then.png/.jpg/.webpdatasets ingest as--helpadvertises ("Accepted image extensions: .jpg, .jpeg, .png, .webp").Notes / follow-ups
.jpg+.pngdataset would still fail even after this fix — worth deciding whether mixed sets are supported (the--helptext implies they might be). Likely a companion data-ingestors issue (accept the supported set, or derive the extension from the staged files)..webpmay not be in the ingestor'sFileExtensionenum (utils/constants.py) — verify before claiming support.Part of #67.