Skip to content

dataset push: image ingestion only works for .jpeg — .png/.jpg/.webp fail the File Type Validator #68

@saadqbal

Description

@saadqbal

Severity: CRITICAL — breaks the default --category image_classification path end-to-end. Found during the #67 stress sweep (dev cluster tb-client-dev-templates, ns tracebloc-templates, chart 1.6.1, ingestor image tag 0.3).

Repro

Push valid images of any type other than .jpeg:

# 6 properly-encoded 32x32 PNGs + labels.csv
tracebloc dataset push ./img/realpng --no-input -n tracebloc-templates \
  --category image_classification --table qa_realpng --intent train --label-column label
# → exit 9
# Also fails identically for a .jpg dataset (qa_realjpg).

Observed

Local dry-run is green. The ingestor's Image Resolution Validator passes (it decodes the images fine), then the File Type Validator fails:

Ingestion failed: File Type Validator Validator failed:
Files with invalid extensions found: ['/data/shared/.tracebloc-staging/qa_realpng/images/img4.png', ... all 6 .png]
Error: ingestion Job exited non-zero — see logs above

A .jpg dataset fails the same way. Only a uniformly-.jpeg dataset ingests.

Root cause (CLI side)

internal/push/spec.go buildImage() sets images, label, target_size, etc. but never emits an extension field. The embedded schema does define one (internal/schema/ingest.v1.json:203, default .jpg for image categories), so the CLI could send it — it just doesn't.

On the ingestor side, for image categories tracebloc_ingestor/utils/validators_mapping.py builds FileTypeValidator(allowed_extension=options["extension"]) (a required key, single value — unlike text which uses .get(..., default)). With the CLI omitting it, the convention default kicks in: tracebloc_ingestor/cli/conventions.py:105IMAGE_CLASSIFICATION: {"target_size": [256,256], "extension": ".jpeg"}. The validator (validators/file_validator.py:180) then strict-rejects anything != .jpeg — so .png fails, and even .jpg fails because .jpg != .jpeg.

Expected

The CLI should detect the dataset's actual image extension during the walk (it already enumerates image files and decodes the first one for target_size) and emit it in the spec (spec.file_options.extension, alongside the existing target_size). Then .png/.jpg/.webp datasets ingest as --help advertises ("Accepted image extensions: .jpg, .jpeg, .png, .webp").

Notes / follow-ups

  • The FileTypeValidator also enforces single-extension uniformity, so a mixed .jpg+.png dataset would still fail even after this fix — worth deciding whether mixed sets are supported (the --help text implies they might be). Likely a companion data-ingestors issue (accept the supported set, or derive the extension from the staged files).
  • .webp may not be in the ingestor's FileExtension enum (utils/constants.py) — verify before claiming support.

Part of #67.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions