Skip to content

dataset push: UTF-8 BOM not stripped from CSV header — Excel-exported CSVs get a corrupt first column name #71

@saadqbal

Description

@saadqbal

Severity: MEDIUM — hits the single most common tabular source (Excel / Windows "Save as CSV UTF-8", which prepend a BOM). Found during the #67 stress sweep.

Repro

printf '\xef\xbb\xbfa,b,label\n1,2,x\n3,4,y\n' > tab/bom/data.csv   # BOM + header
tracebloc dataset push ./tab/bom --no-input -n tracebloc-templates \
  --category tabular_classification --table qa_bom --intent train --label-column label --dry-run
# → ✔ Dry-run complete (no warning)

Observed / root cause

InferSchema() in internal/push/tabular.go reads the header with encoding/csv (which does not strip a leading BOM) and trims each name with strings.TrimSpace. TrimSpace does not remove U+FEFF (verified: unicode.IsSpace('') == false, bytes ef bb bf survive). So the first column's inferred name becomes <name> rather than <name>.

Consequences:

Expected

Strip a leading UTF-8 BOM from the first header cell during inference (e.g. strings.TrimPrefix(col, "") on the first column, or detect+strip at file read). Excel CSVs should "just work".

Part of #67.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions