11# CTable Implementation Log
22
3- This document records what was implemented as part of the ` ctable-schema.md ` redesign.
4- It covers every new file, every significant change, and the reasoning behind each decision.
3+ This document records everything implemented across the CTable feature:
4+ the ` ctable-schema.md ` redesign (schema, validation, serialization, optimizations)
5+ and the ` ctable-persistency.md ` phase (file-backed storage, ` open() ` , read-only mode).
56
67---
78
8- ## Overview
9+ ## Phase 1 — Schema redesign ( ` ctable-schema.md ` )
910
1011The goal was to replace the original Pydantic-` BaseModel ` -based schema API with a
1112** dataclass-first schema API** using declarative spec objects (` b2.int64() ` ,
@@ -200,11 +201,11 @@ All tests live in `tests/ctable/`.
200201| ` test_extend_delete.py ` | Interleaved extend/delete cycles, mask correctness, resize behavior |
201202| ` test_row_logic.py ` | Row indexer (int/slice/list), views, chained views |
202203
203- Total: ** 127 tests, all passing** .
204+ Total: ** 135 tests, all passing** (after Phase 1 + optimizations) .
204205
205206---
206207
207- ## Design decisions
208+ ## Phase 1 design decisions
208209
209210** Why two validation paths?**
210211` append() ` handles one row at a time — Pydantic is fast enough and also performs
@@ -220,6 +221,195 @@ Existing code using `class RowModel(BaseModel)` continues to work without
220221modification. The adapter is not on the critical path for new code.
221222
222223** Why ` schema_to_dict ` / ` schema_from_dict ` now?**
223- Persistence (` save() ` /` load() ` ) requires a self-contained schema representation.
224- Establishing the serialization format early means the format will be stable before
225- anything depends on it.
224+ Persistence requires a self-contained schema representation that survives without
225+ the original Python dataclass. Establishing the serialization format before
226+ persistence was built ensured the format was stable before anything depended on it.
227+
228+ ---
229+
230+ ## Phase 1 optimizations (post-schema)
231+
232+ Several performance improvements were made after the schema work was complete:
233+
234+ ** ` _last_pos ` cache**
235+ Added ` _last_pos: int | None ` to ` CTable ` . Tracks the physical index of the next
236+ write slot so that ` append() ` and ` extend() ` no longer need to scan backward through
237+ chunk metadata on every call. Set to ` None ` after any deletion (triggers one lazy
238+ recalculation on the next write). Set to ` _n_rows ` after ` compact() ` . Eliminated a
239+ backward O(n_chunks) scan per insert.
240+
241+ ** ` _grow() ` helper**
242+ Extracted the capacity-doubling logic into ` _grow() ` . Removes duplication between
243+ ` append() ` and ` extend() ` .
244+
245+ ** In-place delete**
246+ ` delete() ` now writes the updated boolean array back with `self._ valid_rows[ :] =
247+ valid_rows_np` (in-place slice assignment) instead of creating a new NDArray.
248+ Avoids a full allocation on each delete.
249+
250+ ** ` head() ` / ` tail() ` refactored**
251+ Both methods now reuse ` _find_physical_index() ` instead of containing their own
252+ chunk-walk loops.
253+
254+ ** ` _make_view() ` classmethod**
255+ Added to construct view CTables without going through ` __init__ ` . Avoids
256+ allocating and immediately discarding NDArrays that were never used.
257+
258+ ** ` _NumericSpec ` mixin + new spec types**
259+ All numeric specs (` int8 ` through ` uint64 ` , ` float32 ` , ` float64 ` ) share a common
260+ ` _NumericSpec ` mixin for ` ge ` /` gt ` /` le ` /` lt ` constraint handling, eliminating
261+ boilerplate. New specs added: ` int8 ` , ` int16 ` , ` int32 ` , ` uint8 ` , ` uint16 ` ,
262+ ` uint32 ` , ` uint64 ` , ` float32 ` .
263+
264+ ** String vectorized validation**
265+ ` validate_column_values ` uses ` np.char.str_len() ` (true C-level) for ` U ` /` S ` dtype
266+ arrays instead of ` np.vectorize(len) ` (Python loop in disguise). The check also
267+ extracted ` _validate_string_lengths() ` to reduce cyclomatic complexity.
268+
269+ ** Column name validation**
270+ ` compile_schema ` now calls ` _validate_column_name() ` on every field. Rejects names
271+ that are empty, start with ` _ ` , or contain ` / ` — rules that apply equally to
272+ in-memory and persistent tables.
273+
274+ ---
275+
276+ ## Phase 2 — Persistency (` ctable-persistency.md ` )
277+
278+ ### New file: ` src/blosc2/ctable_storage.py `
279+
280+ A storage-backend abstraction that keeps all file I/O out of ` ctable.py ` .
281+
282+ ** ` TableStorage ` ** — interface class defining:
283+ ` create_column ` , ` open_column ` , ` create_valid_rows ` , ` open_valid_rows ` ,
284+ ` save_schema ` , ` load_schema ` , ` table_exists ` , ` is_read_only ` .
285+
286+ ** ` InMemoryTableStorage ` ** — trivial implementation that creates plain in-memory
287+ ` blosc2.NDArray ` objects and is a no-op for ` save_schema ` . Used when ` urlpath ` is
288+ not provided (existing default behaviour, unchanged).
289+
290+ ** ` FileTableStorage ` ** — file-backed implementation.
291+
292+ Disk layout:
293+
294+ ```
295+ <urlpath>/
296+ _meta.b2frame ← blosc2.SChunk; vlmeta holds kind, version, schema JSON
297+ _valid_rows.b2nd ← file-backed boolean NDArray (tombstone mask)
298+ _cols/
299+ <name>.b2nd ← one file-backed NDArray per column
300+ ```
301+
302+ Key implementation notes:
303+ - ` save_schema ` always opens ` _meta.b2frame ` with ` mode="w" ` (create path only).
304+ - ` load_schema ` / ` check_kind ` use ` blosc2.open() ` (not `blosc2.SChunk(...,
305+ mode="a")`), which is the correct API for reopening an existing SChunk file.
306+ - File-backed NDArrays (` urlpath=..., mode="w" ` ) support in-place writes
307+ (` col[pos] = value ` , ` col[start:end] = arr ` ) that persist immediately. This is
308+ why resize (` _grow() ` ), append, extend, and delete all work transparently on
309+ persistent tables.
310+ - ` _n_rows ` on reopen is reconstructed as ` blosc2.count_nonzero(valid_rows) ` —
311+ always correct because unwritten slots are ` False ` , same as deleted slots.
312+ - ` _last_pos ` is set to ` None ` on reopen and resolved lazily by ` _resolve_last_pos() `
313+ on the first write.
314+
315+ ### Changes to ` src/blosc2/ctable.py `
316+
317+ ** Constructor**
318+
319+ New parameters: ` urlpath: str | None = None ` , ` mode: str = "a" ` .
320+
321+ Logic:
322+ - ` urlpath=None ` → ` InMemoryTableStorage ` → existing behaviour unchanged.
323+ - ` urlpath ` + existing table + ` mode != "w" ` → open existing (load schema from
324+ disk, open file-backed arrays, reconstruct state).
325+ - ` urlpath ` + ` mode="w" ` or no existing table → create new (compile schema,
326+ save to disk, create file-backed arrays).
327+ - Passing ` new_data ` when opening an existing table raises ` ValueError ` .
328+
329+ ** ` CTable.open(cls, urlpath, *, mode="r") ` **
330+
331+ New classmethod for ergonomic read-only access. Opens the table, verifies
332+ ` kind="ctable" ` in vlmeta, reconstructs schema from JSON (no dataclass needed),
333+ returns a fully usable ` CTable ` .
334+
335+ ** Read-only enforcement**
336+
337+ ` _read_only: bool ` flag set from ` storage.is_read_only() ` . Guards added to the top
338+ of ` append() ` , ` extend() ` , ` delete() ` , ` compact() ` — each raises
339+ ` ValueError("Table is read-only (opened with mode='r').") ` .
340+
341+ ** ` _make_view(cls, parent, new_valid_rows) ` **
342+
343+ New classmethod that constructs a view ` CTable ` directly via ` cls.__new__ ` without
344+ calling ` __init__ ` . Replaces the old ` CTable(self._row_type, expected_size=...) ` +
345+ ` retval._cols = self._cols ` pattern, which was wasteful (allocated NDArrays then
346+ discarded them) and broke when ` _row_type ` is ` None ` (tables opened via ` open() ` ).
347+
348+ ** ` schema_dict() ` **
349+
350+ No longer needs a local import of ` schema_to_dict ` — now imported at the module top.
351+
352+ ### New test file: ` tests/ctable/test_persistency.py `
353+
354+ 23 tests covering:
355+
356+ | Test group | What it checks |
357+ | ---| ---|
358+ | Layout | ` _meta.b2frame ` , ` _valid_rows.b2nd ` , ` _cols/<name>.b2nd ` all exist after creation |
359+ | Metadata | ` kind ` , ` version ` , ` schema ` in vlmeta; column names and order in schema JSON |
360+ | Round-trips | Data survives reopen via both ` CTable(Row, urlpath=..., mode="a") ` and ` CTable.open() ` |
361+ | Column order | Preserved exactly from schema JSON, not from filesystem order |
362+ | Constraints | Validation re-enabled after reopen (schema reconstructed from disk) |
363+ | Append/extend/delete after reopen | Mutations visible in subsequent opens |
364+ | ` _valid_rows ` on disk | Tombstone mask correctly stored and loaded |
365+ | ` mode="w" ` | Overwrites existing table; subsequent open sees empty table |
366+ | Read-only | ` append ` , ` extend ` , ` delete ` , ` compact ` all raise on ` mode="r" ` |
367+ | Read-only reads | ` row[] ` , column access, ` head() ` , ` tail() ` , ` where() ` all work |
368+ | Error cases | ` FileNotFoundError ` for missing path; ` ValueError ` for wrong kind |
369+ | Column name validation | Empty, ` _ ` -prefixed, ` / ` -containing names rejected |
370+ | ` new_data ` guard | ` ValueError ` when ` new_data ` passed to open-existing path |
371+ | Capacity growth | ` _grow() ` (resize) works on file-backed arrays and survives reopen |
372+
373+ Total: ** 158 tests, all passing** .
374+
375+ ### New benchmark: ` bench/ctable/bench_persistency.py `
376+
377+ Four sections:
378+
379+ 1 . ** ` extend() ` bulk insert** — in-memory vs file-backed at 1k–1M rows.
380+ Overhead converges to ~ 1x at 1M rows (compression dominates, not I/O).
381+ 2 . ** ` open() ` / reopen time** — ~ 4–10 ms regardless of table size. Fixed cost:
382+ open 3 files (meta, valid_rows, one column) + parse schema JSON.
383+ 3 . ** ` append() ` single-row** — file-backed is ~ 6x slower per row (~ 3 ms vs ~ 0.5 ms).
384+ Recommendation: batch inserts via ` extend() ` for persistent tables.
385+ 4 . ** Column ` to_numpy() ` ** — essentially identical between backends (≤1.06x ratio).
386+ Decompression dominates; file I/O is negligible once data is loaded.
387+
388+ ---
389+
390+ ## Phase 2 design decisions
391+
392+ ** Why direct files instead of TreeStore?**
393+ TreeStore stores snapshots of in-memory arrays. In-place writes to a
394+ TreeStore-retrieved NDArray do not persist after reopen. File-backed NDArrays
395+ created with ` urlpath=... ` support in-place writes natively. Using direct ` .b2nd `
396+ files aligns with how the rest of blosc2 handles persistent arrays.
397+
398+ ** Why ` blosc2.SChunk ` vlmeta for metadata, not JSON files?**
399+ ` vlmeta ` is compressed and is already part of the blosc2 ecosystem.
400+ ` blosc2.open() ` works on ` .b2frame ` files the same way it works on ` .b2nd ` files,
401+ keeping the open path uniform.
402+
403+ ** Why not store ` _last_pos ` in metadata?**
404+ ` _resolve_last_pos() ` reconstructs it in O(n_chunks) with no full decompression.
405+ Storing it would create a write on every ` append() ` just to update a counter in the
406+ SChunk — not worth the extra I/O.
407+
408+ ** Why ` _make_view() ` instead of calling ` __init__ ` ?**
409+ ` __init__ ` now has storage-routing logic and would try to create new NDArrays even
410+ for views (which immediately get thrown away). ` _make_view() ` via ` __new__ ` is
411+ explicit and zero-waste.
412+
413+ ** Why ` CTable.open() ` defaults to ` mode="r" ` ?**
414+ The most common read-back scenario is inspection or analysis, not modification.
415+ Defaulting to read-only prevents accidental mutations on shared or archived tables.
0 commit comments