Skip to content

Commit ee1d0c4

Browse files
committed
persistency half way done
1 parent 4ce8296 commit ee1d0c4

4 files changed

Lines changed: 493 additions & 65 deletions

File tree

plans/ctable-implementation-log.md

Lines changed: 198 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
11
# CTable Implementation Log
22

3-
This document records what was implemented as part of the `ctable-schema.md` redesign.
4-
It covers every new file, every significant change, and the reasoning behind each decision.
3+
This document records everything implemented across the CTable feature:
4+
the `ctable-schema.md` redesign (schema, validation, serialization, optimizations)
5+
and the `ctable-persistency.md` phase (file-backed storage, `open()`, read-only mode).
56

67
---
78

8-
## Overview
9+
## Phase 1 — Schema redesign (`ctable-schema.md`)
910

1011
The goal was to replace the original Pydantic-`BaseModel`-based schema API with a
1112
**dataclass-first schema API** using declarative spec objects (`b2.int64()`,
@@ -200,11 +201,11 @@ All tests live in `tests/ctable/`.
200201
| `test_extend_delete.py` | Interleaved extend/delete cycles, mask correctness, resize behavior |
201202
| `test_row_logic.py` | Row indexer (int/slice/list), views, chained views |
202203

203-
Total: **127 tests, all passing**.
204+
Total: **135 tests, all passing** (after Phase 1 + optimizations).
204205

205206
---
206207

207-
## Design decisions
208+
## Phase 1 design decisions
208209

209210
**Why two validation paths?**
210211
`append()` handles one row at a time — Pydantic is fast enough and also performs
@@ -220,6 +221,195 @@ Existing code using `class RowModel(BaseModel)` continues to work without
220221
modification. The adapter is not on the critical path for new code.
221222

222223
**Why `schema_to_dict` / `schema_from_dict` now?**
223-
Persistence (`save()`/`load()`) requires a self-contained schema representation.
224-
Establishing the serialization format early means the format will be stable before
225-
anything depends on it.
224+
Persistence requires a self-contained schema representation that survives without
225+
the original Python dataclass. Establishing the serialization format before
226+
persistence was built ensured the format was stable before anything depended on it.
227+
228+
---
229+
230+
## Phase 1 optimizations (post-schema)
231+
232+
Several performance improvements were made after the schema work was complete:
233+
234+
**`_last_pos` cache**
235+
Added `_last_pos: int | None` to `CTable`. Tracks the physical index of the next
236+
write slot so that `append()` and `extend()` no longer need to scan backward through
237+
chunk metadata on every call. Set to `None` after any deletion (triggers one lazy
238+
recalculation on the next write). Set to `_n_rows` after `compact()`. Eliminated a
239+
backward O(n_chunks) scan per insert.
240+
241+
**`_grow()` helper**
242+
Extracted the capacity-doubling logic into `_grow()`. Removes duplication between
243+
`append()` and `extend()`.
244+
245+
**In-place delete**
246+
`delete()` now writes the updated boolean array back with `self._valid_rows[:] =
247+
valid_rows_np` (in-place slice assignment) instead of creating a new NDArray.
248+
Avoids a full allocation on each delete.
249+
250+
**`head()` / `tail()` refactored**
251+
Both methods now reuse `_find_physical_index()` instead of containing their own
252+
chunk-walk loops.
253+
254+
**`_make_view()` classmethod**
255+
Added to construct view CTables without going through `__init__`. Avoids
256+
allocating and immediately discarding NDArrays that were never used.
257+
258+
**`_NumericSpec` mixin + new spec types**
259+
All numeric specs (`int8` through `uint64`, `float32`, `float64`) share a common
260+
`_NumericSpec` mixin for `ge`/`gt`/`le`/`lt` constraint handling, eliminating
261+
boilerplate. New specs added: `int8`, `int16`, `int32`, `uint8`, `uint16`,
262+
`uint32`, `uint64`, `float32`.
263+
264+
**String vectorized validation**
265+
`validate_column_values` uses `np.char.str_len()` (true C-level) for `U`/`S` dtype
266+
arrays instead of `np.vectorize(len)` (Python loop in disguise). The check also
267+
extracted `_validate_string_lengths()` to reduce cyclomatic complexity.
268+
269+
**Column name validation**
270+
`compile_schema` now calls `_validate_column_name()` on every field. Rejects names
271+
that are empty, start with `_`, or contain `/` — rules that apply equally to
272+
in-memory and persistent tables.
273+
274+
---
275+
276+
## Phase 2 — Persistency (`ctable-persistency.md`)
277+
278+
### New file: `src/blosc2/ctable_storage.py`
279+
280+
A storage-backend abstraction that keeps all file I/O out of `ctable.py`.
281+
282+
**`TableStorage`** — interface class defining:
283+
`create_column`, `open_column`, `create_valid_rows`, `open_valid_rows`,
284+
`save_schema`, `load_schema`, `table_exists`, `is_read_only`.
285+
286+
**`InMemoryTableStorage`** — trivial implementation that creates plain in-memory
287+
`blosc2.NDArray` objects and is a no-op for `save_schema`. Used when `urlpath` is
288+
not provided (existing default behaviour, unchanged).
289+
290+
**`FileTableStorage`** — file-backed implementation.
291+
292+
Disk layout:
293+
294+
```
295+
<urlpath>/
296+
_meta.b2frame ← blosc2.SChunk; vlmeta holds kind, version, schema JSON
297+
_valid_rows.b2nd ← file-backed boolean NDArray (tombstone mask)
298+
_cols/
299+
<name>.b2nd ← one file-backed NDArray per column
300+
```
301+
302+
Key implementation notes:
303+
- `save_schema` always opens `_meta.b2frame` with `mode="w"` (create path only).
304+
- `load_schema` / `check_kind` use `blosc2.open()` (not `blosc2.SChunk(...,
305+
mode="a")`), which is the correct API for reopening an existing SChunk file.
306+
- File-backed NDArrays (`urlpath=..., mode="w"`) support in-place writes
307+
(`col[pos] = value`, `col[start:end] = arr`) that persist immediately. This is
308+
why resize (`_grow()`), append, extend, and delete all work transparently on
309+
persistent tables.
310+
- `_n_rows` on reopen is reconstructed as `blosc2.count_nonzero(valid_rows)`
311+
always correct because unwritten slots are `False`, same as deleted slots.
312+
- `_last_pos` is set to `None` on reopen and resolved lazily by `_resolve_last_pos()`
313+
on the first write.
314+
315+
### Changes to `src/blosc2/ctable.py`
316+
317+
**Constructor**
318+
319+
New parameters: `urlpath: str | None = None`, `mode: str = "a"`.
320+
321+
Logic:
322+
- `urlpath=None``InMemoryTableStorage` → existing behaviour unchanged.
323+
- `urlpath` + existing table + `mode != "w"` → open existing (load schema from
324+
disk, open file-backed arrays, reconstruct state).
325+
- `urlpath` + `mode="w"` or no existing table → create new (compile schema,
326+
save to disk, create file-backed arrays).
327+
- Passing `new_data` when opening an existing table raises `ValueError`.
328+
329+
**`CTable.open(cls, urlpath, *, mode="r")`**
330+
331+
New classmethod for ergonomic read-only access. Opens the table, verifies
332+
`kind="ctable"` in vlmeta, reconstructs schema from JSON (no dataclass needed),
333+
returns a fully usable `CTable`.
334+
335+
**Read-only enforcement**
336+
337+
`_read_only: bool` flag set from `storage.is_read_only()`. Guards added to the top
338+
of `append()`, `extend()`, `delete()`, `compact()` — each raises
339+
`ValueError("Table is read-only (opened with mode='r').")`.
340+
341+
**`_make_view(cls, parent, new_valid_rows)`**
342+
343+
New classmethod that constructs a view `CTable` directly via `cls.__new__` without
344+
calling `__init__`. Replaces the old `CTable(self._row_type, expected_size=...)` +
345+
`retval._cols = self._cols` pattern, which was wasteful (allocated NDArrays then
346+
discarded them) and broke when `_row_type` is `None` (tables opened via `open()`).
347+
348+
**`schema_dict()`**
349+
350+
No longer needs a local import of `schema_to_dict` — now imported at the module top.
351+
352+
### New test file: `tests/ctable/test_persistency.py`
353+
354+
23 tests covering:
355+
356+
| Test group | What it checks |
357+
|---|---|
358+
| Layout | `_meta.b2frame`, `_valid_rows.b2nd`, `_cols/<name>.b2nd` all exist after creation |
359+
| Metadata | `kind`, `version`, `schema` in vlmeta; column names and order in schema JSON |
360+
| Round-trips | Data survives reopen via both `CTable(Row, urlpath=..., mode="a")` and `CTable.open()` |
361+
| Column order | Preserved exactly from schema JSON, not from filesystem order |
362+
| Constraints | Validation re-enabled after reopen (schema reconstructed from disk) |
363+
| Append/extend/delete after reopen | Mutations visible in subsequent opens |
364+
| `_valid_rows` on disk | Tombstone mask correctly stored and loaded |
365+
| `mode="w"` | Overwrites existing table; subsequent open sees empty table |
366+
| Read-only | `append`, `extend`, `delete`, `compact` all raise on `mode="r"` |
367+
| Read-only reads | `row[]`, column access, `head()`, `tail()`, `where()` all work |
368+
| Error cases | `FileNotFoundError` for missing path; `ValueError` for wrong kind |
369+
| Column name validation | Empty, `_`-prefixed, `/`-containing names rejected |
370+
| `new_data` guard | `ValueError` when `new_data` passed to open-existing path |
371+
| Capacity growth | `_grow()` (resize) works on file-backed arrays and survives reopen |
372+
373+
Total: **158 tests, all passing**.
374+
375+
### New benchmark: `bench/ctable/bench_persistency.py`
376+
377+
Four sections:
378+
379+
1. **`extend()` bulk insert** — in-memory vs file-backed at 1k–1M rows.
380+
Overhead converges to ~1x at 1M rows (compression dominates, not I/O).
381+
2. **`open()` / reopen time**~4–10 ms regardless of table size. Fixed cost:
382+
open 3 files (meta, valid_rows, one column) + parse schema JSON.
383+
3. **`append()` single-row** — file-backed is ~6x slower per row (~3 ms vs ~0.5 ms).
384+
Recommendation: batch inserts via `extend()` for persistent tables.
385+
4. **Column `to_numpy()`** — essentially identical between backends (≤1.06x ratio).
386+
Decompression dominates; file I/O is negligible once data is loaded.
387+
388+
---
389+
390+
## Phase 2 design decisions
391+
392+
**Why direct files instead of TreeStore?**
393+
TreeStore stores snapshots of in-memory arrays. In-place writes to a
394+
TreeStore-retrieved NDArray do not persist after reopen. File-backed NDArrays
395+
created with `urlpath=...` support in-place writes natively. Using direct `.b2nd`
396+
files aligns with how the rest of blosc2 handles persistent arrays.
397+
398+
**Why `blosc2.SChunk` vlmeta for metadata, not JSON files?**
399+
`vlmeta` is compressed and is already part of the blosc2 ecosystem.
400+
`blosc2.open()` works on `.b2frame` files the same way it works on `.b2nd` files,
401+
keeping the open path uniform.
402+
403+
**Why not store `_last_pos` in metadata?**
404+
`_resolve_last_pos()` reconstructs it in O(n_chunks) with no full decompression.
405+
Storing it would create a write on every `append()` just to update a counter in the
406+
SChunk — not worth the extra I/O.
407+
408+
**Why `_make_view()` instead of calling `__init__`?**
409+
`__init__` now has storage-routing logic and would try to create new NDArrays even
410+
for views (which immediately get thrown away). `_make_view()` via `__new__` is
411+
explicit and zero-waste.
412+
413+
**Why `CTable.open()` defaults to `mode="r"`?**
414+
The most common read-back scenario is inspection or analysis, not modification.
415+
Defaulting to read-only prevents accidental mutations on shared or archived tables.

plans/ctable-user-guide.md

Lines changed: 110 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ class Row:
7777
```python
7878
import blosc2 as b2
7979

80-
# Empty table
80+
# Empty table (in-memory)
8181
t = b2.CTable(Row)
8282

8383
# Table pre-loaded with data
@@ -96,6 +96,62 @@ t = b2.CTable(Row, compact=True)
9696
t = b2.CTable(Row, cparams={"codec": b2.Codec.ZSTD, "clevel": 5})
9797
```
9898

99+
### Persistent tables
100+
101+
Pass `urlpath` to store the table on disk. The table root is a directory containing
102+
compressed array files — everything is handled automatically.
103+
104+
```python
105+
# Create a new persistent table (overwrites any existing table at that path)
106+
t = b2.CTable(Row, urlpath="people", mode="w", expected_size=1_000_000)
107+
t.extend([(i, float(i % 100), True) for i in range(10_000)])
108+
109+
# Open an existing persistent table for reading and writing
110+
t = b2.CTable(Row, urlpath="people", mode="a")
111+
t.append((99999, 50.0, True))
112+
113+
# Open read-only (default for CTable.open)
114+
t = b2.CTable.open("people") # mode="r" by default
115+
t = b2.CTable.open("people", mode="r") # explicit
116+
117+
# Open read/write via the classmethod
118+
t = b2.CTable.open("people", mode="a")
119+
```
120+
121+
`mode` values:
122+
123+
| mode | behaviour |
124+
|---|---|
125+
| `"w"` | create (overwrite if the path already exists) |
126+
| `"a"` | open existing or create new |
127+
| `"r"` | open existing read-only |
128+
129+
In-memory tables (`urlpath=None`, the default) behave exactly as before — no
130+
`mode` or path handling is involved.
131+
132+
### Disk layout
133+
134+
```
135+
people/
136+
_meta.b2frame ← schema JSON, kind marker, version (in vlmeta)
137+
_valid_rows.b2nd ← tombstone mask
138+
_cols/
139+
id.b2nd
140+
score.b2nd
141+
active.b2nd
142+
```
143+
144+
You can inspect the raw metadata:
145+
146+
```python
147+
import blosc2, json
148+
149+
meta = blosc2.open("people/_meta.b2frame")
150+
print(meta.vlmeta["kind"]) # "ctable"
151+
print(meta.vlmeta["version"]) # 1
152+
schema = json.loads(meta.vlmeta["schema"])
153+
```
154+
99155
### Per-column storage options
100156

101157
```python
@@ -270,6 +326,25 @@ t = b2.CTable(Row, compact=True)
270326

271327
---
272328

329+
## Read-only mode
330+
331+
When a table is opened with `mode="r"` (or via `CTable.open()` without specifying
332+
mode), all mutating operations raise immediately:
333+
334+
```python
335+
t = b2.CTable.open("people") # read-only
336+
337+
t.append((1, 50.0, True)) # ValueError: Table is read-only
338+
t.extend([(1, 50.0, True)]) # ValueError: Table is read-only
339+
t.delete(0) # ValueError: Table is read-only
340+
t.compact() # ValueError: Table is read-only
341+
```
342+
343+
All read operations work normally: `row[]`, column access, `head()`, `tail()`,
344+
`where()`, `len()`, `info()`, `schema_dict()`.
345+
346+
---
347+
273348
## Filtering
274349

275350
`where()` applies a boolean expression and returns a read-only view:
@@ -359,7 +434,7 @@ class Measurement:
359434
valid: bool = b2.field(b2.bool(), default=True)
360435

361436

362-
# Create and populate
437+
# Create and populate (in-memory)
363438
t = b2.CTable(Measurement, expected_size=10_000)
364439
t.extend([(i, float(i % 200 - 100), i % 3 != 0) for i in range(5000)])
365440

@@ -376,3 +451,36 @@ if invalid_indices:
376451
t.info()
377452
print(t.schema_dict())
378453
```
454+
455+
## Persistency example
456+
457+
```python
458+
from dataclasses import dataclass
459+
import blosc2 as b2
460+
461+
462+
@dataclass
463+
class Measurement:
464+
sensor_id: int = b2.field(b2.int64(ge=0))
465+
value: float = b2.field(b2.float64(ge=-1000, le=1000), default=0.0)
466+
valid: bool = b2.field(b2.bool(), default=True)
467+
468+
469+
# --- Session 1: create and populate ---
470+
t = b2.CTable(Measurement, urlpath="sensors", mode="w", expected_size=100_000)
471+
t.extend([(i, float(i % 200 - 100), i % 3 != 0) for i in range(50_000)])
472+
print(f"Saved {len(t)} rows to disk")
473+
# Table is automatically persisted — no explicit save() needed.
474+
475+
# --- Session 2: reopen and query ---
476+
t = b2.CTable.open("sensors") # read-only by default
477+
hot = t.where(t["value"] > 50)
478+
print(f"Hot readings: {len(hot)}")
479+
arr = t["sensor_id"].to_numpy()
480+
print(f"First 5 sensor IDs: {arr[:5]}")
481+
482+
# --- Session 3: reopen and append more data ---
483+
t = b2.CTable(Measurement, urlpath="sensors", mode="a")
484+
t.extend([(50_000 + i, float(i), True) for i in range(1_000)])
485+
print(f"Total rows: {len(t)}")
486+
```

0 commit comments

Comments
 (0)