Skip to content

Commit 457b0ff

Browse files
committed
Back CTable persistence with TreeStore and materialize it via blosc2.open()
1 parent 0dc8697 commit 457b0ff

7 files changed

Lines changed: 851 additions & 70 deletions

File tree

Lines changed: 344 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,344 @@
1+
# TreeStore Root-Level `CTable` Extension Plan
2+
3+
## Goal
4+
5+
Allow a `CTable` stored as the sole logical object inside a `TreeStore` to be
6+
opened directly via:
7+
8+
```python
9+
table = blosc2.open(urlpath)
10+
```
11+
12+
That is, if a `TreeStore` at `urlpath` carries a recognized root manifest for
13+
`CTable`, `blosc2.open(urlpath)` should return a `CTable` instance instead of a
14+
raw `TreeStore`.
15+
16+
This plan intentionally covers only the simple first round:
17+
18+
- one `CTable` per `TreeStore`
19+
- object root is the store root
20+
- `/_meta` at store root is the manifest
21+
22+
Subtree object roots and multiple tables per store are deferred.
23+
24+
## Background
25+
26+
`TreeStore` now has persistent low-level container metadata through:
27+
28+
- `storage.meta["b2tree"] = {"version": 1}`
29+
30+
That is enough for `blosc2.open()` to recognize the path as a `TreeStore`, but
31+
not enough to know whether the store should materialize as a richer object.
32+
33+
The generic extension contract in [tree_store_extensions.md](/Users/faltet/blosc/python-blosc2/tree_store_extensions.md)
34+
introduces:
35+
36+
- `/_meta` as the logical-object manifest for store-backed objects
37+
38+
This plan applies that contract to `CTable`.
39+
40+
## Storage Layout
41+
42+
The persisted root-level `CTable` layout should be:
43+
44+
- `/_meta`
45+
- `/_valid_rows`
46+
- `/_cols/<name>`
47+
48+
Example:
49+
50+
- `/_meta`
51+
- `/_valid_rows`
52+
- `/_cols/id`
53+
- `/_cols/score`
54+
- `/_cols/active`
55+
56+
Rationale:
57+
58+
- `/_meta` stores logical-object manifest data
59+
- `/_valid_rows` stores real row-visibility data
60+
- `/_cols/<name>` stores one persisted column array per field
61+
62+
## Root Manifest
63+
64+
`/_meta` should be a small persisted `SChunk` used primarily through `vlmeta`.
65+
66+
Initial required manifest fields:
67+
68+
- `kind`
69+
- `version`
70+
- `schema`
71+
72+
Initial `CTable` manifest:
73+
74+
```python
75+
{
76+
"kind": "ctable",
77+
"version": 1,
78+
"schema": {...},
79+
}
80+
```
81+
82+
Recommended concrete writes:
83+
84+
```python
85+
tstore["/_meta"].vlmeta["kind"] = "ctable"
86+
tstore["/_meta"].vlmeta["version"] = 1
87+
tstore["/_meta"].vlmeta["schema"] = schema_payload
88+
```
89+
90+
## Schema Persistence Format
91+
92+
The schema should be stored in:
93+
94+
- `/_meta.vlmeta["schema"]`
95+
96+
The schema document should be JSON-compatible, explicit, and versioned.
97+
98+
Recommended shape:
99+
100+
```python
101+
{
102+
"version": 1,
103+
"columns": [
104+
{
105+
"name": "id",
106+
"py_type": "int",
107+
"spec": {"kind": "int64", "ge": 0},
108+
"default": None,
109+
},
110+
{
111+
"name": "score",
112+
"py_type": "float",
113+
"spec": {"kind": "float64", "ge": 0, "le": 100},
114+
"default": None,
115+
},
116+
{
117+
"name": "active",
118+
"py_type": "bool",
119+
"spec": {"kind": "bool"},
120+
"default": True,
121+
},
122+
],
123+
}
124+
```
125+
126+
Notes:
127+
128+
- `columns` must be an ordered list, not a dict
129+
- column order comes from the schema list
130+
- `TreeStore` iteration order must not be used as schema authority
131+
132+
For the first version, do not duplicate data that can be inspected from the
133+
stored column arrays:
134+
135+
- per-column `cparams`
136+
- per-column `dparams`
137+
- chunk/block layout
138+
- `expected_size`
139+
- compaction settings
140+
141+
## `_valid_rows` Persistence
142+
143+
`/_valid_rows` should be a normal persisted boolean array.
144+
145+
This is correct because:
146+
147+
- it is table data, not metadata
148+
- it may grow large
149+
- it participates in normal row visibility semantics
150+
151+
It should not be folded into `/_meta`.
152+
153+
## Column Persistence
154+
155+
Each column should be stored as its own persisted array under:
156+
157+
- `/_cols/<name>`
158+
159+
This keeps the physical layout aligned with the internal columnar design and
160+
lets per-column storage details remain attached to the actual persisted array.
161+
162+
## Constructor Semantics
163+
164+
The intended public constructor remains:
165+
166+
```python
167+
table = blosc2.CTable(
168+
Row,
169+
urlpath=None,
170+
mode="a",
171+
expected_size=1_048_576,
172+
compact=False,
173+
validate=True,
174+
)
175+
```
176+
177+
For the persistent path:
178+
179+
- `urlpath is None`:
180+
- in-memory `CTable`
181+
- `urlpath is not None`:
182+
- root-level `CTable` persisted on top of a `TreeStore`
183+
184+
Recommended mode behavior:
185+
186+
- `mode="w"`:
187+
- create a fresh store-root `CTable`
188+
- `mode="a"`:
189+
- open existing or create new
190+
- `mode="r"`:
191+
- open existing read-only
192+
193+
## `blosc2.open()` Materialization
194+
195+
The root-level dispatch behavior should be:
196+
197+
1. `blosc2.open(urlpath)` detects a `TreeStore`
198+
2. it opens the `TreeStore`
199+
3. it checks for `/_meta`
200+
4. if `/_meta.vlmeta["kind"] == "ctable"`, it materializes `CTable`
201+
5. otherwise it returns the raw `TreeStore`
202+
203+
This preserves the current open layering:
204+
205+
- first detect the low-level container
206+
- then optionally materialize a richer object
207+
208+
## Suggested Implementation Shape
209+
210+
### Step 1: Add Root Manifest Helpers
211+
212+
Add private helper(s) for root-manifest probing, e.g.:
213+
214+
- `_open_treestore_root_object(store)`
215+
- `_read_treestore_root_manifest(store)`
216+
217+
Responsibilities:
218+
219+
- check whether `/_meta` exists
220+
- open `/_meta`
221+
- validate that it is an `SChunk`
222+
- read `kind` / `version`
223+
- return a manifest payload suitable for dispatch
224+
225+
### Step 2: Extend `blosc2.open()`
226+
227+
In the special-store open path:
228+
229+
- if opening yields a `TreeStore`
230+
- probe the root manifest
231+
- if recognized as `ctable`, return `CTable.open(...)` or equivalent internal
232+
constructor
233+
- otherwise return the `TreeStore`
234+
235+
This logic should be localized so the generic `open()` path remains easy to
236+
follow.
237+
238+
### Step 3: Add `CTable` Root-Manifest Read/Write Helpers
239+
240+
In the `CTable` persistence layer, add helpers for:
241+
242+
- creating `/_meta`
243+
- writing `kind`
244+
- writing `version`
245+
- writing `schema`
246+
- reading and validating the root manifest
247+
248+
This should be the only place that knows the `CTable` manifest schema.
249+
250+
### Step 4: Wire Creation
251+
252+
When a persistent `CTable` is created:
253+
254+
- create/open the backing `TreeStore`
255+
- create `/_meta`
256+
- write the root manifest
257+
- create `/_valid_rows`
258+
- create `/_cols/<name>` arrays
259+
260+
### Step 5: Wire Reopen
261+
262+
When a persistent `CTable` is reopened:
263+
264+
- read `/_meta.vlmeta["schema"]`
265+
- rebuild the compiled schema
266+
- reopen `/_valid_rows`
267+
- reopen each persisted column from `/_cols/<name>`
268+
269+
### Step 6: Keep Internal Names Reserved
270+
271+
Validation should reject user column names that collide with internal names:
272+
273+
- `_meta`
274+
- `_valid_rows`
275+
- `_cols`
276+
277+
This already aligns with the existing schema compiler reserved-name logic.
278+
279+
## Validation Rules
280+
281+
For `CTable` root-manifest detection:
282+
283+
- if `/_meta` does not exist:
284+
- not a persisted `CTable`
285+
- if `/_meta` exists but is malformed:
286+
- raise clear error on attempted `CTable` materialization
287+
- if `kind != "ctable"`:
288+
- return raw `TreeStore`
289+
- if `kind == "ctable"` but required fields are missing:
290+
- raise clear error
291+
292+
Recommended required fields for version 1:
293+
294+
- `kind`
295+
- `version`
296+
- `schema`
297+
298+
## Deferred Scope
299+
300+
This plan intentionally does not cover:
301+
302+
- multiple `CTable` objects in one `TreeStore`
303+
- subtree object roots such as `/users/_meta`
304+
- automatic materialization when indexing a subtree from `TreeStore`
305+
- `Ref` support for store-subtree logical objects
306+
- schema evolution beyond append-only behavior
307+
308+
These should be handled in later phases after the root-level path is stable.
309+
310+
## Tests
311+
312+
Add coverage for:
313+
314+
- create persistent root-level `CTable`
315+
- reopen via `blosc2.open(urlpath)` and get `CTable`
316+
- reopen via `CTable.open(urlpath, mode="r")`
317+
- root manifest present and schema readable from `/_meta.vlmeta`
318+
- store with no `/_meta` still opens as raw `TreeStore`
319+
- store with unknown root-manifest `kind` still opens as raw `TreeStore`
320+
- malformed `CTable` manifest raises clear error
321+
- append rows after reopen
322+
- read-only reopen rejects writes
323+
324+
## Recommended Implementation Order
325+
326+
1. write root-manifest probe helpers for `TreeStore`
327+
2. extend `blosc2.open()` with root-manifest dispatch
328+
3. add `CTable` manifest read/write helpers
329+
4. wire persistent create/open around the manifest
330+
5. add tests for dispatch and round-trip
331+
332+
## Summary
333+
334+
The first `TreeStore` extension should treat root `/_meta` as the logical
335+
manifest for the whole store.
336+
337+
For `CTable`, this yields a simple and coherent open story:
338+
339+
- low-level metadata says "this is a `TreeStore`"
340+
- root `/_meta` says "this store materializes as a `CTable`"
341+
- `blosc2.open(urlpath)` returns the richer object directly
342+
343+
This keeps the first implementation small while staying compatible with a later
344+
generalization to subtree object roots.

0 commit comments

Comments
 (0)