|
| 1 | +# TreeStore Root-Level `CTable` Extension Plan |
| 2 | + |
| 3 | +## Goal |
| 4 | + |
| 5 | +Allow a `CTable` stored as the sole logical object inside a `TreeStore` to be |
| 6 | +opened directly via: |
| 7 | + |
| 8 | +```python |
| 9 | +table = blosc2.open(urlpath) |
| 10 | +``` |
| 11 | + |
| 12 | +That is, if a `TreeStore` at `urlpath` carries a recognized root manifest for |
| 13 | +`CTable`, `blosc2.open(urlpath)` should return a `CTable` instance instead of a |
| 14 | +raw `TreeStore`. |
| 15 | + |
| 16 | +This plan intentionally covers only the simple first round: |
| 17 | + |
| 18 | +- one `CTable` per `TreeStore` |
| 19 | +- object root is the store root |
| 20 | +- `/_meta` at store root is the manifest |
| 21 | + |
| 22 | +Subtree object roots and multiple tables per store are deferred. |
| 23 | + |
| 24 | +## Background |
| 25 | + |
| 26 | +`TreeStore` now has persistent low-level container metadata through: |
| 27 | + |
| 28 | +- `storage.meta["b2tree"] = {"version": 1}` |
| 29 | + |
| 30 | +That is enough for `blosc2.open()` to recognize the path as a `TreeStore`, but |
| 31 | +not enough to know whether the store should materialize as a richer object. |
| 32 | + |
| 33 | +The generic extension contract in [tree_store_extensions.md](/Users/faltet/blosc/python-blosc2/tree_store_extensions.md) |
| 34 | +introduces: |
| 35 | + |
| 36 | +- `/_meta` as the logical-object manifest for store-backed objects |
| 37 | + |
| 38 | +This plan applies that contract to `CTable`. |
| 39 | + |
| 40 | +## Storage Layout |
| 41 | + |
| 42 | +The persisted root-level `CTable` layout should be: |
| 43 | + |
| 44 | +- `/_meta` |
| 45 | +- `/_valid_rows` |
| 46 | +- `/_cols/<name>` |
| 47 | + |
| 48 | +Example: |
| 49 | + |
| 50 | +- `/_meta` |
| 51 | +- `/_valid_rows` |
| 52 | +- `/_cols/id` |
| 53 | +- `/_cols/score` |
| 54 | +- `/_cols/active` |
| 55 | + |
| 56 | +Rationale: |
| 57 | + |
| 58 | +- `/_meta` stores logical-object manifest data |
| 59 | +- `/_valid_rows` stores real row-visibility data |
| 60 | +- `/_cols/<name>` stores one persisted column array per field |
| 61 | + |
| 62 | +## Root Manifest |
| 63 | + |
| 64 | +`/_meta` should be a small persisted `SChunk` used primarily through `vlmeta`. |
| 65 | + |
| 66 | +Initial required manifest fields: |
| 67 | + |
| 68 | +- `kind` |
| 69 | +- `version` |
| 70 | +- `schema` |
| 71 | + |
| 72 | +Initial `CTable` manifest: |
| 73 | + |
| 74 | +```python |
| 75 | +{ |
| 76 | + "kind": "ctable", |
| 77 | + "version": 1, |
| 78 | + "schema": {...}, |
| 79 | +} |
| 80 | +``` |
| 81 | + |
| 82 | +Recommended concrete writes: |
| 83 | + |
| 84 | +```python |
| 85 | +tstore["/_meta"].vlmeta["kind"] = "ctable" |
| 86 | +tstore["/_meta"].vlmeta["version"] = 1 |
| 87 | +tstore["/_meta"].vlmeta["schema"] = schema_payload |
| 88 | +``` |
| 89 | + |
| 90 | +## Schema Persistence Format |
| 91 | + |
| 92 | +The schema should be stored in: |
| 93 | + |
| 94 | +- `/_meta.vlmeta["schema"]` |
| 95 | + |
| 96 | +The schema document should be JSON-compatible, explicit, and versioned. |
| 97 | + |
| 98 | +Recommended shape: |
| 99 | + |
| 100 | +```python |
| 101 | +{ |
| 102 | + "version": 1, |
| 103 | + "columns": [ |
| 104 | + { |
| 105 | + "name": "id", |
| 106 | + "py_type": "int", |
| 107 | + "spec": {"kind": "int64", "ge": 0}, |
| 108 | + "default": None, |
| 109 | + }, |
| 110 | + { |
| 111 | + "name": "score", |
| 112 | + "py_type": "float", |
| 113 | + "spec": {"kind": "float64", "ge": 0, "le": 100}, |
| 114 | + "default": None, |
| 115 | + }, |
| 116 | + { |
| 117 | + "name": "active", |
| 118 | + "py_type": "bool", |
| 119 | + "spec": {"kind": "bool"}, |
| 120 | + "default": True, |
| 121 | + }, |
| 122 | + ], |
| 123 | +} |
| 124 | +``` |
| 125 | + |
| 126 | +Notes: |
| 127 | + |
| 128 | +- `columns` must be an ordered list, not a dict |
| 129 | +- column order comes from the schema list |
| 130 | +- `TreeStore` iteration order must not be used as schema authority |
| 131 | + |
| 132 | +For the first version, do not duplicate data that can be inspected from the |
| 133 | +stored column arrays: |
| 134 | + |
| 135 | +- per-column `cparams` |
| 136 | +- per-column `dparams` |
| 137 | +- chunk/block layout |
| 138 | +- `expected_size` |
| 139 | +- compaction settings |
| 140 | + |
| 141 | +## `_valid_rows` Persistence |
| 142 | + |
| 143 | +`/_valid_rows` should be a normal persisted boolean array. |
| 144 | + |
| 145 | +This is correct because: |
| 146 | + |
| 147 | +- it is table data, not metadata |
| 148 | +- it may grow large |
| 149 | +- it participates in normal row visibility semantics |
| 150 | + |
| 151 | +It should not be folded into `/_meta`. |
| 152 | + |
| 153 | +## Column Persistence |
| 154 | + |
| 155 | +Each column should be stored as its own persisted array under: |
| 156 | + |
| 157 | +- `/_cols/<name>` |
| 158 | + |
| 159 | +This keeps the physical layout aligned with the internal columnar design and |
| 160 | +lets per-column storage details remain attached to the actual persisted array. |
| 161 | + |
| 162 | +## Constructor Semantics |
| 163 | + |
| 164 | +The intended public constructor remains: |
| 165 | + |
| 166 | +```python |
| 167 | +table = blosc2.CTable( |
| 168 | + Row, |
| 169 | + urlpath=None, |
| 170 | + mode="a", |
| 171 | + expected_size=1_048_576, |
| 172 | + compact=False, |
| 173 | + validate=True, |
| 174 | +) |
| 175 | +``` |
| 176 | + |
| 177 | +For the persistent path: |
| 178 | + |
| 179 | +- `urlpath is None`: |
| 180 | + - in-memory `CTable` |
| 181 | +- `urlpath is not None`: |
| 182 | + - root-level `CTable` persisted on top of a `TreeStore` |
| 183 | + |
| 184 | +Recommended mode behavior: |
| 185 | + |
| 186 | +- `mode="w"`: |
| 187 | + - create a fresh store-root `CTable` |
| 188 | +- `mode="a"`: |
| 189 | + - open existing or create new |
| 190 | +- `mode="r"`: |
| 191 | + - open existing read-only |
| 192 | + |
| 193 | +## `blosc2.open()` Materialization |
| 194 | + |
| 195 | +The root-level dispatch behavior should be: |
| 196 | + |
| 197 | +1. `blosc2.open(urlpath)` detects a `TreeStore` |
| 198 | +2. it opens the `TreeStore` |
| 199 | +3. it checks for `/_meta` |
| 200 | +4. if `/_meta.vlmeta["kind"] == "ctable"`, it materializes `CTable` |
| 201 | +5. otherwise it returns the raw `TreeStore` |
| 202 | + |
| 203 | +This preserves the current open layering: |
| 204 | + |
| 205 | +- first detect the low-level container |
| 206 | +- then optionally materialize a richer object |
| 207 | + |
| 208 | +## Suggested Implementation Shape |
| 209 | + |
| 210 | +### Step 1: Add Root Manifest Helpers |
| 211 | + |
| 212 | +Add private helper(s) for root-manifest probing, e.g.: |
| 213 | + |
| 214 | +- `_open_treestore_root_object(store)` |
| 215 | +- `_read_treestore_root_manifest(store)` |
| 216 | + |
| 217 | +Responsibilities: |
| 218 | + |
| 219 | +- check whether `/_meta` exists |
| 220 | +- open `/_meta` |
| 221 | +- validate that it is an `SChunk` |
| 222 | +- read `kind` / `version` |
| 223 | +- return a manifest payload suitable for dispatch |
| 224 | + |
| 225 | +### Step 2: Extend `blosc2.open()` |
| 226 | + |
| 227 | +In the special-store open path: |
| 228 | + |
| 229 | +- if opening yields a `TreeStore` |
| 230 | +- probe the root manifest |
| 231 | +- if recognized as `ctable`, return `CTable.open(...)` or equivalent internal |
| 232 | + constructor |
| 233 | +- otherwise return the `TreeStore` |
| 234 | + |
| 235 | +This logic should be localized so the generic `open()` path remains easy to |
| 236 | +follow. |
| 237 | + |
| 238 | +### Step 3: Add `CTable` Root-Manifest Read/Write Helpers |
| 239 | + |
| 240 | +In the `CTable` persistence layer, add helpers for: |
| 241 | + |
| 242 | +- creating `/_meta` |
| 243 | +- writing `kind` |
| 244 | +- writing `version` |
| 245 | +- writing `schema` |
| 246 | +- reading and validating the root manifest |
| 247 | + |
| 248 | +This should be the only place that knows the `CTable` manifest schema. |
| 249 | + |
| 250 | +### Step 4: Wire Creation |
| 251 | + |
| 252 | +When a persistent `CTable` is created: |
| 253 | + |
| 254 | +- create/open the backing `TreeStore` |
| 255 | +- create `/_meta` |
| 256 | +- write the root manifest |
| 257 | +- create `/_valid_rows` |
| 258 | +- create `/_cols/<name>` arrays |
| 259 | + |
| 260 | +### Step 5: Wire Reopen |
| 261 | + |
| 262 | +When a persistent `CTable` is reopened: |
| 263 | + |
| 264 | +- read `/_meta.vlmeta["schema"]` |
| 265 | +- rebuild the compiled schema |
| 266 | +- reopen `/_valid_rows` |
| 267 | +- reopen each persisted column from `/_cols/<name>` |
| 268 | + |
| 269 | +### Step 6: Keep Internal Names Reserved |
| 270 | + |
| 271 | +Validation should reject user column names that collide with internal names: |
| 272 | + |
| 273 | +- `_meta` |
| 274 | +- `_valid_rows` |
| 275 | +- `_cols` |
| 276 | + |
| 277 | +This already aligns with the existing schema compiler reserved-name logic. |
| 278 | + |
| 279 | +## Validation Rules |
| 280 | + |
| 281 | +For `CTable` root-manifest detection: |
| 282 | + |
| 283 | +- if `/_meta` does not exist: |
| 284 | + - not a persisted `CTable` |
| 285 | +- if `/_meta` exists but is malformed: |
| 286 | + - raise clear error on attempted `CTable` materialization |
| 287 | +- if `kind != "ctable"`: |
| 288 | + - return raw `TreeStore` |
| 289 | +- if `kind == "ctable"` but required fields are missing: |
| 290 | + - raise clear error |
| 291 | + |
| 292 | +Recommended required fields for version 1: |
| 293 | + |
| 294 | +- `kind` |
| 295 | +- `version` |
| 296 | +- `schema` |
| 297 | + |
| 298 | +## Deferred Scope |
| 299 | + |
| 300 | +This plan intentionally does not cover: |
| 301 | + |
| 302 | +- multiple `CTable` objects in one `TreeStore` |
| 303 | +- subtree object roots such as `/users/_meta` |
| 304 | +- automatic materialization when indexing a subtree from `TreeStore` |
| 305 | +- `Ref` support for store-subtree logical objects |
| 306 | +- schema evolution beyond append-only behavior |
| 307 | + |
| 308 | +These should be handled in later phases after the root-level path is stable. |
| 309 | + |
| 310 | +## Tests |
| 311 | + |
| 312 | +Add coverage for: |
| 313 | + |
| 314 | +- create persistent root-level `CTable` |
| 315 | +- reopen via `blosc2.open(urlpath)` and get `CTable` |
| 316 | +- reopen via `CTable.open(urlpath, mode="r")` |
| 317 | +- root manifest present and schema readable from `/_meta.vlmeta` |
| 318 | +- store with no `/_meta` still opens as raw `TreeStore` |
| 319 | +- store with unknown root-manifest `kind` still opens as raw `TreeStore` |
| 320 | +- malformed `CTable` manifest raises clear error |
| 321 | +- append rows after reopen |
| 322 | +- read-only reopen rejects writes |
| 323 | + |
| 324 | +## Recommended Implementation Order |
| 325 | + |
| 326 | +1. write root-manifest probe helpers for `TreeStore` |
| 327 | +2. extend `blosc2.open()` with root-manifest dispatch |
| 328 | +3. add `CTable` manifest read/write helpers |
| 329 | +4. wire persistent create/open around the manifest |
| 330 | +5. add tests for dispatch and round-trip |
| 331 | + |
| 332 | +## Summary |
| 333 | + |
| 334 | +The first `TreeStore` extension should treat root `/_meta` as the logical |
| 335 | +manifest for the whole store. |
| 336 | + |
| 337 | +For `CTable`, this yields a simple and coherent open story: |
| 338 | + |
| 339 | +- low-level metadata says "this is a `TreeStore`" |
| 340 | +- root `/_meta` says "this store materializes as a `CTable`" |
| 341 | +- `blosc2.open(urlpath)` returns the richer object directly |
| 342 | + |
| 343 | +This keeps the first implementation small while staying compatible with a later |
| 344 | +generalization to subtree object roots. |
0 commit comments