Skip to content

Commit d326cc3

Browse files
committed
Refactor indexing API around enums and argsort
- replace legacy indexing names with a cleaner public API - add strict IndexKind enum and rename tiers to SUMMARY/BUCKET/PARTIAL/FULL - require enum kinds and build= in create_index() - fold expression indexing into create_index() - remove create_csindex() and create_expr_index() - rename itersorted() to iter_sorted() - standardize on argsort() for arrays and lazy expressions - remove public indices()-style indexing entry points - add tmpdir support for OOC full-index builds - fix stale sidecar-handle invalidation on Windows - update docstrings, tutorial, examples, and benchmarks to match - refresh indexing tests for the new public surface
1 parent c926554 commit d326cc3

18 files changed

Lines changed: 706 additions & 791 deletions

bench/indexing/blosc2-vs-duckdb-indexes.md

Lines changed: 36 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,10 @@ and 24 GB of RAM.
3232

3333
- Script: `index_query_bench.py`
3434
- Index kinds:
35-
- `ultralight`
36-
- `light`
37-
- `medium`
38-
- `full`
35+
- `summary`
36+
- `bucket`
37+
- `exact`
38+
- `sorted`
3939
- Default geometry in these runs:
4040
- `chunks=1,250,000`
4141
- `blocks=10,000`
@@ -104,30 +104,30 @@ Command:
104104
python index_query_bench.py \
105105
--size 10M \
106106
--outdir /tmp/indexes-10M \
107-
--kind light \
107+
--kind bucket \
108108
--query-width 50 \
109-
--in-mem \
109+
--build memory \
110110
--dist random
111111
```
112112

113-
Observed `light` results:
113+
Observed `bucket` results:
114114

115115
- build: `705.193 ms`
116116
- cold lookup: `6.370 ms`
117117
- warm lookup: `6.250 ms`
118118
- base array size: about `31 MB`
119-
- `light` index sidecars: about `27 MB`
119+
- `bucket` index sidecars: about `27 MB`
120120
- total footprint: about `58 MB`
121121

122122
### Interpretation
123123

124124
For this moderately selective random workload:
125125

126-
- Blosc2 `light` is about `2x` faster than DuckDB `zonemap`
127-
- Blosc2 `light` has a total footprint similar to DuckDB `zonemap`
126+
- Blosc2 `bucket` is about `2x` faster than DuckDB `zonemap`
127+
- Blosc2 `bucket` has a total footprint similar to DuckDB `zonemap`
128128
- DuckDB `art-index` is only slightly faster than `zonemap` here, but much larger
129129

130-
This suggests that Blosc2 `light` is more than a simple zonemap. It behaves like an active lossy lookup
130+
This suggests that Blosc2 `bucket` is more than a simple zonemap. It behaves like an active lossy lookup
131131
structure rather than only coarse pruning metadata.
132132

133133

@@ -169,23 +169,23 @@ python index_query_bench.py \
169169

170170
Observed results:
171171

172-
- `light`
172+
- `bucket`
173173
- cold lookup: `0.841 ms`
174174
- warm lookup: `0.184 ms`
175-
- `medium`
175+
- `exact`
176176
- cold lookup: `0.564 ms`
177177
- warm lookup: `0.168 ms`
178-
- `full`
178+
- `sorted`
179179
- cold lookup: `0.554 ms`
180180
- warm lookup: `0.167 ms`
181181

182182
### Interpretation
183183

184184
With the generic width-1 range form, Blosc2 is much faster than DuckDB:
185185

186-
- Blosc2 `light` is already much faster than DuckDB `zonemap`, and comfortably faster than the
186+
- Blosc2 `bucket` is already much faster than DuckDB `zonemap`, and comfortably faster than the
187187
generic-range DuckDB `art-index` behavior
188-
- Blosc2 `medium` and `full` are in a different regime on warm hits, at about `0.17 ms`
188+
- Blosc2 `exact` and `sorted` are in a different regime on warm hits, at about `0.17 ms`
189189
- DuckDB `art-index` does not show its real point-lookup behavior in this predicate form
190190
- Blosc2 warm reuse changes the picture substantially for repeated lookups
191191

@@ -236,17 +236,17 @@ python index_query_bench.py \
236236

237237
Observed results:
238238

239-
- `light`
239+
- `bucket`
240240
- build: `960.048 ms`
241241
- cold lookup: `2.489 ms`
242242
- warm lookup: `0.172 ms`
243243
- index sidecars: `27,497,393` bytes
244-
- `medium`
244+
- `exact`
245245
- build: `4745.880 ms`
246246
- cold lookup: `2.202 ms`
247247
- warm lookup: `0.147 ms`
248248
- index sidecars: `37,645,201` bytes
249-
- `full`
249+
- `sorted`
250250
- build: `9539.843 ms`
251251
- cold lookup: `1.753 ms`
252252
- warm lookup: `0.144 ms`
@@ -258,21 +258,21 @@ Once DuckDB is allowed to use the more planner-friendly single-value predicate:
258258

259259
- `art-index` becomes very fast
260260
- `art-index` is clearly faster than Blosc2 on cold point lookups in this run
261-
- Blosc2 is clearly faster on warm repeated point lookups across `light`, `medium`, and `full`
261+
- Blosc2 is clearly faster on warm repeated point lookups across `bucket`, `exact`, and `sorted`
262262

263263
However, the storage costs are very different:
264264

265265
- DuckDB `art-index` database size: about `478.4 MB`
266266
- DuckDB zonemap baseline size: about `56.1 MB`
267267
- estimated ART overhead over baseline: about `422.3 MB`
268-
- Blosc2 `full` base + index footprint: about `31 MB + 29.9 MB = 60.9 MB`
268+
- Blosc2 `sorted` base + index footprint: about `31 MB + 29.9 MB = 60.9 MB`
269269

270270
So for true point lookups:
271271

272272
- DuckDB `art-index` wins on cold point-lookup latency in this measurement
273-
- Blosc2 `full` remains much smaller overall
274-
- Blosc2 `light`, `medium`, and `full` all become faster than DuckDB `art-index` on warm repeated hits
275-
- DuckDB `art-index` still has a very large storage premium over both Blosc2 `light` and `full`
273+
- Blosc2 `sorted` remains much smaller overall
274+
- Blosc2 `bucket`, `exact`, and `sorted` all become faster than DuckDB `art-index` on warm repeated hits
275+
- DuckDB `art-index` still has a very large storage premium over both Blosc2 `bucket` and `sorted`
276276

277277

278278
## Blosc2 Light vs DuckDB Zonemap
@@ -284,16 +284,16 @@ Main observations:
284284

285285
- storage footprint is in roughly the same ballpark
286286
- DuckDB zonemap DB: about `56 MB`
287-
- Blosc2 base + `light`: about `58 MB`
288-
- Blosc2 `light` lookup speed is much better
287+
- Blosc2 base + `bucket`: about `58 MB`
288+
- Blosc2 `bucket` lookup speed is much better
289289
- width `50`: about `6.25 ms` vs `13.33 ms`
290290
- width `1` range: about `0.18 ms` warm vs `12.61 ms` generic-range DuckDB
291291
- width `1` equality: about `0.17 ms` warm vs `2.94 ms` DuckDB zonemap warm
292292

293293
Conclusion:
294294

295-
- DuckDB zonemap is closer in spirit to Blosc2 `light` than DuckDB ART is
296-
- but Blosc2 `light` is a materially stronger lookup structure on these workloads
295+
- DuckDB zonemap is closer in spirit to Blosc2 `bucket` than DuckDB ART is
296+
- but Blosc2 `bucket` is a materially stronger lookup structure on these workloads
297297

298298

299299
## Blosc2 Full vs DuckDB ART
@@ -304,20 +304,20 @@ Main observations:
304304

305305
- point-lookup latency
306306
- DuckDB `art-index`: `0.613 ms` cold, `0.245 ms` warm
307-
- Blosc2 `full`: `1.753 ms` cold, `0.144 ms` warm
307+
- Blosc2 `sorted`: `1.753 ms` cold, `0.144 ms` warm
308308
- build time
309309
- DuckDB `art-index`: `2000.316 ms`
310-
- Blosc2 `full`: `9539.843 ms`
310+
- Blosc2 `sorted`: `9539.843 ms`
311311
- footprint
312312
- DuckDB `art-index` DB: about `478.4 MB`
313-
- Blosc2 `full` base + index: about `60.9 MB`
313+
- Blosc2 `sorted` base + index: about `60.9 MB`
314314

315315
Conclusion:
316316

317-
- Blosc2 `full` wins on storage efficiency
317+
- Blosc2 `sorted` wins on storage efficiency
318318
- DuckDB `art-index` wins on cold point-lookup latency
319-
- Warm repeated point lookups favor Blosc2 `full` more clearly
320-
- DuckDB `art-index` is much faster to build than Blosc2 `full`
319+
- Warm repeated point lookups favor Blosc2 `sorted` more clearly
320+
- DuckDB `art-index` is much faster to build than Blosc2 `sorted`
321321
- DuckDB ART is much more sensitive to predicate shape
322322

323323

@@ -349,8 +349,8 @@ Practical implication:
349349

350350
## Current Takeaways
351351

352-
1. Blosc2 `light` is very competitive against DuckDB zonemap-like pruning.
353-
2. Blosc2 `light` offers much faster selective lookups than DuckDB zonemap at a similar total storage cost.
352+
1. Blosc2 `bucket` is very competitive against DuckDB zonemap-like pruning.
353+
2. Blosc2 `bucket` offers much faster selective lookups than DuckDB zonemap at a similar total storage cost.
354354
3. DuckDB `art-index` becomes strong only when queries are written as true equality predicates.
355355
4. On true point lookups, DuckDB `art-index` wins on cold latency in the current M4 Pro run, but
356356
Blosc2 exact indexes are markedly better on warm repeated lookups.

bench/indexing/index_query_bench.py

Lines changed: 31 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -24,18 +24,19 @@
2424

2525
SIZES = (1_000_000, 2_000_000, 5_000_000, 10_000_000)
2626
DEFAULT_REPEATS = 3
27-
KINDS = ("ultralight", "light", "medium", "full")
28-
DEFAULT_KIND = "light"
27+
KINDS = ("summary", "bucket", "partial", "full")
28+
DEFAULT_KIND = "bucket"
2929
DISTS = ("sorted", "block-shuffled", "permuted", "random")
3030
RNG_SEED = 0
3131
DEFAULT_OPLEVEL = 5
3232
FULL_QUERY_MODES = ("auto", "selective-ooc", "whole-load")
3333
DATASET_LAYOUT_VERSION = "payload-ramp-v1"
34+
BUILD_MODES = ("auto", "memory", "ooc")
3435

3536
COLD_COLUMNS = [
3637
("rows", lambda result: f"{result['size']:,}"),
3738
("dist", lambda result: result["dist"]),
38-
("builder", lambda result: "mem" if result["in_mem"] else "ooc"),
39+
("builder", lambda result: "mem" if result["build"] == "memory" else "ooc"),
3940
("kind", lambda result: result["kind"]),
4041
("create_idx_ms", lambda result: f"{result['create_idx_ms']:.3f}"),
4142
("scan_ms", lambda result: f"{result['scan_ms']:.3f}"),
@@ -50,7 +51,7 @@
5051
WARM_COLUMNS = [
5152
("rows", lambda result: f"{result['size']:,}"),
5253
("dist", lambda result: result["dist"]),
53-
("builder", lambda result: "mem" if result["in_mem"] else "ooc"),
54+
("builder", lambda result: "mem" if result["build"] == "memory" else "ooc"),
5455
("kind", lambda result: result["kind"]),
5556
("create_idx_ms", lambda result: f"{result['create_idx_ms']:.3f}"),
5657
("scan_ms", lambda result: f"{result['scan_ms']:.3f}"),
@@ -277,7 +278,7 @@ def build_persistent_array(
277278
for start in range(0, size, chunk_len):
278279
stop = min(start + chunk_len, size)
279280
chunk = np.zeros(stop - start, dtype=dtype)
280-
if dist == "sorted":
281+
if dist == "full":
281282
chunk["id"] = ordered_id_slice(size, start, stop, id_dtype)
282283
elif dist == "block-shuffled":
283284
_fill_block_shuffled_ids(chunk["id"], size, start, stop, block_len, block_order)
@@ -308,14 +309,14 @@ def indexed_array_path(
308309
kind: str,
309310
optlevel: int,
310311
id_dtype: np.dtype,
311-
in_mem: bool,
312+
build: str,
312313
chunks: int | None,
313314
blocks: int | None,
314315
codec: blosc2.Codec | None,
315316
clevel: int | None,
316317
nthreads: int | None,
317318
) -> Path:
318-
mode = "mem" if in_mem else "ooc"
319+
mode = "mem" if build == "memory" else "ooc"
319320
codec_token = "codec-auto" if codec is None else f"codec-{codec.name}"
320321
clevel_token = "clevel-auto" if clevel is None else f"clevel-{clevel}"
321322
thread_token = "threads-auto" if nthreads is None else f"threads-{nthreads}"
@@ -442,11 +443,11 @@ def _condition_expr(lo: object, hi: object, dtype: np.dtype, *, query_single_val
442443
return f"(id >= {lo_literal}) & (id <= {hi_literal})"
443444

444445

445-
def _valid_index_descriptor(arr: blosc2.NDArray, kind: str, optlevel: int, in_mem: bool) -> dict | None:
446+
def _valid_index_descriptor(arr: blosc2.NDArray, kind: str, optlevel: int, build: str) -> dict | None:
446447
for descriptor in arr.indexes:
447448
if descriptor.get("version") != blosc2_indexing.INDEX_FORMAT_VERSION:
448449
continue
449-
expected_ooc = descriptor.get("ooc", False) if kind == "ultralight" else (not bool(in_mem))
450+
expected_ooc = build != "memory"
450451
if (
451452
descriptor.get("field") == "id"
452453
and descriptor.get("kind") == kind
@@ -474,7 +475,7 @@ def _open_or_build_indexed_array(
474475
id_dtype: np.dtype,
475476
kind: str,
476477
optlevel: int,
477-
in_mem: bool,
478+
build: str,
478479
chunks: int | None,
479480
blocks: int | None,
480481
codec: blosc2.Codec | None,
@@ -484,15 +485,20 @@ def _open_or_build_indexed_array(
484485
) -> tuple[blosc2.NDArray, float]:
485486
if path.exists():
486487
arr = blosc2.open(path, mode="a")
487-
if _valid_index_descriptor(arr, kind, optlevel, in_mem) is not None:
488+
if _valid_index_descriptor(arr, kind, optlevel, build) is not None:
488489
return arr, 0.0
489490
if arr.indexes:
490491
arr.drop_index(field="id")
491492
blosc2.remove_urlpath(path)
492493

493494
arr = build_persistent_array(size, dist, id_dtype, path, chunks, blocks)
494495
build_start = time.perf_counter()
495-
kwargs = {"field": "id", "kind": kind, "optlevel": optlevel, "in_mem": in_mem}
496+
kwargs = {
497+
"field": "id",
498+
"kind": blosc2.IndexKind[kind.upper()],
499+
"optlevel": optlevel,
500+
"build": build,
501+
}
496502
cparams = {}
497503
if codec is not None:
498504
cparams["codec"] = codec
@@ -515,7 +521,7 @@ def benchmark_size(
515521
query_single_value: bool,
516522
optlevel: int,
517523
id_dtype: np.dtype,
518-
in_mem: bool,
524+
build: str,
519525
full_query_mode: str,
520526
chunks: int | None,
521527
blocks: int | None,
@@ -549,7 +555,7 @@ def benchmark_size(
549555
kind,
550556
optlevel,
551557
id_dtype,
552-
in_mem,
558+
build,
553559
chunks,
554560
blocks,
555561
codec,
@@ -561,7 +567,7 @@ def benchmark_size(
561567
id_dtype,
562568
kind,
563569
optlevel,
564-
in_mem,
570+
build,
565571
chunks,
566572
blocks,
567573
codec,
@@ -588,7 +594,7 @@ def benchmark_size(
588594
"dist": dist,
589595
"kind": kind,
590596
"optlevel": optlevel,
591-
"in_mem": in_mem,
597+
"build": build,
592598
"query_rows": index_len,
593599
"build_s": build_time,
594600
"create_idx_ms": build_time * 1_000,
@@ -714,10 +720,10 @@ def parse_args() -> argparse.Namespace:
714720
help=f"Index kind to benchmark. Use 'all' to benchmark every kind. Default: {DEFAULT_KIND}.",
715721
)
716722
parser.add_argument(
717-
"--in-mem",
718-
action=argparse.BooleanOptionalAction,
719-
default=False,
720-
help="Use the in-memory index builders. Disabled by default; pass --in-mem to force them.",
723+
"--build",
724+
choices=BUILD_MODES,
725+
default="auto",
726+
help="Index builder policy: auto, memory, or ooc. Default: auto.",
721727
)
722728
parser.add_argument(
723729
"--full-query-mode",
@@ -787,7 +793,7 @@ def main() -> None:
787793
args.repeats,
788794
args.optlevel,
789795
id_dtype,
790-
args.in_mem,
796+
args.build,
791797
args.full_query_mode,
792798
args.chunks,
793799
args.blocks,
@@ -809,7 +815,7 @@ def main() -> None:
809815
args.repeats,
810816
args.optlevel,
811817
id_dtype,
812-
args.in_mem,
818+
args.build,
813819
args.full_query_mode,
814820
args.chunks,
815821
args.blocks,
@@ -831,7 +837,7 @@ def run_benchmarks(
831837
repeats: int,
832838
optlevel: int,
833839
id_dtype: np.dtype,
834-
in_mem: bool,
840+
build: str,
835841
full_query_mode: str,
836842
chunks: int | None,
837843
blocks: int | None,
@@ -852,7 +858,7 @@ def run_benchmarks(
852858
print("Structured range-query benchmark across index kinds")
853859
print(
854860
f"{geometry_label}, repeats={repeats}, dist={dist_label}, "
855-
f"query_width={query_width:,}, optlevel={optlevel}, dtype={id_dtype.name}, in_mem={in_mem}, "
861+
f"query_width={query_width:,}, optlevel={optlevel}, dtype={id_dtype.name}, build={build}, "
856862
f"query_single_value={query_single_value}, "
857863
f"full_query_mode={full_query_mode}, index_codec={'auto' if codec is None else codec.name}, "
858864
f"index_clevel={'auto' if clevel is None else clevel}, "
@@ -878,7 +884,7 @@ def cold_progress_callback(row: dict) -> None:
878884
query_single_value,
879885
optlevel,
880886
id_dtype,
881-
in_mem,
887+
build,
882888
full_query_mode,
883889
chunks,
884890
blocks,

0 commit comments

Comments
 (0)