refactor: simplify internal chunk representation#3899
refactor: simplify internal chunk representation#3899d-v-b wants to merge 9 commits intozarr-developers:mainfrom
Conversation
Previously rectilinear chunk grids and regular chunk grids normalized chunks inconsistently. This change ensures that chunk specifications are always normalized by the same routines in all cases. This change also ensures that chunks=(-1, ...) consistently normalizes to a full length chunk along that axis.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3899 +/- ##
==========================================
- Coverage 92.98% 92.70% -0.28%
==========================================
Files 87 87
Lines 11246 11261 +15
==========================================
- Hits 10457 10440 -17
- Misses 789 821 +32
🚀 New features to boost your workflow:
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3899 +/- ##
==========================================
+ Coverage 92.98% 93.01% +0.02%
==========================================
Files 87 87
Lines 11246 11262 +16
==========================================
+ Hits 10457 10475 +18
+ Misses 789 787 -2
🚀 New features to boost your workflow:
|
maxrjones
left a comment
There was a problem hiding this comment.
I really like the direction of the refactor.
I found the description of the PR somewhat misleading. The bug fix (make chunk normalization properly handle -1) is totally unrelated to the additional of rectilinear chunk support; the bug report showed the issue on prior releases. The rectilinear chunk addition made the pre-existing jank related to duplicated normalization logic worse.
there are a few cases in the deprecated Array.create() method that possibly regress in this PR:
def _create_deprecated(**kwargs):
"""Call the deprecated Array.create(), suppressing the deprecation warning."""
with warnings.catch_warnings():
warnings.simplefilter("ignore", DeprecationWarning)
return zarr.Array.create(**kwargs)
def test_deprecated_underspecified_chunks_padded():
"""Fewer chunk dims than shape dims — missing dims padded from shape."""
arr = _create_deprecated(store={}, shape=(100, 20, 10), chunks=(30,), dtype="uint8")
assert arr.metadata.chunk_grid.chunk_shape == (30, 20, 10)
def test_deprecated_underspecified_chunks_with_none():
"""Partial chunks with None — padded from shape."""
arr = _create_deprecated(store={}, shape=(100, 20, 10), chunks=(30, None), dtype="uint8")
assert arr.metadata.chunk_grid.chunk_shape == (30, 20, 10)
def test_deprecated_none_per_dimension_sentinel():
"""None inside chunks tuple means 'span the full axis'."""
arr = _create_deprecated(store={}, shape=(100, 10), chunks=(10, None), dtype="uint8")
assert arr.metadata.chunk_grid.chunk_shape == (10, 10)I'm not sure if these were intentional API design choices, versus quirks in the old API. It may be a good time to remove deprecated functions, as a separate PR, first to reduce the surface area for potential regressions when fixing/adding functionality to the new API.
good catch, the change that broke -1 normalization was this one: #2761. We basically forked array creation routines and didn't reach feature / testing parity with the new one 🤦 I don't see value in supporting cases like this, other than backwards compatibility.
Are there any non-deprecated functions that supported this? |
💯 |
I couldn't find any non-deprecated cases of supporting underspecified chunks (fewer than the number of dims) or using None as a sentinel value like |
|
#3903 removes the deprecated methods |
|
the latest changes make the representation of chunks recursive, in order to express nested sharding. This is future-proofing the design here against the possibility that we give our high-level routines a simple way to declare nested sharding. |
The addition of rectilinear chunks left us with some jank in our internal chunk normalization logic. We had a lot of redundant chunk normalization routines, and we also weren't handling user input correctly, e.g. #3898. We need some internal changes to ensure that user input is consistently handled regardless of whether we are generating regular chunks or irregular chunks. That's what this PR does. Also, this PR closes #3898
I will give my summary, then a summary generated by claude.
My summary
ChunksTupleThis PR addresses this by introducing a canonical internal representation of the fully normalized chunk layout for an array, which is a tuple called
ChunksTuple. Feel free to suggest better names.ChunksTupleis justtuple[tuple[int, ...], ...], i.e. a representation compatible with regular or irregular chunks, but I wrap this type inNewType.I use
NewTypebecause tuples of tuples of ints can be very easily confused with tuples of ints (regular chunks), or tuples of tuples of tuples of ints (e.g., rectilinear chunking with RLE). So I think it's helpful to be defensive here and reduce ambiguity.There are 2 functions that produce
ChunksTuple:normalize_chunks_nd, which converts a user-friendly request for specific chunks into an explicit chunk layoutguess_chunks, which converts a user's request for auto chunking into a specific layout. Auto chunking depends on configuration, data type, etc so this is a separate routine.ResolvedChunkingChunksTupleis used inResolvedChunking(bad name, I would rather useChunkSpecbut that's in use already), which is this:ResolvedChunkingis what you get when you jointly normalize thechunksandshardskeyword arguments tocreate_array.I introduce some new terminology here for internal purposes.
outer_chunksdenotes the shape of the chunks qua stored objects, andinner_chunksdenotes the shape of the subchunks inside an outer chunk, if that outer chunk uses sharding. If the outer chunk doesn't use sharding, theninner_chunksisNone.These two data types are used to consolidate our chunk normalization routines.
Claude's Summary
Refactors chunk and shard handling during array creation to fix a naming ambiguity where
chunk_shapemeant "outer grid partition" without sharding but "inner sub-chunk" with sharding, silently changing meaning based on context.Introduces a three-layer architecture for chunk resolution:
Normalization —
normalize_chunks_ndandguess_chunksconvert raw user input intoChunksTuple, aNewType-brandedtuple[tuple[int, ...], ...]that represents both regular and rectilinear chunks uniformly. This is the only boundary between untyped user input and the internal representation.Resolution —
resolve_outer_and_inner_chunkstakes aChunksTuple(the user'schunks=) and raw shard input (shards=), and returns aResolvedChunkingNamedTuple with two unambiguous fields:outer_chunks: ChunksTuple— chunk sizes for the chunk grid metadata (shard sizes when sharding, chunk sizes otherwise)inner_chunks: ChunksTuple | None— sub-chunk sizes forShardingCodec, orNonewhen sharding is not activeMetadata construction —
create_chunk_grid_metadatatakes aChunksTupleand dispatches toRegularChunkGridMetadataorRectilinearChunkGridMetadatabased on whether the chunks are uniform.Key design decisions
ChunksTupleas aNewType: Zero runtime cost, but the type checker prevents accidentally passing raw user input where normalized chunks are expected. Both regular and rectilinear chunks use the same representation — regular is just the case where each inner tuple has uniform values.inner_chunks: Nonemodels capability, not configuration: An unsharded chunk is opaque (read the whole thing or nothing). A shard has internal structure (an index that enables sub-chunk addressing).Nonemeans "this chunk has no internal structure" — it's not a flag you toggle, it's the absence of a capability.normalize_chunks_ndrejectsNone: Top-levelNonemeans "auto" everywhere else in the codebase. Havingnormalize_chunks_ndsilently treat it as "span all" would be a bug waiting to happen. Callers must useguess_chunksfor auto-chunking.Rectilinear shard detection absorbed into
resolve_outer_and_inner_chunks: The function handles all shard input forms (None,"auto", dict, flat tuple, nested sequence) internally, eliminating theshards_for_partition/rectilinear_shard_metadance that callers previously had to manage.Changes by file
src/zarr/core/chunk_grids.pySHARDED_INNER_CHUNK_MAX_BYTESconstant (1 MiB) — replaces magic number used as the auto-chunking ceiling when sharding is activeChunksTupleNewType — brandedtuple[tuple[int, ...], ...]ResolvedChunkingNamedTuple —(outer_chunks, inner_chunks)normalize_chunks_ndnow returnsChunksTuple, rejectsNoneguess_chunksnow returnsChunksTuple(normalizes vianormalize_chunks_nd)resolve_shard_shape(returned flattuple | None) withresolve_outer_and_inner_chunks(returnsResolvedChunking, absorbs rectilinear shard detection)resolve_chunk_shape(was a lossy flattening wrapper)guess_chunks_and_shards(was dead code)src/zarr/core/metadata/v3.pycreate_chunk_grid_metadatanow acceptsChunksTuple(no longer normalizes internally, noshapeparameter)is_regular_1drewritten to short-circuit on first mismatch instead of building a fullsetsrc/zarr/core/array.pyinit_array: chunk/shard resolution reduced from ~50 lines of interleaved conditionals to a clean pipeline: normalize → resolve → build metadata. Variableschunk_shape_parsed,shard_shape_parsed,chunks_out,shards_for_partition, andrectilinear_shard_metaeliminated in favor ofouter_chunksandinner_chunks._create(legacy API): same normalize-then-build pattern, consistentouter_chunksnamingtests/conftest.pycreate_array_metadataupdated to useresolve_outer_and_inner_chunksandcreate_chunk_grid_metadatainstead of manually constructing grid metadata dictstests/test_chunk_grids.pynormalize_chunks_ndtests updated:Nonemoved to error cases,typesizeparameter removedtests/test_array.pyresolve_outer_and_inner_chunksSHARDED_INNER_CHUNK_MAX_BYTESinstead of magic1048576