refactor: simplify internal chunk representation by d-v-b · Pull Request #3899 · zarr-developers/zarr-python

d-v-b · 2026-04-12T12:27:10Z

The addition of rectilinear chunks left us with some jank in our internal chunk normalization logic. We had a lot of redundant chunk normalization routines, and we also weren't handling user input correctly, e.g. #3898. We need some internal changes to ensure that user input is consistently handled regardless of whether we are generating regular chunks or irregular chunks. That's what this PR does. Also, this PR closes #3898

I will give my summary, then a summary generated by claude.

My summary

`ChunksTuple`

This PR addresses this by introducing a canonical internal representation of the fully normalized chunk layout for an array, which is a tuple called ChunksTuple. Feel free to suggest better names.

ChunksTuple is just tuple[tuple[int, ...], ...], i.e. a representation compatible with regular or irregular chunks, but I wrap this type in NewType.

I use NewType because tuples of tuples of ints can be very easily confused with tuples of ints (regular chunks), or tuples of tuples of tuples of ints (e.g., rectilinear chunking with RLE). So I think it's helpful to be defensive here and reduce ambiguity.

There are 2 functions that produce ChunksTuple:

normalize_chunks_nd, which converts a user-friendly request for specific chunks into an explicit chunk layout
guess_chunks, which converts a user's request for auto chunking into a specific layout. Auto chunking depends on configuration, data type, etc so this is a separate routine.

`ResolvedChunking`

ChunksTuple is used in ResolvedChunking (bad name, I would rather use ChunkSpec but that's in use already), which is this:

class ResolvedChunking(NamedTuple):
    outer_chunks: ChunksTuple
    inner_chunks: ChunksTuple | None

ResolvedChunking is what you get when you jointly normalize the chunks and shards keyword arguments to create_array.

I introduce some new terminology here for internal purposes. outer_chunks denotes the shape of the chunks qua stored objects, and inner_chunks denotes the shape of the subchunks inside an outer chunk, if that outer chunk uses sharding. If the outer chunk doesn't use sharding, then inner_chunks is None.

These two data types are used to consolidate our chunk normalization routines.

Claude's Summary

Refactors chunk and shard handling during array creation to fix a naming ambiguity where chunk_shape meant "outer grid partition" without sharding but "inner sub-chunk" with sharding, silently changing meaning based on context.

Introduces a three-layer architecture for chunk resolution:

Normalization — normalize_chunks_nd and guess_chunks convert raw user input into ChunksTuple, a NewType-branded tuple[tuple[int, ...], ...] that represents both regular and rectilinear chunks uniformly. This is the only boundary between untyped user input and the internal representation.
Resolution — resolve_outer_and_inner_chunks takes a ChunksTuple (the user's chunks=) and raw shard input (shards=), and returns a ResolvedChunking NamedTuple with two unambiguous fields:
- outer_chunks: ChunksTuple — chunk sizes for the chunk grid metadata (shard sizes when sharding, chunk sizes otherwise)
- inner_chunks: ChunksTuple | None — sub-chunk sizes for ShardingCodec, or None when sharding is not active
Metadata construction — create_chunk_grid_metadata takes a ChunksTuple and dispatches to RegularChunkGridMetadata or RectilinearChunkGridMetadata based on whether the chunks are uniform.

Key design decisions

ChunksTuple as a NewType: Zero runtime cost, but the type checker prevents accidentally passing raw user input where normalized chunks are expected. Both regular and rectilinear chunks use the same representation — regular is just the case where each inner tuple has uniform values.
inner_chunks: None models capability, not configuration: An unsharded chunk is opaque (read the whole thing or nothing). A shard has internal structure (an index that enables sub-chunk addressing). None means "this chunk has no internal structure" — it's not a flag you toggle, it's the absence of a capability.
normalize_chunks_nd rejects None: Top-level None means "auto" everywhere else in the codebase. Having normalize_chunks_nd silently treat it as "span all" would be a bug waiting to happen. Callers must use guess_chunks for auto-chunking.
Rectilinear shard detection absorbed into resolve_outer_and_inner_chunks: The function handles all shard input forms (None, "auto", dict, flat tuple, nested sequence) internally, eliminating the shards_for_partition / rectilinear_shard_meta dance that callers previously had to manage.

Changes by file

src/zarr/core/chunk_grids.py

Added SHARDED_INNER_CHUNK_MAX_BYTES constant (1 MiB) — replaces magic number used as the auto-chunking ceiling when sharding is active
Added ChunksTuple NewType — branded tuple[tuple[int, ...], ...]
Added ResolvedChunking NamedTuple — (outer_chunks, inner_chunks)
normalize_chunks_nd now returns ChunksTuple, rejects None
guess_chunks now returns ChunksTuple (normalizes via normalize_chunks_nd)
Replaced resolve_shard_shape (returned flat tuple | None) with resolve_outer_and_inner_chunks (returns ResolvedChunking, absorbs rectilinear shard detection)
Removed resolve_chunk_shape (was a lossy flattening wrapper)
Removed guess_chunks_and_shards (was dead code)

src/zarr/core/metadata/v3.py

create_chunk_grid_metadata now accepts ChunksTuple (no longer normalizes internally, no shape parameter)
is_regular_1d rewritten to short-circuit on first mismatch instead of building a full set
RST-style docstring syntax replaced with markdown

src/zarr/core/array.py

init_array: chunk/shard resolution reduced from ~50 lines of interleaved conditionals to a clean pipeline: normalize → resolve → build metadata. Variables chunk_shape_parsed, shard_shape_parsed, chunks_out, shards_for_partition, and rectilinear_shard_meta eliminated in favor of outer_chunks and inner_chunks.
_create (legacy API): same normalize-then-build pattern, consistent outer_chunks naming

tests/conftest.py

create_array_metadata updated to use resolve_outer_and_inner_chunks and create_chunk_grid_metadata instead of manually constructing grid metadata dicts

tests/test_chunk_grids.py

normalize_chunks_nd tests updated: None moved to error cases, typesize parameter removed
Tests use the new function signatures

tests/test_array.py

Shard auto-partition tests updated to use resolve_outer_and_inner_chunks
Auto-chunk-with-sharding test exercises the full pipeline (guess → resolve → verify divisibility)
Uses SHARDED_INNER_CHUNK_MAX_BYTES instead of magic 1048576

…s regular chunks

Previously rectilinear chunk grids and regular chunk grids normalized chunks inconsistently. This change ensures that chunk specifications are always normalized by the same routines in all cases. This change also ensures that chunks=(-1, ...) consistently normalizes to a full length chunk along that axis.

codecov · 2026-04-12T12:33:33Z

Codecov Report

❌ Patch coverage is 61.53846% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.70%. Comparing base (9681cf9) to head (c40c5ff).

Files with missing lines	Patch %	Lines
src/zarr/core/chunk_grids.py	55.17%	26 Missing ⚠️
src/zarr/core/array.py	61.11%	7 Missing ⚠️
src/zarr/core/metadata/v3.py	86.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3899      +/-   ##
==========================================
- Coverage   92.98%   92.70%   -0.28%     
==========================================
  Files          87       87              
  Lines       11246    11261      +15     
==========================================
- Hits        10457    10440      -17     
- Misses        789      821      +32

Files with missing lines	Coverage Δ
src/zarr/core/metadata/v3.py	`92.39% <86.66%> (-0.38%)`	⬇️
src/zarr/core/array.py	`97.15% <61.11%> (-0.49%)`	⬇️
src/zarr/core/chunk_grids.py	`89.08% <55.17%> (-7.18%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov · 2026-04-12T12:41:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.01%. Comparing base (9681cf9) to head (9fc3fea).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3899      +/-   ##
==========================================
+ Coverage   92.98%   93.01%   +0.02%     
==========================================
  Files          87       87              
  Lines       11246    11262      +16     
==========================================
+ Hits        10457    10475      +18     
+ Misses        789      787       -2

Files with missing lines	Coverage Δ
src/zarr/core/array.py	`97.79% <100.00%> (+0.15%)`	⬆️
src/zarr/core/chunk_grids.py	`96.76% <100.00%> (+0.50%)`	⬆️
src/zarr/core/metadata/v3.py	`92.98% <100.00%> (+0.21%)`	⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

maxrjones

I really like the direction of the refactor.

I found the description of the PR somewhat misleading. The bug fix (make chunk normalization properly handle -1) is totally unrelated to the additional of rectilinear chunk support; the bug report showed the issue on prior releases. The rectilinear chunk addition made the pre-existing jank related to duplicated normalization logic worse.

there are a few cases in the deprecated Array.create() method that possibly regress in this PR:

def _create_deprecated(**kwargs):
    """Call the deprecated Array.create(), suppressing the deprecation warning."""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", DeprecationWarning)
        return zarr.Array.create(**kwargs)


def test_deprecated_underspecified_chunks_padded():
    """Fewer chunk dims than shape dims — missing dims padded from shape."""
    arr = _create_deprecated(store={}, shape=(100, 20, 10), chunks=(30,), dtype="uint8")
    assert arr.metadata.chunk_grid.chunk_shape == (30, 20, 10)


def test_deprecated_underspecified_chunks_with_none():
    """Partial chunks with None — padded from shape."""
    arr = _create_deprecated(store={}, shape=(100, 20, 10), chunks=(30, None), dtype="uint8")
    assert arr.metadata.chunk_grid.chunk_shape == (30, 20, 10)


def test_deprecated_none_per_dimension_sentinel():
    """None inside chunks tuple means 'span the full axis'."""
    arr = _create_deprecated(store={}, shape=(100, 10), chunks=(10, None), dtype="uint8")
    assert arr.metadata.chunk_grid.chunk_shape == (10, 10)

I'm not sure if these were intentional API design choices, versus quirks in the old API. It may be a good time to remove deprecated functions, as a separate PR, first to reduce the surface area for potential regressions when fixing/adding functionality to the new API.

d-v-b · 2026-04-12T19:48:31Z

I found the description of the PR somewhat misleading. The bug fix (make chunk normalization properly handle -1) is totally unrelated to the additional of rectilinear chunk support; the bug report showed the issue on prior releases. The rectilinear chunk addition made the pre-existing jank related to duplicated normalization logic worse.

good catch, the change that broke -1 normalization was this one: #2761. We basically forked array creation routines and didn't reach feature / testing parity with the new one 🤦

I don't see value in supporting cases like this, other than backwards compatibility.

shape=(100, 20, 10), chunks=(30,)
shape=(100, 20, 10), chunks=(30, None)

Are there any non-deprecated functions that supported this?

d-v-b · 2026-04-12T19:49:32Z

It may be a good time to remove deprecated functions, as a separate PR, first to reduce the surface area for potential regressions when fixing/adding functionality to the new API.

💯

maxrjones · 2026-04-12T20:07:30Z

I don't see value in supporting cases like this, other than backwards compatibility. Are there any non-deprecated functions that supported this?

I couldn't find any non-deprecated cases of supporting underspecified chunks (fewer than the number of dims) or using None as a sentinel value like -1.

d-v-b · 2026-04-13T10:11:35Z

#3903 removes the deprecated methods

d-v-b · 2026-04-13T10:29:43Z

the latest changes make the representation of chunks recursive, in order to express nested sharding. This is future-proofing the design here against the possibility that we give our high-level routines a simple way to declare nested sharding.

d-v-b added 3 commits April 10, 2026 18:26

refactor: rename guess_chunks to more clearly indicate that it guesse…

5990390

…s regular chunks

refactor: use newtype pattern

9735a85

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 12, 2026

docs: changelog

c40c5ff

github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 12, 2026

d-v-b requested a review from maxrjones April 12, 2026 12:29

fix: handle 0-length arrays

a9b68d8

d-v-b added 3 commits April 12, 2026 14:50

test: test untested cases of chunk normalization

88e93ad

fix: don't accept inane input

fcd5ab0

test: check error states in normalize_chunks_1d

5c56197

maxrjones reviewed Apr 12, 2026

View reviewed changes

d-v-b changed the title ~~refactor/simplify internal chunk representation~~ refactor: simplify internal chunk representation Apr 13, 2026

refactor: make resolvedchunking recursive to support nested sharding

9fc3fea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: simplify internal chunk representation#3899

refactor: simplify internal chunk representation#3899
d-v-b wants to merge 9 commits intozarr-developers:mainfrom
d-v-b:refactor/simplify-internal-chunk-representation

d-v-b commented Apr 12, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 12, 2026

Uh oh!

codecov bot commented Apr 12, 2026 •

edited

Loading

Uh oh!

maxrjones left a comment

Uh oh!

d-v-b commented Apr 12, 2026

Uh oh!

d-v-b commented Apr 12, 2026

Uh oh!

maxrjones commented Apr 12, 2026

Uh oh!

d-v-b commented Apr 13, 2026

Uh oh!

d-v-b commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

d-v-b commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

My summary

ChunksTuple

ResolvedChunking

Claude's Summary

Key design decisions

Changes by file

Uh oh!

codecov bot commented Apr 12, 2026

Codecov Report

Uh oh!

codecov bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

maxrjones left a comment

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Apr 12, 2026

Uh oh!

d-v-b commented Apr 12, 2026

Uh oh!

maxrjones commented Apr 12, 2026

Uh oh!

d-v-b commented Apr 13, 2026

Uh oh!

d-v-b commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

d-v-b commented Apr 12, 2026 •

edited

Loading

`ChunksTuple`

`ResolvedChunking`

codecov bot commented Apr 12, 2026 •

edited

Loading

d-v-b commented Apr 13, 2026 •

edited

Loading