Skip to content

refactor: simplify internal chunk representation#3899

Open
d-v-b wants to merge 9 commits intozarr-developers:mainfrom
d-v-b:refactor/simplify-internal-chunk-representation
Open

refactor: simplify internal chunk representation#3899
d-v-b wants to merge 9 commits intozarr-developers:mainfrom
d-v-b:refactor/simplify-internal-chunk-representation

Conversation

@d-v-b
Copy link
Copy Markdown
Contributor

@d-v-b d-v-b commented Apr 12, 2026

The addition of rectilinear chunks left us with some jank in our internal chunk normalization logic. We had a lot of redundant chunk normalization routines, and we also weren't handling user input correctly, e.g. #3898. We need some internal changes to ensure that user input is consistently handled regardless of whether we are generating regular chunks or irregular chunks. That's what this PR does. Also, this PR closes #3898

I will give my summary, then a summary generated by claude.

My summary

ChunksTuple

This PR addresses this by introducing a canonical internal representation of the fully normalized chunk layout for an array, which is a tuple called ChunksTuple. Feel free to suggest better names.

ChunksTuple is just tuple[tuple[int, ...], ...], i.e. a representation compatible with regular or irregular chunks, but I wrap this type in NewType.

I use NewType because tuples of tuples of ints can be very easily confused with tuples of ints (regular chunks), or tuples of tuples of tuples of ints (e.g., rectilinear chunking with RLE). So I think it's helpful to be defensive here and reduce ambiguity.

There are 2 functions that produce ChunksTuple:

  1. normalize_chunks_nd, which converts a user-friendly request for specific chunks into an explicit chunk layout
  2. guess_chunks, which converts a user's request for auto chunking into a specific layout. Auto chunking depends on configuration, data type, etc so this is a separate routine.

ResolvedChunking

ChunksTuple is used in ResolvedChunking (bad name, I would rather use ChunkSpec but that's in use already), which is this:

class ResolvedChunking(NamedTuple):
    outer_chunks: ChunksTuple
    inner_chunks: ChunksTuple | None

ResolvedChunking is what you get when you jointly normalize the chunks and shards keyword arguments to create_array.

I introduce some new terminology here for internal purposes. outer_chunks denotes the shape of the chunks qua stored objects, and inner_chunks denotes the shape of the subchunks inside an outer chunk, if that outer chunk uses sharding. If the outer chunk doesn't use sharding, then inner_chunks is None.

These two data types are used to consolidate our chunk normalization routines.

Claude's Summary

Refactors chunk and shard handling during array creation to fix a naming ambiguity where chunk_shape meant "outer grid partition" without sharding but "inner sub-chunk" with sharding, silently changing meaning based on context.

Introduces a three-layer architecture for chunk resolution:

  1. Normalizationnormalize_chunks_nd and guess_chunks convert raw user input into ChunksTuple, a NewType-branded tuple[tuple[int, ...], ...] that represents both regular and rectilinear chunks uniformly. This is the only boundary between untyped user input and the internal representation.

  2. Resolutionresolve_outer_and_inner_chunks takes a ChunksTuple (the user's chunks=) and raw shard input (shards=), and returns a ResolvedChunking NamedTuple with two unambiguous fields:

    • outer_chunks: ChunksTuple — chunk sizes for the chunk grid metadata (shard sizes when sharding, chunk sizes otherwise)
    • inner_chunks: ChunksTuple | None — sub-chunk sizes for ShardingCodec, or None when sharding is not active
  3. Metadata constructioncreate_chunk_grid_metadata takes a ChunksTuple and dispatches to RegularChunkGridMetadata or RectilinearChunkGridMetadata based on whether the chunks are uniform.

Key design decisions

  • ChunksTuple as a NewType: Zero runtime cost, but the type checker prevents accidentally passing raw user input where normalized chunks are expected. Both regular and rectilinear chunks use the same representation — regular is just the case where each inner tuple has uniform values.

  • inner_chunks: None models capability, not configuration: An unsharded chunk is opaque (read the whole thing or nothing). A shard has internal structure (an index that enables sub-chunk addressing). None means "this chunk has no internal structure" — it's not a flag you toggle, it's the absence of a capability.

  • normalize_chunks_nd rejects None: Top-level None means "auto" everywhere else in the codebase. Having normalize_chunks_nd silently treat it as "span all" would be a bug waiting to happen. Callers must use guess_chunks for auto-chunking.

  • Rectilinear shard detection absorbed into resolve_outer_and_inner_chunks: The function handles all shard input forms (None, "auto", dict, flat tuple, nested sequence) internally, eliminating the shards_for_partition / rectilinear_shard_meta dance that callers previously had to manage.

Changes by file

src/zarr/core/chunk_grids.py

  • Added SHARDED_INNER_CHUNK_MAX_BYTES constant (1 MiB) — replaces magic number used as the auto-chunking ceiling when sharding is active
  • Added ChunksTuple NewType — branded tuple[tuple[int, ...], ...]
  • Added ResolvedChunking NamedTuple — (outer_chunks, inner_chunks)
  • normalize_chunks_nd now returns ChunksTuple, rejects None
  • guess_chunks now returns ChunksTuple (normalizes via normalize_chunks_nd)
  • Replaced resolve_shard_shape (returned flat tuple | None) with resolve_outer_and_inner_chunks (returns ResolvedChunking, absorbs rectilinear shard detection)
  • Removed resolve_chunk_shape (was a lossy flattening wrapper)
  • Removed guess_chunks_and_shards (was dead code)

src/zarr/core/metadata/v3.py

  • create_chunk_grid_metadata now accepts ChunksTuple (no longer normalizes internally, no shape parameter)
  • is_regular_1d rewritten to short-circuit on first mismatch instead of building a full set
  • RST-style docstring syntax replaced with markdown

src/zarr/core/array.py

  • init_array: chunk/shard resolution reduced from ~50 lines of interleaved conditionals to a clean pipeline: normalize → resolve → build metadata. Variables chunk_shape_parsed, shard_shape_parsed, chunks_out, shards_for_partition, and rectilinear_shard_meta eliminated in favor of outer_chunks and inner_chunks.
  • _create (legacy API): same normalize-then-build pattern, consistent outer_chunks naming

tests/conftest.py

  • create_array_metadata updated to use resolve_outer_and_inner_chunks and create_chunk_grid_metadata instead of manually constructing grid metadata dicts

tests/test_chunk_grids.py

  • normalize_chunks_nd tests updated: None moved to error cases, typesize parameter removed
  • Tests use the new function signatures

tests/test_array.py

  • Shard auto-partition tests updated to use resolve_outer_and_inner_chunks
  • Auto-chunk-with-sharding test exercises the full pipeline (guess → resolve → verify divisibility)
  • Uses SHARDED_INNER_CHUNK_MAX_BYTES instead of magic 1048576

d-v-b added 3 commits April 10, 2026 18:26
Previously rectilinear chunk grids and regular chunk grids normalized chunks inconsistently.
This change ensures that chunk specifications are always normalized by the same routines in all cases.

This change also ensures that chunks=(-1, ...) consistently normalizes to a full length chunk along that axis.
@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 12, 2026
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 12, 2026
@d-v-b d-v-b requested a review from maxrjones April 12, 2026 12:29
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 12, 2026

Codecov Report

❌ Patch coverage is 61.53846% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.70%. Comparing base (9681cf9) to head (c40c5ff).

Files with missing lines Patch % Lines
src/zarr/core/chunk_grids.py 55.17% 26 Missing ⚠️
src/zarr/core/array.py 61.11% 7 Missing ⚠️
src/zarr/core/metadata/v3.py 86.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3899      +/-   ##
==========================================
- Coverage   92.98%   92.70%   -0.28%     
==========================================
  Files          87       87              
  Lines       11246    11261      +15     
==========================================
- Hits        10457    10440      -17     
- Misses        789      821      +32     
Files with missing lines Coverage Δ
src/zarr/core/metadata/v3.py 92.39% <86.66%> (-0.38%) ⬇️
src/zarr/core/array.py 97.15% <61.11%> (-0.49%) ⬇️
src/zarr/core/chunk_grids.py 89.08% <55.17%> (-7.18%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.01%. Comparing base (9681cf9) to head (9fc3fea).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3899      +/-   ##
==========================================
+ Coverage   92.98%   93.01%   +0.02%     
==========================================
  Files          87       87              
  Lines       11246    11262      +16     
==========================================
+ Hits        10457    10475      +18     
+ Misses        789      787       -2     
Files with missing lines Coverage Δ
src/zarr/core/array.py 97.79% <100.00%> (+0.15%) ⬆️
src/zarr/core/chunk_grids.py 96.76% <100.00%> (+0.50%) ⬆️
src/zarr/core/metadata/v3.py 92.98% <100.00%> (+0.21%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the direction of the refactor.

I found the description of the PR somewhat misleading. The bug fix (make chunk normalization properly handle -1) is totally unrelated to the additional of rectilinear chunk support; the bug report showed the issue on prior releases. The rectilinear chunk addition made the pre-existing jank related to duplicated normalization logic worse.

there are a few cases in the deprecated Array.create() method that possibly regress in this PR:

def _create_deprecated(**kwargs):
    """Call the deprecated Array.create(), suppressing the deprecation warning."""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", DeprecationWarning)
        return zarr.Array.create(**kwargs)


def test_deprecated_underspecified_chunks_padded():
    """Fewer chunk dims than shape dims — missing dims padded from shape."""
    arr = _create_deprecated(store={}, shape=(100, 20, 10), chunks=(30,), dtype="uint8")
    assert arr.metadata.chunk_grid.chunk_shape == (30, 20, 10)


def test_deprecated_underspecified_chunks_with_none():
    """Partial chunks with None — padded from shape."""
    arr = _create_deprecated(store={}, shape=(100, 20, 10), chunks=(30, None), dtype="uint8")
    assert arr.metadata.chunk_grid.chunk_shape == (30, 20, 10)


def test_deprecated_none_per_dimension_sentinel():
    """None inside chunks tuple means 'span the full axis'."""
    arr = _create_deprecated(store={}, shape=(100, 10), chunks=(10, None), dtype="uint8")
    assert arr.metadata.chunk_grid.chunk_shape == (10, 10)

I'm not sure if these were intentional API design choices, versus quirks in the old API. It may be a good time to remove deprecated functions, as a separate PR, first to reduce the surface area for potential regressions when fixing/adding functionality to the new API.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 12, 2026

I found the description of the PR somewhat misleading. The bug fix (make chunk normalization properly handle -1) is totally unrelated to the additional of rectilinear chunk support; the bug report showed the issue on prior releases. The rectilinear chunk addition made the pre-existing jank related to duplicated normalization logic worse.

good catch, the change that broke -1 normalization was this one: #2761. We basically forked array creation routines and didn't reach feature / testing parity with the new one 🤦

I don't see value in supporting cases like this, other than backwards compatibility.

shape=(100, 20, 10), chunks=(30,)
shape=(100, 20, 10), chunks=(30, None)

Are there any non-deprecated functions that supported this?

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 12, 2026

It may be a good time to remove deprecated functions, as a separate PR, first to reduce the surface area for potential regressions when fixing/adding functionality to the new API.

💯

@maxrjones
Copy link
Copy Markdown
Member

I don't see value in supporting cases like this, other than backwards compatibility. Are there any non-deprecated functions that supported this?

I couldn't find any non-deprecated cases of supporting underspecified chunks (fewer than the number of dims) or using None as a sentinel value like -1.

@d-v-b d-v-b changed the title refactor/simplify internal chunk representation refactor: simplify internal chunk representation Apr 13, 2026
@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 13, 2026

#3903 removes the deprecated methods

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 13, 2026

the latest changes make the representation of chunks recursive, in order to express nested sharding. This is future-proofing the design here against the possibility that we give our high-level routines a simple way to declare nested sharding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

given shape=(s,) chunks=(-1,) should mean (s,)

2 participants