-
Notifications
You must be signed in to change notification settings - Fork 29
Flexible indexing blog post #795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
1754d44
4e4307a
7bb7cfe
12b1737
a69eda0
7906054
051ea8d
de477f3
caf7663
f561ba4
3d699bf
32a730b
10656d2
cb0a8e6
3582d85
39a8b5f
6793f22
40e1871
b8ce048
f6b0289
54ccde0
3527970
499b811
56a6e7e
29726d9
4ff07d8
82b8138
dcee56c
fce5d49
4b6d366
d10ff38
bdef441
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,218 @@ | ||||||
| --- | ||||||
| title: 'Xarray Indexes: Exciting new ways to slice and dice your data!' | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
| date: '2025-08-11' | ||||||
| authors: | ||||||
| - name: Benoît Bovy | ||||||
| github: benbovy | ||||||
| - name: Scott Henderson | ||||||
| github: scottyhq | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @scottyhq since you authored this blog post I think it would make sense to add you as 1st author here, or at least add something that gives you credit for authoring this post. I'm thinking of authoring a follow-up technical blog post on how to customize Xarray's behavior via indexes for common operations such as assigning, indexing and alignment (with flowcharts illustrating the logic of Xarray internals and how the index API is called).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, either way is fine by me. want to make sure you're recognized as leading the effort overall! |
||||||
| - name: Deepak Cherian | ||||||
| github: dcherian | ||||||
| - name: Justus Magin | ||||||
| github: keewis | ||||||
| summary: 'An introduction to customizable coordinate-based data selection and alignment for more efficient handling of both traditional and more exotic data structures' | ||||||
| --- | ||||||
|
|
||||||
| \_TLDR: Xarray>2025.6 has been through a major refactoring of its internals that makes coordinate-based data selection and alignment customizable, enabling more efficient handling of both traditional and more exotic data structures. In this post we highlight a few examples that take advantage of this new superpower! | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
|
|
||||||
| <figure> | ||||||
| <img src='/posts/flexible-indexing/summary-slide.png' /> | ||||||
| <figcaption> | ||||||
| *Summary schematic from Deepak Cherian's [2025 SciPy | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
| Presentation](https://www.youtube.com/watch?v=I-NHCuLhRjY) highlighting new | ||||||
| custom Indexes and usecases. [Link to full slide | ||||||
| deck](https://docs.google.com/presentation/d/1sQU2N0-ThNZM8TUhsZy-kT0bZnu0H5X0FRJz2eKwEpA/edit?slide=id.g37373ba88e6_0_214#slide=id.g37373ba88e6_0_214)* | ||||||
| </figcaption> | ||||||
| </figure> | ||||||
|
|
||||||
| {/* This is a comment that won't be rendered! */} | ||||||
|
keewis marked this conversation as resolved.
Outdated
|
||||||
|
|
||||||
| ## Indexing basics | ||||||
|
|
||||||
| First thing's first, _what is an `index` and why is it helpful?_ | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
|
|
||||||
| > In brief, an _index_ makes repeated selection of data more efficient. Xarray Indexes connect coordinate labels to associated data values and encode important contextual information about the coordinate space. | ||||||
|
|
||||||
| Examples of indexes are all around you and are a fundamental way to organize and simplify access to information. | ||||||
| If you want a book about Natural Sciences, you can go to your local library branch and head straight to section `500`, or if you're in the mood for a good novel go to section `800`. Connecting thematic labels with numbers is a classic indexing system that's been around for hundreds of years [(Dewey Decimal System, 1876)](https://en.wikipedia.org/wiki/Dewey_Decimal_Classification). | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
| The need for an index becomes critical as the size of data grows - just imagine the time it would take to find a specific novel amongst a million uncategorized books! | ||||||
|
|
||||||
| The same efficiencies arise in computing. Consider a simple 1D dataset consisting of measurements `Y=[10,20,30,40,50,60]` at six coordinate positions `X=[1, 2, 4, 8, 16, 32]`. _What was our measurement at `X=8`?_ | ||||||
| To answer this in code, we have need an index that is simply a key:value mapping between the coordinate values and integer positions `i=[0,1,2,3,4,5]` in the coordinates array. | ||||||
| With only 6 coordinates, we easily see `X[3]=8` so our measurement of interest is `Y[3]=40`. | ||||||
|
|
||||||
| > 💡 **Note:** for large datasets we should loop over _all_ the coordinates to ensure there are no repeated values! This initial pass over all the coordinates to build an _index_ may take significant time and may not always be desirable. | ||||||
|
|
||||||
| ## Pandas.Index | ||||||
|
|
||||||
| Xarray's [label-based selection](https://docs.xarray.dev/en/latest/user-guide/indexing.html#indexing-with-dimension-names) allows a more expressive and simple syntax in which you don't have to think about the index (`da.sel(x=8) = 40`). Up until now, Xarray has relied exclusively on [Pandas.Index](https://pandas.pydata.org/docs/user_guide/indexing.html), which is still used by default: | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
|
|
||||||
| ```python | ||||||
| x = np.array([1, 2, 4, 8, 16, 32]) | ||||||
| y = np.array([10, 20, 30, 40, 50, 60]) | ||||||
| da = xr.DataArray(y, coords={'x': x}) | ||||||
| da | ||||||
| ``` | ||||||
|
|
||||||
| <RawHTML filePath='/posts/flexible-indexing/da-pandas-repr.html' /> | ||||||
|
|
||||||
| ```python | ||||||
| da.sel(x=8) | ||||||
| # 40 | ||||||
| ``` | ||||||
|
|
||||||
| ## Alternatives to Pandas.Index | ||||||
|
|
||||||
| Importantly, a loop over all the coordinate values is not the only way to create an index. | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
| You might recognize that our coordinates about can in fact be represented by a function `X(i)=2**i` where `i` is the integer position! Given that function we can quickly get measurement values at any position: `Y(X=4)` = `Y[sqrt(4)]` = `Y[2]=30`. Xarray now has a [CoordinateTransformIndex](https://xarray-indexes.readthedocs.io/blocks/transform.html) to handle this type of on-demand lookup of coordinate positions. | ||||||
|
|
||||||
| ### Xarray RangeIndex | ||||||
|
|
||||||
| A simple special case of `CoordinateTransformIndex` is a `RangeIndex` where coordinates can be defined by a start, stop, and uniform step size. `Pandas.RangeIndex` only supports _integers_, whereas Xarray handles floating-point values. Coordinate look-up is performed on-the-fly rather than loading all values into memory up-front when creating a Dataset, which is critical for the example below that has a coordinate array of 7TB! | ||||||
|
|
||||||
| ```python | ||||||
| from xarray.indexes import RangeIndex | ||||||
|
|
||||||
| index = RangeIndex.arange(0.0, 1000.0, 1e-9, dim='x') # 7TB coordinate array! | ||||||
| ds = xr.Dataset(coords=xr.Coordinates.from_xindex(index)) | ||||||
| ds | ||||||
| ``` | ||||||
|
|
||||||
| <RawHTML filePath='/posts/flexible-indexing/ds-range-repr.html' /> | ||||||
|
|
||||||
| Selection preserves the RangeIndex and does not require loading all the coordinates into memory. | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
|
|
||||||
| ``` | ||||||
| sliced = ds.isel(x=slice(1_000, 50_000, 100)) | ||||||
| sliced.x | ||||||
| ``` | ||||||
|
|
||||||
| <RawHTML filePath='/posts/flexible-indexing/ds-range-slice-repr.html' /> | ||||||
|
|
||||||
| ## Third-party custom Indexes | ||||||
|
|
||||||
| In addition to a few new built-in indexes, Xarray can be extended by third party indexes defined outside of the core codebase. This capability is very important to support a multitude of domain-specific data structures. | ||||||
|
|
||||||
| ### Rasterix RasterIndex | ||||||
|
|
||||||
| Earlier we mentioned that coordinates often have a _functional representation_. | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
| For 2D raster images, this function often takes the form of an [Affine Transform](https://en.wikipedia.org/wiki/Affine_transformation). | ||||||
| The [rasterix](https://github.com/xarray-contrib/rasterix) library extends Xarray with a `RasterIndex` which computes coordinates for geospatial images such as GeoTiffs via Affine Transform. | ||||||
|
|
||||||
| Below is a simple example of slicing a large mosaic of GeoTiffs without ever loading the coordinates into memory, note that a new Affine is defined after the slicing operation: | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
|
|
||||||
| ```python | ||||||
| # 811816322401 values! | ||||||
| import rasterix | ||||||
|
|
||||||
| #26475 GeoTiffs represented by a GDAL VRT | ||||||
| da = xr.open_dataarray('https://opentopography.s3.sdsc.edu/raster/COP30/COP30_hh.vrt', | ||||||
| engine='rasterio', | ||||||
| parse_coordinates=False).squeeze().pipe( | ||||||
| rasterix.assign_index | ||||||
| ) | ||||||
| da | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
| ``` | ||||||
|
|
||||||
| <RawHTML filePath='/posts/flexible-indexing/da-rasterix-repr.html' /> | ||||||
|
|
||||||
| ```python | ||||||
| print('Original geotransform:\n', da.xindexes['x'].transform()) | ||||||
| da_sliced = da.sel(x=slice(-122.4, -120.0), y=slice(-47.1,-49.0)) | ||||||
| print('Sliced geotransform:\n', da_sliced.xindexes['x'].transform()) | ||||||
| ``` | ||||||
|
|
||||||
| ``` | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
| Original geotransform: | ||||||
| | 0.00, 0.00,-180.00| | ||||||
| | 0.00,-0.00, 84.00| | ||||||
| | 0.00, 0.00, 1.00| | ||||||
|
|
||||||
| Sliced geotransform: | ||||||
| | 0.00, 0.00,-122.40| | ||||||
| | 0.00,-0.00,-47.10| | ||||||
| | 0.00, 0.00, 1.00| | ||||||
| ``` | ||||||
|
|
||||||
| ### XProj CRSIndex | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| > real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc. [Xarray Docs](https://docs.xarray.dev/en/stable/getting-started-guide/why-xarray.html#what-labels-enable) | ||||||
|
|
||||||
| We often think about metadata providing context for _measurement values_ but metadata is also critical for coordinates! | ||||||
| In particular, to align two different datasets we must ask if the coordinates are in the same coordinate system. | ||||||
| In other words, do they share the same origin and scale? | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
|
|
||||||
| There are currently over 7000 commonly used [Coordinate Reference Systems (CRS)](https://spatialreference.org/ref/epsg/) for geospatial data in the authoritative EPSG database! | ||||||
| And of course an infinite number of custom-defined CRSs. | ||||||
| [xproj.CRSIndex](https://xproj.readthedocs.io/en/latest/) gives Xarray objects an automatic awareness of the coordinate reference system operations like `xr.align()`, which can raise an an informative error when there is a CRS mismatch: | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
|
|
||||||
| ```python | ||||||
| from xproj import CRSIndex | ||||||
| lons1 = np.arange(-125, -120, 1) | ||||||
| lons2 = np.arange(-122, -118, 1) | ||||||
| ds1 = xr.Dataset(coords={'longitude': lons1}).proj.assign_crs(crs=4267) | ||||||
| ds2 = xr.Dataset(coords={'longitude': lons2}).proj.assign_crs(crs=4326) | ||||||
| ds1 + ds2 | ||||||
| ``` | ||||||
|
|
||||||
| ```pytb | ||||||
| MergeError: conflicting values/indexes on objects to be combined for coordinate 'crs' | ||||||
| ``` | ||||||
|
|
||||||
| ### XVec GeometryIndex | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| A "vector data cube" is an n-D array that has at least one dimension indexed by a 2-D array of vector geometries. | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
| Large vector cubes can take advantage of an [R-tree spatial index](https://en.wikipedia.org/wiki/R-tree) for efficiently selecting vector geometries within a given bounding box. | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
| The `XVec.GeometryIndex` provides this functionality, below is a short code snippet but please refer to the [documentation for more](https://xvec.readthedocs.io/en/stable/indexing.html)! | ||||||
|
|
||||||
| ```python | ||||||
| import xvec | ||||||
| import geopandas as gpd | ||||||
| from geodatasets import get_path | ||||||
|
|
||||||
| # Dataset that contains demographic data indexed by U.S. counties | ||||||
| counties = gpd.read_file(get_path("geoda.natregimes")) | ||||||
|
|
||||||
| cube = xr.Dataset( | ||||||
| data_vars=dict( | ||||||
| population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]), | ||||||
| unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]), | ||||||
| ), | ||||||
| coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]), | ||||||
| ).xvec.set_geom_indexes("county", crs=counties.crs) | ||||||
| cube | ||||||
| ``` | ||||||
|
|
||||||
| <RawHTML filePath='/posts/flexible-indexing/xvec-repr.html' /> | ||||||
|
|
||||||
| ```python | ||||||
| # Efficient selection using shapely.STRtree | ||||||
| from shapely.geometry import box | ||||||
|
|
||||||
| subset = cube.xvec.query( | ||||||
| "county", | ||||||
| box(-125.4, 40, -120.0, 50), | ||||||
| predicate="intersects", | ||||||
| ) | ||||||
|
|
||||||
| subset['population'].xvec.plot(col='year'); | ||||||
| ``` | ||||||
|
|
||||||
| <p align='center'> | ||||||
| <img src='/posts/flexible-indexing/xvecfig.png' /> | ||||||
| </p> | ||||||
|
|
||||||
| ### Even more examples! | ||||||
|
|
||||||
| Be sure to check out the [Gallery of Custom Index Examples](https://xarray-indexes.readthedocs.io) for more detailed examples of all the indexes mentioned in this post and more! | ||||||
|
|
||||||
| ## What's next? | ||||||
|
|
||||||
| While we're extremely excited about what can _already_ be accomplished with the new indexing capabilities, there are plenty of exciting ideas for future work. | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
|
|
||||||
| Have an idea for your own custom index? Check out [this section of the Xarray documentation](https://docs.xarray.dev/en/stable/internals/how-to-create-custom-index.html). | ||||||
|
scottyhq marked this conversation as resolved.
Outdated
|
||||||
|
|
||||||
| ## Acknowledgments | ||||||
|
|
||||||
| This work would not have been possible without technical input from the Xarray core team and community! | ||||||
| Several developers received essential funding from a [CZI Essential Open Source Software for Science (EOSS) grant](https://xarray.dev/blog/czi-eoss-grant-conclusion) as well as NASA's Open Source Tools, Frameworks, and Libraries (OSTFL) grant 80NSSC22K0345. | ||||||
Uh oh!
There was an error while loading. Please reload this page.