Fix pandas 3.0 / xarray 2025+ compat across groupby, _xarray, run, netcdf by benmsanderson · Pull Request #321 · openscm/scmdata

benmsanderson · 2026-05-23T18:12:05Z

Summary

Restores scmdata compatibility with pandas 3.0 and xarray 2025+. Four targeted patches, each scoped to a single source site, with regression tests and one shared changelog fragment.

Patch 1 — `groupby.py`: numeric-column detection on `StringDtype`

RunGroupBy.__init__ called numpy.issubdtype(col.dtype, numpy.number) to detect numeric meta columns. pandas 3.0 ships pd.StringDtype as the default for inferred string columns, and numpy.issubdtype rejects StringDtype:

TypeError: Cannot interpret '<StringDtype(storage='python', na_value=nan)>' as a data type

Routed through pandas.api.types.is_numeric_dtype, which returns False for StringDtype and True for numeric dtypes.

Patch 2 — `_xarray.py`: `Series` positional indexing

_many_to_one ended with checker.groupby(col2).count().max()[0]. The chained .max() returns a label-indexed Series, and pandas 3.0 removed positional integer indexing on those:

KeyError: 0

Replaced with .iloc[0] — explicit positional, same semantics.

Patch 3 — `run.py`: read-only `DataFrame.values`

ScmRun.convert_unit, _binary_op, and _unary_op all wrote results back via self._df.values[:] = .... pandas 3.0 makes DataFrame.values return a read-only ndarray, so the in-place write raises:

ValueError: assignment destination is read-only

Switched all four sites to self._df.iloc[:, :] = ..., which goes through pandas' indexer rather than the underlying ndarray. The _binary_op / _unary_op sites additionally wrap the RHS in np.asarray(..., dtype=float), because .iloc is dtype-strict where .values[:] was not, and the boolean arrays returned by comparison ops (lt, eq, ne, ...) need to keep landing as float64 to preserve the prior behaviour.

Patch 4 — `netcdf.py`: deprecated `use_cftime` kwarg

_read_nc called xr.load_dataset(fname, use_cftime=True). xarray 2025+ deprecates the bare kwarg in favour of passing an xr.coders.CFDatetimeCoder via decode_times, firing a FutureWarning on every ScmRun.from_nc() call and flooding downstream notebook output.

Routes through the new API when xr.coders.CFDatetimeCoder is available, falling back to the bare use_cftime on xarray older than 2024.09.

Impact

Each of the four patches blocks a different category of ScmRun usage on pandas 3.0:

Patch 1: any multi-scenario ScmRun.groupby (and everything built on it).
Patch 2: ScmRun.to_xarray / ScmRun.to_nc with extras.
Patch 3: ScmRun.convert_unit, all arithmetic operators, all comparison operators.
Patch 4: cosmetic (warning flood), but it's currently several lines of stderr per call.

After the four fixes the relevant code paths run cleanly on both pandas 2.x and pandas 3.x stacks; xarray 2025+ no longer emits the use_cftime deprecation.

Tests

Added regression tests at the patch sites:

tests/unit/test_xarray.py: direct unit tests for _many_to_one on one-to-one and many-to-one inputs.
tests/unit/test_netcdf.py: a from_nc round-trip that asserts no use_cftime FutureWarning is emitted.

The existing convert_unit, _binary_op, and _unary_op test suites (~80 tests) all pass under pandas 3.0 with the patch — and were the diagnostic that surfaced the bool-array dtype edge case during porting.

Backward compatibility: pandas.api.types.is_numeric_dtype, Series.iloc[0], DataFrame.iloc[:, :] = ..., and xr.coders.CFDatetimeCoder (with the fallback) have all been available since well before the lower bounds in pyproject.toml.

Driver

This PR is the dependency-side fix that unblocks the AR7-cycle modernisation work on openscm-runner. Once it's released, the runner deletes its in-tree _scmdata_patches shim and pins scmdata>=<the-release>.

…itional indexing) pandas 3.0 introduced two changes that scmdata 0.18 trips on for any multi-scenario ScmRun: 1. Default StringDtype inference. String columns now come back as pd.StringDtype rather than object. RunGroupBy.__init__ called numpy.issubdtype(col.dtype, numpy.number) to detect numeric meta columns; on StringDtype this raises 'TypeError: Cannot interpret <StringDtype(...)> as a data type'. Route the check through pd.api.types.is_numeric_dtype instead, which returns False for StringDtype and True for numeric dtypes. 2. Removal of Series positional integer indexing. _xarray._many_to_one ended with checker.groupby(col2).count().max()[0]. max() on a DataFrame returns a label-indexed Series and pandas 3.0 removed positional integer indexing on those, so [0] raises 'KeyError: 0'. Use .iloc[0]: same semantics, explicit positional. Both calls are exercised by every multi-scenario ScmRun. The second in particular blocks ScmRun.to_nc entirely on pandas 3.0, so any downstream that streams scenarios to disk (e.g. openscm-runner's NetCDFChunkWriter) currently cannot run. The fixes are backward-compatible: pd.api.types.is_numeric_dtype and Series.iloc[0] have been pandas's canonical APIs since well before pandas 2.0.

Mirror of scripts/run_rcmip_fair2.py for the CICEROSCMPY2 adapter: runs every SSP in the RCMIP fixture (ssp119, ssp126, ssp245, ssp370 and the two lowNTCF variants, ssp434, ssp460, ssp534-over, ssp585) against N posterior members of a CICERO-SCM v2.x distribution and prints a per-scenario 2100 GSAT / CO2 / ERF summary. Defaults to splice mode (user emissions + bundled ssp245 historical), which is the path the demo uses. Pass --cicero-bundle-dir to switch to bundle mode (Marit RCMIP-aligned setup) where gaspam and conc files are resolved per-scenario from inside the bundle directory. Smoke-tested end-to-end against draw_samples_500.json with 20 members: ~44 s for 10 scenarios x 20 members on a single thread. 2100 GSAT medians are systematically warmer than the FaIRv2 numbers on the same protocol (e.g. ssp245 3.77 K vs FaIR 2.63 K, ssp585 6.72 K vs FaIR 4.82 K). The CICEROSCM bundle's ECS distribution is wider than FaIR's, and the 20-member subset is small relative to the full 500-member posterior, so the offset is consistent with the expected inter-model spread. Results are kept in memory: the NetCDFChunkWriter path currently trips a scmdata-pandas-3 incompatibility (fixed in PR #11 / upstream openscm/scmdata#321), so writer support stays out of this script until those land in main.

pandas 3.0 makes DataFrame.values return a read-only ndarray, so the existing self._df.values[:] = ... idiom in convert_unit, _binary_op and _unary_op raises ValueError: assignment destination is read-only. Switch to .iloc[:, :] = ..., which goes through pandas' indexer rather than the underlying ndarray and so isn't affected by the read-only change. _binary_op and _unary_op additionally wrap the right-hand side in np.asarray(..., dtype=float) to preserve the prior silent bool-to- float cast that comparison ops (lt, eq, ne, etc.) relied on; .iloc is dtype-strict where the old .values write was not, so the explicit cast keeps the historical semantics of the result frame.

…ation xarray 2025+ deprecates the bare use_cftime kwarg on `xr.load_dataset`, recommending instead that callers pass an `xr.coders.CFDatetimeCoder` via `decode_times`. The deprecation emits a FutureWarning on every `ScmRun.from_nc()` / `nc_to_run` call, which floods downstream notebook output. Prefer the new API when `xr.coders.CFDatetimeCoder` is available and fall back to the bare kwarg on older xarray (< 2024.09) where the new coder does not exist.

…ps, netcdf

benmsanderson · 2026-05-31T22:51:46Z

Hey @znicholls - this is ready when maintainers have a minute.

Four targeted fixes. The openscm-runner PR needs this first if we want to avoid shims, happy to address anything you flag.

-b

benmsanderson added 2 commits May 23, 2026 20:11

Add changelog fragment for PR openscm#321

ebeb601

This was referenced May 23, 2026

Remove _scmdata_patches monkey-patch once upstream scmdata is fixed benmsanderson/openscm-runner#10

Open

scmdata pandas-3 compatibility patches (in-tree shim, tracking #10) benmsanderson/openscm-runner#11

Open

benmsanderson added 3 commits May 31, 2026 01:31

Expand PR openscm#321 changelog to cover convert_unit, binary/unary o…

bb9a9be

…ps, netcdf

benmsanderson changed the title ~~Fix two pandas 3.0 incompatibilities (StringDtype groupby, Series positional indexing)~~ Fix pandas 3.0 / xarray 2025+ compat across groupby, _xarray, run, netcdf May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix pandas 3.0 / xarray 2025+ compat across groupby, _xarray, run, netcdf#321

Fix pandas 3.0 / xarray 2025+ compat across groupby, _xarray, run, netcdf#321
benmsanderson wants to merge 5 commits into
openscm:mainfrom
benmsanderson:fix/pandas3-compat

benmsanderson commented May 23, 2026 •

edited

Loading

Uh oh!

benmsanderson commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benmsanderson commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Patch 1 — groupby.py: numeric-column detection on StringDtype

Patch 2 — _xarray.py: Series positional indexing

Patch 3 — run.py: read-only DataFrame.values

Patch 4 — netcdf.py: deprecated use_cftime kwarg

Impact

Tests

Driver

Uh oh!

benmsanderson commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

benmsanderson commented May 23, 2026 •

edited

Loading

Patch 1 — `groupby.py`: numeric-column detection on `StringDtype`

Patch 2 — `_xarray.py`: `Series` positional indexing

Patch 3 — `run.py`: read-only `DataFrame.values`

Patch 4 — `netcdf.py`: deprecated `use_cftime` kwarg