Skip to content

Fix pandas 3.0 / xarray 2025+ compat across groupby, _xarray, run, netcdf#321

Open
benmsanderson wants to merge 5 commits into
openscm:mainfrom
benmsanderson:fix/pandas3-compat
Open

Fix pandas 3.0 / xarray 2025+ compat across groupby, _xarray, run, netcdf#321
benmsanderson wants to merge 5 commits into
openscm:mainfrom
benmsanderson:fix/pandas3-compat

Conversation

@benmsanderson
Copy link
Copy Markdown

@benmsanderson benmsanderson commented May 23, 2026

Summary

Restores scmdata compatibility with pandas 3.0 and xarray 2025+. Four targeted patches, each scoped to a single source site, with regression tests and one shared changelog fragment.

Patch 1 — groupby.py: numeric-column detection on StringDtype

RunGroupBy.__init__ called numpy.issubdtype(col.dtype, numpy.number) to detect numeric meta columns. pandas 3.0 ships pd.StringDtype as the default for inferred string columns, and numpy.issubdtype rejects StringDtype:

TypeError: Cannot interpret '<StringDtype(storage='python', na_value=nan)>' as a data type

Routed through pandas.api.types.is_numeric_dtype, which returns False for StringDtype and True for numeric dtypes.

Patch 2 — _xarray.py: Series positional indexing

_many_to_one ended with checker.groupby(col2).count().max()[0]. The chained .max() returns a label-indexed Series, and pandas 3.0 removed positional integer indexing on those:

KeyError: 0

Replaced with .iloc[0] — explicit positional, same semantics.

Patch 3 — run.py: read-only DataFrame.values

ScmRun.convert_unit, _binary_op, and _unary_op all wrote results back via self._df.values[:] = .... pandas 3.0 makes DataFrame.values return a read-only ndarray, so the in-place write raises:

ValueError: assignment destination is read-only

Switched all four sites to self._df.iloc[:, :] = ..., which goes through pandas' indexer rather than the underlying ndarray. The _binary_op / _unary_op sites additionally wrap the RHS in np.asarray(..., dtype=float), because .iloc is dtype-strict where .values[:] was not, and the boolean arrays returned by comparison ops (lt, eq, ne, ...) need to keep landing as float64 to preserve the prior behaviour.

Patch 4 — netcdf.py: deprecated use_cftime kwarg

_read_nc called xr.load_dataset(fname, use_cftime=True). xarray 2025+ deprecates the bare kwarg in favour of passing an xr.coders.CFDatetimeCoder via decode_times, firing a FutureWarning on every ScmRun.from_nc() call and flooding downstream notebook output.

Routes through the new API when xr.coders.CFDatetimeCoder is available, falling back to the bare use_cftime on xarray older than 2024.09.

Impact

Each of the four patches blocks a different category of ScmRun usage on pandas 3.0:

  • Patch 1: any multi-scenario ScmRun.groupby (and everything built on it).
  • Patch 2: ScmRun.to_xarray / ScmRun.to_nc with extras.
  • Patch 3: ScmRun.convert_unit, all arithmetic operators, all comparison operators.
  • Patch 4: cosmetic (warning flood), but it's currently several lines of stderr per call.

After the four fixes the relevant code paths run cleanly on both pandas 2.x and pandas 3.x stacks; xarray 2025+ no longer emits the use_cftime deprecation.

Tests

Added regression tests at the patch sites:

  • tests/unit/test_xarray.py: direct unit tests for _many_to_one on one-to-one and many-to-one inputs.
  • tests/unit/test_netcdf.py: a from_nc round-trip that asserts no use_cftime FutureWarning is emitted.

The existing convert_unit, _binary_op, and _unary_op test suites (~80 tests) all pass under pandas 3.0 with the patch — and were the diagnostic that surfaced the bool-array dtype edge case during porting.

Backward compatibility: pandas.api.types.is_numeric_dtype, Series.iloc[0], DataFrame.iloc[:, :] = ..., and xr.coders.CFDatetimeCoder (with the fallback) have all been available since well before the lower bounds in pyproject.toml.

Driver

This PR is the dependency-side fix that unblocks the AR7-cycle modernisation work on openscm-runner. Once it's released, the runner deletes its in-tree _scmdata_patches shim and pins scmdata>=<the-release>.

…itional indexing)

pandas 3.0 introduced two changes that scmdata 0.18 trips on for any
multi-scenario ScmRun:

1. Default StringDtype inference. String columns now come back as
   pd.StringDtype rather than object. RunGroupBy.__init__ called
   numpy.issubdtype(col.dtype, numpy.number) to detect numeric meta
   columns; on StringDtype this raises
   'TypeError: Cannot interpret <StringDtype(...)> as a data type'.
   Route the check through pd.api.types.is_numeric_dtype instead,
   which returns False for StringDtype and True for numeric dtypes.

2. Removal of Series positional integer indexing.
   _xarray._many_to_one ended with checker.groupby(col2).count().max()[0].
   max() on a DataFrame returns a label-indexed Series and pandas 3.0
   removed positional integer indexing on those, so [0] raises
   'KeyError: 0'. Use .iloc[0]: same semantics, explicit positional.

Both calls are exercised by every multi-scenario ScmRun. The second
in particular blocks ScmRun.to_nc entirely on pandas 3.0, so any
downstream that streams scenarios to disk (e.g. openscm-runner's
NetCDFChunkWriter) currently cannot run.

The fixes are backward-compatible: pd.api.types.is_numeric_dtype and
Series.iloc[0] have been pandas's canonical APIs since well before
pandas 2.0.
benmsanderson added a commit to benmsanderson/openscm-runner that referenced this pull request May 23, 2026
Mirror of scripts/run_rcmip_fair2.py for the CICEROSCMPY2 adapter:
runs every SSP in the RCMIP fixture (ssp119, ssp126, ssp245, ssp370
and the two lowNTCF variants, ssp434, ssp460, ssp534-over, ssp585)
against N posterior members of a CICERO-SCM v2.x distribution and
prints a per-scenario 2100 GSAT / CO2 / ERF summary.

Defaults to splice mode (user emissions + bundled ssp245 historical),
which is the path the demo uses. Pass --cicero-bundle-dir to switch
to bundle mode (Marit RCMIP-aligned setup) where gaspam and conc
files are resolved per-scenario from inside the bundle directory.

Smoke-tested end-to-end against draw_samples_500.json with 20
members: ~44 s for 10 scenarios x 20 members on a single thread.
2100 GSAT medians are systematically warmer than the FaIRv2 numbers
on the same protocol (e.g. ssp245 3.77 K vs FaIR 2.63 K, ssp585
6.72 K vs FaIR 4.82 K). The CICEROSCM bundle's ECS distribution is
wider than FaIR's, and the 20-member subset is small relative to the
full 500-member posterior, so the offset is consistent with the
expected inter-model spread.

Results are kept in memory: the NetCDFChunkWriter path currently
trips a scmdata-pandas-3 incompatibility (fixed in PR #11 / upstream
openscm/scmdata#321), so writer support stays out of this script
until those land in main.
pandas 3.0 makes DataFrame.values return a read-only ndarray, so the
existing self._df.values[:] = ... idiom in convert_unit, _binary_op
and _unary_op raises ValueError: assignment destination is read-only.

Switch to .iloc[:, :] = ..., which goes through pandas' indexer rather
than the underlying ndarray and so isn't affected by the read-only
change. _binary_op and _unary_op additionally wrap the right-hand side
in np.asarray(..., dtype=float) to preserve the prior silent bool-to-
float cast that comparison ops (lt, eq, ne, etc.) relied on; .iloc is
dtype-strict where the old .values write was not, so the explicit
cast keeps the historical semantics of the result frame.
…ation

xarray 2025+ deprecates the bare use_cftime kwarg on
`xr.load_dataset`, recommending instead that callers pass an
`xr.coders.CFDatetimeCoder` via `decode_times`. The deprecation
emits a FutureWarning on every `ScmRun.from_nc()` /
`nc_to_run` call, which floods downstream notebook output.

Prefer the new API when `xr.coders.CFDatetimeCoder` is available
and fall back to the bare kwarg on older xarray (< 2024.09) where
the new coder does not exist.
@benmsanderson benmsanderson changed the title Fix two pandas 3.0 incompatibilities (StringDtype groupby, Series positional indexing) Fix pandas 3.0 / xarray 2025+ compat across groupby, _xarray, run, netcdf May 30, 2026
@benmsanderson
Copy link
Copy Markdown
Author

Hey @znicholls - this is ready when maintainers have a minute.

Four targeted fixes. The openscm-runner PR needs this first if we want to avoid shims, happy to address anything you flag.

-b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant