Skip to content

Cloud storage support (Azure, S3, GCS)#1087

Open
SamirMoustafa wants to merge 33 commits intoscverse:mainfrom
SamirMoustafa:cloud-storage-support
Open

Cloud storage support (Azure, S3, GCS)#1087
SamirMoustafa wants to merge 33 commits intoscverse:mainfrom
SamirMoustafa:cloud-storage-support

Conversation

@SamirMoustafa
Copy link
Copy Markdown

@SamirMoustafa SamirMoustafa commented Mar 2, 2026

Cloud storage support (Azure, S3, GCS)

Summary

Add read/write support for SpatialData on remote object storage via UPath, fixing the issue reported in #999 where sd.read_zarr(UPath("s3://...")) failed because SpatialData.path did not accept UPath. PR #971 ("add remote support") pursued the same goal but remains a draft, blocked on zarr v3/ome-zarr and async fsspec after dask unpinning. This PR delivers working remote support by fixing the path setter, wrapping fsspec in an async filesystem where required for current zarr, and testing Azure, S3, and GCS via Docker emulators. It also addresses #441 (private remote object storage): credentials go via UPath kwargs or via a pre-opened zarr.Group (e.g. read_zarr(zarr.open("s3://...", storage_options={...}))). Fixes #441.

Supported features

  • Path handling: SpatialData.path accepts None, str, Path, or UPath (enables remote-backed objects).
  • Read: SpatialData.read(upath) and read_zarr(upath) for Azure Blob (az://), S3 (s3://), and GCS (gs://) using universal-pathlib (UPath). For private stores, read_zarr(zarr_group) is also supported when the store is opened with zarr.open(..., storage_options=...).
  • Write: sdata.write(upath) and element-level writes to the same backends; parquet (points/shapes) and zarr (raster/tables) written via fsspec with async filesystem support where required.
  • Consolidated metadata: Read/write of consolidated metadata on remote stores (e.g. zmetadata) supported.

Testing

Remote storage is tested with Docker-based emulators (Azurite for Azure, moto for S3, fake-gcs-server for GCS). In CI we build tests/io/remote_storage/Dockerfile.emulators, start the emulators on Ubuntu, then run the full test suite including tests/io/remote_storage/. These remote-storage tests run only on Ubuntu (Linux), because they depend on Docker; on Windows and macOS we skip tests/io/remote_storage/ and run the rest of the suite. To run the remote tests locally you need Docker and can start the emulators with the same image and ports (5000, 10000, 4443) as in the workflow.

Example (three providers)

from upath import UPath
from spatialdata import SpatialData

# Azure Blob Storage
az_path = UPath("az://my-container/data.zarr", connection_string="<your-connection-string>")
sdata = SpatialData.read(az_path)

# Amazon S3 (e.g. public bucket or custom endpoint)
s3_path = UPath(
    "s3://bucket/data.zarr",
    endpoint_url="https://s3.embl.de",  # omit for default AWS
    anon=True,
)
sdata = SpatialData.read(s3_path)

# Google Cloud Storage
gs_path = UPath("gs://my-bucket/data.zarr", token="anon", project="my-project")
sdata = SpatialData.read(gs_path)

# Write works the same way (any provider)
# sdata.write(az_path)

Credentials and options are passed through UPath (e.g. connection_string, endpoint_url, anon, token, project) as supported by the underlying fsspec backend.


Release notes

  • Add cloud storage support: read and write SpatialData from/to Azure Blob, S3, and GCS using UPath. SpatialData.path now accepts UPath in addition to str and Path. Fixes initialization from remote stores (e.g. S3) as in #999. Fixes #441 (private remote object storage).

SamirMoustafa and others added 13 commits February 28, 2026 02:13
Patch da.to_zarr so ome_zarr's **kwargs are forwarded as zarr_array_kwargs,
avoiding FutureWarning and keeping behavior correct.
- _FsspecStoreRoot, _get_store_root for path-like store roots (local + fsspec)
- _storage_options_from_fs for parquet writes to Azure/S3/GCS
- _remote_zarr_store_exists, _ensure_async_fs for UPath/FsspecStore
- Extend _resolve_zarr_store for UPath and _FsspecStoreRoot with async fs
- _backed_elements_contained_in_path, _is_element_self_contained accept UPath
- path and _path accept Path | UPath; setter allows UPath
- write() accepts file_path: str | Path | UPath | None (None uses path)
- _validate_can_safely_write_to_path handles UPath and remote store existence
- _write_element accepts Path | UPath; skip local subfolder checks for UPath
- __repr__ and _get_groups_for_element use path without forcing Path()
…table, zarr

- Resolve store via _resolve_zarr_store in read paths (points, shapes, raster, table)
- Use _get_store_root for parquet paths; read/write parquet with storage_options for fsspec
- io_shapes: upload parquet to Azure/S3/GCS via temp file when path is _FsspecStoreRoot
- io_zarr: _get_store_root, UPath in _get_groups_for_element and _write_consolidated_metadata; set sdata.path to UPath when store is remote
- pyproject.toml: adlfs, gcsfs, moto[server], pytest-timeout in test extras
- Dockerfile.emulators: moto, Azurite, fake-gcs-server for tests/io/remote_storage/
… emulator config

- full_sdata fixture: two regions for table categorical (avoids 404 on remote read)
- tests/io/remote_storage/conftest.py: bucket/container creation, resilient async shutdown
- tests/io/remote_storage/test_remote_storage.py: parametrized Azure/S3/GCS roundtrip and write tests
- Added "dimension_separator" to the frozenset of internal keys that should not be passed to zarr.Group.create_array(), ensuring compatibility with various zarr versions.
- Updated test to set region labels for full_sdata table, allowing the test_set_table_annotates_spatialelement to succeed without errors.
- Updated the `test_subset` function to exclude labels and poly from the default table, ensuring accurate subset validation.
- Enhanced `test_validate_table_in_spatialdata` to assert that both regions (labels2d and poly) are correctly annotated in the table.
- Adjusted `test_labels_table_joins` to restrict the table to labels2d, ensuring the join returns the expected results.
…inux

- Added steps to build and run storage emulators (S3, Azure, GCS) using Docker, specifically for the Ubuntu environment.
- Implemented a wait mechanism to ensure emulators are ready before running tests.
- Adjusted test execution to skip remote storage tests on non-Linux platforms.
- Wrapped the fsspec async sync function to prevent RuntimeError "Loop is not running" during process exit when using remote storage (Azure, S3, GCS).
- Ensured compatibility with async session management in the _utils module.
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 2, 2026

Codecov Report

❌ Patch coverage is 83.50877% with 47 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.74%. Comparing base (cf91ad5) to head (53c45ee).

Files with missing lines Patch % Lines
src/spatialdata/_io/_utils.py 78.08% 32 Missing ⚠️
src/spatialdata/_core/spatialdata.py 74.28% 9 Missing ⚠️
src/spatialdata/_io/io_zarr.py 80.00% 5 Missing ⚠️
src/spatialdata/_io/io_shapes.py 98.24% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1087      +/-   ##
==========================================
- Coverage   91.93%   91.74%   -0.19%     
==========================================
  Files          51       51              
  Lines        7772     8008     +236     
==========================================
+ Hits         7145     7347     +202     
- Misses        627      661      +34     
Files with missing lines Coverage Δ
src/spatialdata/_io/io_points.py 98.11% <100.00%> (+0.38%) ⬆️
src/spatialdata/_io/io_raster.py 92.09% <100.00%> (ø)
src/spatialdata/_io/io_table.py 91.11% <100.00%> (+0.63%) ⬆️
src/spatialdata/_io/io_shapes.py 96.15% <98.24%> (+1.21%) ⬆️
src/spatialdata/_io/io_zarr.py 90.00% <80.00%> (-2.39%) ⬇️
src/spatialdata/_core/spatialdata.py 91.58% <74.28%> (-0.36%) ⬇️
src/spatialdata/_io/_utils.py 85.21% <78.08%> (-1.54%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@selmanozleyen
Copy link
Copy Markdown
Member

selmanozleyen commented Apr 15, 2026

  • Maybe add the docker tests as a different jobs.
  • Add pytest marks for tests that expect docker setup

…g for unsupported protocols in storage options, and add test cases to validate new functionality and ensure compatibility with cloud object store protocols.
Comment thread tests/conftest.py Outdated
Comment thread tests/core/operations/test_spatialdata_operations.py Outdated
Comment on lines +236 to +247
if isinstance(store, UPath):
sdata.path = store
elif isinstance(store, str):
sdata.path = UPath(store) if "://" in store else Path(store)
elif isinstance(store, Path):
sdata.path = store
elif isinstance(store, zarr.Group):
if isinstance(resolved_store, LocalStore):
sdata.path = Path(resolved_store.root)
elif isinstance(resolved_store, FsspecStore):
sdata.path = UPath(str(_FsspecStoreRoot(resolved_store)))
else:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading from a remote zarr.Group still loses the original storage options when sdata.path is reconstructed. In the zarr.Group branch you rebuild the path from str(_FsspecStoreRoot(...)), which throws away the original fs config. That means a later write() / write_element() on the returned SpatialData can no longer round-trip to Azure/GCS unless the credentials happen to come from global env.

@selmanozleyen
Copy link
Copy Markdown
Member

Here I also added suggestions for SamirMoustafa#1 so that we can explicitly use a zarr store. Get's rid of the _FsspecStoreRoot tricks

…load configurations, and refining test execution conditions for different operating systems.
Comment on lines +167 to +220
def _parse_fsspec_remote_path(path: _FsspecStoreRoot) -> tuple[str, str]:
"""Return (bucket_or_container, blob_key) from an fsspec store path."""
remote = str(path)
if "://" in remote:
remote = remote.split("://", 1)[1]
parts = remote.split("/", 1)
bucket_or_container = parts[0]
blob_key = parts[1] if len(parts) > 1 else ""
return bucket_or_container, blob_key


def _upload_parquet_to_azure(tmp_path: str, bucket: str, key: str, fs: Any) -> None:
from azure.storage.blob import BlobServiceClient

client = BlobServiceClient.from_connection_string(fs.connection_string)
blob_client = client.get_blob_client(container=bucket, blob=key)
with open(tmp_path, "rb") as f:
blob_client.upload_blob(f, overwrite=True)


def _upload_parquet_to_s3(tmp_path: str, bucket: str, key: str, fs: Any) -> None:
import boto3

endpoint = getattr(fs, "endpoint_url", None) or os.environ.get("AWS_ENDPOINT_URL")
s3 = boto3.client(
"s3",
endpoint_url=endpoint,
aws_access_key_id=getattr(fs, "key", None) or os.environ.get("AWS_ACCESS_KEY_ID"),
aws_secret_access_key=getattr(fs, "secret", None) or os.environ.get("AWS_SECRET_ACCESS_KEY"),
region_name=os.environ.get("AWS_DEFAULT_REGION", "us-east-1"),
)
s3.upload_file(tmp_path, bucket, key)


def _upload_parquet_to_fsspec(path: _FsspecStoreRoot, tmp_path: str) -> None:
"""Upload local parquet file to remote fsspec store using sync APIs to avoid event-loop issues."""
fs = path._store.fs
bucket, key = _parse_fsspec_remote_path(path)
fs_name = type(fs).__name__
if fs_name == "AzureBlobFileSystem" and getattr(fs, "connection_string", None):
_upload_parquet_to_azure(tmp_path, bucket, key, fs)
elif fs_name in ("S3FileSystem", "MotoS3FS"):
_upload_parquet_to_s3(tmp_path, bucket, key, fs)
elif fs_name == "GCSFileSystem":
import fsspec

fs_dict = json.loads(fs.to_json())
fs_dict["asynchronous"] = False
sync_fs = fsspec.AbstractFileSystem.from_json(json.dumps(fs_dict))
sync_fs.put_file(tmp_path, path._path)
else:
fs.put(tmp_path, str(path))


Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need to have these parquet sidecars for authentication of each cloud system. If we can have a full zarr-native spatialdata there is no need to re-extract Azure/S3/GCS credentials for parquet APIs and
less distinction between "Zarr backing" and "non-Zarr sidecar next to the Zarr group".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for private remote object storage

2 participants