iceberg-bioimage is a format-agnostic Python package for cataloging bioimaging data with Apache Iceberg and exporting Cytomining-compatible warehouse layouts.
Core idea:
- Iceberg is the control plane for cataloging, schemas, joins, and snapshots.
- Cytomining-compatible Parquet warehouses are a first-class export target.
- Zarr and OME-TIFF remain the data plane.
- Adapters normalize each format into a pure-Python
ScanResult. - Execution stays in external tools such as DuckDB, xarray, and tifffile.
src/iceberg_bioimage/
__init__.py
api.py
cli.py
adapters/
integrations/
models/
publishing/
validation/
pyarrowpyicebergtifffilezarr
Optional integration groups:
duckdbfor query helpers and examplesome-arrowfor Arrow-native tabular image payloads and lazy image access
- If you want a catalog-free first run, start with Cytomining export:
iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr - If you want Iceberg-backed publishing, configure a PyIceberg catalog first.
- For step-by-step setup, see:
docs/src/getting-started.mddocs/src/catalog-setup.md
iceberg-bioimage keeps the user-facing API simple: use scan_store(...) for
both local Zarr v2 stores and local Zarr v3 metadata stores.
- Zarr v2 arrays are scanned through the
zarrPython package - Local Zarr v3 stores are scanned from
zarr.jsonmetadata without requiring a separate API - Summaries report the storage variant as
zarr-v2orzarr-v3 - The base package allows either Zarr 2 or Zarr 3 runtimes so that optional forward-facing integrations can coexist in the same environment
from iceberg_bioimage import (
export_store_to_cytomining_warehouse,
ingest_stores_to_warehouse,
join_profiles_with_store,
register_store,
summarize_store,
validate_microscopy_profile_table,
)
registration = register_store(
"data/experiment.zarr",
"default",
"bioimage.cytotable",
)
print(registration.to_dict())
summary = summarize_store("data/experiment.zarr")
print(summary.to_dict())
contract = validate_microscopy_profile_table("data/cells.parquet")
print(contract.is_valid)
# Requires the optional DuckDB integration:
# pip install 'iceberg-bioimage[duckdb]'
joined = join_profiles_with_store("data/experiment.zarr", "data/cells.parquet")
print(joined.num_rows)
warehouse = ingest_stores_to_warehouse(
["data/experiment-a.zarr", "data/experiment-b.zarr"],
"default",
"bioimage.cytotable",
)
print(warehouse.to_dict())
cytomining_export = export_store_to_cytomining_warehouse(
"data/experiment-a.zarr",
"warehouse-root",
profiles="data/cells.parquet",
profile_dataset_id="experiment-a",
)
print(cytomining_export.to_dict())iceberg-bioimage scan data/experiment.zarr
iceberg-bioimage summarize data/experiment.zarr
iceberg-bioimage register --catalog default --namespace bioimage.cytotable data/experiment.zarr
iceberg-bioimage ingest --catalog default --namespace bioimage.cytotable data/experiment-a.zarr data/experiment-b.zarr
iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr
iceberg-bioimage publish-chunks --catalog default --namespace bioimage.cytotable data/experiment.zarr
iceberg-bioimage register --catalog default --namespace bioimage.cytotable --publish-chunks data/experiment.zarr
iceberg-bioimage validate-contract data/cells.parquet
iceberg-bioimage join-profiles data/experiment.zarr data/cells.parquet --output joined.parquetexamples/quickstart.pyfor a minimal scan, publish, and validation scriptexamples/catalog_duckdb.pyfor a catalog-backed query workflowexamples/synthetic_workflow.pyfor a self-contained local workflow
Install optional integrations with:
pip install 'iceberg-bioimage[duckdb]'
pip install 'iceberg-bioimage[ome-arrow]'DuckDB is supported as an optional integration layer, not as a required engine.
The join helpers also accept common pycytominer and coSMicQC-style
Metadata_* aliases for dataset_id, image_id, plate_id, well_id, and
site_id. If a profile table is missing dataset_id but all rows belong to
one dataset, pass profile_dataset_id=... to the high-level join helpers.
import pyarrow as pa
from iceberg_bioimage import join_image_assets_with_profiles, query_metadata_table
image_assets = pa.table(
{
"dataset_id": ["ds-1"],
"image_id": ["img-1"],
"array_path": ["0"],
"uri": ["data/example.zarr"],
}
)
profiles = pa.table(
{
"dataset_id": ["ds-1"],
"image_id": ["img-1"],
"cell_count": [42],
}
)
joined = join_image_assets_with_profiles(image_assets, profiles)
filtered = query_metadata_table(
joined,
filters=[("cell_count", ">", 10)],
)Install the optional integration with uv sync --group duckdb.
The package supports Cytomining interoperability as a primary workflow.
Besides publishing canonical metadata to Iceberg, it can materialize a
Parquet-backed warehouse root that tools like pycytominer can consume
directly.
from iceberg_bioimage import export_store_to_cytomining_warehouse
result = export_store_to_cytomining_warehouse(
"data/experiment.zarr",
"warehouse-root",
profiles="data/profiles.parquet",
profile_dataset_id="experiment",
)
print(result.to_dict())This writes one or more of:
image_assets/chunk_index/joined_profiles/
It can also append downstream Cytomining tables into the same warehouse root, for example:
pycytominer_profiles/cosmicqc_profiles/
OME-Arrow is available as an optional forward-facing integration for tabular image payloads stored in Arrow-compatible formats.
from iceberg_bioimage import create_ome_arrow, scan_ome_arrow
oa = create_ome_arrow("image.ome.tiff")
lazy_oa = scan_ome_arrow("image.ome.parquet")Install it with uv sync --group ome-arrow or
pip install 'iceberg-bioimage[ome-arrow]'.
For a catalog-free onboarding path, examples/synthetic_workflow.py creates a
small Zarr store and profile table, validates the join contract, derives
canonical metadata rows, and joins them with the optional DuckDB helpers.
Run it with:
uv run --group duckdb python examples/synthetic_workflow.pyIf you already published canonical metadata tables, you can read them from a catalog and join them to analysis outputs directly:
import pyarrow as pa
from iceberg_bioimage import join_catalog_image_assets_with_profiles
profiles = pa.table(
{
"dataset_id": ["ds-1"],
"image_id": ["img-1"],
"cell_count": [42],
}
)
joined = join_catalog_image_assets_with_profiles(
"default",
"bioimage.cytotable",
profiles,
chunk_index_table="chunk_index",
)- Scan Zarr and OME-TIFF stores into canonical
ScanResultobjects - Summarize scanned datasets into user-facing
DatasetSummaryobjects - Publish
image_assetsandchunk_indexmetadata tables with PyIceberg - Ingest one or more existing datasets into Cytotable-compatible Iceberg warehouses
- Export new or existing datasets into Cytomining-compatible Parquet warehouses
- Validate profile tables against the microscopy join contract
- Join scanned image metadata to profile tables through a simple top-level API
- Query canonical metadata through optional DuckDB helpers
- Load catalog-backed metadata tables into Arrow for downstream joins
DuckDB helpers require the optional duckdb dependency group: install withpip install 'iceberg-bioimage[duckdb]'oruv sync --group duckdb.Profiles do not satisfy the microscopy join contract: runiceberg-bioimage validate-contract ...and pass--profile-dataset-idwhendataset_idis missing but implied.Missing table: ...for catalog-backed paths: verify catalog configuration, namespace, and table names.
The package focuses on metadata scanning, publishing, Cytomining warehouse export, validation, and joins. OME-Arrow remains the place for Arrow-native image payload handling and lazy image access.