iceberg-bioimage

iceberg-bioimage is a format-agnostic Python package for cataloging bioimaging data with Apache Iceberg and exporting Cytomining-compatible warehouse layouts.

Core idea:

Iceberg is the control plane for cataloging, schemas, joins, and snapshots.
Cytomining-compatible Parquet warehouses are a first-class export target.
Zarr and OME-TIFF remain the data plane.
Adapters normalize each format into a pure-Python ScanResult.
Execution stays in external tools such as DuckDB, xarray, and tifffile.

Package layout

src/iceberg_bioimage/
  __init__.py
  api.py
  cli.py
  adapters/
  integrations/
  models/
  publishing/
  validation/

Minimal dependencies

pyarrow
pyiceberg
tifffile
zarr

Optional integration groups:

duckdb for query helpers and examples
ome-arrow for Arrow-native tabular image payloads and lazy image access

Start Here

If you want a catalog-free first run, start with Cytomining export: iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr
If you want Iceberg-backed publishing, configure a PyIceberg catalog first.
For step-by-step setup, see:
- docs/src/getting-started.md
- docs/src/catalog-setup.md

Zarr support

iceberg-bioimage keeps the user-facing API simple: use scan_store(...) for both local Zarr v2 stores and local Zarr v3 metadata stores.

Zarr v2 arrays are scanned through the zarr Python package
Local Zarr v3 stores are scanned from zarr.json metadata without requiring a separate API
Summaries report the storage variant as zarr-v2 or zarr-v3
The base package allows either Zarr 2 or Zarr 3 runtimes so that optional forward-facing integrations can coexist in the same environment

Quickstart

from iceberg_bioimage import (
    export_store_to_cytomining_warehouse,
    ingest_stores_to_warehouse,
    join_profiles_with_store,
    register_store,
    summarize_store,
    validate_microscopy_profile_table,
)

registration = register_store(
    "data/experiment.zarr",
    "default",
    "bioimage.cytotable",
)
print(registration.to_dict())

summary = summarize_store("data/experiment.zarr")
print(summary.to_dict())

contract = validate_microscopy_profile_table("data/cells.parquet")
print(contract.is_valid)

# Requires the optional DuckDB integration:
#   pip install 'iceberg-bioimage[duckdb]'
joined = join_profiles_with_store("data/experiment.zarr", "data/cells.parquet")
print(joined.num_rows)

warehouse = ingest_stores_to_warehouse(
    ["data/experiment-a.zarr", "data/experiment-b.zarr"],
    "default",
    "bioimage.cytotable",
)
print(warehouse.to_dict())

cytomining_export = export_store_to_cytomining_warehouse(
    "data/experiment-a.zarr",
    "warehouse-root",
    profiles="data/cells.parquet",
    profile_dataset_id="experiment-a",
)
print(cytomining_export.to_dict())

iceberg-bioimage scan data/experiment.zarr
iceberg-bioimage summarize data/experiment.zarr
iceberg-bioimage register --catalog default --namespace bioimage.cytotable data/experiment.zarr
iceberg-bioimage ingest --catalog default --namespace bioimage.cytotable data/experiment-a.zarr data/experiment-b.zarr
iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr
iceberg-bioimage publish-chunks --catalog default --namespace bioimage.cytotable data/experiment.zarr
iceberg-bioimage register --catalog default --namespace bioimage.cytotable --publish-chunks data/experiment.zarr
iceberg-bioimage validate-contract data/cells.parquet
iceberg-bioimage join-profiles data/experiment.zarr data/cells.parquet --output joined.parquet

examples/quickstart.py for a minimal scan, publish, and validation script
examples/catalog_duckdb.py for a catalog-backed query workflow
examples/synthetic_workflow.py for a self-contained local workflow

Install optional integrations with:

pip install 'iceberg-bioimage[duckdb]'
pip install 'iceberg-bioimage[ome-arrow]'

DuckDB helpers

DuckDB is supported as an optional integration layer, not as a required engine. The join helpers also accept common pycytominer and coSMicQC-style Metadata_* aliases for dataset_id, image_id, plate_id, well_id, and site_id. If a profile table is missing dataset_id but all rows belong to one dataset, pass profile_dataset_id=... to the high-level join helpers.

import pyarrow as pa

from iceberg_bioimage import join_image_assets_with_profiles, query_metadata_table

image_assets = pa.table(
    {
        "dataset_id": ["ds-1"],
        "image_id": ["img-1"],
        "array_path": ["0"],
        "uri": ["data/example.zarr"],
    }
)
profiles = pa.table(
    {
        "dataset_id": ["ds-1"],
        "image_id": ["img-1"],
        "cell_count": [42],
    }
)

joined = join_image_assets_with_profiles(image_assets, profiles)
filtered = query_metadata_table(
    joined,
    filters=[("cell_count", ">", 10)],
)

Install the optional integration with uv sync --group duckdb.

Cytomining warehouse export

The package supports Cytomining interoperability as a primary workflow. Besides publishing canonical metadata to Iceberg, it can materialize a Parquet-backed warehouse root that tools like pycytominer can consume directly.

from iceberg_bioimage import export_store_to_cytomining_warehouse

result = export_store_to_cytomining_warehouse(
    "data/experiment.zarr",
    "warehouse-root",
    profiles="data/profiles.parquet",
    profile_dataset_id="experiment",
)
print(result.to_dict())

This writes one or more of:

image_assets/
chunk_index/
joined_profiles/

It can also append downstream Cytomining tables into the same warehouse root, for example:

pycytominer_profiles/
cosmicqc_profiles/

OME-Arrow helpers

OME-Arrow is available as an optional forward-facing integration for tabular image payloads stored in Arrow-compatible formats.

from iceberg_bioimage import create_ome_arrow, scan_ome_arrow

oa = create_ome_arrow("image.ome.tiff")
lazy_oa = scan_ome_arrow("image.ome.parquet")

Install it with uv sync --group ome-arrow or pip install 'iceberg-bioimage[ome-arrow]'.

Local synthetic workflow

For a catalog-free onboarding path, examples/synthetic_workflow.py creates a small Zarr store and profile table, validates the join contract, derives canonical metadata rows, and joins them with the optional DuckDB helpers.

Run it with:

uv run --group duckdb python examples/synthetic_workflow.py

Catalog-backed query workflow

If you already published canonical metadata tables, you can read them from a catalog and join them to analysis outputs directly:

import pyarrow as pa

from iceberg_bioimage import join_catalog_image_assets_with_profiles

profiles = pa.table(
    {
        "dataset_id": ["ds-1"],
        "image_id": ["img-1"],
        "cell_count": [42],
    }
)

joined = join_catalog_image_assets_with_profiles(
    "default",
    "bioimage.cytotable",
    profiles,
    chunk_index_table="chunk_index",
)

Current scope

Scan Zarr and OME-TIFF stores into canonical ScanResult objects
Summarize scanned datasets into user-facing DatasetSummary objects
Publish image_assets and chunk_index metadata tables with PyIceberg
Ingest one or more existing datasets into Cytotable-compatible Iceberg warehouses
Export new or existing datasets into Cytomining-compatible Parquet warehouses
Validate profile tables against the microscopy join contract
Join scanned image metadata to profile tables through a simple top-level API
Query canonical metadata through optional DuckDB helpers
Load catalog-backed metadata tables into Arrow for downstream joins

Troubleshooting

DuckDB helpers require the optional duckdb dependency group: install with pip install 'iceberg-bioimage[duckdb]' or uv sync --group duckdb.
Profiles do not satisfy the microscopy join contract: run iceberg-bioimage validate-contract ... and pass --profile-dataset-id when dataset_id is missing but implied.
Missing table: ... for catalog-backed paths: verify catalog configuration, namespace, and table names.

Design note

The package focuses on metadata scanning, publishing, Cytomining warehouse export, validation, and joins. OME-Arrow remains the place for Arrow-native image payload handling and lazy image access.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
docs/src		docs/src
examples		examples
src/iceberg_bioimage		src/iceberg_bioimage
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

iceberg-bioimage

Package layout

Minimal dependencies

Start Here

Zarr support

Quickstart

DuckDB helpers

Cytomining warehouse export

OME-Arrow helpers

Local synthetic workflow

Catalog-backed query workflow

Current scope

Troubleshooting

Design note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

iceberg-bioimage

Package layout

Minimal dependencies

Start Here

Zarr support

Quickstart

DuckDB helpers

Cytomining warehouse export

OME-Arrow helpers

Local synthetic workflow

Catalog-backed query workflow

Current scope

Troubleshooting

Design note

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages