diff --git a/quickstarts/obstore.ipynb b/quickstarts/obstore.ipynb new file mode 100644 index 0000000..9420df2 --- /dev/null +++ b/quickstarts/obstore.ipynb @@ -0,0 +1,412 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b02d33be", + "metadata": {}, + "source": "# Working with Planetary Computer data using obstore\n\nThis notebook walks through reading Planetary Computer data with [obstore](https://developmentseed.org/obstore/). Obstore is a Python library that talks to cloud object stores (Azure Blob, S3, GCS) directly, without going through HTTP wrappers like fsspec. This has a number of key benefits:**\n\n1. **Reliability** - SAS tokens auto-refresh. No `TokenExpiredError` mid-job, no manual re-signing.\n2. **Cost** — range reads download only the bytes you need.\n3. **Speed** — async surface fires reads in parallel. Roughly N× faster than serial for multi-file workloads.\n4. **Composability** — any library that accepts an [obspec](https://github.com/developmentseed/obspec)-compatible store reads through your authenticated connection without re-doing auth.\n5. **Portability** — `AzureStore`, `S3Store`, `GCSStore` are interchangeable. Cloud-agnostic code.\n\nEach cell below calls out which of these it demonstrates. Speed-relevant cells use `%%time` so you can compare wall-clock numbers.\n\nThe companion [obstore tutorial](../overview/obstore.md) has the full narrative and migration reference." + }, + { + "cell_type": "markdown", + "id": "509ad843", + "metadata": {}, + "source": [ + "## Install\n", + "\n", + "obstore is the main library. `pystac-client` lets us query Planetary Computer's STAC API to find a scene to read. `requests` powers the sync credential provider; `aiohttp` + `aiohttp_retry` power the async one (we use both in this notebook)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e916247a", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install --quiet obstore pystac-client requests aiohttp aiohttp_retry" + ] + }, + { + "cell_type": "markdown", + "id": "97600151", + "metadata": {}, + "source": [ + "## Authenticate from a STAC asset\n", + "\n", + "`PlanetaryComputerCredentialProvider` handles SAS token acquisition and refresh under the hood — no manual `planetary_computer.sign()` calls anywhere in this notebook. If a token expires mid-job, the provider re-acquires it transparently. The old fsspec pattern required you to handle re-signing and retry logic yourself.\n", + "\n", + "**Expected result:** working `provider` object, no output printed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f5472d25", + "metadata": {}, + "outputs": [], + "source": [ + "import pystac_client\n", + "from obstore.auth.planetary_computer import PlanetaryComputerCredentialProvider\n", + "\n", + "catalog = pystac_client.Client.open(\n", + " \"https://planetarycomputer.microsoft.com/api/stac/v1\"\n", + ")\n", + "item = next(catalog.search(collections=[\"naip\"], max_items=1).items())\n", + "asset = item.assets[\"image\"]\n", + "\n", + "provider = PlanetaryComputerCredentialProvider.from_asset(asset)" + ] + }, + { + "cell_type": "markdown", + "id": "b81cc8c9", + "metadata": {}, + "source": [ + "Notice the asset href is unsigned without a SAS query string appended. The provider signs it for you at read time." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47eb7f91", + "metadata": {}, + "outputs": [], + "source": [ + "asset.href" + ] + }, + { + "cell_type": "markdown", + "id": "707f4707", + "metadata": {}, + "source": [ + "## Build a store\n", + "\n", + "A *store* is obstore's connection to a specific cloud location. Once built, you hand it to any obstore read/write function, or to any higher-level library that accepts an obstore-compatible store.\n", + "\n", + "**Expected result:** working `store` object, no output printed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f2330799", + "metadata": {}, + "outputs": [], + "source": [ + "from obstore.store import AzureStore\n", + "\n", + "store = AzureStore(credential_provider=provider)" + ] + }, + { + "cell_type": "markdown", + "id": "ee68c0a7", + "metadata": {}, + "source": [ + "## Read\n", + "\n", + "There are three ways to read data, depending on your needs.:\n", + "\n", + "### 1. Read the entire file\n", + "\n", + "**This is the slowest path. Use when you actually want all the bytes. For large files, this can take a long time. For example, these NAIP scenes range from 100–500 MB, taking a minute or more depending on your connection.\n", + "**Expected result:** 100–500 million bytes, 30–90 seconds depending on which NAIP scene came back." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8ff4624c", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "import obstore\n", + "\n", + "buf = obstore.get(store, \"\").bytes()\n", + "print(f\"downloaded {len(buf):,} bytes\")" + ] + }, + { + "cell_type": "markdown", + "id": "7415bafa", + "metadata": {}, + "source": [ + "### 2. Read a byte range (16 KB)\n", + "\n", + "A Cloud Optimized GeoTIFF stores its header in the first few KB. Most libraries (async-geotiff, GDAL, rasterio) only need the header to start working. They don't need the pixel data until you ask for a specific window. Range reads make this possible.\n", + "\n", + "**Expected result:** 16,384 bytes, well under a second. Much less data than the full file above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "98eb0e8a", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "header = obstore.get_range(store, \"\", start=0, end=16384)\n", + "print(f\"downloaded {len(header):,} bytes\")\n", + "print(f\"that's {len(buf) / len(header):,.0f}x less data than the full file\")" + ] + }, + { + "cell_type": "markdown", + "id": "78fc671c", + "metadata": {}, + "source": [ + "### 3. Read multiple byte ranges in one request\n", + "\n", + "When you need several slices of the same file you could issue separate `get_range` calls. Each one is a round-trip to Azure. `get_ranges` batches them into a single HTTP request, cutting round-trip latency.\n", + "\n", + "**Expected result:** two ranges of 16 KB each, similar wall time to a single `get_range`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c190e591", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "ranges = obstore.get_ranges(\n", + " store, \"\", starts=[0, 65536], ends=[16384, 81920]\n", + ")\n", + "print([len(r) for r in ranges])" + ] + }, + { + "cell_type": "markdown", + "id": "b2bf6cd7", + "metadata": {}, + "source": [ + "## Listing requires a container-scoped store\n", + "\n", + "To enumerate objects under a prefix (\"show me every NAIP scene in Montana in 2023\"), the store needs to be scoped to the container *and* the credential provider needs container-level `List` permission. The asset-derived provider above only signs the single blob so we build a fresh provider against the container URL.\n", + "\n", + "**Expected result:** three lines printed, each a blob path and its size in bytes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c5700d19", + "metadata": {}, + "outputs": [], + "source": [ + "container_provider = PlanetaryComputerCredentialProvider(\n", + " \"https://naipeuwest.blob.core.windows.net/naip/\"\n", + ")\n", + "container_store = AzureStore(\n", + " account_name=\"naipeuwest\",\n", + " container_name=\"naip\",\n", + " credential_provider=container_provider,\n", + ")\n", + "\n", + "for batch in obstore.list(container_store, prefix=\"v002/mt/2023/\"):\n", + " for entry in batch[:3]:\n", + " print(entry[\"path\"], entry[\"size\"])\n", + " break" + ] + }, + { + "cell_type": "markdown", + "id": "93160f86", + "metadata": {}, + "source": [ + "## Concurrent reads (async)\n", + "\n", + "For multi-file workloads, running them in parallel is dramatically faster than serial. Below we read the same 4 KB header four times - first serially, then concurrently — and compare wall times.\n", + "\n", + "Async needs its own credential provider class (`PlanetaryComputerAsyncCredentialProvider`) backed by `aiohttp` instead of `requests`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f597033b", + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "from obstore.auth.planetary_computer import PlanetaryComputerAsyncCredentialProvider\n", + "\n", + "async_provider = PlanetaryComputerAsyncCredentialProvider.from_asset(asset)\n", + "async_store = AzureStore(credential_provider=async_provider)" + ] + }, + { + "cell_type": "markdown", + "id": "79c84f80", + "metadata": {}, + "source": [ + "\n", + "**Warmup the async store.** First call has to acquire a SAS token from Planetary Computer — a separate HTTP round-trip. We do one throwaway read so that overhead doesn't pollute the timing below. (The sync store was already warmed up by cells 1/2/3, which is why we only need to warm the async store.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3f3fee63", + "metadata": {}, + "outputs": [], + "source": [ + "_ = await obstore.get_range_async(async_store, \"\", start=0, end=4096)\n", + "print(\"warmed up\")" + ] + }, + { + "cell_type": "markdown", + "id": "4c692834", + "metadata": {}, + "source": [ + "**Serial baseline:** eight reads, one after the other." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "82939794", + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "\n", + "n_reads = 8\n", + "start = time.perf_counter()\n", + "for _ in range(n_reads):\n", + " obstore.get_range(store, \"\", start=0, end=4096)\n", + "serial_elapsed = time.perf_counter() - start\n", + "print(f\"serial ({n_reads} reads): {serial_elapsed:.3f}s\")" + ] + }, + { + "cell_type": "markdown", + "id": "fc0f8a4b", + "metadata": {}, + "source": [ + "**Concurrent:** same eight reads, all firing at once via `asyncio.gather`.\n", + "\n", + "**Expected result:** several times faster than the serial cell above. The exact speedup depends on Azure's per-connection throttling, but you should see a clear win." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5391f24b", + "metadata": {}, + "outputs": [], + "source": [ + "async def fetch_header():\n", + " return await obstore.get_range_async(async_store, \"\", start=0, end=4096)\n", + "\n", + "start = time.perf_counter()\n", + "headers = await asyncio.gather(*[fetch_header() for _ in range(n_reads)])\n", + "concurrent_elapsed = time.perf_counter() - start\n", + "\n", + "print(f\"concurrent ({n_reads} reads): {concurrent_elapsed:.3f}s\")\n", + "print(f\"speedup: {serial_elapsed / concurrent_elapsed:.1f}x\")\n", + "print(f\"all {len(headers)} reads returned {len(headers[0])} bytes each\")" + ] + }, + { + "cell_type": "markdown", + "id": "6e892d3f", + "metadata": {}, + "source": [ + "The speedup scales with how many parallel reads you're doing. For real workloads (building a mosaic, fetching all bands across all scenes in an AOI), this is the difference between \"minutes\" and \"seconds.\"" + ] + }, + { + "cell_type": "markdown", + "id": "802a25f9", + "metadata": {}, + "source": [ + "## Hand the store to async-geotiff\n", + "\n", + "Once you have a working `AzureStore`, you can hand it to any library that accepts an [obspec](https://github.com/developmentseed/obspec)-compatible store and they'll read through your authenticated connection." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "457b28a7", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install --quiet async-geotiff" + ] + }, + { + "cell_type": "markdown", + "id": "1c801295", + "metadata": {}, + "source": [ + "Open the NAIP scene as a COG and read its metadata. geotiff.transform tells you the scene's pixel size and corner position on the ground. \n", + "\n", + "**Expected result:** a transform and a CRS name (e.g. NAD83 / UTM zone 11N)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ead5f993", + "metadata": {}, + "outputs": [], + "source": [ + "from async_geotiff import GeoTIFF\n", + "\n", + "# async_store is scoped to the asset, so path is \"\" (same rule as obstore reads above)\n", + "geotiff = await GeoTIFF.open(\"\", store=async_store)\n", + "print(geotiff.transform)\n", + "print(geotiff.crs.name)" + ] + }, + { + "cell_type": "markdown", + "id": "f6d207db", + "metadata": {}, + "source": [ + "If you want the full CRS details (datum, axis order, area of use), just evaluate `geotiff.crs` on its own\n", + "\n", + "Notice async-geotiff only fetched ~16 KB to get this metadata, not the full file. The range-read win compounds at every level of the stack." + ] + }, + { + "cell_type": "markdown", + "id": "9dd739eb", + "metadata": {}, + "source": [ + "## Portability\n", + "\n", + "The same `obstore.get(store, ...)` call works against S3 or GCS, only the store constructor changes. The example below shows the shape of the request in S3:\n", + "\n", + "```python\n", + "from obstore.store import S3Store\n", + "\n", + "s3_store = S3Store(bucket=\"my-bucket\", region=\"us-west-2\")\n", + "buf = obstore.get(s3_store, \"path/to/object\").bytes() # same call\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "b2d0385b", + "metadata": {}, + "source": "## You're done\n\nIf you got expected output on every cell above, the obstore stack is wired up end-to-end.\n\nFrom here, any obspec-compatible library plugs in the same way. Check the companion [Lonboard tutorial](../overview/lonboard.md) for interactive visualization or the [async-geotiff tutorial](../overview/async-geotiff.md) for pixel-level analysis." + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file