Skip to content

Support direct GCS dataset sources#1777

Merged
anth-volk merged 2 commits into
mainfrom
codex/uk-gcs-dataset-loading
Jun 11, 2026
Merged

Support direct GCS dataset sources#1777
anth-volk merged 2 commits into
mainfrom
codex/uk-gcs-dataset-loading

Conversation

@anth-volk

Copy link
Copy Markdown
Collaborator

Fixes #1776

Summary

  • Adds direct gs:// dataset source handling for UK simulations by materializing GCS objects to local H5 files, then reusing the existing UK local-file loader.
  • Supports numeric @... suffixes as exact GCS generations and non-numeric suffixes such as @1.55.10 as PolicyEngine data-version metadata stored on GCS objects.
  • Preserves existing hf:// and local file path behavior.

Failed Code Push Requiring This Fix

This fixes the UK-side failure exposed by the policyengine-api-v2 deployment from merge commit fd723516fae6d86b5f3a3d201130598fe51be11f in workflow run 27304642490 (Deploy Simulation API to Modal). The beta integration test tests/simulation/test_calculate.py::test_calculate_uk_model submitted a UK runtime dataset like:

gs://policyengine-uk-data-private/enhanced_frs_2023_24.h5@1.55.10

The Modal worker failed while constructing the UK microsimulation with:

ValueError: Non-HuggingFace URLs are currently not supported.

Validation

  • make format
  • .venv/bin/python -m pytest policyengine_uk/tests/test_dataset_sources.py policyengine_uk/tests/test_simulation_dataset_sources.py -v

I did not run a live private GCS download smoke from this package environment; the unit tests mock GCS access as required, and the package raises explicit dependency/credential errors when gs:// is used without a configured GCS client.

@vahid-ahmadi vahid-ahmadi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Support direct GCS dataset sources

Overall: functionally safe — no correctness blockers. I verified the load-bearing assumptions: parse_gs_url exists in policyengine-core 3.26.0 (the minimum pin in pyproject.toml), so the unconditional import is safe across the allowed dependency range; build_from_file(..., cache_key=...) exists on main; and the new module lives in an existing package. Test coverage is good (numeric generation, metadata-version match on current blob, list fallback picking the latest matching generation, missing-version error, file-cache reuse, and routing/cache_key plumbing).

A few issues worth addressing, roughly in priority order:

1. The in-memory dataset cache is write-only for gs:// sources (performance)

build_from_file(dataset_file, cache_key=dataset_source) stores the built dataset in _url_dataset_cache, but that cache is only read in build_from_url (the hf:// path). So repeated Microsimulation(dataset="gs://...") calls in the same process skip the download (local file cache by generation) but re-pay the HDF5 read + uprating + enum encoding (~2.2s per the comment in build_from_url) every time — and the cache entries accumulate with no benefit. Since the motivating use case is API workers constructing simulations per request, consider checking _url_dataset_cache in the gs:// branch of build_from_dataset_source before materializing, mirroring the hf:// behavior.

2. Temp-file race in _download_blob (concurrency)

The temp name is deterministic (<name>.tmp) in a shared cache dir. Two processes downloading the same dataset concurrently will write to the same temp file (interleaved writes → corrupted file published by os.replace), and the loser's finally: unlink can delete the winner's in-progress temp file. Use a unique temp name in the same directory (e.g. tempfile.mkstemp(dir=local_path.parent)) — os.replace stays atomic and last-writer-wins is then safe.

3. Private data cached in world-readable system temp (security, minor)

The default cache dir is $TMPDIR/policyengine-uk-datasets at a predictable sha256 path, and cached files are trusted without any integrity check. On shared hosts, other users can read the private dataset or pre-plant a file at the expected path. Containers are typically single-user so this is low risk in the Modal deployment, but a user-scoped default (e.g. ~/.cache/policyengine-uk) or restrictive permissions on the cache dir would be safer.

Minor notes

  • google-cloud-storage isn't declared anywhere in pyproject.toml; the guarded import with a clear install message is fine, but a [gcs] extra would make the deployment dependency explicit.
  • In the metadata-version listing path, _blob_metadata_version calls blob.reload() for any listed version lacking metadata, and reload won't populate it if the object genuinely has none — one extra API round-trip per such version. Cheap fix: skip reload for blobs that came from list_blobs (they already carry metadata).
  • Purely numeric @ suffixes are always interpreted as generations, so a hypothetical all-numeric data version is unreachable. Fine for semver-style versions; maybe worth a sentence in the docs.

Also note the PR is still marked draft, so it can't be merged until it's marked ready for review.

@vahid-ahmadi vahid-ahmadi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls see above

@anth-volk anth-volk marked this pull request as ready for review June 11, 2026 12:55
@anth-volk anth-volk merged commit ba4b1a6 into main Jun 11, 2026
10 checks passed
@anth-volk anth-volk deleted the codex/uk-gcs-dataset-loading branch June 11, 2026 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support direct GCS dataset URLs in UK simulations

2 participants