Support direct GCS dataset sources#1777
Conversation
vahid-ahmadi
left a comment
There was a problem hiding this comment.
Review: Support direct GCS dataset sources
Overall: functionally safe — no correctness blockers. I verified the load-bearing assumptions: parse_gs_url exists in policyengine-core 3.26.0 (the minimum pin in pyproject.toml), so the unconditional import is safe across the allowed dependency range; build_from_file(..., cache_key=...) exists on main; and the new module lives in an existing package. Test coverage is good (numeric generation, metadata-version match on current blob, list fallback picking the latest matching generation, missing-version error, file-cache reuse, and routing/cache_key plumbing).
A few issues worth addressing, roughly in priority order:
1. The in-memory dataset cache is write-only for gs:// sources (performance)
build_from_file(dataset_file, cache_key=dataset_source) stores the built dataset in _url_dataset_cache, but that cache is only read in build_from_url (the hf:// path). So repeated Microsimulation(dataset="gs://...") calls in the same process skip the download (local file cache by generation) but re-pay the HDF5 read + uprating + enum encoding (~2.2s per the comment in build_from_url) every time — and the cache entries accumulate with no benefit. Since the motivating use case is API workers constructing simulations per request, consider checking _url_dataset_cache in the gs:// branch of build_from_dataset_source before materializing, mirroring the hf:// behavior.
2. Temp-file race in _download_blob (concurrency)
The temp name is deterministic (<name>.tmp) in a shared cache dir. Two processes downloading the same dataset concurrently will write to the same temp file (interleaved writes → corrupted file published by os.replace), and the loser's finally: unlink can delete the winner's in-progress temp file. Use a unique temp name in the same directory (e.g. tempfile.mkstemp(dir=local_path.parent)) — os.replace stays atomic and last-writer-wins is then safe.
3. Private data cached in world-readable system temp (security, minor)
The default cache dir is $TMPDIR/policyengine-uk-datasets at a predictable sha256 path, and cached files are trusted without any integrity check. On shared hosts, other users can read the private dataset or pre-plant a file at the expected path. Containers are typically single-user so this is low risk in the Modal deployment, but a user-scoped default (e.g. ~/.cache/policyengine-uk) or restrictive permissions on the cache dir would be safer.
Minor notes
google-cloud-storageisn't declared anywhere inpyproject.toml; the guarded import with a clear install message is fine, but a[gcs]extra would make the deployment dependency explicit.- In the metadata-version listing path,
_blob_metadata_versioncallsblob.reload()for any listed version lacking metadata, and reload won't populate it if the object genuinely has none — one extra API round-trip per such version. Cheap fix: skip reload for blobs that came fromlist_blobs(they already carry metadata). - Purely numeric
@suffixes are always interpreted as generations, so a hypothetical all-numeric data version is unreachable. Fine for semver-style versions; maybe worth a sentence in the docs.
Also note the PR is still marked draft, so it can't be merged until it's marked ready for review.
Fixes #1776
Summary
gs://dataset source handling for UK simulations by materializing GCS objects to local H5 files, then reusing the existing UK local-file loader.@...suffixes as exact GCS generations and non-numeric suffixes such as@1.55.10as PolicyEngine data-version metadata stored on GCS objects.hf://and local file path behavior.Failed Code Push Requiring This Fix
This fixes the UK-side failure exposed by the
policyengine-api-v2deployment from merge commitfd723516fae6d86b5f3a3d201130598fe51be11fin workflow run27304642490(Deploy Simulation API to Modal). The beta integration testtests/simulation/test_calculate.py::test_calculate_uk_modelsubmitted a UK runtime dataset like:The Modal worker failed while constructing the UK microsimulation with:
Validation
make format.venv/bin/python -m pytest policyengine_uk/tests/test_dataset_sources.py policyengine_uk/tests/test_simulation_dataset_sources.py -vI did not run a live private GCS download smoke from this package environment; the unit tests mock GCS access as required, and the package raises explicit dependency/credential errors when
gs://is used without a configured GCS client.