refactor: migrate karyotype tag loading to GCS and add simulated tests#1278
refactor: migrate karyotype tag loading to GCS and add simulated tests#1278adilraza99 wants to merge 5 commits intomalariagen:masterfrom
Conversation
- Replace importlib.resources loading with GCS-based loading via self._fs - Introduce DEFAULT_KARYOTYPE_ANALYSIS config support - Add inversion validation and improved contig handling - Add simulated test data and coverage for karyotype - Remove bundled CSV from package resources
2da28e8 to
3e7859e
Compare
|
Hi @jonbrenas, I've made the changes. Take a look when you have a moment. |
| "SITE_ANNOTATIONS_ZARR_PATH": "reference/genome/agamp4/Anopheles-gambiae-PEST_SEQANNOTATION_AgamP4.12.zarr", | ||
| "DEFAULT_AIM_ANALYSIS": "20220528", | ||
| "DEFAULT_SITE_FILTERS_ANALYSIS": "dt_20200416", | ||
| "DEFAULT_KARYOTYPE_ANALYSIS": "20231213", |
There was a problem hiding this comment.
Where does this value come from?
There was a problem hiding this comment.
This is a simulated placeholder used in the test config. I’ve added a comment to make that clearer.
|
|
||
| # Generate tag SNP data using positions from simulated SNP sites. | ||
| tags = [] | ||
| for contig, inversion in [("2R", "2Rb"), ("2L", "2La")]: |
There was a problem hiding this comment.
Why are the inversions hard coded here? If we ever implement karyotypes for ever species they won't have the same inversions.
There was a problem hiding this comment.
Good point. I’ve moved this to a config-driven list so it’s not tied to specific inversions.
| "No inversion tags are available for this data resource." | ||
| path = ( | ||
| f"{self._base_path}/{self._major_version_path}" | ||
| f"/karyotype/{self._karyotype_analysis}/karyotype_tag_snps.csv" |
There was a problem hiding this comment.
The actual path would probably be snp_karyotype to be consistent with the others (e.g., snp_haplotypes). Also, the name of the actual file shouldn't be hard-coded. It should probably be added to the config.
There was a problem hiding this comment.
Updated the path to follow the snp_karyotype convention and made the filename configurable via the config.
5d223e4 to
01111ab
Compare
|
Hi @jonbrenas, I’ve updated this to follow the existing patterns and addressed the review points. Let me know if this looks good. |
|
Could you please take a look when you have a moment? |
This PR implements the karyotype refactor outlined in #689.
The main change is aligning karyotype tag loading with existing data access patterns used across the codebase (e.g.,
site_filters). Instead of loading a bundled CSV viaimportlib.resources, tag SNP data is now accessed from GCS using a config-driven analysis version and the shared filesystem interface.Changes:
importlib.resources-based loading with GCS-backed loading viaself._fsDEFAULT_KARYOTYPE_ANALYSISconfig key with optional constructor override{version}/karyotype/{analysis}/karyotype_tag_snps.csvValueErrorfor unknown inputs)inversion[0:2]string slicinginit_karyotype_tags) and test coverage undertests/anoph/karyotype_tag_snps.csvfrom package resourcesload_inversion_tagsinAg3.rstNote:
Simulated unit tests pass independently. Integration tests may require the tag SNP data to be available on GCS along with the corresponding config update.
Fixes #689