refactor: migrate karyotype tag loading to GCS and add simulated tests by adilraza99 · Pull Request #1278 · malariagen/malariagen-data-python

adilraza99 · 2026-04-08T12:29:24Z

This PR implements the karyotype refactor outlined in #689.

The main change is aligning karyotype tag loading with existing data access patterns used across the codebase (e.g., site_filters). Instead of loading a bundled CSV via importlib.resources, tag SNP data is now accessed from GCS using a config-driven analysis version and the shared filesystem interface.

Changes:

Replace importlib.resources-based loading with GCS-backed loading via self._fs
Introduce DEFAULT_KARYOTYPE_ANALYSIS config key with optional constructor override
Move tag data path to {version}/karyotype/{analysis}/karyotype_tag_snps.csv
Add explicit validation for inversion names (raises ValueError for unknown inputs)
Use contig from tag data directly instead of inversion[0:2] string slicing
Add simulated tag SNP generation (init_karyotype_tags) and test coverage under tests/anoph/
Remove bundled karyotype_tag_snps.csv from package resources
Document load_inversion_tags in Ag3.rst

Note:
Simulated unit tests pass independently. Integration tests may require the tag SNP data to be available on GCS along with the corresponding config update.

Fixes #689

- Replace importlib.resources loading with GCS-based loading via self._fs - Introduce DEFAULT_KARYOTYPE_ANALYSIS config support - Add inversion validation and improved contig handling - Add simulated test data and coverage for karyotype - Remove bundled CSV from package resources

adilraza99 · 2026-04-08T18:40:40Z

Hi @jonbrenas, I've made the changes. Take a look when you have a moment.

jonbrenas · 2026-04-09T09:39:15Z

            "SITE_ANNOTATIONS_ZARR_PATH": "reference/genome/agamp4/Anopheles-gambiae-PEST_SEQANNOTATION_AgamP4.12.zarr",
            "DEFAULT_AIM_ANALYSIS": "20220528",
            "DEFAULT_SITE_FILTERS_ANALYSIS": "dt_20200416",
+            "DEFAULT_KARYOTYPE_ANALYSIS": "20231213",


Where does this value come from?

This is a simulated placeholder used in the test config. I’ve added a comment to make that clearer.

jonbrenas · 2026-04-09T09:42:08Z

+
+        # Generate tag SNP data using positions from simulated SNP sites.
+        tags = []
+        for contig, inversion in [("2R", "2Rb"), ("2L", "2La")]:


Why are the inversions hard coded here? If we ever implement karyotypes for ever species they won't have the same inversions.

Good point. I’ve moved this to a config-driven list so it’s not tied to specific inversions.

jonbrenas · 2026-04-09T09:44:28Z

-                "No inversion tags are available for this data resource."
+        path = (
+            f"{self._base_path}/{self._major_version_path}"
+            f"/karyotype/{self._karyotype_analysis}/karyotype_tag_snps.csv"


The actual path would probably be snp_karyotype to be consistent with the others (e.g., snp_haplotypes). Also, the name of the actual file shouldn't be hard-coded. It should probably be added to the config.

Updated the path to follow the snp_karyotype convention and made the filename configurable via the config.

…epo conventions

adilraza99 · 2026-04-09T14:43:47Z

Hi @jonbrenas, I’ve updated this to follow the existing patterns and addressed the review points. Let me know if this looks good.

adilraza99 · 2026-04-16T10:11:35Z

Could you please take a look when you have a moment?
@jonbrenas

adilraza99 force-pushed the GH689-karyotype-gcs-tests branch from 2da28e8 to 3e7859e Compare April 8, 2026 12:55

jonbrenas requested changes Apr 9, 2026

View reviewed changes

fix: align karyotype implementation with config-driven patterns and r…

01111ab

…epo conventions

adilraza99 force-pushed the GH689-karyotype-gcs-tests branch from 5d223e4 to 01111ab Compare April 9, 2026 14:08

adilraza99 added 3 commits April 11, 2026 21:18

Merge branch 'master' into GH689-karyotype-gcs-tests

b08ccea

Merge branch 'master' into GH689-karyotype-gcs-tests

4f57e71

Merge branch 'master' into GH689-karyotype-gcs-tests

7d1690a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: migrate karyotype tag loading to GCS and add simulated tests#1278

refactor: migrate karyotype tag loading to GCS and add simulated tests#1278
adilraza99 wants to merge 5 commits intomalariagen:masterfrom
adilraza99:GH689-karyotype-gcs-tests

adilraza99 commented Apr 8, 2026

Uh oh!

adilraza99 commented Apr 8, 2026

Uh oh!

jonbrenas Apr 9, 2026

Uh oh!

adilraza99 Apr 9, 2026

Uh oh!

jonbrenas Apr 9, 2026

Uh oh!

adilraza99 Apr 9, 2026

Uh oh!

jonbrenas Apr 9, 2026

Uh oh!

adilraza99 Apr 9, 2026

Uh oh!

adilraza99 commented Apr 9, 2026

Uh oh!

adilraza99 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adilraza99 commented Apr 8, 2026

Uh oh!

adilraza99 commented Apr 8, 2026

Uh oh!

jonbrenas Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

adilraza99 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

jonbrenas Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

adilraza99 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

jonbrenas Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

adilraza99 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

adilraza99 commented Apr 9, 2026

Uh oh!

adilraza99 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants