Skip to content

refactor: migrate karyotype tag loading to GCS and add simulated tests#1278

Open
adilraza99 wants to merge 5 commits intomalariagen:masterfrom
adilraza99:GH689-karyotype-gcs-tests
Open

refactor: migrate karyotype tag loading to GCS and add simulated tests#1278
adilraza99 wants to merge 5 commits intomalariagen:masterfrom
adilraza99:GH689-karyotype-gcs-tests

Conversation

@adilraza99
Copy link
Copy Markdown
Contributor

This PR implements the karyotype refactor outlined in #689.

The main change is aligning karyotype tag loading with existing data access patterns used across the codebase (e.g., site_filters). Instead of loading a bundled CSV via importlib.resources, tag SNP data is now accessed from GCS using a config-driven analysis version and the shared filesystem interface.

Changes:

  • Replace importlib.resources-based loading with GCS-backed loading via self._fs
  • Introduce DEFAULT_KARYOTYPE_ANALYSIS config key with optional constructor override
  • Move tag data path to {version}/karyotype/{analysis}/karyotype_tag_snps.csv
  • Add explicit validation for inversion names (raises ValueError for unknown inputs)
  • Use contig from tag data directly instead of inversion[0:2] string slicing
  • Add simulated tag SNP generation (init_karyotype_tags) and test coverage under tests/anoph/
  • Remove bundled karyotype_tag_snps.csv from package resources
  • Document load_inversion_tags in Ag3.rst

Note:
Simulated unit tests pass independently. Integration tests may require the tag SNP data to be available on GCS along with the corresponding config update.

Fixes #689

- Replace importlib.resources loading with GCS-based loading via self._fs
- Introduce DEFAULT_KARYOTYPE_ANALYSIS config support
- Add inversion validation and improved contig handling
- Add simulated test data and coverage for karyotype
- Remove bundled CSV from package resources
@adilraza99 adilraza99 force-pushed the GH689-karyotype-gcs-tests branch from 2da28e8 to 3e7859e Compare April 8, 2026 12:55
@adilraza99
Copy link
Copy Markdown
Contributor Author

Hi @jonbrenas, I've made the changes. Take a look when you have a moment.

Comment thread tests/anoph/conftest.py
"SITE_ANNOTATIONS_ZARR_PATH": "reference/genome/agamp4/Anopheles-gambiae-PEST_SEQANNOTATION_AgamP4.12.zarr",
"DEFAULT_AIM_ANALYSIS": "20220528",
"DEFAULT_SITE_FILTERS_ANALYSIS": "dt_20200416",
"DEFAULT_KARYOTYPE_ANALYSIS": "20231213",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this value come from?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a simulated placeholder used in the test config. I’ve added a comment to make that clearer.

Comment thread tests/anoph/conftest.py Outdated

# Generate tag SNP data using positions from simulated SNP sites.
tags = []
for contig, inversion in [("2R", "2Rb"), ("2L", "2La")]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the inversions hard coded here? If we ever implement karyotypes for ever species they won't have the same inversions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I’ve moved this to a config-driven list so it’s not tied to specific inversions.

Comment thread malariagen_data/anoph/karyotype.py Outdated
"No inversion tags are available for this data resource."
path = (
f"{self._base_path}/{self._major_version_path}"
f"/karyotype/{self._karyotype_analysis}/karyotype_tag_snps.csv"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual path would probably be snp_karyotype to be consistent with the others (e.g., snp_haplotypes). Also, the name of the actual file shouldn't be hard-coded. It should probably be added to the config.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the path to follow the snp_karyotype convention and made the filename configurable via the config.

@adilraza99 adilraza99 force-pushed the GH689-karyotype-gcs-tests branch from 5d223e4 to 01111ab Compare April 9, 2026 14:08
@adilraza99
Copy link
Copy Markdown
Contributor Author

Hi @jonbrenas, I’ve updated this to follow the existing patterns and addressed the review points. Let me know if this looks good.

@adilraza99
Copy link
Copy Markdown
Contributor Author

Could you please take a look when you have a moment?
@jonbrenas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactoring karyotype

2 participants