BTE MVP1 template changes#107
Conversation
Max told the BTE devs that this template was returning too many results - "238k results for asthma (MONDO:0004979)". I investigated and found that Gene-DoP edge had issues: (1) the gene_associated_with_condition predicate was returning many edges from geneticskp (not original intent of template), leading to many intermediate nodes; and (2) the biomarker edge was returning nothing due to no matching MetaEdges. I reviewed DINGO KGX metadata and found MetaEdges that retrieve "more causal" genes from fairly reliable resources (HPOA, G2P). So I adjusted the Gene-DoP edge to hit those edges specifically and not geneticskp or semmeddb (using qualifier-constraints). For the Chem-Gene hop, I also made a change: removed the "regulates" predicate since "affects" (now its parent) covers its cases.
The Gene-has_phenotype-DoP hop hits MetaEdges from AGR and HPOA that seem fairly reliable. There aren't many currently, <3000. For the Chem-Gene hop, I removed the "regulates" predicate since "affects" (now its parent) covers its cases.
A good proportion of G2P's Gene-Disease edges have directional qualifiers: loss or non-loss (diff types of gain) of function. For these, more directional Chem-affects-Gene edges can be used to try to find potential drugs that counteract the gene variant effect. But these templates are so specific that they will only find results in a subset of diseases (ex: no results for asthma MONDO:0004979). They are covered/emcompassed by the broader template Chem-affects-Gene-associated_variant_contributes-DoP.json.
this is a defunct property. waiting for TRAPI 2.0 to reintroduce the behavior (COLLATE)
put all together in 1 folder, so only the templates currently used are on the top level
Codecov Report✅ All modified and coverable lines are covered by tests. Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
|
I ran Asthma against this and the largest query returned 20k, which is much more manageable. It, of course, runs faster because we're not handling as many results. I can't speak to whether the results are any "better", but I'd like to merge in and deploy to CI so that we can see any differences in the automated tests. |
Max, please try out these new MVP1 templates (listed in the
template_groups.jsondiff) and let me know what you think (fewer results? better results? faster?). If this is likely an improvement, maybe it'd be good to get this into CI ASAP for testing and the "Test" environ deployment?Background: ~ 1 month ago, we discussed how BTE's gene-intermediate template was the culprit blowing up on MVP1 queries and returning way too many results (238k for asthma
MONDO:0004979). I was tasked with investigating why and adjusting the template to return fewer results. I discovered that the Gene-DiseaseOrPheno hop had issues (for asthma, 1 pred returned >1000 edges from geneticskp, and the other had no MetaEdges). So I wrote new templates to hit Gene-DiseaseOrPheno MetaEdges that are more strong/causal and from fairly reliable resources (HPOA, G2P, AGR). See the commit messages for more details.Some info for testing:
associated_variant_contributestemplate is still liable to blow up and returns ~39k results for asthma right now (with subclassing turned off in Tier 0?). But...that's still better than before? The Gene-DiseaseOrPheno hop is reasonable (returns 198-199 edges/intermediates from HPOA); it's the 2nd Chem-Gene hop that blows up. I didn't work on that hop 😖 (ran out of time). Based on a quick look, a lot of that comes from CTD (26663 / 54626 edges). But I was told it's not possible to constrain the source (ideally for a template/QEdge) right now...affects_increases: OMIM:615190affects_decreases: MONDO:0032942has_phenotype:MONDO:0001068(Gene-Disease edges from AGR). For asthma, this template's Gene-Disease hop does return 1 edge from HPOA.