Skip to content

ds4cabs/OpenLineage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenLineage

A stem-cell differentiation copilot for computational biology teams.

License: MIT Status CABS 2026 Python

OpenLineage is an AI agent built to serve the computational biology workflow at stem-cell engineering companies. It acts as five tools in one system — single-cell analyst, differentiation predictor, protocol scout, batch-QC monitor, and literature curator — driven from a single chat interface.

Drop in a scRNA-seq dataset and ask "what lineage is this drifting toward, and which factors would push it to cardiomyocyte fate?"

The agent annotates cell types against CELLxGENE Discover, the Human Cell Atlas, and Tabula Sapiens, maps differentiation trajectories, and cross-references transcription-factor targets from Open Targets, regulatory networks from Reactome and KEGG, and protein-level evidence from UniProt and the Human Protein Atlas. For protocol design it pulls published differentiation recipes from PubMed, bioRxiv, and Europe PMC, and surfaces small-molecule modulators from ChEMBL and DrugBank. For translational and regulatory context it queries ClinicalTrials.gov for cell-therapy trials and openFDA for cell-based product filings.


Two modes

1 · Forward (computational analyst)

Given a scRNA-seq dataset, return the most likely lineage trajectory, drift risk, and the transcription factors / small molecules that would bias the population toward a target fate.

2 · Inverse (wet-lab assistant)

Given a target cell type, recommend differentiation strategies from the published literature and predict batch-consistency risk from historical scRNA-seq profiles.


What's inside

Module Purpose
Annotator Cell-type calls against CELLxGENE / HCA / Tabula Sapiens, built on Scanpy + scVI
Trajectory Pseudotime and lineage prediction with CellRank
Retrieval Unified, cached query layer across 9+ public sources
Protocol scout Mines bioRxiv / PubMed / Europe PMC for differentiation recipes
Batch-QC monitor Flags runs drifting > 2 SD from a reference embedding
Dashboard Streamlit chat-driven UI

Architecture

                                ┌──────────────────────────┐
        User question ─────────▶│   Agent (LangChain)      │
                                └──────────┬───────────────┘
                                           │
            ┌──────────────┬───────────────┼───────────────┬──────────────┐
            ▼              ▼               ▼               ▼              ▼
       Annotation     Trajectory      Retrieval        Protocol       Batch-QC
       (Scanpy,       (CellRank,      (cached over     scout          monitor
        scVI)          scFates)        9 sources)      (bioRxiv,      (drift vs.
                                                       Europe PMC)     reference)
                                           │
                ┌──────────────────────────┼──────────────────────────┐
                ▼                          ▼                          ▼
       Single-cell refs            Targets & pathways          Chem & literature
       CELLxGENE Discover          Open Targets                ChEMBL  · DrugBank
       Human Cell Atlas            Reactome · KEGG             PubMed  · bioRxiv
       Tabula Sapiens              UniProt · HPA               Europe PMC
                                                               ClinicalTrials.gov
                                                               openFDA

                                           │
                                           ▼
                            Streamlit dashboard + benchmark dataset

Deliverables

  • 📦 Open benchmark dataset of curated stem-cell differentiation references.
  • 📊 Streamlit dashboard with chat-driven analysis, tracking lineage decisions across experiments.
  • 🧰 Python library (openlineage) that wet-lab and computational teams can call from notebooks or pipelines.

Tech stack

Python · PyTorch · Scanpy · scVI · CellRank · Seurat (R) · Hugging Face Transformers · LangChain · FastAPI · DuckDB · Streamlit · AWS S3 · GitHub Actions


Quickstart

# clone
git clone https://github.com/ds4cabs/OpenLineage.git
cd OpenLineage

# install (uv recommended)
uv sync
# or: pip install -e .

# launch the dashboard
uv run streamlit run src/openlineage/app.py

End-to-end notebook walkthroughs land in docs/ and notebooks/ as modules ship.


Status

🚧 Active development — built as a CABS Data Science Summer Intern Program 2026 project (June – August 2026). Milestones tracked in Issues.

Performance targets the team is shooting for:

  • Lineage predictions returned in under 90 s for datasets up to 200K cells.
  • Unified retrieval over 9 public sources, with ~12K cached query results.
  • 400+ published differentiation protocols ranked by lineage-match score.
  • Batch-QC validated against 8 publicly released iPSC differentiation time courses.

License

MIT — see LICENSE. The curated benchmark dataset is released under the same terms; individual source datasets retain their original licenses.


About CABS

OpenLineage is a 2026 intern project of the Chinese American Biopharmaceutical Society (CABS), under the DS4CABS open-source data-science initiative.

Releases

No releases published

Packages

 
 
 

Contributors

Languages