Skip to content

ruppinlab/EXPRESSO

Repository files navigation

EXPRESSO: Drug Response Prediction Framework

Overview

EXPRESSO (EXPression-based Response pREdiction via Supervised Signal Optimization) is a drug-specific supervised LASSO logistic regression framework for transcriptome-based treatment response prediction. It evaluates predictive power of drug targets and biomarkers across patient cohorts using a Leave-One-Cohort-Out Cross-Validation (LOCO-CV) strategy.

Two model variants are implemented:

  • EXPRESSO-T: LASSO model with the known drug target gene(s) as penalty-free (regularization-exempt) features, with the rest of the transcriptome included as penalized predictors.
  • EXPRESSO-B: Extends EXPRESSO-T by additionally including context-specific biomarkers — identified via a nested LOCO-CV differential expression procedure — as penalty-free features alongside the target gene(s).

Two baseline models are included for comparison:

  • Vanilla LASSO (noTarget mode): Supervised LASSO with no biological prior; the full transcriptome is penalized equally.
  • Unsupervised target-only: Ranks patients by the expression of the target gene alone, with no model training.

Repository Structure

project/
│
├── expresso_all.R                         # Main LOCO-CV script (EXPRESSO-T / -B / vanilla)
├── expresso_biomarker_search.R            # Nested LOCO-CV biomarker discovery (EXPRESSO-B)
├── gene_signature_benchmarking.R               # Gene signature benchmarking (prospective)
├── expressoT_MLmodels_generate.py         # Train/evaluate alternative ML models (EXPRESSO-T context)
├── expressoB_MLmodels_generate.py         # Train/evaluate alternative ML models (EXPRESSO-B context)
├── Methods_benchmarking.ipynb             # Jupyter notebook: full benchmarking walkthrough
├── source_EXPRESSO_func.R          # All shared R helper functions (single merged source)
│
├── data/
│   ├── <cohort>/                          # One directory per drug cohort group
│   │   ├── metadata.tsv                   # Cohort metadata (cohort, cancer, type, drug, targets)
│   │   ├── response.tsv                   # Sample-level treatment response labels
│   │   └── mrna/
│   │       └── <cohort_id>.rds            # Expression matrix (genes × samples, numeric matrix)
│   └── biomarkers/
│       └── <cohort>_biomarkers.tsv        # Per-fold target/biomarker gene definitions
│
├── comparisons/
│   ├── final_signatures.tsv               # Index of all 16 published gene signatures
│   ├── <signature_name>.tsv               # One file per gene signature (gene lists)
│   └── ...
│
└── results/                               # All output TSV files written here

Input Data Format

metadata.tsv

Column Description
cohort Unique cohort identifier (e.g. GSE78220)
cancer Cancer type
type Sample type
drug Drug/intervention name
targets Drug target gene(s)

response.tsv

Column Description
cohort Cohort identifier (must match metadata.tsv)
sample Sample ID (must match column names in expression matrix)
response Responder or Non-responder

mrna/<cohort_id>.rds

A numeric R matrix with genes as row names and sample IDs as column names. Gene expression values should be in raw or log-scale; normalization is applied internally (see Normalization section).

data/biomarkers/<cohort>_biomarkers.tsv

Column Description
test_id Cohort held out as test set for this fold
target Comma-separated target gene(s) for target mode
biomarkers Comma-separated biomarker genes for biomarker mode

Normalization

Raw gene expression values are rank-normalized within each sample (converting expression values to fractional ranks in [0, 1]) using the rank_normalization() function in source_EXPRESSO_func_merged.R. This within-sample normalization is applied uniformly across all cohorts prior to model training.

The --normalization argument (default: "ranked") controls this behavior:

Value Description
ranked Within-sample rank normalization to [0,1] — default, recommended
NPN Nonparanormal transformation (rank + probit)
raw No normalization applied

Software Requirements

R (≥ 4.0)

install.packages(c(
  "tidyverse", "data.table", "glmnet", "caret",
  "rsample", "pROC", "limma", "parallel", "metap"
))

# For biomarker search (Brown's method p-value combination):
install.packages("EmpiricalBrownsMethod")

# For differential expression:
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("limma")

Python (= 3.11)

pip install numpy pandas scipy matplotlib

Random Seeds

All stochastic procedures use fixed random seeds for full reproducibility:

  • set.seed(31052024) is called immediately before each cv.glmnet() model build in source_EXPRESSO_func_merged.R
  • The bootstrap CI functions in the Python comparison scripts use np.random.default_rng(42)

Scripts and Usage

1. expresso_all.R — Main LOCO-CV (EXPRESSO-T / -B / noTarget)

Runs LOCO-CV for a single drug and produces AUCs, odds ratios, and selected features.

Rscript expresso_all.R <intervention> <cohort> <biomarker_file> <mode> [normalization] [response_weights]
Argument Description
intervention Drug name (e.g. ICB, trastuzumab)
cohort Cohort directory under data/ (e.g. pd1mel)
biomarker_file TSV file under data/biomarkers/ with per-fold target/biomarker genes
mode target (EXPRESSO-T), biomarker (EXPRESSO-B), or noTarget
normalization (optional) ranked (default), NPN, or raw
response_weights (optional) TRUE (default) or FALSE

Drug-cohort reference table:

Drug intervention cohort biomarker_file
ICB – melanoma ICB pd1mel pd1mel_biomarkers.tsv
ICB – non-melanoma ICB pd1oth pd1oth_biomarkers.tsv
Trastuzumab trastuzumab tras tras_biomarkers.tsv
Bevacizumab bevacizumab beva beva_biomarkers.tsv
BRAFi BRAFi brafi brafi_biomarkers.tsv
Paclitaxel paclitaxel pacli pacli_biomarkers.tsv
Chemo-FAC-FEC cyclophos cyclop cyclop_biomarkers.tsv

Examples:

# EXPRESSO-T for ICB-melanoma
Rscript expresso_all.R ICB pd1mel pd1mel_biomarkers.tsv target

# EXPRESSO-B for trastuzumab
Rscript expresso_all.R trastuzumab tras tras_biomarkers.tsv biomarker

# Vanilla LASSO (no target prior) for paclitaxel
Rscript expresso_all.R paclitaxel pacli pacli_biomarkers.tsv noTarget

Outputs written to results/:

File Contents
<intervention>_<cohort>_<mode>_expresso_AUCs.tsv Per-cohort AUC and odds ratio (mid.OR), plus Mean/Median summary rows
<intervention>_<cohort>_<mode>_expresso_features.tsv Genes selected per fold with LASSO coefficients
<intervention>_<cohort>_<mode>_expresso_features_frequency.tsv Aggregated gene selection frequency across all folds

Note on filename convention for downstream comparison scripts: The Python comparison scripts (expresso_compare_models.py, expressoB_compare_models.py) expect AUC files named <intervention>_LOCOCV_lasso_AUCs.tsv. Rename or symlink the output from expresso_all.R accordingly before running comparison scripts.


2. expresso_biomarker_search.R — Biomarker Discovery (Nested LOCO-CV)

Runs the nested LOCO-CV biomarker discovery procedure used to produce the per-fold biomarker gene lists for EXPRESSO-B. For each outer fold, identifies differentially expressed genes from the training cohorts using limma, combines fold-level p-values using Brown's method (to account for inter-fold dependence from overlapping training sets), and selects genes that improve LOCO-CV AUC by ≥ 0.02 (ΔAUC criterion).

Rscript expresso_biomarker_search.R <intervention> <cohort> <target> [options]
Argument Description
intervention Drug name
cohort Cohort identifier
target Target gene(s), comma-separated (e.g. CD274 or CD274,PDCD1)

Key options:

Option Default Description
--de_fdr 0.05 FDR cutoff for DE gene selection
--de_logfc 0.4 Minimum mean absolute log-fold-change
--delta_auc 0.02 Minimum ΔAUC to accept a biomarker gene
--delta_pval 0.05 Wilcoxon p-value threshold for ΔAUC test
--max_biom 3 Maximum biomarker genes to add per fold
--ncores auto Parallel cores for deltaAUC computation
--result_dir <cohort>_cv_fold_results Output directory

Examples:

Rscript expresso_biomarker_search.R ICB pd1mel CD274
Rscript expresso_biomarker_search.R trastuzumab tras ERBB2 --ncores 8 --max_biom 3

Outputs written to --result_dir:

File Contents
<intervention>_<cohort>_cv_result_<test_id>.rds Per-fold RDS with DE summary, deltaAUC results, selected biomarker genes
<intervention>_<cohort>_summary_auc_results.tsv Summary AUC table across all folds

The per-fold RDS files are used to populate the biomarkers column in <cohort>_biomarkers.tsv for input to expresso_all.R.


3. gene_signature_benchmarking.R — Published Signature Benchmarking

Evaluates a set of published gene signatures against all cohorts in a prospective (non-cross-validated) manner. Each signature is scored by the mean rank-normalized expression of its member genes; AUC is computed per cohort.

Rscript gene_signature_benchmarking.R <intervention> <cohort> <signatures_file> [options]
Argument Description
intervention Drug name (used in output filename)
cohort Cohort identifier
signatures_file TSV file with a signatures column listing signature names

Key options:

Option Default Description
--sig_dir comparisons Directory containing per-signature TSV gene list files
--result_dir comparisons Directory for output TSV
--drug_label same as intervention Human-readable label in output table

Example:

Rscript gene_signature_prospec.R trastuzumab tras comparisons/final_signatures.tsv \
  --sig_dir comparisons --result_dir results --drug_label "Trastuzumab"

Output: <intervention>_<cohort>_aucs_comparison_gene_signatures_long_loco_pros.tsv

Contains columns: cohort, AUC, drug, predictor (signature name).

The comparisons/ directory contains gene lists for all 16 published signatures used for benchmarking (e.g., TIDE, IMPRES, CYT, MammaPrint, OncotypeDX), each as a separate TSV file, along with references to the original publications. Signatures were implemented according to their original scoring procedures using the same rank-normalized expression data as EXPRESSO.


4. expressoT_MLmodels_generate.py — Alternative ML Models

5. expressoB_MLmodels_generate.py — Alternative ML Models

Generate LOCO-CV AUC results for alternative machine learning models (Random Forest, XGBoost, SVM, KNN, MLP) in the EXPRESSO-T and EXPRESSO-B contexts respectively, for direct comparison with EXPRESSO.

Output files follow the naming convention:

results/<intervention>_LOCOCV_<model>_AUCs.tsv

Both scripts load AUC TSV files for all models, generate LOCO-CV AUC results for alternative machine learning models (Random Forest, XGBoost, SVM, KNN, MLP) in the EXPRESSO-T and EXPRESSO-B contexts respectively, for direct comparison with EXPRESSO - run paired Wilcoxon signed-rank tests against the LASSO reference, compute bootstrap 95% confidence intervals, and produce a publication-ready boxplot.

python expressoT_MLmodels_generate.py \
  --results_dir results \
  --intervention ICBmel \
  --reference lasso \
  --output_dir figures

python expressoB_MLmodels_generate.py \
  --results_dir results \
  --intervention ICBmel \
  --reference lasso \
  --output_dir figures
Option Default Description
--results_dir (required) Directory with <intervention>_LOCOCV_<model>_AUCs.tsv files
--intervention (required) Drug name matching the file prefix
--reference lasso Reference model for Wilcoxon test
--min_cohort_size 0 Minimum cohort size filter
--output_dir figures Directory for output PDF/PNG and stats TSV
--n_boot 2000 Bootstrap iterations for CI computation

Outputs:

File Contents
<intervention>_model_stats.tsv Mean AUC, 95% CI, Wilcoxon statistic, p-value, significance per model
<intervention>_boxplot_AUC.pdf/.png Boxplot with per-cohort AUCs, jittered points, and significance stars

6. Methods_benchmarking.ipynb — Walkthrough Notebook

A Jupyter notebook demonstrating a full end-to-end run of EXPRESSO-T and EXPRESSO-B for a single drug and cohort, including data loading, model training, AUC evaluation, and signature benchmarking. Intended as a reproducibility reference and example for new users.


Typical Full Workflow

Step 1: Biomarker discovery (produces per-fold biomarker gene lists)
  └─ expresso_biomarker_search.R  →  <cohort>_biomarkers.tsv

Step 2: LOCO-CV model evaluation
  ├─ expresso_all.R (mode=target)    →  EXPRESSO-T AUCs + features
  ├─ expresso_all.R (mode=biomarker) →  EXPRESSO-B AUCs + features
  └─ expresso_all.R (mode=noTarget)  →  Vanilla LASSO AUCs

Step 3: Alternative ML model evaluation
  ├─ expressoT_MLmodels_generate.py  →  RF, XGB, SVM, KNN, MLP AUCs (T context)
  └─ expressoB_MLmodels_generate.py  →  RF, XGB, SVM, KNN, MLP AUCs (B context)

Step 4: Signature benchmarking
  ├─ gene_signature_benchmarking.R        →  Published signature AUCs per cohort
  └─ Methods_benchmarking.ipynb           →  Published Methods AUCs per cohort

Step 5: Statistical comparison and visualization
  ├─ expressoT_MLmodels_generate.py      →  EXPRESSO-T boxplot vs ML models
  └─ expressoB_MLmodels_generate.py     →  EXPRESSO-B boxplot vs ML models

Notes

  • The helper functions in source_EXPRESSO_func.R must be in the same directory as the calling script, or the --source_file path must be updated.
  • Cohorts containing dbGaP-restricted data require data access approval; results for those cohorts will differ from the manuscript until access is granted.
  • Genes are restricted to those measured in at least 3 cohorts (controlled by --gene_cutoff/gene_cohort_cutoff_) to enable consistent cross-cohort model fitting.
  • The full list of cohorts used per drug is provided in Supplementary Table S1 of the manuscript.

About

EXPRESSO is a drug-specific supervised LASSO logistic regression framework for transcriptome-based treatment response prediction. It evaluates predictive power of drug targets and biomarkers across patient cohorts using a Leave-One-Cohort-Out Cross-Validation (LOCO-CV) strategy.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors