EXPRESSO: Drug Response Prediction Framework

Overview

EXPRESSO (EXPression-based Response pREdiction via Supervised Signal Optimization) is a drug-specific supervised LASSO logistic regression framework for transcriptome-based treatment response prediction. It evaluates predictive power of drug targets and biomarkers across patient cohorts using a Leave-One-Cohort-Out Cross-Validation (LOCO-CV) strategy.

Two model variants are implemented:

EXPRESSO-T: LASSO model with the known drug target gene(s) as penalty-free (regularization-exempt) features, with the rest of the transcriptome included as penalized predictors.
EXPRESSO-B: Extends EXPRESSO-T by additionally including context-specific biomarkers — identified via a nested LOCO-CV differential expression procedure — as penalty-free features alongside the target gene(s).

Two baseline models are included for comparison:

Vanilla LASSO (noTarget mode): Supervised LASSO with no biological prior; the full transcriptome is penalized equally.
Unsupervised target-only: Ranks patients by the expression of the target gene alone, with no model training.

Repository Structure

project/
│
├── expresso_all.R                         # Main LOCO-CV script (EXPRESSO-T / -B / vanilla)
├── expresso_biomarker_search.R            # Nested LOCO-CV biomarker discovery (EXPRESSO-B)
├── gene_signature_benchmarking.R               # Gene signature benchmarking (prospective)
├── expressoT_MLmodels_generate.py         # Train/evaluate alternative ML models (EXPRESSO-T context)
├── expressoB_MLmodels_generate.py         # Train/evaluate alternative ML models (EXPRESSO-B context)
├── Methods_benchmarking.ipynb             # Jupyter notebook: full benchmarking walkthrough
├── source_EXPRESSO_func.R          # All shared R helper functions (single merged source)
│
├── data/
│   ├── <cohort>/                          # One directory per drug cohort group
│   │   ├── metadata.tsv                   # Cohort metadata (cohort, cancer, type, drug, targets)
│   │   ├── response.tsv                   # Sample-level treatment response labels
│   │   └── mrna/
│   │       └── <cohort_id>.rds            # Expression matrix (genes × samples, numeric matrix)
│   └── biomarkers/
│       └── <cohort>_biomarkers.tsv        # Per-fold target/biomarker gene definitions
│
├── comparisons/
│   ├── final_signatures.tsv               # Index of all 16 published gene signatures
│   ├── <signature_name>.tsv               # One file per gene signature (gene lists)
│   └── ...
│
└── results/                               # All output TSV files written here

Input Data Format

`metadata.tsv`

Column	Description
`cohort`	Unique cohort identifier (e.g. `GSE78220`)
`cancer`	Cancer type
`type`	Sample type
`drug`	Drug/intervention name
`targets`	Drug target gene(s)

`response.tsv`

Column	Description
`cohort`	Cohort identifier (must match `metadata.tsv`)
`sample`	Sample ID (must match column names in expression matrix)
`response`	`Responder` or `Non-responder`

`mrna/<cohort_id>.rds`

A numeric R matrix with genes as row names and sample IDs as column names. Gene expression values should be in raw or log-scale; normalization is applied internally (see Normalization section).

`data/biomarkers/<cohort>_biomarkers.tsv`

Column	Description
`test_id`	Cohort held out as test set for this fold
`target`	Comma-separated target gene(s) for `target` mode
`biomarkers`	Comma-separated biomarker genes for `biomarker` mode

Normalization

Raw gene expression values are rank-normalized within each sample (converting expression values to fractional ranks in [0, 1]) using the rank_normalization() function in source_EXPRESSO_func_merged.R. This within-sample normalization is applied uniformly across all cohorts prior to model training.

The --normalization argument (default: "ranked") controls this behavior:

Value	Description
`ranked`	Within-sample rank normalization to [0,1] — default, recommended
`NPN`	Nonparanormal transformation (rank + probit)
`raw`	No normalization applied

Software Requirements

R (≥ 4.0)

install.packages(c(
  "tidyverse", "data.table", "glmnet", "caret",
  "rsample", "pROC", "limma", "parallel", "metap"
))

# For biomarker search (Brown's method p-value combination):
install.packages("EmpiricalBrownsMethod")

# For differential expression:
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("limma")

Python (= 3.11)

pip install numpy pandas scipy matplotlib

Random Seeds

All stochastic procedures use fixed random seeds for full reproducibility:

set.seed(31052024) is called immediately before each cv.glmnet() model build in source_EXPRESSO_func_merged.R
The bootstrap CI functions in the Python comparison scripts use np.random.default_rng(42)

Scripts and Usage

1. `expresso_all.R` — Main LOCO-CV (EXPRESSO-T / -B / noTarget)

Runs LOCO-CV for a single drug and produces AUCs, odds ratios, and selected features.

Rscript expresso_all.R <intervention> <cohort> <biomarker_file> <mode> [normalization] [response_weights]

Argument	Description
`intervention`	Drug name (e.g. `ICB`, `trastuzumab`)
`cohort`	Cohort directory under `data/` (e.g. `pd1mel`)
`biomarker_file`	TSV file under `data/biomarkers/` with per-fold target/biomarker genes
`mode`	`target` (EXPRESSO-T), `biomarker` (EXPRESSO-B), or `noTarget`
`normalization`	(optional) `ranked` (default), `NPN`, or `raw`
`response_weights`	(optional) `TRUE` (default) or `FALSE`

Drug-cohort reference table:

Drug	intervention	cohort	biomarker_file
ICB – melanoma	`ICB`	`pd1mel`	`pd1mel_biomarkers.tsv`
ICB – non-melanoma	`ICB`	`pd1oth`	`pd1oth_biomarkers.tsv`
Trastuzumab	`trastuzumab`	`tras`	`tras_biomarkers.tsv`
Bevacizumab	`bevacizumab`	`beva`	`beva_biomarkers.tsv`
BRAFi	`BRAFi`	`brafi`	`brafi_biomarkers.tsv`
Paclitaxel	`paclitaxel`	`pacli`	`pacli_biomarkers.tsv`
Chemo-FAC-FEC	`cyclophos`	`cyclop`	`cyclop_biomarkers.tsv`

Examples:

# EXPRESSO-T for ICB-melanoma
Rscript expresso_all.R ICB pd1mel pd1mel_biomarkers.tsv target

# EXPRESSO-B for trastuzumab
Rscript expresso_all.R trastuzumab tras tras_biomarkers.tsv biomarker

# Vanilla LASSO (no target prior) for paclitaxel
Rscript expresso_all.R paclitaxel pacli pacli_biomarkers.tsv noTarget

Outputs written to results/:

File	Contents
`<intervention>_<cohort>_<mode>_expresso_AUCs.tsv`	Per-cohort AUC and odds ratio (mid.OR), plus Mean/Median summary rows
`<intervention>_<cohort>_<mode>_expresso_features.tsv`	Genes selected per fold with LASSO coefficients
`<intervention>_<cohort>_<mode>_expresso_features_frequency.tsv`	Aggregated gene selection frequency across all folds

Note on filename convention for downstream comparison scripts: The Python comparison scripts (expresso_compare_models.py, expressoB_compare_models.py) expect AUC files named <intervention>_LOCOCV_lasso_AUCs.tsv. Rename or symlink the output from expresso_all.R accordingly before running comparison scripts.

2. `expresso_biomarker_search.R` — Biomarker Discovery (Nested LOCO-CV)

Runs the nested LOCO-CV biomarker discovery procedure used to produce the per-fold biomarker gene lists for EXPRESSO-B. For each outer fold, identifies differentially expressed genes from the training cohorts using limma, combines fold-level p-values using Brown's method (to account for inter-fold dependence from overlapping training sets), and selects genes that improve LOCO-CV AUC by ≥ 0.02 (ΔAUC criterion).

Rscript expresso_biomarker_search.R <intervention> <cohort> <target> [options]

Argument	Description
`intervention`	Drug name
`cohort`	Cohort identifier
`target`	Target gene(s), comma-separated (e.g. `CD274` or `CD274,PDCD1`)

Key options:

Option	Default	Description
`--de_fdr`	`0.05`	FDR cutoff for DE gene selection
`--de_logfc`	`0.4`	Minimum mean absolute log-fold-change
`--delta_auc`	`0.02`	Minimum ΔAUC to accept a biomarker gene
`--delta_pval`	`0.05`	Wilcoxon p-value threshold for ΔAUC test
`--max_biom`	`3`	Maximum biomarker genes to add per fold
`--ncores`	auto	Parallel cores for deltaAUC computation
`--result_dir`	`<cohort>_cv_fold_results`	Output directory

Examples:

Rscript expresso_biomarker_search.R ICB pd1mel CD274
Rscript expresso_biomarker_search.R trastuzumab tras ERBB2 --ncores 8 --max_biom 3

Outputs written to --result_dir:

File	Contents
`<intervention>_<cohort>_cv_result_<test_id>.rds`	Per-fold RDS with DE summary, deltaAUC results, selected biomarker genes
`<intervention>_<cohort>_summary_auc_results.tsv`	Summary AUC table across all folds

The per-fold RDS files are used to populate the biomarkers column in <cohort>_biomarkers.tsv for input to expresso_all.R.

3. `gene_signature_benchmarking.R` — Published Signature Benchmarking

Evaluates a set of published gene signatures against all cohorts in a prospective (non-cross-validated) manner. Each signature is scored by the mean rank-normalized expression of its member genes; AUC is computed per cohort.

Rscript gene_signature_benchmarking.R <intervention> <cohort> <signatures_file> [options]

Argument	Description
`intervention`	Drug name (used in output filename)
`cohort`	Cohort identifier
`signatures_file`	TSV file with a `signatures` column listing signature names

Key options:

Option	Default	Description
`--sig_dir`	`comparisons`	Directory containing per-signature TSV gene list files
`--result_dir`	`comparisons`	Directory for output TSV
`--drug_label`	same as `intervention`	Human-readable label in output table

Example:

Rscript gene_signature_prospec.R trastuzumab tras comparisons/final_signatures.tsv \
  --sig_dir comparisons --result_dir results --drug_label "Trastuzumab"

Output: <intervention>_<cohort>_aucs_comparison_gene_signatures_long_loco_pros.tsv

Contains columns: cohort, AUC, drug, predictor (signature name).

The comparisons/ directory contains gene lists for all 16 published signatures used for benchmarking (e.g., TIDE, IMPRES, CYT, MammaPrint, OncotypeDX), each as a separate TSV file, along with references to the original publications. Signatures were implemented according to their original scoring procedures using the same rank-normalized expression data as EXPRESSO.

4. `expressoT_MLmodels_generate.py` — Alternative ML Models

5. `expressoB_MLmodels_generate.py` — Alternative ML Models

Generate LOCO-CV AUC results for alternative machine learning models (Random Forest, XGBoost, SVM, KNN, MLP) in the EXPRESSO-T and EXPRESSO-B contexts respectively, for direct comparison with EXPRESSO.

Output files follow the naming convention:

results/<intervention>_LOCOCV_<model>_AUCs.tsv

Both scripts load AUC TSV files for all models, generate LOCO-CV AUC results for alternative machine learning models (Random Forest, XGBoost, SVM, KNN, MLP) in the EXPRESSO-T and EXPRESSO-B contexts respectively, for direct comparison with EXPRESSO - run paired Wilcoxon signed-rank tests against the LASSO reference, compute bootstrap 95% confidence intervals, and produce a publication-ready boxplot.

python expressoT_MLmodels_generate.py \
  --results_dir results \
  --intervention ICBmel \
  --reference lasso \
  --output_dir figures

python expressoB_MLmodels_generate.py \
  --results_dir results \
  --intervention ICBmel \
  --reference lasso \
  --output_dir figures

Option	Default	Description
`--results_dir`	(required)	Directory with `<intervention>_LOCOCV_<model>_AUCs.tsv` files
`--intervention`	(required)	Drug name matching the file prefix
`--reference`	`lasso`	Reference model for Wilcoxon test
`--min_cohort_size`	`0`	Minimum cohort size filter
`--output_dir`	`figures`	Directory for output PDF/PNG and stats TSV
`--n_boot`	`2000`	Bootstrap iterations for CI computation

Outputs:

File	Contents
`<intervention>_model_stats.tsv`	Mean AUC, 95% CI, Wilcoxon statistic, p-value, significance per model
`<intervention>_boxplot_AUC.pdf/.png`	Boxplot with per-cohort AUCs, jittered points, and significance stars

6. `Methods_benchmarking.ipynb` — Walkthrough Notebook

A Jupyter notebook demonstrating a full end-to-end run of EXPRESSO-T and EXPRESSO-B for a single drug and cohort, including data loading, model training, AUC evaluation, and signature benchmarking. Intended as a reproducibility reference and example for new users.

Typical Full Workflow

Step 1: Biomarker discovery (produces per-fold biomarker gene lists)
  └─ expresso_biomarker_search.R  →  <cohort>_biomarkers.tsv

Step 2: LOCO-CV model evaluation
  ├─ expresso_all.R (mode=target)    →  EXPRESSO-T AUCs + features
  ├─ expresso_all.R (mode=biomarker) →  EXPRESSO-B AUCs + features
  └─ expresso_all.R (mode=noTarget)  →  Vanilla LASSO AUCs

Step 3: Alternative ML model evaluation
  ├─ expressoT_MLmodels_generate.py  →  RF, XGB, SVM, KNN, MLP AUCs (T context)
  └─ expressoB_MLmodels_generate.py  →  RF, XGB, SVM, KNN, MLP AUCs (B context)

Step 4: Signature benchmarking
  ├─ gene_signature_benchmarking.R        →  Published signature AUCs per cohort
  └─ Methods_benchmarking.ipynb           →  Published Methods AUCs per cohort

Step 5: Statistical comparison and visualization
  ├─ expressoT_MLmodels_generate.py      →  EXPRESSO-T boxplot vs ML models
  └─ expressoB_MLmodels_generate.py     →  EXPRESSO-B boxplot vs ML models

Notes

The helper functions in source_EXPRESSO_func.R must be in the same directory as the calling script, or the --source_file path must be updated.
Cohorts containing dbGaP-restricted data require data access approval; results for those cohorts will differ from the manuscript until access is granted.
Genes are restricted to those measured in at least 3 cohorts (controlled by --gene_cutoff/gene_cohort_cutoff_) to enable consistent cross-cohort model fitting.
The full list of cohorts used per drug is provided in Supplementary Table S1 of the manuscript.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EXPRESSO: Drug Response Prediction Framework

Overview

Repository Structure

Input Data Format

`metadata.tsv`

`response.tsv`

`mrna/<cohort_id>.rds`

`data/biomarkers/<cohort>_biomarkers.tsv`

Normalization

Software Requirements

R (≥ 4.0)

Python (= 3.11)

Random Seeds

Scripts and Usage

1. `expresso_all.R` — Main LOCO-CV (EXPRESSO-T / -B / noTarget)

2. `expresso_biomarker_search.R` — Biomarker Discovery (Nested LOCO-CV)

3. `gene_signature_benchmarking.R` — Published Signature Benchmarking

4. `expressoT_MLmodels_generate.py` — Alternative ML Models

5. `expressoB_MLmodels_generate.py` — Alternative ML Models

6. `Methods_benchmarking.ipynb` — Walkthrough Notebook

Typical Full Workflow

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
comparisons		comparisons
prospective		prospective
Methods_benchmarking.ipynb		Methods_benchmarking.ipynb
README.md		README.md
expressoB_MLmodels_generate.py		expressoB_MLmodels_generate.py
expressoT_MLmodels_generate.py		expressoT_MLmodels_generate.py
expresso_all.R		expresso_all.R
expresso_biomarker_search.R		expresso_biomarker_search.R
gene_signature_benchmarking.R		gene_signature_benchmarking.R
source_EXPRESSO_func.R		source_EXPRESSO_func.R

Folders and files

Latest commit

History

Repository files navigation

EXPRESSO: Drug Response Prediction Framework

Overview

Repository Structure

Input Data Format

metadata.tsv

response.tsv

mrna/<cohort_id>.rds

data/biomarkers/<cohort>_biomarkers.tsv

Normalization

Software Requirements

R (≥ 4.0)

Python (= 3.11)

Random Seeds

Scripts and Usage

1. expresso_all.R — Main LOCO-CV (EXPRESSO-T / -B / noTarget)

2. expresso_biomarker_search.R — Biomarker Discovery (Nested LOCO-CV)

3. gene_signature_benchmarking.R — Published Signature Benchmarking

4. expressoT_MLmodels_generate.py — Alternative ML Models

5. expressoB_MLmodels_generate.py — Alternative ML Models

6. Methods_benchmarking.ipynb — Walkthrough Notebook

Typical Full Workflow

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`metadata.tsv`

`response.tsv`

`mrna/<cohort_id>.rds`

`data/biomarkers/<cohort>_biomarkers.tsv`

1. `expresso_all.R` — Main LOCO-CV (EXPRESSO-T / -B / noTarget)

2. `expresso_biomarker_search.R` — Biomarker Discovery (Nested LOCO-CV)

3. `gene_signature_benchmarking.R` — Published Signature Benchmarking

4. `expressoT_MLmodels_generate.py` — Alternative ML Models

5. `expressoB_MLmodels_generate.py` — Alternative ML Models

6. `Methods_benchmarking.ipynb` — Walkthrough Notebook

Packages