EXPRESSO (EXPression-based Response pREdiction via Supervised Signal Optimization) is a drug-specific supervised LASSO logistic regression framework for transcriptome-based treatment response prediction. It evaluates predictive power of drug targets and biomarkers across patient cohorts using a Leave-One-Cohort-Out Cross-Validation (LOCO-CV) strategy.
Two model variants are implemented:
- EXPRESSO-T: LASSO model with the known drug target gene(s) as penalty-free (regularization-exempt) features, with the rest of the transcriptome included as penalized predictors.
- EXPRESSO-B: Extends EXPRESSO-T by additionally including context-specific biomarkers — identified via a nested LOCO-CV differential expression procedure — as penalty-free features alongside the target gene(s).
Two baseline models are included for comparison:
- Vanilla LASSO (
noTargetmode): Supervised LASSO with no biological prior; the full transcriptome is penalized equally. - Unsupervised target-only: Ranks patients by the expression of the target gene alone, with no model training.
project/
│
├── expresso_all.R # Main LOCO-CV script (EXPRESSO-T / -B / vanilla)
├── expresso_biomarker_search.R # Nested LOCO-CV biomarker discovery (EXPRESSO-B)
├── gene_signature_benchmarking.R # Gene signature benchmarking (prospective)
├── expressoT_MLmodels_generate.py # Train/evaluate alternative ML models (EXPRESSO-T context)
├── expressoB_MLmodels_generate.py # Train/evaluate alternative ML models (EXPRESSO-B context)
├── Methods_benchmarking.ipynb # Jupyter notebook: full benchmarking walkthrough
├── source_EXPRESSO_func.R # All shared R helper functions (single merged source)
│
├── data/
│ ├── <cohort>/ # One directory per drug cohort group
│ │ ├── metadata.tsv # Cohort metadata (cohort, cancer, type, drug, targets)
│ │ ├── response.tsv # Sample-level treatment response labels
│ │ └── mrna/
│ │ └── <cohort_id>.rds # Expression matrix (genes × samples, numeric matrix)
│ └── biomarkers/
│ └── <cohort>_biomarkers.tsv # Per-fold target/biomarker gene definitions
│
├── comparisons/
│ ├── final_signatures.tsv # Index of all 16 published gene signatures
│ ├── <signature_name>.tsv # One file per gene signature (gene lists)
│ └── ...
│
└── results/ # All output TSV files written here
| Column | Description |
|---|---|
cohort |
Unique cohort identifier (e.g. GSE78220) |
cancer |
Cancer type |
type |
Sample type |
drug |
Drug/intervention name |
targets |
Drug target gene(s) |
| Column | Description |
|---|---|
cohort |
Cohort identifier (must match metadata.tsv) |
sample |
Sample ID (must match column names in expression matrix) |
response |
Responder or Non-responder |
A numeric R matrix with genes as row names and sample IDs as column names. Gene expression values should be in raw or log-scale; normalization is applied internally (see Normalization section).
| Column | Description |
|---|---|
test_id |
Cohort held out as test set for this fold |
target |
Comma-separated target gene(s) for target mode |
biomarkers |
Comma-separated biomarker genes for biomarker mode |
Raw gene expression values are rank-normalized within each sample (converting expression values to fractional ranks in [0, 1]) using the rank_normalization() function in source_EXPRESSO_func_merged.R. This within-sample normalization is applied uniformly across all cohorts prior to model training.
The --normalization argument (default: "ranked") controls this behavior:
| Value | Description |
|---|---|
ranked |
Within-sample rank normalization to [0,1] — default, recommended |
NPN |
Nonparanormal transformation (rank + probit) |
raw |
No normalization applied |
install.packages(c(
"tidyverse", "data.table", "glmnet", "caret",
"rsample", "pROC", "limma", "parallel", "metap"
))
# For biomarker search (Brown's method p-value combination):
install.packages("EmpiricalBrownsMethod")
# For differential expression:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("limma")pip install numpy pandas scipy matplotlibAll stochastic procedures use fixed random seeds for full reproducibility:
set.seed(31052024)is called immediately before eachcv.glmnet()model build insource_EXPRESSO_func_merged.R- The bootstrap CI functions in the Python comparison scripts use
np.random.default_rng(42)
Runs LOCO-CV for a single drug and produces AUCs, odds ratios, and selected features.
Rscript expresso_all.R <intervention> <cohort> <biomarker_file> <mode> [normalization] [response_weights]| Argument | Description |
|---|---|
intervention |
Drug name (e.g. ICB, trastuzumab) |
cohort |
Cohort directory under data/ (e.g. pd1mel) |
biomarker_file |
TSV file under data/biomarkers/ with per-fold target/biomarker genes |
mode |
target (EXPRESSO-T), biomarker (EXPRESSO-B), or noTarget |
normalization |
(optional) ranked (default), NPN, or raw |
response_weights |
(optional) TRUE (default) or FALSE |
Drug-cohort reference table:
| Drug | intervention | cohort | biomarker_file |
|---|---|---|---|
| ICB – melanoma | ICB |
pd1mel |
pd1mel_biomarkers.tsv |
| ICB – non-melanoma | ICB |
pd1oth |
pd1oth_biomarkers.tsv |
| Trastuzumab | trastuzumab |
tras |
tras_biomarkers.tsv |
| Bevacizumab | bevacizumab |
beva |
beva_biomarkers.tsv |
| BRAFi | BRAFi |
brafi |
brafi_biomarkers.tsv |
| Paclitaxel | paclitaxel |
pacli |
pacli_biomarkers.tsv |
| Chemo-FAC-FEC | cyclophos |
cyclop |
cyclop_biomarkers.tsv |
Examples:
# EXPRESSO-T for ICB-melanoma
Rscript expresso_all.R ICB pd1mel pd1mel_biomarkers.tsv target
# EXPRESSO-B for trastuzumab
Rscript expresso_all.R trastuzumab tras tras_biomarkers.tsv biomarker
# Vanilla LASSO (no target prior) for paclitaxel
Rscript expresso_all.R paclitaxel pacli pacli_biomarkers.tsv noTargetOutputs written to results/:
| File | Contents |
|---|---|
<intervention>_<cohort>_<mode>_expresso_AUCs.tsv |
Per-cohort AUC and odds ratio (mid.OR), plus Mean/Median summary rows |
<intervention>_<cohort>_<mode>_expresso_features.tsv |
Genes selected per fold with LASSO coefficients |
<intervention>_<cohort>_<mode>_expresso_features_frequency.tsv |
Aggregated gene selection frequency across all folds |
Note on filename convention for downstream comparison scripts: The Python comparison scripts (
expresso_compare_models.py,expressoB_compare_models.py) expect AUC files named<intervention>_LOCOCV_lasso_AUCs.tsv. Rename or symlink the output fromexpresso_all.Raccordingly before running comparison scripts.
Runs the nested LOCO-CV biomarker discovery procedure used to produce the per-fold biomarker gene lists for EXPRESSO-B. For each outer fold, identifies differentially expressed genes from the training cohorts using limma, combines fold-level p-values using Brown's method (to account for inter-fold dependence from overlapping training sets), and selects genes that improve LOCO-CV AUC by ≥ 0.02 (ΔAUC criterion).
Rscript expresso_biomarker_search.R <intervention> <cohort> <target> [options]| Argument | Description |
|---|---|
intervention |
Drug name |
cohort |
Cohort identifier |
target |
Target gene(s), comma-separated (e.g. CD274 or CD274,PDCD1) |
Key options:
| Option | Default | Description |
|---|---|---|
--de_fdr |
0.05 |
FDR cutoff for DE gene selection |
--de_logfc |
0.4 |
Minimum mean absolute log-fold-change |
--delta_auc |
0.02 |
Minimum ΔAUC to accept a biomarker gene |
--delta_pval |
0.05 |
Wilcoxon p-value threshold for ΔAUC test |
--max_biom |
3 |
Maximum biomarker genes to add per fold |
--ncores |
auto | Parallel cores for deltaAUC computation |
--result_dir |
<cohort>_cv_fold_results |
Output directory |
Examples:
Rscript expresso_biomarker_search.R ICB pd1mel CD274
Rscript expresso_biomarker_search.R trastuzumab tras ERBB2 --ncores 8 --max_biom 3Outputs written to --result_dir:
| File | Contents |
|---|---|
<intervention>_<cohort>_cv_result_<test_id>.rds |
Per-fold RDS with DE summary, deltaAUC results, selected biomarker genes |
<intervention>_<cohort>_summary_auc_results.tsv |
Summary AUC table across all folds |
The per-fold RDS files are used to populate the biomarkers column in <cohort>_biomarkers.tsv for input to expresso_all.R.
Evaluates a set of published gene signatures against all cohorts in a prospective (non-cross-validated) manner. Each signature is scored by the mean rank-normalized expression of its member genes; AUC is computed per cohort.
Rscript gene_signature_benchmarking.R <intervention> <cohort> <signatures_file> [options]| Argument | Description |
|---|---|
intervention |
Drug name (used in output filename) |
cohort |
Cohort identifier |
signatures_file |
TSV file with a signatures column listing signature names |
Key options:
| Option | Default | Description |
|---|---|---|
--sig_dir |
comparisons |
Directory containing per-signature TSV gene list files |
--result_dir |
comparisons |
Directory for output TSV |
--drug_label |
same as intervention |
Human-readable label in output table |
Example:
Rscript gene_signature_prospec.R trastuzumab tras comparisons/final_signatures.tsv \
--sig_dir comparisons --result_dir results --drug_label "Trastuzumab"Output: <intervention>_<cohort>_aucs_comparison_gene_signatures_long_loco_pros.tsv
Contains columns: cohort, AUC, drug, predictor (signature name).
The comparisons/ directory contains gene lists for all 16 published signatures used for benchmarking (e.g., TIDE, IMPRES, CYT, MammaPrint, OncotypeDX), each as a separate TSV file, along with references to the original publications. Signatures were implemented according to their original scoring procedures using the same rank-normalized expression data as EXPRESSO.
Generate LOCO-CV AUC results for alternative machine learning models (Random Forest, XGBoost, SVM, KNN, MLP) in the EXPRESSO-T and EXPRESSO-B contexts respectively, for direct comparison with EXPRESSO.
Output files follow the naming convention:
results/<intervention>_LOCOCV_<model>_AUCs.tsv
Both scripts load AUC TSV files for all models, generate LOCO-CV AUC results for alternative machine learning models (Random Forest, XGBoost, SVM, KNN, MLP) in the EXPRESSO-T and EXPRESSO-B contexts respectively, for direct comparison with EXPRESSO - run paired Wilcoxon signed-rank tests against the LASSO reference, compute bootstrap 95% confidence intervals, and produce a publication-ready boxplot.
python expressoT_MLmodels_generate.py \
--results_dir results \
--intervention ICBmel \
--reference lasso \
--output_dir figures
python expressoB_MLmodels_generate.py \
--results_dir results \
--intervention ICBmel \
--reference lasso \
--output_dir figures| Option | Default | Description |
|---|---|---|
--results_dir |
(required) | Directory with <intervention>_LOCOCV_<model>_AUCs.tsv files |
--intervention |
(required) | Drug name matching the file prefix |
--reference |
lasso |
Reference model for Wilcoxon test |
--min_cohort_size |
0 |
Minimum cohort size filter |
--output_dir |
figures |
Directory for output PDF/PNG and stats TSV |
--n_boot |
2000 |
Bootstrap iterations for CI computation |
Outputs:
| File | Contents |
|---|---|
<intervention>_model_stats.tsv |
Mean AUC, 95% CI, Wilcoxon statistic, p-value, significance per model |
<intervention>_boxplot_AUC.pdf/.png |
Boxplot with per-cohort AUCs, jittered points, and significance stars |
A Jupyter notebook demonstrating a full end-to-end run of EXPRESSO-T and EXPRESSO-B for a single drug and cohort, including data loading, model training, AUC evaluation, and signature benchmarking. Intended as a reproducibility reference and example for new users.
Step 1: Biomarker discovery (produces per-fold biomarker gene lists)
└─ expresso_biomarker_search.R → <cohort>_biomarkers.tsv
Step 2: LOCO-CV model evaluation
├─ expresso_all.R (mode=target) → EXPRESSO-T AUCs + features
├─ expresso_all.R (mode=biomarker) → EXPRESSO-B AUCs + features
└─ expresso_all.R (mode=noTarget) → Vanilla LASSO AUCs
Step 3: Alternative ML model evaluation
├─ expressoT_MLmodels_generate.py → RF, XGB, SVM, KNN, MLP AUCs (T context)
└─ expressoB_MLmodels_generate.py → RF, XGB, SVM, KNN, MLP AUCs (B context)
Step 4: Signature benchmarking
├─ gene_signature_benchmarking.R → Published signature AUCs per cohort
└─ Methods_benchmarking.ipynb → Published Methods AUCs per cohort
Step 5: Statistical comparison and visualization
├─ expressoT_MLmodels_generate.py → EXPRESSO-T boxplot vs ML models
└─ expressoB_MLmodels_generate.py → EXPRESSO-B boxplot vs ML models
- The helper functions in
source_EXPRESSO_func.Rmust be in the same directory as the calling script, or the--source_filepath must be updated. - Cohorts containing dbGaP-restricted data require data access approval; results for those cohorts will differ from the manuscript until access is granted.
- Genes are restricted to those measured in at least 3 cohorts (controlled by
--gene_cutoff/gene_cohort_cutoff_) to enable consistent cross-cohort model fitting. - The full list of cohorts used per drug is provided in Supplementary Table S1 of the manuscript.