HPAseg

HPAseg is a batch segmentation and single-cell quantification tool for HPA-style brightfield IHC images (RGB images stained with hematoxylin for nuclei and DAB for the target protein). For every image it:

Separates the hematoxylin and DAB stains (color deconvolution).
Detects and segments nuclei into individual cells.
Measures the nuclei and protein intensity of each cell.
Classifies every cell as protein-positive or negative (two thresholds: low and high).
Writes per-cell measurements (cell_data.csv) and a set of visualization overlays.

It reads a batch description from batch.csv, processes each row independently (a failure on one image is logged and skipped, not fatal), and writes all results under a per-sample output folder.

Requirements

Python 3.12 (the project and Docker image are pinned to 3.12.13).
The Python packages in requirements.txt:
- numpy==1.26.4
- pandas==2.1.4
- Pillow==10.4.0
- scikit-image==0.22.0
- scipy==1.11.4

Installation

Local virtual environment

cd /home/fredbn/python-sandbox/hpaseg
python3.12 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Docker

A Dockerfile and docker-compose.yml are provided. The image is based on python:3.12.13-slim and installs requirements.txt at build time.

docker compose build

See Running the code → Docker below for how to launch it.

Setup

HPAseg reads the list of images to process from batch.csv. Each row is one image. The columns, in order, are:

image_uri,type,distribution,output_folder,output_prefix,parameters

Column	Required	Meaning
`image_uri`	yes	Path or URL of the source RGB image. Local paths or `https://…` URLs both work (read with `skimage.io.imread`). Images with an alpha channel are reduced to RGB.
`type`	yes	The protein localization type (e.g. `cytoplasm`, `nuclei`, `membrane`). Passed through to the analysis as `protein_type`.
`distribution`	yes	Staining distribution hint. Use `unknown` to let HPAseg detect it automatically, or set it explicitly (see Distribution values below).
`output_folder`	yes	Sub-path (relative to the data folder) where this sample's results are written. May contain sub-directories, e.g. `ASMTL/Cerebral_cortex_1`.
`output_prefix`	yes	Name used for the saved copy of the source image (`<output_prefix>.jpg`).
`parameters`	no	Per-row tuning, as a `key=value;key=value` string (see The `parameters` column). Leave empty to use defaults.

Example batch.csv:

image_uri,type,distribution,output_folder,output_prefix,parameters
https://images.proteinatlas.org/3630/11417_B_7_5.jpg,cytoplasm,unknown,ASMTL/Cerebral_cortex_1,ASMTL,nuclei_size=15;nuclei_expansion=15

Parameters

There are three places where behavior can be tuned: the per-row parameters column, command-line options, and module-level constants.

The `parameters` column

A ;-separated list of key=value pairs. Only the following two keys are used; any other key is ignored with a warning printed to the log (>>> ignoring unsupported parameters: …):

Key	Type	Default	Effect
`nuclei_size`	int	`6`	Approximate nuclei radius used to size the blob detector (`blob_log`) and the autolevel footprint. Larger values detect larger / more separated nuclei.
`nuclei_expansion`	int	`20`	How many pixels each nucleus label is grown to approximate the full cell (`expand_labels`). Set to a negative value to disable expansion (cells = nuclei only).

Note: values such as nuclei_definition or membrane_diameter are not supported and will be ignored — they only document intent.

Command-line options

Passed to hpaseg.py as -key=value arguments:

Option	Default	Effect
`-data=<path>`	`./data`	Root folder under which each row's `output_folder` is created.
`-full_run=<b>`	`True`	When truthy (`1`/`true`/`yes`), also writes the extra QC/visualization images (deconvolved channels and the binarized overlay masks). When false, only the core outputs and `cell_data.csv` are produced — faster, fewer files.

Example:

python -u hpaseg.py -data=./dataValidation -full_run=true

Module-level constants

These are not exposed on the command line; edit the source to change them.

In hpaseg.py:

Constant	Default	Meaning
`default_blob_reg_size`	`6`	Fallback for `nuclei_size` when a row doesn't specify it.
`distribution_weak_factor`	`3.0`	Sensitivity of the automatic `weak` classification (higher → fewer images called weak).
`distribution_localized_factor`	`5.0`	Sensitivity of the automatic `localized` classification.
`data_folder`	`./data`	Default data root (overridden by `-data=`).

In segmentation_lite.py:

Constant	Default	Meaning
`nuclei_expansion`	`20`	Default cell expansion (overridden per row by the `parameters` column).
`bin_percentage`	`100.0`	Scales the positive/negative threshold derived from the protein Otsu levels (lower → more permissive positivity calls).
`nuclei_marker`	`nuclei`	Marker name used internally for the nuclei channel.

Distribution values

The distribution column controls how aggressively the protein signal is thresholded. If set to unknown, HPAseg inspects the multi-Otsu levels of the DAB channel relative to the hematoxylin channel and appends one or more tags:

Detected tag	When	Effect on thresholding
(none)	DAB levels comparable to nuclei	Standard 3-level Otsu (`low`/`high` = levels 1/2).
`weak`	DAB levels sit below nuclei levels but mean signal is non-trivial	Finer 5-level Otsu (`low`/`high` = levels 2/4).
`localized`	The high-signal area is a small fraction of the low-signal area (added on top of `weak`)	Shifts the chosen levels upward to isolate the few strong spots.
`strong`	DAB levels sit above nuclei levels (marker dominates)	Standard 3-level Otsu.

You can also set these tags manually in the CSV (e.g. weak, weak,localized) to skip auto-detection for that row.

Running the code

Local

Activate the virtual environment, then run hpaseg.py (it drives everything, including segmentation_lite.py):

source .venv/bin/activate
python -u hpaseg.py -data=./data -full_run=true

Process behavior:

A lock file RUNNING (containing the PID) is created on start and removed on exit. If RUNNING already exists, HPAseg refuses to start ("Another HPAseg process seems to be running"). Delete it manually if a previous run crashed.
If run_id.txt exists, its value is used as the run id; otherwise a random id is generated. The final id is written to LAST_RUN_ID at the end.
Each row is processed in its own try/except; an error on one image is logged and the batch continues.

Docker

docker-compose.yml mounts a host work directory (via the HPASEG_WORK environment variable, read from a .env file) into /opt/hpaseg/work and runs with -data=/opt/hpaseg/work/. When the HPASEG_WORK env var is present, HPAseg reads ./work/batch.csv and writes its RUNNING, run_id.txt and LAST_RUN_ID into ./work/ instead of the project root. A log.txt is captured in the same work folder.

# .env must define HPASEG_WORK=/abs/path/to/work  (containing batch.csv)
docker compose up

Output / Results

Everything for a row is written under <data_folder>/<output_folder>/. Below, that base is shown as …/, its output/ subfolder holds the images, and output/analysis/ holds the quantification.

Always produced

File	Description
`…/<output_prefix>.jpg`	A JPEG copy of the (RGB-reduced) source image.
`…/output/protein_high_mask_show.png`	RGBA overlay (red) of the strongly DAB-positive pixels (`high` Otsu level).
`…/output/protein_low_mask_show.png`	RGBA overlay (orange) of the weakly DAB-positive pixels (`low` Otsu level).
`…/output/nuclei_blobs.png`	The detected nuclei seed blobs used for segmentation.
`…/output/analysis/segmentation_mask.npy`	The integer label image (one id per cell) as a NumPy array — the canonical machine-readable segmentation.
`…/output/analysis/segmentation_mask_show.png`	RGBA overlay of the cell boundaries (green) for visual QC.
`…/output/analysis/cell_data.csv`	The main result: one row per segmented cell (see columns below).

Only when `full_run=true`

File	Description
`…/output/nuclei.jpg`	The deconvolved hematoxylin (nuclei) channel. (Only when `distribution=unknown`.)
`…/output/protein.jpg`	The deconvolved DAB (protein) channel. (Only when `distribution=unknown`.)
`…/output/analysis/segmentation_mask_binarized_low_show.png`	RGBA overlay (orange) of only the cells called positive at the low threshold.
`…/output/analysis/segmentation_mask_binarized_high_show.png`	RGBA overlay (red) of only the cells called positive at the high threshold.

`cell_data.csv` columns

Column	Description
`cell_id`	Unique integer label of the cell (matches `segmentation_mask.npy`).
`size`	Cell area in pixels.
`x`, `y`	Cell centroid coordinates (column, row).
`nuclei`	Mean hematoxylin (nuclei) intensity of the cell, 0–255.
`nuclei_local_90`	90th-percentile nuclei intensity over the cell's non-zero pixels.
`nuclei_bin_low`	`1` if the cell's `nuclei_local_90` reaches the nuclei threshold, else `0`.
`nuclei_bin_high`	`1` if the cell's mean nuclei intensity reaches the nuclei threshold, else `0`.
`protein`	Mean DAB (protein) intensity of the cell, 0–255.
`protein_local_90`	90th-percentile protein intensity over the cell's non-zero pixels.
`protein_bin_low`	`1` if the cell is protein-positive at the low (lenient, local-90 based) threshold.
`protein_bin_high`	`1` if the cell is protein-positive at the high (strict, mean based) threshold.

The *_bin_low / *_bin_high flags are the positive/negative calls: bin_low catches cells with a localized positive region, bin_high requires the whole cell to be above threshold.

How it works (pipeline summary)

Load & deconvolve — read the image, strip alpha, and split into hematoxylin (nuclei) and DAB (protein) channels via separate_stains.
Distribution detection — if distribution=unknown, compare the DAB and nuclei Otsu levels to tag the image weak / localized / strong.
Protein masks — multi-Otsu threshold the normalized DAB channel into low/high regions and save the overlays.
Nuclei detection — normalize and clip the hematoxylin channel, Otsu-mask it, smooth, autolevel, then detect nuclei with Laplacian-of-Gaussian blob detection (blob_log, sized by nuclei_size) to produce seed blobs.
Cell segmentation — label the blobs and grow them by nuclei_expansion pixels (expand_labels) to approximate cell bodies.
Quantification — for each cell, measure mean and local-90 intensity of both channels and derive the bin_low/bin_high positivity flags; write cell_data.csv.
Binarized overlays (full_run) — render the positive cells at the low and high thresholds as RGBA overlays.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HPAseg

Requirements

Installation

Local virtual environment

Docker

Setup

Parameters

The `parameters` column

Command-line options

Module-level constants

Distribution values

Running the code

Local

Docker

Output / Results

Always produced

Only when `full_run=true`

`cell_data.csv` columns

How it works (pipeline summary)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
hpaseg.py		hpaseg.py
requirements.txt		requirements.txt
segmentation_lite.py		segmentation_lite.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

HPAseg

Requirements

Installation

Local virtual environment

Docker

Setup

Parameters

The parameters column

Command-line options

Module-level constants

Distribution values

Running the code

Local

Docker

Output / Results

Always produced

Only when full_run=true

cell_data.csv columns

How it works (pipeline summary)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The `parameters` column

Only when `full_run=true`

`cell_data.csv` columns

Packages