HPAseg is a batch segmentation and single-cell quantification tool for HPA-style brightfield IHC images (RGB images stained with hematoxylin for nuclei and DAB for the target protein). For every image it:
- Separates the hematoxylin and DAB stains (color deconvolution).
- Detects and segments nuclei into individual cells.
- Measures the nuclei and protein intensity of each cell.
- Classifies every cell as protein-positive or negative (two thresholds:
lowandhigh). - Writes per-cell measurements (
cell_data.csv) and a set of visualization overlays.
It reads a batch description from batch.csv, processes each row independently
(a failure on one image is logged and skipped, not fatal), and writes all results
under a per-sample output folder.
- Python 3.12 (the project and Docker image are pinned to 3.12.13).
- The Python packages in
requirements.txt:numpy==1.26.4pandas==2.1.4Pillow==10.4.0scikit-image==0.22.0scipy==1.11.4
cd /home/fredbn/python-sandbox/hpaseg
python3.12 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtA Dockerfile and docker-compose.yml are provided. The image is based on
python:3.12.13-slim and installs requirements.txt at build time.
docker compose buildSee Running the code → Docker below for how to launch it.
HPAseg reads the list of images to process from batch.csv. Each row is one
image. The columns, in order, are:
image_uri,type,distribution,output_folder,output_prefix,parameters
| Column | Required | Meaning |
|---|---|---|
image_uri |
yes | Path or URL of the source RGB image. Local paths or https://… URLs both work (read with skimage.io.imread). Images with an alpha channel are reduced to RGB. |
type |
yes | The protein localization type (e.g. cytoplasm, nuclei, membrane). Passed through to the analysis as protein_type. |
distribution |
yes | Staining distribution hint. Use unknown to let HPAseg detect it automatically, or set it explicitly (see Distribution values below). |
output_folder |
yes | Sub-path (relative to the data folder) where this sample's results are written. May contain sub-directories, e.g. ASMTL/Cerebral_cortex_1. |
output_prefix |
yes | Name used for the saved copy of the source image (<output_prefix>.jpg). |
parameters |
no | Per-row tuning, as a key=value;key=value string (see The parameters column). Leave empty to use defaults. |
Example batch.csv:
image_uri,type,distribution,output_folder,output_prefix,parameters
https://images.proteinatlas.org/3630/11417_B_7_5.jpg,cytoplasm,unknown,ASMTL/Cerebral_cortex_1,ASMTL,nuclei_size=15;nuclei_expansion=15
There are three places where behavior can be tuned: the per-row parameters
column, command-line options, and module-level constants.
A ;-separated list of key=value pairs. Only the following two keys are
used; any other key is ignored with a warning printed to the log
(>>> ignoring unsupported parameters: …):
| Key | Type | Default | Effect |
|---|---|---|---|
nuclei_size |
int | 6 |
Approximate nuclei radius used to size the blob detector (blob_log) and the autolevel footprint. Larger values detect larger / more separated nuclei. |
nuclei_expansion |
int | 20 |
How many pixels each nucleus label is grown to approximate the full cell (expand_labels). Set to a negative value to disable expansion (cells = nuclei only). |
Note: values such as
nuclei_definitionormembrane_diameterare not supported and will be ignored — they only document intent.
Passed to hpaseg.py as -key=value arguments:
| Option | Default | Effect |
|---|---|---|
-data=<path> |
./data |
Root folder under which each row's output_folder is created. |
-full_run=<b> |
True |
When truthy (1/true/yes), also writes the extra QC/visualization images (deconvolved channels and the binarized overlay masks). When false, only the core outputs and cell_data.csv are produced — faster, fewer files. |
Example:
python -u hpaseg.py -data=./dataValidation -full_run=trueThese are not exposed on the command line; edit the source to change them.
In hpaseg.py:
| Constant | Default | Meaning |
|---|---|---|
default_blob_reg_size |
6 |
Fallback for nuclei_size when a row doesn't specify it. |
distribution_weak_factor |
3.0 |
Sensitivity of the automatic weak classification (higher → fewer images called weak). |
distribution_localized_factor |
5.0 |
Sensitivity of the automatic localized classification. |
data_folder |
./data |
Default data root (overridden by -data=). |
In segmentation_lite.py:
| Constant | Default | Meaning |
|---|---|---|
nuclei_expansion |
20 |
Default cell expansion (overridden per row by the parameters column). |
bin_percentage |
100.0 |
Scales the positive/negative threshold derived from the protein Otsu levels (lower → more permissive positivity calls). |
nuclei_marker |
nuclei |
Marker name used internally for the nuclei channel. |
The distribution column controls how aggressively the protein signal is
thresholded. If set to unknown, HPAseg inspects the multi-Otsu levels of the
DAB channel relative to the hematoxylin channel and appends one or more tags:
| Detected tag | When | Effect on thresholding |
|---|---|---|
| (none) | DAB levels comparable to nuclei | Standard 3-level Otsu (low/high = levels 1/2). |
weak |
DAB levels sit below nuclei levels but mean signal is non-trivial | Finer 5-level Otsu (low/high = levels 2/4). |
localized |
The high-signal area is a small fraction of the low-signal area (added on top of weak) |
Shifts the chosen levels upward to isolate the few strong spots. |
strong |
DAB levels sit above nuclei levels (marker dominates) | Standard 3-level Otsu. |
You can also set these tags manually in the CSV (e.g. weak, weak,localized)
to skip auto-detection for that row.
Activate the virtual environment, then run hpaseg.py (it drives everything,
including segmentation_lite.py):
source .venv/bin/activate
python -u hpaseg.py -data=./data -full_run=trueProcess behavior:
- A lock file
RUNNING(containing the PID) is created on start and removed on exit. IfRUNNINGalready exists, HPAseg refuses to start ("Another HPAseg process seems to be running"). Delete it manually if a previous run crashed. - If
run_id.txtexists, its value is used as the run id; otherwise a random id is generated. The final id is written toLAST_RUN_IDat the end. - Each row is processed in its own
try/except; an error on one image is logged and the batch continues.
docker-compose.yml mounts a host work directory (via the HPASEG_WORK
environment variable, read from a .env file) into /opt/hpaseg/work and runs
with -data=/opt/hpaseg/work/. When the HPASEG_WORK env var is present,
HPAseg reads ./work/batch.csv and writes its RUNNING, run_id.txt and
LAST_RUN_ID into ./work/ instead of the project root. A log.txt is captured
in the same work folder.
# .env must define HPASEG_WORK=/abs/path/to/work (containing batch.csv)
docker compose upEverything for a row is written under <data_folder>/<output_folder>/. Below,
that base is shown as …/, its output/ subfolder holds the images, and
output/analysis/ holds the quantification.
| File | Description |
|---|---|
…/<output_prefix>.jpg |
A JPEG copy of the (RGB-reduced) source image. |
…/output/protein_high_mask_show.png |
RGBA overlay (red) of the strongly DAB-positive pixels (high Otsu level). |
…/output/protein_low_mask_show.png |
RGBA overlay (orange) of the weakly DAB-positive pixels (low Otsu level). |
…/output/nuclei_blobs.png |
The detected nuclei seed blobs used for segmentation. |
…/output/analysis/segmentation_mask.npy |
The integer label image (one id per cell) as a NumPy array — the canonical machine-readable segmentation. |
…/output/analysis/segmentation_mask_show.png |
RGBA overlay of the cell boundaries (green) for visual QC. |
…/output/analysis/cell_data.csv |
The main result: one row per segmented cell (see columns below). |
| File | Description |
|---|---|
…/output/nuclei.jpg |
The deconvolved hematoxylin (nuclei) channel. (Only when distribution=unknown.) |
…/output/protein.jpg |
The deconvolved DAB (protein) channel. (Only when distribution=unknown.) |
…/output/analysis/segmentation_mask_binarized_low_show.png |
RGBA overlay (orange) of only the cells called positive at the low threshold. |
…/output/analysis/segmentation_mask_binarized_high_show.png |
RGBA overlay (red) of only the cells called positive at the high threshold. |
| Column | Description |
|---|---|
cell_id |
Unique integer label of the cell (matches segmentation_mask.npy). |
size |
Cell area in pixels. |
x, y |
Cell centroid coordinates (column, row). |
nuclei |
Mean hematoxylin (nuclei) intensity of the cell, 0–255. |
nuclei_local_90 |
90th-percentile nuclei intensity over the cell's non-zero pixels. |
nuclei_bin_low |
1 if the cell's nuclei_local_90 reaches the nuclei threshold, else 0. |
nuclei_bin_high |
1 if the cell's mean nuclei intensity reaches the nuclei threshold, else 0. |
protein |
Mean DAB (protein) intensity of the cell, 0–255. |
protein_local_90 |
90th-percentile protein intensity over the cell's non-zero pixels. |
protein_bin_low |
1 if the cell is protein-positive at the low (lenient, local-90 based) threshold. |
protein_bin_high |
1 if the cell is protein-positive at the high (strict, mean based) threshold. |
The *_bin_low / *_bin_high flags are the positive/negative calls: bin_low
catches cells with a localized positive region, bin_high requires the whole
cell to be above threshold.
- Load & deconvolve — read the image, strip alpha, and split into
hematoxylin (nuclei) and DAB (protein) channels via
separate_stains. - Distribution detection — if
distribution=unknown, compare the DAB and nuclei Otsu levels to tag the imageweak/localized/strong. - Protein masks — multi-Otsu threshold the normalized DAB channel into
low/highregions and save the overlays. - Nuclei detection — normalize and clip the hematoxylin channel, Otsu-mask
it, smooth, autolevel, then detect nuclei with Laplacian-of-Gaussian blob
detection (
blob_log, sized bynuclei_size) to produce seed blobs. - Cell segmentation — label the blobs and grow them by
nuclei_expansionpixels (expand_labels) to approximate cell bodies. - Quantification — for each cell, measure mean and local-90 intensity of
both channels and derive the
bin_low/bin_highpositivity flags; writecell_data.csv. - Binarized overlays (full_run) — render the positive cells at the low and high thresholds as RGBA overlays.