Skip to content

CellProfiling/hpaseg

Repository files navigation

HPAseg

HPAseg is a batch segmentation and single-cell quantification tool for HPA-style brightfield IHC images (RGB images stained with hematoxylin for nuclei and DAB for the target protein). For every image it:

  1. Separates the hematoxylin and DAB stains (color deconvolution).
  2. Detects and segments nuclei into individual cells.
  3. Measures the nuclei and protein intensity of each cell.
  4. Classifies every cell as protein-positive or negative (two thresholds: low and high).
  5. Writes per-cell measurements (cell_data.csv) and a set of visualization overlays.

It reads a batch description from batch.csv, processes each row independently (a failure on one image is logged and skipped, not fatal), and writes all results under a per-sample output folder.

Requirements

  • Python 3.12 (the project and Docker image are pinned to 3.12.13).
  • The Python packages in requirements.txt:
    • numpy==1.26.4
    • pandas==2.1.4
    • Pillow==10.4.0
    • scikit-image==0.22.0
    • scipy==1.11.4

Installation

Local virtual environment

cd /home/fredbn/python-sandbox/hpaseg
python3.12 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Docker

A Dockerfile and docker-compose.yml are provided. The image is based on python:3.12.13-slim and installs requirements.txt at build time.

docker compose build

See Running the code → Docker below for how to launch it.

Setup

HPAseg reads the list of images to process from batch.csv. Each row is one image. The columns, in order, are:

image_uri,type,distribution,output_folder,output_prefix,parameters
Column Required Meaning
image_uri yes Path or URL of the source RGB image. Local paths or https://… URLs both work (read with skimage.io.imread). Images with an alpha channel are reduced to RGB.
type yes The protein localization type (e.g. cytoplasm, nuclei, membrane). Passed through to the analysis as protein_type.
distribution yes Staining distribution hint. Use unknown to let HPAseg detect it automatically, or set it explicitly (see Distribution values below).
output_folder yes Sub-path (relative to the data folder) where this sample's results are written. May contain sub-directories, e.g. ASMTL/Cerebral_cortex_1.
output_prefix yes Name used for the saved copy of the source image (<output_prefix>.jpg).
parameters no Per-row tuning, as a key=value;key=value string (see The parameters column). Leave empty to use defaults.

Example batch.csv:

image_uri,type,distribution,output_folder,output_prefix,parameters
https://images.proteinatlas.org/3630/11417_B_7_5.jpg,cytoplasm,unknown,ASMTL/Cerebral_cortex_1,ASMTL,nuclei_size=15;nuclei_expansion=15

Parameters

There are three places where behavior can be tuned: the per-row parameters column, command-line options, and module-level constants.

The parameters column

A ;-separated list of key=value pairs. Only the following two keys are used; any other key is ignored with a warning printed to the log (>>> ignoring unsupported parameters: …):

Key Type Default Effect
nuclei_size int 6 Approximate nuclei radius used to size the blob detector (blob_log) and the autolevel footprint. Larger values detect larger / more separated nuclei.
nuclei_expansion int 20 How many pixels each nucleus label is grown to approximate the full cell (expand_labels). Set to a negative value to disable expansion (cells = nuclei only).

Note: values such as nuclei_definition or membrane_diameter are not supported and will be ignored — they only document intent.

Command-line options

Passed to hpaseg.py as -key=value arguments:

Option Default Effect
-data=<path> ./data Root folder under which each row's output_folder is created.
-full_run=<b> True When truthy (1/true/yes), also writes the extra QC/visualization images (deconvolved channels and the binarized overlay masks). When false, only the core outputs and cell_data.csv are produced — faster, fewer files.

Example:

python -u hpaseg.py -data=./dataValidation -full_run=true

Module-level constants

These are not exposed on the command line; edit the source to change them.

In hpaseg.py:

Constant Default Meaning
default_blob_reg_size 6 Fallback for nuclei_size when a row doesn't specify it.
distribution_weak_factor 3.0 Sensitivity of the automatic weak classification (higher → fewer images called weak).
distribution_localized_factor 5.0 Sensitivity of the automatic localized classification.
data_folder ./data Default data root (overridden by -data=).

In segmentation_lite.py:

Constant Default Meaning
nuclei_expansion 20 Default cell expansion (overridden per row by the parameters column).
bin_percentage 100.0 Scales the positive/negative threshold derived from the protein Otsu levels (lower → more permissive positivity calls).
nuclei_marker nuclei Marker name used internally for the nuclei channel.

Distribution values

The distribution column controls how aggressively the protein signal is thresholded. If set to unknown, HPAseg inspects the multi-Otsu levels of the DAB channel relative to the hematoxylin channel and appends one or more tags:

Detected tag When Effect on thresholding
(none) DAB levels comparable to nuclei Standard 3-level Otsu (low/high = levels 1/2).
weak DAB levels sit below nuclei levels but mean signal is non-trivial Finer 5-level Otsu (low/high = levels 2/4).
localized The high-signal area is a small fraction of the low-signal area (added on top of weak) Shifts the chosen levels upward to isolate the few strong spots.
strong DAB levels sit above nuclei levels (marker dominates) Standard 3-level Otsu.

You can also set these tags manually in the CSV (e.g. weak, weak,localized) to skip auto-detection for that row.

Running the code

Local

Activate the virtual environment, then run hpaseg.py (it drives everything, including segmentation_lite.py):

source .venv/bin/activate
python -u hpaseg.py -data=./data -full_run=true

Process behavior:

  • A lock file RUNNING (containing the PID) is created on start and removed on exit. If RUNNING already exists, HPAseg refuses to start ("Another HPAseg process seems to be running"). Delete it manually if a previous run crashed.
  • If run_id.txt exists, its value is used as the run id; otherwise a random id is generated. The final id is written to LAST_RUN_ID at the end.
  • Each row is processed in its own try/except; an error on one image is logged and the batch continues.

Docker

docker-compose.yml mounts a host work directory (via the HPASEG_WORK environment variable, read from a .env file) into /opt/hpaseg/work and runs with -data=/opt/hpaseg/work/. When the HPASEG_WORK env var is present, HPAseg reads ./work/batch.csv and writes its RUNNING, run_id.txt and LAST_RUN_ID into ./work/ instead of the project root. A log.txt is captured in the same work folder.

# .env must define HPASEG_WORK=/abs/path/to/work  (containing batch.csv)
docker compose up

Output / Results

Everything for a row is written under <data_folder>/<output_folder>/. Below, that base is shown as …/, its output/ subfolder holds the images, and output/analysis/ holds the quantification.

Always produced

File Description
…/<output_prefix>.jpg A JPEG copy of the (RGB-reduced) source image.
…/output/protein_high_mask_show.png RGBA overlay (red) of the strongly DAB-positive pixels (high Otsu level).
…/output/protein_low_mask_show.png RGBA overlay (orange) of the weakly DAB-positive pixels (low Otsu level).
…/output/nuclei_blobs.png The detected nuclei seed blobs used for segmentation.
…/output/analysis/segmentation_mask.npy The integer label image (one id per cell) as a NumPy array — the canonical machine-readable segmentation.
…/output/analysis/segmentation_mask_show.png RGBA overlay of the cell boundaries (green) for visual QC.
…/output/analysis/cell_data.csv The main result: one row per segmented cell (see columns below).

Only when full_run=true

File Description
…/output/nuclei.jpg The deconvolved hematoxylin (nuclei) channel. (Only when distribution=unknown.)
…/output/protein.jpg The deconvolved DAB (protein) channel. (Only when distribution=unknown.)
…/output/analysis/segmentation_mask_binarized_low_show.png RGBA overlay (orange) of only the cells called positive at the low threshold.
…/output/analysis/segmentation_mask_binarized_high_show.png RGBA overlay (red) of only the cells called positive at the high threshold.

cell_data.csv columns

Column Description
cell_id Unique integer label of the cell (matches segmentation_mask.npy).
size Cell area in pixels.
x, y Cell centroid coordinates (column, row).
nuclei Mean hematoxylin (nuclei) intensity of the cell, 0–255.
nuclei_local_90 90th-percentile nuclei intensity over the cell's non-zero pixels.
nuclei_bin_low 1 if the cell's nuclei_local_90 reaches the nuclei threshold, else 0.
nuclei_bin_high 1 if the cell's mean nuclei intensity reaches the nuclei threshold, else 0.
protein Mean DAB (protein) intensity of the cell, 0–255.
protein_local_90 90th-percentile protein intensity over the cell's non-zero pixels.
protein_bin_low 1 if the cell is protein-positive at the low (lenient, local-90 based) threshold.
protein_bin_high 1 if the cell is protein-positive at the high (strict, mean based) threshold.

The *_bin_low / *_bin_high flags are the positive/negative calls: bin_low catches cells with a localized positive region, bin_high requires the whole cell to be above threshold.

How it works (pipeline summary)

  1. Load & deconvolve — read the image, strip alpha, and split into hematoxylin (nuclei) and DAB (protein) channels via separate_stains.
  2. Distribution detection — if distribution=unknown, compare the DAB and nuclei Otsu levels to tag the image weak / localized / strong.
  3. Protein masks — multi-Otsu threshold the normalized DAB channel into low/high regions and save the overlays.
  4. Nuclei detection — normalize and clip the hematoxylin channel, Otsu-mask it, smooth, autolevel, then detect nuclei with Laplacian-of-Gaussian blob detection (blob_log, sized by nuclei_size) to produce seed blobs.
  5. Cell segmentation — label the blobs and grow them by nuclei_expansion pixels (expand_labels) to approximate cell bodies.
  6. Quantification — for each cell, measure mean and local-90 intensity of both channels and derive the bin_low/bin_high positivity flags; write cell_data.csv.
  7. Binarized overlays (full_run) — render the positive cells at the low and high thresholds as RGBA overlays.

About

HPAseg is a batch segmentation and single-cell quantification tool for HPA-style brightfield IHC images (RGB images stained with hematoxylin for nuclei and DAB for the target protein)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors