[3/3] cp_measure features + coordinate alignment for calculate_image_features#1216
Open
timtreis wants to merge 12 commits into
Open
[3/3] cp_measure features + coordinate alignment for calculate_image_features#1216timtreis wants to merge 12 commits into
timtreis wants to merge 12 commits into
Conversation
7bc0f28 to
ef48286
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1216 +/- ##
==========================================
+ Coverage 76.55% 76.84% +0.29%
==========================================
Files 63 63
Lines 9067 9186 +119
Branches 1521 1532 +11
==========================================
+ Hits 6941 7059 +118
+ Misses 1541 1540 -1
- Partials 585 587 +2
🚀 New features to boost your workflow:
|
25df4eb to
0bc0c1f
Compare
Final piece of scverse#982 (split from scverse#1189, scverse#1190): wires cp_measure into calculate_image_features and adds image/labels coordinate-system alignment (align_mode strict/rasterize), with boundary-cell drop accounting. - deps: cp-measure>=0.1.19,<0.2, centrosome>=1.2.3 (hard) - features=None now runs all features across all backends - _tiling: bbox_y0/bbox_x0 on CellInfo; _Accum accumulator; shared yx_size - key_added (was adata_key_added); keyword-only params - skimage backend: single regionprops_table pass for morphology+intensity - dedupe skimage morphology vs cp_measure:sizeshape when both requested (cp wins, computed identically; skimage-only props kept) - purge non-ASCII em/en-dashes from text files repo-wide (ASCII-only policy)
0bc0c1f to
8bed3e7
Compare
cp_measure and skimage featurize per cell on tiny arrays, where NumPy's default multi-threaded OpenBLAS/OpenMP pools (sized to all cores) add pure thread-dispatch overhead with nothing to amortize it, and oversubscribe under parallelism. Wrapping the per-tile featurize body in threadpool_limits(1) cuts the serial cp:texture path ~2.6x end-to-end on an 8-core machine and removes the n_jobs>1 oversubscription slowdown. The clamp lives inside the per-tile function so it also applies in worker processes (a driver-side limit would not reach them), which the upcoming distributed execution engine relies on. Output is bit-identical.
The tiled featurization parallelism uses distributed.LocalCluster (the local "processes" scheduler does not fork for this graph) and clamps worker BLAS/OpenMP via threadpoolctl. Both were only present transitively; declare them. The dask[distributed] extra co-pins distributed to the resolved dask version (they release lockstep).
Extract the per-tile dispatch shared by the tiled featurizers into _run_tiled(specs, process_fn, *, n_jobs, kind, scatter): active-Client-first dispatch, n_jobs->worker resolution, serial fast path (no scheduler when one worker or one tile), streaming collection via as_completed with progress, and safe LocalCluster teardown. The execution kind is parametrized because the callers differ: tiling QC's per-tile work is numba `nogil` (threads scale), while image featurization is GIL-bound (needs processes). Migrate calculate_tiling_qc onto the engine with kind="threads" - behavior-preserving (its tests are unchanged). The duplicated _has_distributed_client probe is lifted into the shared module.
Replace the joblib threads loop in calculate_image_features with _run_tiled (kind="processes"). cp_measure is GIL-bound, so real parallelism needs worker processes via a distributed LocalCluster (an active Client is used if present); the local multiprocessing scheduler does not fork for this graph. n_jobs=1 and single-tile inputs stay serial in-process with zero cluster overhead. The image and labels arrays are passed as scattered arguments (broadcast once) rather than captured, so the backing graph is not re-embedded in every task. Per-tile crops are now materialized with an explicit synchronous scheduler (_materialize): a bare .values inside a distributed worker routes the inner compute back to the scheduler and deadlocks. Drop accounting is reduced from the returned results. Results are bit-identical to serial (verified on zarr-backed skimage and cp_measure runs). Remove the dead joblib/tqdm/time imports.
The n_jobs equivalence tests used the default tile_size, which yields a single tile and never leaves the serial path. Use a small tile_size so n_jobs>1 forks a real LocalCluster, and add TestParallelEngine: an active-Client run over zarr-backed data (scatter of a zarr graph + cp_measure in worker processes) and the align_mode="rasterize" path under a LocalCluster, each asserted bit-identical to serial. These lock the picklability of the scattered closures on the real zarr and rasterize paths, not just synthetic in-memory data.
Expand the n_jobs docstring (active Client else LocalCluster; n_jobs=1 serial; the macOS/Windows __main__ guard requirement and the reuse-your-own-Client tip for scale-out) and add a changelog entry noting the new distributed/threadpoolctl dependencies.
- Don't warn "n_jobs ignored" on the default path: the active-Client branch warned for any n_jobs not in (None, -1), but featurization defaults to 1, so every default call under a user's Client emitted a spurious warning. Exempt 1 (and -1, None) so only an explicit worker count triggers it. - Release each tile future as its result is gathered in _run_on_client, so worker-side results are freed incrementally instead of pinned for the whole batch (the docstring already claimed this; now it is true). - Fix a stale calculate_tiling_qc docstring that referenced dask.compute (the active-Client path now submits via as_completed).
Surfaced by an A-Z run on a real MIBI-TOF FOV:
- Warn when per-channel features are requested on an image whose channels are
positional integers (the default for unlabeled data), pointing users to set
c_coords; per-channel columns would otherwise be index-named.
- Join the channel onto skimage/squidpy feature names with a double underscore
(intensity_mean__dsDNA, summary_mean__dsDNA, texture_contrast__dsDNA) so the
marker is unambiguously separable, matching cp_measure's __ convention.
cp_measure names are left unchanged.
- Rename squidpy:color_hist -> squidpy:histogram (the classic sq.im name);
columns become histogram_bin{n}__{channel}.
- Show a dask ProgressBar on the serial path too (run via the synchronous
scheduler), not just the threaded path.
- obs_names are now the label-image cell IDs (str) instead of cell_<id>.
Feature values are unchanged; this only affects column/obs names, a new warning,
and progress output.
- _run_tiled local path: bind the scatter arrays into the task closure instead of passing them as dask.delayed positional args. A dask collection given to delayed is materialized in full before the call, so the previous form read the whole image for every tile (verified: 100-block image x N tiles) - defeating the lazy per-tile crop. Closing over keeps each task reading only its crop. - Unnamed-channel warning now fires only for the exact positional default ([0, 1, ...]) instead of any all-numeric names, so real numeric panels (e.g. wavelengths "405","488") no longer warn spuriously. - obs_names are now derived from label_id on both the labels and shapes paths (and in calculate_tiling_qc), so obs_names == label_id consistently across the sibling featurizers.
Behaviour-preserving cleanups from review: - Derive _CP_PER_CHANNEL from _CP_CORRELATION_KEYS instead of re-listing the four correlation names (removes a drift hazard). - Inline the single-use _CLIENT_OVERRIDES_NJOBS string. - Trim over-long docstrings/comments (_run_tiled, _materialize, _process_one, the unnamed-channel note) down to intent.
`distributed` is not in the intersphinx mapping, so the :class:/{class}
`dask.distributed.LocalCluster` cross-references failed under nitpicky + -W.
Render the dask/distributed class names as inline literals (matching how the
Client is already written), and point the calculate_tiling_qc reference at its
documented API entry.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Final piece of #982 (split from #1189, #1190). Adds cp_measure features and image/labels coordinate alignment.
New deps (hard)
cp-measure>=0.1.19,<0.2,centrosome>=1.2.3.Usage
features=None-> all features, all backends (needs an image).align_mode="strict"(default) raises on non-pixel-aligned image/labels;"rasterize"resamples onto the image grid.Notes
rasterizeraises (pre-align instead).