Fable review by constantinpape · Pull Request #54 · computational-cell-analytics/bioimage-cpp

constantinpape · 2026-06-12T04:53:53Z

No description provided.

B2: validate node ids at the UnionFind binding boundary. UnionFind::find/ merge/merge_to index their parent vectors without a bounds check (intentional for the hot path), but the scalar and array bindings passed user-supplied ids straight through, so an out-of-range id was undefined behavior / a segfault from Python. Add check_node() and route the scalar find/merge/merge_to binds and the merge_edges/find_nodes helpers through it, throwing std::invalid_argument before releasing the GIL. B1: route the mutex-watershed and semantic-mutex-watershed neighbor computation through detail::valid_offset_target instead of an unchecked flat-index shift. This bounds-checks the neighbor per axis, so the kernel is memory-safe and rejects row/plane wrap-around independently of the caller's valid_edges mask. Drops the now-unused offset_strides precompute. Add regression tests for the new UnionFind validation and for the existing binding-layer point-coordinate validation in non_maximum_distance_suppression. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

B3: argsort_by_first broke ties on the primary key nondeterministically (std::sort is unstable). Add a secondary tie-break by original index so the ordering is a total order. This makes the label-multiset downsample argmax and restrict_set truncation deterministic on count ties (smaller id wins). B4: GreedyAdditiveMulticutProposalGenerator derived its per-call seed as seed_ + call_count_, which collided with the fusion-move driver's per-slot seed + slot offset (different (slot, iteration) cells computed identical proposals). Mix the packed (seed, counter) pair through a SplitMix64 finalizer to remove the linearity. B7: WatershedProposalGenerator::reset re-seeded the RNG but left the cached normal-distribution deviate, so a reset generator was not identical to a fresh one. Add noise_.reset(). flow: round_to_flat_index used std::nearbyint (FP-rounding-mode dependent, half-to-even); switch to floor(x + 0.5) to match the half-up convention used in affine.hxx and watershed.hxx. Add a determinism regression test for the downsample count-tie case. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

`_bp_##__LINE__` does not expand __LINE__ (token-paste suppresses operand expansion), so every BIOIMAGE_PROFILE_SCOPE declared the same identifier `_bp___LINE__`. It compiled only because each call site happened to wrap its scope in its own block; two scopes in one block (the documented per-phase pattern) would be a redefinition error in BIOIMAGE_PROFILE=ON builds. Add the standard two-level concat indirection so __LINE__ is expanded first. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ulation I1: GreedyAdditiveWorkspace::reset called heap.reset_capacity, and so did detail::initialize_dynamic_graph which every caller runs immediately after. The second call re-wiped the heap locator vector (O(E)) on every solve — hot under fusion-move (per fuse x slot x iteration). Drop it from reset() in both the multicut and lifted-multicut workspaces; initialize_dynamic_graph owns it. I3: BfsWorkspace::reset cleared the visited/distance buffers (O(N) over the whole graph) on every call, making the "k-hop neighborhood from every node" pattern O(N^2) of memset. Replace the boolean visited buffer with a uint32 generation stamp: reset just increments the generation (O(1)), clearing only on the rare wraparound. distance_ is sized but written before read. I4: accumulate_labels used an n_threads x n_nodes map-of-maps plus a per-node merge map. Switch to one combined-key (node, other) histogram per thread, merged once, with a single-pass per-node argmax. Far fewer container allocations; the argmax/tie-break/empty-node semantics are unchanged (covered by the existing accumulate_labels tests, incl. parallel-vs-single and nifty cross-checks). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

R1: three byte-identical number_of_pixels helpers (label_accumulation, feature_accumulation, node_label_projection) — the feature_accumulation copy was dead. Add detail::number_of_elements to detail/grid.hxx and route the two live call sites through it; drop the dead copy. R9: drop the redundant nb::cast wrappers inside nb::make_tuple in relabel_sequential_t (make_tuple already casts its arguments). Docs/naming: - AGENTS.md: correct the union-find path (util/union_find.hxx, namespace bioimage_cpp::util, not detail/) and list number_of_elements under grid.hxx. - agglomeration/detail.hxx: the rekey comment described the opposite of the code; reword to explain that a kRejectEdge'd edge is deliberately left out of the heap. - filters: the public kwarg was window_ratio while the Python API documents window_size; align the nb::arg keyword and the validation message to window_size (calls are positional, so behavior is unchanged). - feature_accumulation: note that float feature values are bit-reproducible only for a fixed thread count. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A/B benchmarks (pre-review build vs the review branch) showed two of the review's changes were net-negative, so they are reverted: I4 (label accumulation): the combined-key (node, other) histogram was 3-15x SLOWER than the per-node map-of-maps and scaled negatively with threads (a giant hashmap probed once per pixel, plus a serial argmax). Restore the per-node map-of-maps with the parallel merge. The detail::number_of_elements dedup (R1) is kept. Tests confirm identical output. B1 (mutex watershed): routing the neighbor through detail::valid_offset_target (per-axis bounds check) cost ~17% on 3D in the hot loop. The public API already guarantees valid_edges excludes out-of-bounds / wrapping edges, so restore the single precomputed offset-stride add and document that precondition instead. B2 (UnionFind bindings): keep the safety bounds check but replace the per-element validation loop with a single std::max_element scan, cutting bulk find overhead from ~1.35x to ~1.1x. The residual cost is the unavoidable extra read of the id array; it only affects the non-hot Python convenience API. BFS (I3) stays 4.5x faster; greedy-additive (I1) unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…, mutex-ws Standalone (no nifty/affogato) benchmarks for the areas touched by the code review that lacked timing coverage, used for the A/B perf validation: - utils/benchmark_union_find.py: bulk merge/find (B2 bounds-check cost) - graph/benchmark_accumulate_labels.py: accumulate_labels across 1/4/8 threads - graph/benchmark_bfs.py: lifted_edges_from_node_labels (workspace-reused BFS, I3) - segmentation/benchmark_mutex_watershed.py: mutex_watershed grid kernel (B1) Each follows the warmup + median/min pattern and exposes run() for a driver. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

constantinpape and others added 7 commits June 11, 2026 23:55

constantinpape merged commit 8463dcc into main Jun 12, 2026
6 checks passed

constantinpape deleted the fable-review branch June 12, 2026 05:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fable review#54

Fable review#54
constantinpape merged 7 commits into
mainfrom
fable-review

constantinpape commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

constantinpape commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant