Skip to content

Fable review#54

Merged
constantinpape merged 7 commits into
mainfrom
fable-review
Jun 12, 2026
Merged

Fable review#54
constantinpape merged 7 commits into
mainfrom
fable-review

Conversation

@constantinpape

Copy link
Copy Markdown
Contributor

No description provided.

constantinpape and others added 7 commits June 11, 2026 23:55
B2: validate node ids at the UnionFind binding boundary. UnionFind::find/
merge/merge_to index their parent vectors without a bounds check (intentional
for the hot path), but the scalar and array bindings passed user-supplied ids
straight through, so an out-of-range id was undefined behavior / a segfault
from Python. Add check_node() and route the scalar find/merge/merge_to binds
and the merge_edges/find_nodes helpers through it, throwing std::invalid_argument
before releasing the GIL.

B1: route the mutex-watershed and semantic-mutex-watershed neighbor computation
through detail::valid_offset_target instead of an unchecked flat-index shift.
This bounds-checks the neighbor per axis, so the kernel is memory-safe and
rejects row/plane wrap-around independently of the caller's valid_edges mask.
Drops the now-unused offset_strides precompute.

Add regression tests for the new UnionFind validation and for the existing
binding-layer point-coordinate validation in non_maximum_distance_suppression.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
B3: argsort_by_first broke ties on the primary key nondeterministically
(std::sort is unstable). Add a secondary tie-break by original index so the
ordering is a total order. This makes the label-multiset downsample argmax and
restrict_set truncation deterministic on count ties (smaller id wins).

B4: GreedyAdditiveMulticutProposalGenerator derived its per-call seed as
seed_ + call_count_, which collided with the fusion-move driver's per-slot
seed + slot offset (different (slot, iteration) cells computed identical
proposals). Mix the packed (seed, counter) pair through a SplitMix64 finalizer
to remove the linearity.

B7: WatershedProposalGenerator::reset re-seeded the RNG but left the cached
normal-distribution deviate, so a reset generator was not identical to a fresh
one. Add noise_.reset().

flow: round_to_flat_index used std::nearbyint (FP-rounding-mode dependent,
half-to-even); switch to floor(x + 0.5) to match the half-up convention used
in affine.hxx and watershed.hxx.

Add a determinism regression test for the downsample count-tie case.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
`_bp_##__LINE__` does not expand __LINE__ (token-paste suppresses operand
expansion), so every BIOIMAGE_PROFILE_SCOPE declared the same identifier
`_bp___LINE__`. It compiled only because each call site happened to wrap its
scope in its own block; two scopes in one block (the documented per-phase
pattern) would be a redefinition error in BIOIMAGE_PROFILE=ON builds. Add the
standard two-level concat indirection so __LINE__ is expanded first.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ulation

I1: GreedyAdditiveWorkspace::reset called heap.reset_capacity, and so did
detail::initialize_dynamic_graph which every caller runs immediately after.
The second call re-wiped the heap locator vector (O(E)) on every solve — hot
under fusion-move (per fuse x slot x iteration). Drop it from reset() in both
the multicut and lifted-multicut workspaces; initialize_dynamic_graph owns it.

I3: BfsWorkspace::reset cleared the visited/distance buffers (O(N) over the
whole graph) on every call, making the "k-hop neighborhood from every node"
pattern O(N^2) of memset. Replace the boolean visited buffer with a uint32
generation stamp: reset just increments the generation (O(1)), clearing only on
the rare wraparound. distance_ is sized but written before read.

I4: accumulate_labels used an n_threads x n_nodes map-of-maps plus a per-node
merge map. Switch to one combined-key (node, other) histogram per thread,
merged once, with a single-pass per-node argmax. Far fewer container
allocations; the argmax/tie-break/empty-node semantics are unchanged (covered
by the existing accumulate_labels tests, incl. parallel-vs-single and nifty
cross-checks).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
R1: three byte-identical number_of_pixels helpers (label_accumulation,
feature_accumulation, node_label_projection) — the feature_accumulation copy
was dead. Add detail::number_of_elements to detail/grid.hxx and route the two
live call sites through it; drop the dead copy.

R9: drop the redundant nb::cast wrappers inside nb::make_tuple in
relabel_sequential_t (make_tuple already casts its arguments).

Docs/naming:
- AGENTS.md: correct the union-find path (util/union_find.hxx, namespace
  bioimage_cpp::util, not detail/) and list number_of_elements under grid.hxx.
- agglomeration/detail.hxx: the rekey comment described the opposite of the
  code; reword to explain that a kRejectEdge'd edge is deliberately left out of
  the heap.
- filters: the public kwarg was window_ratio while the Python API documents
  window_size; align the nb::arg keyword and the validation message to
  window_size (calls are positional, so behavior is unchanged).
- feature_accumulation: note that float feature values are bit-reproducible
  only for a fixed thread count.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A/B benchmarks (pre-review build vs the review branch) showed two of the
review's changes were net-negative, so they are reverted:

I4 (label accumulation): the combined-key (node, other) histogram was 3-15x
SLOWER than the per-node map-of-maps and scaled negatively with threads (a giant
hashmap probed once per pixel, plus a serial argmax). Restore the per-node
map-of-maps with the parallel merge. The detail::number_of_elements dedup (R1)
is kept. Tests confirm identical output.

B1 (mutex watershed): routing the neighbor through detail::valid_offset_target
(per-axis bounds check) cost ~17% on 3D in the hot loop. The public API already
guarantees valid_edges excludes out-of-bounds / wrapping edges, so restore the
single precomputed offset-stride add and document that precondition instead.

B2 (UnionFind bindings): keep the safety bounds check but replace the
per-element validation loop with a single std::max_element scan, cutting bulk
find overhead from ~1.35x to ~1.1x. The residual cost is the unavoidable extra
read of the id array; it only affects the non-hot Python convenience API.

BFS (I3) stays 4.5x faster; greedy-additive (I1) unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…, mutex-ws

Standalone (no nifty/affogato) benchmarks for the areas touched by the code
review that lacked timing coverage, used for the A/B perf validation:
- utils/benchmark_union_find.py: bulk merge/find (B2 bounds-check cost)
- graph/benchmark_accumulate_labels.py: accumulate_labels across 1/4/8 threads
- graph/benchmark_bfs.py: lifted_edges_from_node_labels (workspace-reused BFS, I3)
- segmentation/benchmark_mutex_watershed.py: mutex_watershed grid kernel (B1)

Each follows the warmup + median/min pattern and exposes run() for a driver.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@constantinpape constantinpape merged commit 8463dcc into main Jun 12, 2026
6 checks passed
@constantinpape constantinpape deleted the fable-review branch June 12, 2026 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant