Skip to content

Implement get_unit_spike_trains and performance improvements#4502

Open
alejoe91 wants to merge 24 commits intoSpikeInterface:mainfrom
alejoe91:get-unit-spike-trains
Open

Implement get_unit_spike_trains and performance improvements#4502
alejoe91 wants to merge 24 commits intoSpikeInterface:mainfrom
alejoe91:get-unit-spike-trains

Conversation

@alejoe91
Copy link
Copy Markdown
Member

@alejoe91 alejoe91 commented Apr 9, 2026

  • expose and propagate use_cache (to get_unit_spike_train_in_seconds)
  • fix wrong check in to_reordered_spike_vector
  • avoid lexsort when not needed in select_units

@grahamfindlay

TODO

  • Implement numpy/numba get_unit_spike_trains for PhyKilosortSortingExtractor

(maybe in follow up)

@alejoe91 alejoe91 requested a review from chrishalcrow April 9, 2026 15:41
@alejoe91 alejoe91 added core Changes to core module performance Performance issues/improvements labels Apr 9, 2026
alejoe91 and others added 8 commits April 9, 2026 17:58
- Drop unused `return_times` parameter from get_unit_spike_trains_in_seconds
- Clean up stale/truncated docstrings on get_unit_spike_train_in_seconds,
  get_unit_spike_trains, and get_unit_spike_trains_in_seconds
- Fix UnitsSelectionSortingSegment.get_unit_spike_trains to re-key the
  returned dict with child unit ids (was returning parent-keyed dict,
  breaking whenever renamed_unit_ids differ from parent ids)
- Fix test_get_unit_spike_trains: drop unused return_times kwarg, remove
  unused local variable, fix assertion.
The previous check `np.diff(self.ids_to_indices(self._renamed_unit_ids)).min() < 0`
was never `True`, because `ids_to_indices(self._renamed_unit_ids)` on a USS
always returns `[0, 1, ..., k-1]` (since `_main_ids == _renamed_unit_ids`), so the
diff was always positive and the lexsort branch was unreachable. Therefore the
cached spike vector was wrong whenever two units had co-temporal spikes and the
selection reordered them relative to the parent.

Replaced with a two-step check that attempt to avoid unneccessary lexsorts:
  1. O(k) `_is_order_preserving_selection()` -- Checks if USS `._unit_ids` is
     in the same relative order as in the parent. When True, the remapped vector
     is guaranteed sorted (boolean filtering preserves order; the remap only
     relabels unit_index values). This is the common case via `select_units()`
     with a boolean mask.
  2. O(n) `_is_spike_vector_sorted()` -- Checks if the remapped vector is still
     sorted by (segment, sample, unit). Catches the case where the selection is
     not order-preserving but no co-temporal (same exact sample) spikes exist.

Falls back to the original O(n log n) lexsort only when both checks fail.
`BaseSorting` builds the spike vector with a per-unit boolean scan
over spike_clusters, which is (O(N*K)).

If we already have the full flat spike time and spike cluster arrays, we can
do a lot better by building the spike vector in one shot.
(I think O(N log N) from the lexsort, which is also pessimistic,
because the lexsort doesn't always need to happen.
Under any circumstances I can dream of, K >> log N.)

Since Phy/Kilosort segments already load the full flat arrays when the
`PhyKilosortSorting` object is created, and keep them around  as
`._all_spikes` and `._all_clusters`, we can just use those! :)

Also populates `_cached_spike_vector_segment_slices` directly, so
that `BaseSorting`'s `_get_spike_vector_segment_slices()` lazy
recomputation is skipped.
`BaseSortingSegment.get_unit_spike_trains()` loops over
`get_unit_spike_train`, which is O(N*K) because each call is a
boolean scan over _all_clusters/_all_spikes.

If we know we are going to be getting all the trains, we can do it
much faster. And if we can use numba, even faster still.

In fact, even if we only want _some_ spike trains, it is still often
faster to get all the trains and just discard the ones we don't need,
than to get only the trains we need do unit-by-unit (because we
only ever store or cache flat arrays of spike times/clusters).

Note that **only the use_cache=False path is affected**; the
use_cache=True triggers the computation of the spike vector, which
I don't think can ever be the most efficient way to get spike trains.
…izations

- Fixed test_compute_and_cache_spike_vector: was comparing an array to
  itself (to_spike_vector use_cache=False still returns the cached
  vector). Now explicitly calls the USS override and the BaseSorting
  implementation, and compares the two.
- Added test_uss_get_unit_spike_trains_with_renamed_ids: also not a test
  of the optimization commits per se, but would have caught a mistake made
  along the way. Verifies get_unit_spike_trains returns child-keyed dicts
  (not parent-keyed).
- Added test_spike_vector_sorted_after_reorder_with_cotemporal_spikes:
  verifies the USS spike vector is correctly sorted when the selection
  reverses unit order and co-temporal spikes exist.
- Added test_phy_sorting_segment_get_unit_spike_trains: validates the
  new fast methods on PhySortingSegment.
- Added test_phy_compute_and_cache_spike_vector: verifies the Phy
  override of _compute_and_cache_spike_vector matches BaseSorting
  implementation.
@grahamfindlay
Copy link
Copy Markdown
Contributor

@alejoe91 my changes PR'd to your fork whenever you're ready.

The only thing I should point out that isn't in the commit messages:
I mocked a minimal Phy folder for testing instead of using the phy_example_0 GIN dataset, just because it was quick, easy, and lightweight. I did feel a little guilty doing it, but I'm also not convinced it was a bad idea.

@alejoe91 alejoe91 marked this pull request as ready for review April 14, 2026 09:47
@h-mayorquin
Copy link
Copy Markdown
Collaborator

I am curios on what prompted this? What profiling did you guys do? Any chance that we have a discussion here on the repo at least to know what where the performance benchmarks, reason and validation.

@alejoe91
Copy link
Copy Markdown
Member Author

@grahamfindlay is doing very long chronic recordings. He does all the processing and at a second iteration wants to load the phy sorting object, select some units, and get all the spike trains.

Just caching the spike vector takes almost 4 minutes! Plus there were some additional lexsort that can be avoided and speed up computation.

At least to give some context @h-mayorquin

@grahamfindlay maybe you can add some more details on benchmarks and profiling?

@grahamfindlay
Copy link
Copy Markdown
Contributor

grahamfindlay commented Apr 16, 2026

Here are example timings for various operations using 1 example subject. This subject only has ~400 million spikes - I have some with many more. FWIW, you shouldn't need long chronic recordings to see tangible improvements from most of these changes. I must dig through notes but tested with 100M spikes and they were still clear gains.

"Parent before" = The KiloSortSortingExtractor (342 units), pre-PR
"Parent after" = The KiloSortSortingExtractor, with PR
"Leaf after" = Two layers of UnitSelectionSorting (first layer: 258 units, second layer: 258 units), pre-PR
"Leaf after" = UnitSelectionSorting, with PR

The two layers of UnitSelectionSorting come from 1) selecting based on quality, 2) selecting based on cell type (here I asked for all cell types, so should effectively be no-op, but actually it has a big cost).

Operation Parent Before Parent After Leaf Before Leaf After Notes
to_spike_vector() 5m58s 3m6s 5m58s + 4m30s 3m6s + 21s Time for parent + marginal time for children
precompute_spike_trains() +2m18s +2m10s +2m47s +2m18s Starting from a hot cache and precomputed parent spike trains (for leaf), i.e. best case
loop over get_unit_spike_train(use_cache=False) 3m40s 3m40s ~13m16s (bug) 3m15s Bugs: wasn't available when return_times=True; use_cache was never respected by UnitSelectionSorting
get_unit_spike_trains(use_cache=True) N/A Same as precompute_spike_trains() N/A Same as precompute_spike_trains() Just syntactic sugar; still relies on the spike vector
get_unit_spike_trains(use_cache=False) 3m40s 35s (numba) / 1m49s (numpy) 3m15s 35s (numba) / 1m49s (numpy) Numba / NumPy should be 11s / 1m15s; must fix

Comments:

  • The 4m30s to get the UnitSelectionSorting (USS) spike vector was 2m15s per layer, including the no-op layer.
  • The improvements in to_spike_vector() come from overriding the base class method to take advantage of the fact that the KS extractor already has access to the full flat arrays, and from checks to avoid needless lexsorting on the USS.
    • One of the things that can trigger lexsorting is if the user selects unit ids in a different relative order than they appeared in the parent. I handle this pathological case, but it might be worth discussing whether they should be allowed to do this in the first place, or whether we should always re-order the ids to match the parent. Hopefully it's uncommon in practice.
  • The fact that it still takes minutes to precompute the spike trains for a USS after precomputing the spike trains for the KiloSortSortingExtractor (ie all units) is conspicuous. You're asking for a subset of the trains you already precomputed -- it should be instant!
  • The best way to get spike trains before was to bypass the cache so you could bypass the spike vector computation, but because of bugs you couldn't do this from a USS, or if you wanted to return times in seconds (without accessing private properties like ._parent_segment).
  • The new get_spike_trains() takes advantage of the fact that if you know you will get all spike trains, you can again take advantage of the full flat (sorted) arrays and just scatter them. Basically, to get spike trains, you don't have to figure out if two spikes from different units occur on the same sample, which you do need to know for the spike vector. In fact, it's so much cheaper that even if you only want ~20 spike trains, it's better to just get them all and discard the ones you aren't interested in...
  • You could probably apply these same principles to get gains for other extractors besides the Phy/KiloSort ones.

There are rough edges with this PR I know about:

  1. Alessio and Samuel pointed out that the BasePhyKilosortSortingExtractor is never multi-segment, so I can remove a loop over segments and save a possibly expensive call to np.concatenate().
  2. When translating my prototype numba and numpy versions of get_spike_trains() to the production versions, I made some minor changes to fit function signatures and style that I thought would be harmless, but apparently they add ~20s to the numba implementation and 30s to the numpy one. This is pretty significant, as it took the numba path from 11s to 35s. I need to go back and figure out why the effect of these seemingly small changes was so dramatic.

Another question I haven't resolved:

  • Why is building the spike vector so costly the first place, even using my new more efficient method for 1-shot'ing it from the full flat arrays? It just feels like there should be a better way. As Samuel pointed out, it's 1 big malloc as opposed to K (K = units) mallocs for a dictionary of trains. I don't think I'm just creating views onto the underlying flat arrays... so what gives? Is it really all in the O(N log N) lexsort, and the need to allocate a lot of space for the segment indices?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Changes to core module performance Performance issues/improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants