Not sure if an issue is the right place to discuss this — happy to move the conversation elsewhere if needed.
Context
At Whatnot we use Phoenix.PubSub (backed by Registry) for livestreams. In #14654 we added {:duplicate, :key} partitioning which helped when we had many small livestreams, but it doesn't help with large ones (as discussed in the PR) — all subscribers for a key still end up in one duplicate_bag partition.
We hit a concrete problem: at a certain leave rate from a large livestream, the PID partition GenServer's message queue runs away because each :ets.match_delete/2 on the duplicate_bag has to scan all entries for that key to find the ones belonging to the leaving process. The partition can't keep up.
Idea
Use an ordered_set ETS table with composite keys {key, pid, counter} instead of duplicate_bag with {key,{pid, value}}. Since ordered_set sorts entries, all entries for a {key, pid} prefix are adjacent in the tree, so unregistering only touches that process's entries instead of scanning the whole key.
Combined with key-based partitioning (so lookup hits one partition, not all of them).
A PoC implementation and benchmarks are available at: https://github.com/studzien/elixir/tree/registry/duplicate-ordered-set
In our simulated load tests the lookup time increase was not noticeable in our workload and we no longer saw the queue runaway issue.
Benchmark results
Hot key workload — all N entries under a single key, 16 partitions, duplicate = pid-partitioned duplicate_bag, ordered = key-partitioned ordered_set:
Lookup
| N |
duplicate |
ordered |
|
| 100 |
4.0 μs |
6.4 μs |
ordered 1.6x slower |
| 1,000 |
29.5 μs |
67.3 μs |
ordered 2.3x slower |
| 10,000 |
315 μs |
404 μs |
ordered 1.3x slower |
| 1,000,000 |
149 ms |
94.9 ms |
ordered 1.6x faster |
Ordered is slower at small N due to :ets.select/2 vs :ets.lookup_element/3. The gap closes as N grows, and at 1M ordered wins because it hits one partition instead of sixteen.
Unregister
| N |
duplicate |
ordered |
|
| 100 |
2.9 μs |
3.3 μs |
~tied |
| 1,000 |
4.1 μs |
3.6 μs |
ordered 1.1x faster |
| 10,000 |
14.7 μs |
4.0 μs |
ordered 3.7x faster |
| 1,000,000 |
5.1 ms |
5.4 μs |
ordered 940x faster |
This is the main win. At 1M entries unregister goes from 5ms to 5μs — this is what causes the queue runaway in the partition GenServer with duplicate_bag.
Tradeoffs
Lookup is ~1.5-2x slower at small to medium scale. Unregistration and crash cleanup are dramatically faster at scale. For our use case (large channels, high join/leave rate) the tradeoff is clearly worth it.
Next steps
We're happy to work on this and prepare a production-ready version of the code, but would appreciate some guidance on structuring the solution. The current Registry implementation already has a fair amount of branching for unique, duplicate partitioned by pid, and duplicate partitioned by key — adding another backend on top of that makes the code harder to follow. Would appreciate thoughts on whether the current approach (more case branches) is acceptable, or if there's a preferred way to organize this (e.g., separate modules per backend, behaviour, etc.).
Related: #14654
Not sure if an issue is the right place to discuss this — happy to move the conversation elsewhere if needed.
Context
At Whatnot we use
Phoenix.PubSub(backed byRegistry) for livestreams. In #14654 we added{:duplicate, :key}partitioning which helped when we had many small livestreams, but it doesn't help with large ones (as discussed in the PR) — all subscribers for a key still end up in oneduplicate_bagpartition.We hit a concrete problem: at a certain leave rate from a large livestream, the PID partition GenServer's message queue runs away because each
:ets.match_delete/2on theduplicate_baghas to scan all entries for that key to find the ones belonging to the leaving process. The partition can't keep up.Idea
Use an
ordered_setETS table with composite keys{key, pid, counter}instead ofduplicate_bagwith{key,{pid, value}}. Sinceordered_setsorts entries, all entries for a{key, pid}prefix are adjacent in the tree, so unregistering only touches that process's entries instead of scanning the whole key.Combined with key-based partitioning (so lookup hits one partition, not all of them).
A PoC implementation and benchmarks are available at: https://github.com/studzien/elixir/tree/registry/duplicate-ordered-set
In our simulated load tests the lookup time increase was not noticeable in our workload and we no longer saw the queue runaway issue.
Benchmark results
Hot key workload — all N entries under a single key, 16 partitions,
duplicate= pid-partitionedduplicate_bag,ordered= key-partitionedordered_set:Lookup
Ordered is slower at small N due to
:ets.select/2vs:ets.lookup_element/3. The gap closes as N grows, and at 1M ordered wins because it hits one partition instead of sixteen.Unregister
This is the main win. At 1M entries unregister goes from 5ms to 5μs — this is what causes the queue runaway in the partition GenServer with
duplicate_bag.Tradeoffs
Lookup is ~1.5-2x slower at small to medium scale. Unregistration and crash cleanup are dramatically faster at scale. For our use case (large channels, high join/leave rate) the tradeoff is clearly worth it.
Next steps
We're happy to work on this and prepare a production-ready version of the code, but would appreciate some guidance on structuring the solution. The current Registry implementation already has a fair amount of branching for
unique,duplicatepartitioned by pid, andduplicatepartitioned by key — adding another backend on top of that makes the code harder to follow. Would appreciate thoughts on whether the current approach (more case branches) is acceptable, or if there's a preferred way to organize this (e.g., separate modules per backend, behaviour, etc.).Related: #14654