SOLR-18195: Support for Collapse Results in Combined Query Component by ercsonusharma · Pull Request #4277 · apache/solr

ercsonusharma · 2026-04-11T09:30:49Z

https://issues.apache.org/jira/browse/SOLR-18195

With reference to previous new Query Component Feature: https://issues.apache.org/jira/browse/SOLR-17319

Description

When using {!collapse} with CombinedQueryComponent (hybrid search / RRF), duplicate documents appear in the result set for the same collapse field value. Each sub-query independently collapses correctly via CollapsingPostFilter, but simpleCombine() merges results by Lucene doc ID only - it has no awareness of the collapse field. Different sub-queries may select different group heads for the same field value, and both survive the merge.

However it is well know limitation that :[ "In order to use these features with SolrCloud, the documents must be located on the same shard."|https://solr.apache.org/guide/solr/latest/query-guide/collapse-and-expand-results.html]

Solution

Removing collapsed duplicates by delegating to SolrIndexSearcher with the CollapsingPostFilter by forming the query using PrecomputedScoreQuery to preserve original scores with filter on the docSet from combined sub-queries. The duplicates are removed honouring the collapse query criteria.

Tests

Added comprehensive test with Collapse Query.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide
I have added a changelog entry for my change

ercsonusharma · 2026-04-14T01:17:38Z

David, Could you please take a look when you have a moment? Your past context here on this feature would be super helpful. Thank you @dsmiley

dsmiley · 2026-04-16T04:30:24Z

    long approximateTotalHits = 0;
    Map<String, List<ShardDoc>> shardDocMap = new HashMap<>();
    String[] queriesToCombineKeys = rb.req.getParams().getParams(CombinerParams.COMBINER_QUERY);
+    // Build per-shard set of doc IDs retained after collapse in simpleCombine.


Can you elaborate on this please? I'm lost. Is this theoretical? Tested?

Yes, Sure. Here is the overall flow:

Phase 1: Shard-level : Each shard runs the combined query request. In process():

Per-query execution: Each sub-query executes and applies the fq={!collapse} on its own, so each produces its own set of group heads. The individual results are added to crb.rsp under "response_per_query" — this is what the coordinator reads per-query later.

Cross-query dedup via simpleCombine: All per-query results are merged and collapse is re-applied across the union.

The result is set as crb.rsp's main "response". This is the shard's canonical post-collapse result — it contains only the docs that survived collapse across all sub-queries.

So the shard response sent to the coordinator contains both:

"response_per_query" — per-query doc lists (each collapsed individually, may have cross-query duplicates)

"response" — the simpleCombine output (deduplicated across queries)

Phase 2: Coordinator-level (mergeIds())

The coordinator iterates per-query responses ("response_per_query") from each shard to build shardDocMap (query -> list of ShardDocs), which is then passed to the RRF combiner.

The problem is: per-query responses can contain docs that were eliminated by simpleCombine on the shard. If we feed them into RRF, those eliminated docs get reintroduced in the final result.

Example: On shard-1, q1 picked id=1 as group head for mod3_idv=1, and q2 picked id=4. simpleCombine re-collapsed and kept only one (say id=4). But both id=1 (from q1's per-query response) and id=4 (from q2's) are still in "response_per_query". Without filtering, RRF would see both.

The fix : combinedDocIdsPerShard extracts the doc IDs from each shard's "response" (the simpleCombine output). Then, while iterating per-query docs, any doc not in that set is skipped. This ensures RRF/Combiner only operates on docs that survived the shard-level cross-query collapse.

Tested in testCollapseWithCombinedQueryProducesDuplicates - sets up exactly this scenario with two queries scoring the same collapse group differently, and asserts no duplicates in the final result.

That helps a lot!
It may be strange to suggest this but I think simply removing the comment words "retained after collapse in simpleCombine" would be slightly clearer, as that references stuff happening at a shard level which confused me a little as we're processing at the coordinator here.

What I find confusing from your explanation is that a shard returns both response_per_query and the standard response key. I will suggest a comment to put somewhere (not precisely at the line here). Not sure if it's actually correct or useful but tell me:
"This component receives both "response" and "response_per_query". Only the former is deduplicated, and thus most (all?) processing of the latter must exclude docs not present in the former."
I wonder if we'd be better off with only "response" and then an additional key to provide info on which queries matched.

Added comment for future references. I thought about this single-response-with-annotations approach, which looks cleaner and simpler at first glance but is non-trivial and complex touching the core components. What I found was: that would require extending DocSlice/DocList (which has no per-doc metadata) and the SOLRDOCLIST wire format, which would together a meaningful refactor across per shard response, the response builder, and the transformer layer. I think it's worth digging deeper as a follow up.

I definitely do not suggest extending DocSlice/DocList. I mean adding extra internal fields on the returned docs. e.g. a underscore prefix & suffix field.

The doc to be returned from shard is docListAndSet, SolrDocument is not available at that level where we can add internal fields. Individual shard only setResult to the ResponseBuilder which contains docListAndSet that is inherently lossy in terms of metadata. IMO, fwiw that opens up a separate discussion like I mentioned earlier.

dsmiley · 2026-04-16T13:53:02Z

    long approximateTotalHits = 0;
    Map<String, List<ShardDoc>> shardDocMap = new HashMap<>();
    String[] queriesToCombineKeys = rb.req.getParams().getParams(CombinerParams.COMBINER_QUERY);
+    // Build per-shard set of doc IDs retained after collapse in simpleCombine.


That helps a lot!
It may be strange to suggest this but I think simply removing the comment words "retained after collapse in simpleCombine" would be slightly clearer, as that references stuff happening at a shard level which confused me a little as we're processing at the coordinator here.

What I find confusing from your explanation is that a shard returns both response_per_query and the standard response key. I will suggest a comment to put somewhere (not precisely at the line here). Not sure if it's actually correct or useful but tell me:
"This component receives both "response" and "response_per_query". Only the former is deduplicated, and thus most (all?) processing of the latter must exclude docs not present in the former."
I wonder if we'd be better off with only "response" and then an additional key to provide info on which queries matched.

ercsonusharma added 8 commits April 9, 2026 18:57

Collapse for Combined Query Component

2c235d6

Collapse for Combined Query Component

8d500bd

Collapse for Combined Query Component

c511bc8

Collapse for Combined Query Component

4c72a2b

Collapse for Combined Query Component

9fb4c1a

Collapse for Combined Query Component

7eca692

Add cloud test

a30dc39

Added changelog

6016026

github-actions bot added tests cat:search labels Apr 11, 2026

ercsonusharma commented Apr 14, 2026

View reviewed changes

Comment thread solr/core/src/java/org/apache/solr/handler/component/combine/QueryAndResponseCombiner.java Outdated

dsmiley reviewed Apr 14, 2026

View reviewed changes

ercsonusharma added 2 commits April 14, 2026 18:02

review comment impl

431e233

review comment impl

a82f65f

dsmiley reviewed Apr 15, 2026

View reviewed changes

ercsonusharma added 3 commits April 15, 2026 12:01

review comment impl

126e5df

review comment impl

d8491fb

coordinator combine

4bc0e42

dsmiley reviewed Apr 16, 2026

View reviewed changes

collapse fix and review impl

338eece

dsmiley reviewed Apr 16, 2026

View reviewed changes

review comments impl

25c3186

Conversation

ercsonusharma commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Solution

Tests

Checklist

Uh oh!

ercsonusharma commented Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsmiley Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ercsonusharma Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

dsmiley Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ercsonusharma Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

dsmiley Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ercsonusharma Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsmiley Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ercsonusharma commented Apr 11, 2026 •

edited

Loading