Skip to content

SOLR-18195: Support for Collapse Results in Combined Query Component#4277

Open
ercsonusharma wants to merge 15 commits intoapache:mainfrom
ercsonusharma:feat_combined_query_collapse
Open

SOLR-18195: Support for Collapse Results in Combined Query Component#4277
ercsonusharma wants to merge 15 commits intoapache:mainfrom
ercsonusharma:feat_combined_query_collapse

Conversation

@ercsonusharma
Copy link
Copy Markdown
Contributor

@ercsonusharma ercsonusharma commented Apr 11, 2026

https://issues.apache.org/jira/browse/SOLR-18195

With reference to previous new Query Component Feature: https://issues.apache.org/jira/browse/SOLR-17319

Description

When using {!collapse} with CombinedQueryComponent (hybrid search / RRF), duplicate documents appear in the result set for the same collapse field value. Each sub-query independently collapses correctly via CollapsingPostFilter, but simpleCombine() merges results by Lucene doc ID only - it has no awareness of the collapse field. Different sub-queries may select different group heads for the same field value, and both survive the merge.

However it is well know limitation that :[ "In order to use these features with SolrCloud, the documents must be located on the same shard."|https://solr.apache.org/guide/solr/latest/query-guide/collapse-and-expand-results.html]

Solution

Removing collapsed duplicates by delegating to SolrIndexSearcher with the CollapsingPostFilter by forming the query using PrecomputedScoreQuery to preserve original scores with filter on the docSet from combined sub-queries. The duplicates are removed honouring the collapse query criteria.

Tests

Added comprehensive test with Collapse Query.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide
  • I have added a changelog entry for my change

@ercsonusharma
Copy link
Copy Markdown
Contributor Author

David, Could you please take a look when you have a moment? Your past context here on this feature would be super helpful. Thank you @dsmiley

Comment thread solr/core/src/java/org/apache/solr/handler/component/CombinedQueryComponent.java Outdated
Comment thread solr/core/src/java/org/apache/solr/handler/component/CombinedQueryComponent.java Outdated
long approximateTotalHits = 0;
Map<String, List<ShardDoc>> shardDocMap = new HashMap<>();
String[] queriesToCombineKeys = rb.req.getParams().getParams(CombinerParams.COMBINER_QUERY);
// Build per-shard set of doc IDs retained after collapse in simpleCombine.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this please? I'm lost. Is this theoretical? Tested?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Sure. Here is the overall flow:

Phase 1: Shard-level : Each shard runs the combined query request. In process():

  1. Per-query execution: Each sub-query executes and applies the fq={!collapse} on its own, so each produces its own set of group heads. The individual results are added to crb.rsp under "response_per_query" — this is what the coordinator reads per-query later.
  2. Cross-query dedup via simpleCombine: All per-query results are merged and collapse is re-applied across the union.

The result is set as crb.rsp's main "response". This is the shard's canonical post-collapse result — it contains only the docs that survived collapse across all sub-queries.

So the shard response sent to the coordinator contains both:

  • "response_per_query" — per-query doc lists (each collapsed individually, may have cross-query duplicates)
  • "response" — the simpleCombine output (deduplicated across queries)

Phase 2: Coordinator-level (mergeIds())

The coordinator iterates per-query responses ("response_per_query") from each shard to build shardDocMap (query -> list of ShardDocs), which is then passed to the RRF combiner.

The problem is: per-query responses can contain docs that were eliminated by simpleCombine on the shard. If we feed them into RRF, those eliminated docs get reintroduced in the final result.

Example: On shard-1, q1 picked id=1 as group head for mod3_idv=1, and q2 picked id=4. simpleCombine re-collapsed and kept only one (say id=4). But both id=1 (from q1's per-query response) and id=4 (from q2's) are still in "response_per_query". Without filtering, RRF would see both.

The fix : combinedDocIdsPerShard extracts the doc IDs from each shard's "response" (the simpleCombine output). Then, while iterating per-query docs, any doc not in that set is skipped. This ensures RRF/Combiner only operates on docs that survived the shard-level cross-query collapse.

Tested in testCollapseWithCombinedQueryProducesDuplicates - sets up exactly this scenario with two queries scoring the same collapse group differently, and asserts no duplicates in the final result.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That helps a lot!
It may be strange to suggest this but I think simply removing the comment words "retained after collapse in simpleCombine" would be slightly clearer, as that references stuff happening at a shard level which confused me a little as we're processing at the coordinator here.

What I find confusing from your explanation is that a shard returns both response_per_query and the standard response key. I will suggest a comment to put somewhere (not precisely at the line here). Not sure if it's actually correct or useful but tell me:
"This component receives both "response" and "response_per_query". Only the former is deduplicated, and thus most (all?) processing of the latter must exclude docs not present in the former."
I wonder if we'd be better off with only "response" and then an additional key to provide info on which queries matched.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment for future references. I thought about this single-response-with-annotations approach, which looks cleaner and simpler at first glance but is non-trivial and complex touching the core components. What I found was: that would require extending DocSlice/DocList (which has no per-doc metadata) and the SOLRDOCLIST wire format, which would together a meaningful refactor across per shard response, the response builder, and the transformer layer. I think it's worth digging deeper as a follow up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely do not suggest extending DocSlice/DocList. I mean adding extra internal fields on the returned docs. e.g. a underscore prefix & suffix field.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc to be returned from shard is docListAndSet, SolrDocument is not available at that level where we can add internal fields. Individual shard only setResult to the ResponseBuilder which contains docListAndSet that is inherently lossy in terms of metadata. IMO, fwiw that opens up a separate discussion like I mentioned earlier.

Comment thread changelog/unreleased/SOLR-18195-collapse-results-combined-query.yml Outdated
long approximateTotalHits = 0;
Map<String, List<ShardDoc>> shardDocMap = new HashMap<>();
String[] queriesToCombineKeys = rb.req.getParams().getParams(CombinerParams.COMBINER_QUERY);
// Build per-shard set of doc IDs retained after collapse in simpleCombine.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That helps a lot!
It may be strange to suggest this but I think simply removing the comment words "retained after collapse in simpleCombine" would be slightly clearer, as that references stuff happening at a shard level which confused me a little as we're processing at the coordinator here.

What I find confusing from your explanation is that a shard returns both response_per_query and the standard response key. I will suggest a comment to put somewhere (not precisely at the line here). Not sure if it's actually correct or useful but tell me:
"This component receives both "response" and "response_per_query". Only the former is deduplicated, and thus most (all?) processing of the latter must exclude docs not present in the former."
I wonder if we'd be better off with only "response" and then an additional key to provide info on which queries matched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants