[fix] Rewrite search queries to avoid parenthesised-step slow path#178
Merged
line-o merged 1 commit intoeXist-db:masterfrom May 6, 2026
Merged
[fix] Rewrite search queries to avoid parenthesised-step slow path#178line-o merged 1 commit intoeXist-db:masterfrom
line-o merged 1 commit intoeXist-db:masterfrom
Conversation
Each search function in app.xqm was using the shape `$app:data//(branch1 | branch2 | ...)` -- a parenthesised union step expression. At runtime this defeats the structural index fast path: the engine materialises the full descendant axis under each `$app:data` root, then applies the parenthesised expression as a generic step, instead of dispatching each branch by qname through the structural index. The split form `$app:data//branch1 | $app:data//branch2 | ...` is semantically identical (XPath's `|` is set union with document-order sort and dedup) but evaluates each branch as an independent path with its own structural-index lookup. On a synthetic xqdoc corpus (~6,000 functions) the full `search-everywhere` query goes from ~35ms (parenthesised form) down to ~5ms (split form). The function reference UI's keystroke latency on large corpora drops correspondingly. Two of the functions (`search-in-module-location`, `search-in-module-name`) had a single-branch parenthesised step that also hit the same slow path; they're rewritten by simply dropping the unnecessary parens. For the upstream optimiser-side companion fix (which addresses the single-step `//(name)` shape automatically), see eXist-db/exist#6303. The union-of-steps distribution that this commit performs by hand is left as an upstream follow-up because it requires more invasive AST rewriting than the parser/optimiser currently support. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
duncdrum
approved these changes
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Each search function in
modules/app.xqmwas using the shape$app:data//(branch1 | branch2 | ...)-- a parenthesised union step expression. At runtime this defeats the structural-index fast path: the engine materialises the full descendant axis under each$app:dataroot and applies the union as a generic step, instead of dispatching each branch by qname through the structural index.This PR rewrites each function to the equivalent split form
$app:data//branch1 | $app:data//branch2 | ..., where each branch is an independent path with its own structural-index lookup.What changed
5 functions in
src/main/xar-resources/modules/app.xqm:search-in-module-location-- single-branch parenthesised step, parens droppedsearch-in-module-name-- single-branch parenthesised step, parens droppedsearch-in-description-- 2-branch union, distributed over$app:data//search-in-signature-- 2-branch union, distributedsearch-everywhere-- 9-branch union, distributedPlus a comment block explaining why the rewrite matters and pointing to the upstream optimiser PR.
Why both forms are equivalent
XPath's
|is a set union with document-order sort and duplicate elimination. For pathsPand predicate-pathsA,B:The right-hand form evaluates each path independently and unions the results; the left-hand form materialises the descendant axis once and applies the union per-node. Both produce the same node-set in document order.
Numbers
Synthetic xqdoc-shaped corpus (200 modules, 30 functions each = 6,000 functions, ngram-indexed on description/name/signature/param/return), measured against an embedded eXist running
develop:search-in-description(2-branch)search-in-signature(2-branch)search-everywhere(9-branch)search-in-module-location(1-branch parens)search-in-module-name(1-branch parens)The function-reference UI's keystroke-latency on large corpora drops correspondingly.
Related work
Companion PR upstream: eXist-db/exist#6303 -- adds an Optimizer pass that automatically unwraps the single-step parens shape
//(name), so future code that accidentally uses parens around a single step gets the win for free. The union-of-steps distribution that this PR does by hand is left as an upstream follow-up because it requires more invasive AST rewriting (distributing the parent path over union branches needs either aPathExpr.replaceAllSteps-style API or rewriting the outerPathExprat its parent).Investigation thread: eXist-db/exist#6295 -- @line-o reported residual ngram performance issues in this app after #6300 merged. Diagnosis pinned the slow path to the parenthesised-step shape in this app's queries, not to ngram or the optimizer's predicate-rewriting. This PR fixes the app side; #6303 fixes the engine side as far as it can.
Test plan
app.xqm: 5 functions rewritten, semantics-preservingfundoc_spec.cy.jsincludes a search-everywhere case for "exist_home") -- run by maintainer / CI[This PR was prepared with Claude Code. -Joe]
🤖 Generated with Claude Code