Skip to content

[Feature] Add efficient filtering (knn.filter) support for vectorSearch()#5331

Open
mengweieric wants to merge 17 commits intoopensearch-project:feature/vector-search-p0from
mengweieric:feature/sql-vector-search-efficient-filtering
Open

[Feature] Add efficient filtering (knn.filter) support for vectorSearch()#5331
mengweieric wants to merge 17 commits intoopensearch-project:feature/vector-search-p0from
mengweieric:feature/sql-vector-search-efficient-filtering

Conversation

@mengweieric
Copy link
Copy Markdown
Collaborator

@mengweieric mengweieric commented Apr 9, 2026

Summary

Adds filter_type=post|efficient option to vectorSearch() so WHERE clauses can be placed inside the knn clause (knn.filter) for efficient pre-filtering during ANN search, or outside as bool.filter for post-filtering (default). Also adds mandatory LIMIT enforcement for radial search.

What this PR adds

FilterType enum and option parsing

  • New FilterType enum (POST, EFFICIENT) with fromString() validation
  • filter_type added to allowed option keys in VectorSearchTableFunctionImplementation
  • filter_type is stripped from options before knn JSON generation — it's a SQL-layer directive, not a knn parameter

Efficient filter pushdown

  • VectorSearchQueryBuilder.pushDownFilter() branches on filter type:
    • POST (default): knn in bool.must + WHERE in bool.filter (post-filtering)
    • EFFICIENT: rebuilds knn query with WHERE embedded in knn.filter via callback
  • Function<QueryBuilder, QueryBuilder> callback keeps JSON serialization in VectorSearchIndex
  • buildKnnQueryJson() collapsed to accept optional filter JSON parameter — no duplication

Build-time validation

  • build() override rejects explicit filter_type without a pushdownable WHERE clause
  • Applies to both post and efficient — specifying the directive without a filter is always an error

Radial search LIMIT requirement

  • Radial search (max_distance or min_score) without an explicit LIMIT clause is rejected at build time with a clear error message
  • Prevents unbounded result sets from radial queries that could silently return up to maxResultWindow rows

Engine support

  • knn.filter is supported for lucene and faiss engines (HNSW, IVF). Engine compatibility is not validated by the SQL plugin — unsupported engines reject at execution time.

SQL syntax

-- Post-filtering (default, same as omitting filter_type)
SELECT v._id, v._score
FROM vectorSearch(table='my-index', field='embedding',
     vector='[1.0, 2.0, 3.0]', option='k=10,filter_type=post') AS v
WHERE v.city = 'Seattle'
LIMIT 10

-- Efficient pre-filtering (WHERE inside knn.filter)
SELECT v._id, v._score
FROM vectorSearch(table='my-index', field='embedding',
     vector='[1.0, 2.0, 3.0]', option='k=10,filter_type=efficient') AS v
WHERE v.city = 'Seattle'
LIMIT 10

-- Radial search requires LIMIT
SELECT v._id, v._score
FROM vectorSearch(table='my-index', field='embedding',
     vector='[1.0, 2.0, 3.0]', option='max_distance=10.5') AS v
LIMIT 100

Test plan

  • ./gradlew spotlessCheck — PASS
  • ./gradlew :opensearch:test — PASS
  • ./gradlew :integ-test:integTest -Dtests.class="*VectorSearchIT" — PASS (25 tests)
  • Sole authorship verified — Eric Wei only, no Co-Authored-By

- Enforce exactly one of k, max_distance, or min_score
- Validate k is in [1, 10000] range
- Add 6 tests: mutual exclusivity (3 combos), k too small, k too
  large, k boundary values (1 and 10000)

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
VectorSearchQueryBuilder now accepts options map and rejects
pushDownLimit when LIMIT exceeds k. Radial modes (max_distance,
min_score) have no LIMIT restriction.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Create VectorSearchIndexTest: 7 tests covering buildKnnQueryJson()
  for top-k, max_distance, min_score, nested fields, multi-element
  and single-element vectors, numeric option rendering
- Add edge case tests to VectorSearchTableFunctionImplementationTest:
  NaN vector component, empty option key/value, negative k, NaN for
  max_distance and min_score (6 new tests)
- Add VectorSearchQueryBuilderTest: min_score radial mode LIMIT,
  pushDownSort delegation to parent (2 new tests)
- Extract buildKnnQueryJson() as package-private for direct testing

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Test too-many (5) and zero arguments paths in
VectorSearchTableFunctionResolver to complement existing
too-few (2) test.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Cap radial mode (max_distance/min_score) results at maxResultWindow
  to prevent unbounded result sets
- Reject ORDER BY on non-_score fields and _score ASC in vectorSearch
  since knn results are naturally sorted by _score DESC
- Add 12 integration tests: 4 _explain DSL shape verification tests
  and 8 validation error path tests

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Add multi-sort expression test: ORDER BY _score DESC, name ASC
  correctly rejects the non-_score field (VectorSearchQueryBuilderTest)
- Add case-insensitive argument name lookup test to verify
  TABLE='x' resolves same as table='x' (Implementation test)
- Add non-numeric option fallback test: verifies string options
  are quoted in JSON output (VectorSearchIndexTest)
- Add 4 integration tests: ORDER BY _score DESC succeeds,
  ORDER BY non-score rejects, ORDER BY _score ASC rejects,
  LIMIT within k succeeds (VectorSearchIT, now 16 tests)

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
The base OpenSearchIndexScanQueryBuilder.pushDownSort() pushes
sort.getCount() as a limit when non-zero. Our override validated
_score DESC and returned true, but did not preserve this contract.

SQL always sets count=0, so this was not reachable today, but PPL
or future callers may set a non-zero count to combine sort+limit
in one LogicalSort node. Preserve the behavior defensively.

Add focused test: LogicalSort(count=7) with _score DESC verifies
the count is pushed down as request size.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
- Unit test: compound AND predicate survives pushdown into bool.filter
- Integration test: compound WHERE (term + range) produces bool query
- Integration test: radial max_distance with WHERE produces bool query

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…SearchIndex

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…on in VectorSearchQueryBuilder

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…fficient mode

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
…matting

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
@mengweieric mengweieric added feature skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis. SQL labels Apr 9, 2026
Radial search (max_distance or min_score) can return unbounded results.
Add build-time validation that rejects radial queries without an explicit
LIMIT clause, with a clear error message guiding the user.

Signed-off-by: Eric Wei <mengwei.eric@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis. SQL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant