🐛 Bugfix: re-embed chunk on content update by Lingxi-Li · Pull Request #3250 · ModelEngine-Group/nexent

Lingxi-Li · 2026-06-17T04:02:49Z

🐛 Bugfix: re-embed chunk on content update so stored vector stays in sync

ElasticSearchService.update_chunk accepted content edits but wrote the new text to Elasticsearch without regenerating the embedding vector.
The stored vector silently drifted out of sync with the stored text, degrading pure k-NN search and corrupting the vector half of hybrid search with no operator-visible signal.
This change regenerates and persists the embedding whenever the update payload includes a non-None content value, mirroring the resolution path that create_chunk already uses but with fail-loud semantics instead of create's silent-skip (see Follow-ups section).

New behavior

Metadata-only updates (no content key in payload, or content=None) take a fast path: no knowledge-record lookup, no embedding call, no embedding or embedding_model_id written.
Content updates re-resolve the embedding model from the KB's current record on every call, not from the existing chunk's stored embedding_model_id, to avoid re-using a possibly stale model id.
Any failure during model resolution or embedding generation aborts the entire update before vdb_core.update_chunk is invoked; the error is wrapped through the existing except block as Exception("Error updating chunk: ..."), matching the prior fail-loud convention.

Changes

backend/services/vectordatabase_service.py: ElasticSearchService.update_chunk now detects caller-set content, resolves the embedding model from the knowledge base record via get_knowledge_record then get_embedding_model_by_id, calls get_embeddings, and writes embedding plus embedding_model_id onto the same flat payload sent to vdb_core.update_chunk.
backend/consts/model.py: ChunkUpdateRequest.content gains min_length=1 so an explicit empty string is rejected at the schema boundary, mirroring ChunkCreateRequest.content.
Tests: nine new service tests covering the metadata-only fast path, content re-embedding, the content=None skip, and six fail-loud branches (missing tenant, missing knowledge record, missing embedding_model_id, unresolvable model, get_embeddings raising, empty embedding result).
Existing test updates:
1. test_update_chunk_builds_payload_and_calls_core was rewritten to use the real ChunkUpdateRequest, mock the two new resolution helpers, pass tenant_id, and assert the new embedding / embedding_model_id fields.
2. test_update_chunk_wrapped_exception was switched to a metadata-only payload so the core-failure path is still exercised.
3. test/backend/test_model_consts.py adds the empty-string rejection plus a positive metadata-only construction assertion.

Follow-ups

Make create_chunk fail-loud on embedding resolution failure, matching the convention this PR establishes for update_chunk.
Today it silently proceeds without an embedding field when tenant_id or embedding_model_id is missing or when get_embeddings raises, and only logs a warning.
Deferred because the asymmetry is load-bearing — create-skip leaves an auditable missing field while update-stale leaves a wrong field that looks healthy — and flipping create may break callers that tolerate vector-less creates during KB setup churn, so it deserves its own caller-impact review.
Reconcile the inconsistency where the bulk ingest path writes the per-document field embedding_model_name (matching the index mapping) but create_chunk writes embedding_model_id, which lands as a dynamic field.
The mismatch originates in create_chunk and predates this PR; update_chunk mirrors create_chunk here, so this PR perpetuates the mismatch rather than introducing or fixing it.

🐛 Bugfix: re-embed chunk on content update

2187f22

Lingxi-Li requested review from Dallas98 and WMC001 as code owners June 17, 2026 04:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Bugfix: re-embed chunk on content update#3250

🐛 Bugfix: re-embed chunk on content update#3250
Lingxi-Li wants to merge 1 commit into
ModelEngine-Group:developfrom
Lingxi-Li:dev

Lingxi-Li commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Lingxi-Li commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐛 Bugfix: re-embed chunk on content update so stored vector stays in sync

New behavior

Changes

Follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Lingxi-Li commented Jun 17, 2026 •

edited

Loading