SOLR-18187: Document enrichment with LLMs by nicolo-rinaldi · Pull Request #4259 · apache/solr

nicolo-rinaldi · 2026-04-01T16:11:45Z

https://issues.apache.org/jira/browse/SOLR-18187

Description

The goal of this PR is to add a way to integrate LLMs directly into Solr at index time to fill fields that might be useful (e.g., categories, tags, etc.)

Solution

This PR adds LLM-based document enrichment capabilities to Solr's indexing pipeline via a new DocumentEnrichmentUpdateProcessorFactory in the language-models module. The processor allows users to enrich documents at index time by calling an LLM (via https://github.com/langchain4j/langchain4j) with a configurable prompt built from one or more existing document fields (inputFields), and storing the model's response into an output field. The output field can be of different types (i.e., string, text, int, long, float, double, boolean, and date) and can be single-valued or multi-valued. The structured output has been used to adapt to the output field type.

The implementation has taken inspiration from the text-to-vector feature in the same module. This has been done to keep the implementation consistent with conventions already in the language-models module.

Note: this PR was developed with assistance from Claude Code (Anthropic).

Tests

Tests covering configuration validation (missing required params, conflicting params, invalid field types, placeholder mismatches), and processor initialization.

Tests covering single-valued and multi-valued output fields of all supported types, multi-input-field prompts, prompt file loading, error handling (model exceptions, ambiguous/malformed JSON responses, unsupported model types), and skipNullOrMissingFieldValues behaviour. All the supported models have been tested.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide
I have added a changelog entry for my change

…tUpdateProcessorFactory

- multivalued outputField - outputField different from Str/Text, with numeric, boolean and date

…or documentation

…t with LLMs' module

…rFactory

aruggero

Left some comments

aruggero · 2026-04-13T08:17:31Z

+    restTestHarness.delete(ManagedChatModelStore.REST_END_POINT + "/model1");
+  }
+
+  private UpdateRequestProcessor createUpdateProcessor(


Can't this always be generalised and used for all the tests? In some of them, you are now repeating this code with small changes...

this is the same as createUpdateProcessor a part from the creation of the request and getInstance()
maybe we can exclude the solr request + getInstance() and use that method also here? calling it like "initializeUpdateProcessorFactory"?
what do you think?

I created a function initializeUpdateProcessorFactory that is used inside createUpdateProcessor. In this way, the code inside the first one can be reused

Fixed tests

why some test could not use these new functions?
e.g. init_multipleInputFields_shouldInitAllFields

I kept them unrelated to the model creation, just to see the proper initialization of the Factory. I can see if this can be changed if you want

aruggero · 2026-04-13T13:34:44Z

+
+  @Test
+  public void init_promptFileWithMissingPlaceholder_shouldThrowExceptionInInform() {
+    NamedList<String> args = new NamedList<>();


this is the same as createUpdateProcessor a part from the creation of the request and getInstance()
maybe we can exclude the solr request + getInstance() and use that method also here? calling it like "initializeUpdateProcessorFactory"?
what do you think?

changed and fixed tests

aruggero · 2026-04-13T13:57:46Z

+    restTestHarness.delete(ManagedChatModelStore.REST_END_POINT + "/model1");
+  }
+
+  private UpdateRequestProcessor createUpdateProcessor(


this is the same as createUpdateProcessor a part from the creation of the request and getInstance()
maybe we can exclude the solr request + getInstance() and use that method also here? calling it like "initializeUpdateProcessorFactory"?
what do you think?

aruggero · 2026-04-15T08:43:53Z

+example above). These tokens are _mandatory_ for this module to work properly. Solr will throw an error if the
+parameters are not properly defined.
+For example, both the prompt and the content of the file prompt.txt, must contain the text '{string_field}', which
+will be substituted with the content of the `string_field` field for each document. An example of a valid prompt with


I think the part so far could be explained in a more schematic and better understandable way.

aruggero · 2026-04-15T08:44:50Z

+ </updateRequestProcessorChain>
+----
+
+Another way of using more than one `inputField` is by using the following notation, instead of more than one parameter


Multiple inputField could also be defined by using the following notation:

aruggero · 2026-04-15T08:45:53Z

+</arr>
+----
+
+The LLM response is mapped to the specified `outputField`. Note that this module only supports a subset of Solr's


Maybe we can also specify that only one outputField is supported

aruggero · 2026-04-15T08:56:23Z

+====
+
+=== Index first and enrich your documents on a second pass
+LLM calls are usually quite slow, so, depending on your use case it could be a good idea to index first your documents


LLM calls are typically slow, so depending on your use case, it may be preferable to first index your documents and enrich them with LLM-generated fields at a later stage.

…fields and updated documentation

nicolo-rinaldi · 2026-04-15T13:27:05Z

+
+=== Models
+
+* A model in this module is a chat model, that answers with text given a prompt.


nicolo-rinaldi · 2026-04-15T13:27:12Z

+=== Models
+
+* A model in this module is a chat model, that answers with text given a prompt.
+* A model in this Solr module is a reference to an external API that runs the Large Language Model responsible for chat


nicolo-rinaldi · 2026-04-15T13:30:38Z

+
+Exactly one of the following parameters is required: `prompt` or `promptFile`.
+
+Another important feature of this module is that one (or more) `inputField` needs to be injected in the prompt. This is


nicolo-rinaldi · 2026-04-15T13:30:48Z

+</arr>
+----
+
+The LLM response is mapped to the specified `outputField`. Note that this module only supports a subset of Solr's


nicolo-rinaldi · 2026-04-15T13:30:52Z

+ </updateRequestProcessorChain>
+----
+
+Another way of using more than one `inputField` is by using the following notation, instead of more than one parameter


nicolo-rinaldi · 2026-04-15T13:31:01Z

+====
+
+=== Index first and enrich your documents on a second pass
+LLM calls are usually quite slow, so, depending on your use case it could be a good idea to index first your documents


nicolo-rinaldi · 2026-04-15T13:31:55Z

+boolean `enriched` field to `true`.
+
+Faceting or querying on the boolean `enriched` field can also give you a quick idea on how many documents have been
+enriched with the new generated fields.


Added a note to link to the section of the documentation related to the use of update chains

aruggero · 2026-04-15T13:47:33Z

+}
+----
+
+== How to Trigger Document Enrichment during Indexing


This part has not been moved to the top near the model configuration in the solrconfig

aruggero · 2026-04-15T13:50:49Z

+|===
+
+One (or more) `inputField` needs to be injected in the prompt. This is  done by some special tokens, that are the


I would just say this is the field whose content is used as input/passed to the LLM to enrich the document. And that there could be more than one inputField defined.

I would move the other part about the special tokens to the prompt parameter explanation.

At most, you could say that every inputField declared must be referred to in the prompt

aruggero · 2026-04-15T13:52:27Z

+module to work properly. Solr will throw an error if the parameters are not properly defined.
+For example, both the prompt or the content of the file `prompt.txt`, must contain the text '{string_field}', which
+will be substituted with the content of the `string_field` field for each document. An example of a valid prompt with
+multiple input fields is as follows:


maybe here I would say:
Multiple inputField could also be defined by using one of the following notations:

and then list the two ways

aruggero · 2026-04-15T13:54:27Z

+These fields _can_ be multivalued. Solr uses structured output from LangChain4j to deal with LLMs' responses.
+
+
+`prompt` or `promptFile`::


here I would say that there are two ways of defining a prompt, one directly in the config and one through a file...
then I would explain how the prompt should be structured and the thing related to the inputFields

nicolo-rinaldi added 11 commits March 25, 2026 13:06

[llm-document-erichment] Add first revision of the feature

b11d049

[llm-document-enrichment] First working version

dfb27ab

[llm-document-enrichment] Add promptFile feature to DocumentEnrichmen…

827548a

…tUpdateProcessorFactory

[llm-document-enrichment] Add multiple inputField support + tests

cc94317

[llm-document-enrchment] Add supprot for:

c723362

- multivalued outputField - outputField different from Str/Text, with numeric, boolean and date

[llm-document-enrichment] polished code, added tests and added file f…

cf0d6bb

…or documentation

[llm-document-enrichment] updated supported models + added tests

570c2aa

[llm-document-enrichment] added documentation for 'Document Enrichmen…

184f579

…t with LLMs' module

[llm-document-enrichment] cleanup of DocumentEnrichmentUpdateProcesso…

1b7c972

…rFactory

Merge branch 'main' into llm-document-enrichment

9cf0315

Merge branch 'main' into llm-document-enrichment

36208af

github-actions bot added documentation Improvements or additions to documentation dependencies Dependency upgrades tool:build tests labels Apr 1, 2026

aruggero suggested changes Apr 3, 2026

View reviewed changes

aruggero suggested changes Apr 13, 2026

View reviewed changes

[llm-document-enrichment] Addressed comments in Anna's review

3d29d16

aruggero suggested changes Apr 13, 2026

View reviewed changes

nicolo-rinaldi added 2 commits April 13, 2026 17:55

[llm-document-enrichment] Addressed comments in Anna's review

55ff9ab

[llm-document-enrichment] Fixed broken tests and updated documentation

ed40008

aruggero suggested changes Apr 15, 2026

View reviewed changes

[llm-document-enrichment] Added tests, Exception for multiple output …

ba27be1

…fields and updated documentation

nicolo-rinaldi commented Apr 15, 2026

View reviewed changes

aruggero suggested changes Apr 15, 2026

View reviewed changes

[llm-document-enrichment] Updated documentation

b092f22


		=== Models

		* A model in this module is a chat model, that answers with text given a prompt.


		Exactly one of the following parameters is required: `prompt` or `promptFile`.

		Another important feature of this module is that one (or more) `inputField` needs to be injected in the prompt. This is

		These fields _can_ be multivalued. Solr uses structured output from LangChain4j to deal with LLMs' responses.


		`prompt` or `promptFile`::

Conversation

nicolo-rinaldi commented Apr 1, 2026

Description

Solution

Tests

Checklist

Uh oh!

aruggero left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!