-
Notifications
You must be signed in to change notification settings - Fork 83
feat(llm): make graph extraction split configurable #359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,6 +21,10 @@ | |
| from hugegraph_llm.nodes.document_node.chunk_split import ChunkSplitNode | ||
| from hugegraph_llm.nodes.hugegraph_node.schema import SchemaNode | ||
| from hugegraph_llm.nodes.llm_node.extract_info import ExtractNode | ||
| from hugegraph_llm.operators.document_op.chunk_split import ( | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Sort the import block Evidence: Impact: the PR will fail the repository lint gate even though the targeted tests pass. Requested fix: run Ruff import sorting on this file and commit the formatted import order.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks. I ran Ruff import sorting and formatting on the touched files, and the import block has been reordered by |
||
| SPLIT_TYPE_DOCUMENT, | ||
| VALID_SPLIT_TYPES, | ||
| ) | ||
| from hugegraph_llm.state.ai_state import WkFlowInput, WkFlowState | ||
| from hugegraph_llm.utils.log import log | ||
|
|
||
|
|
@@ -37,22 +41,43 @@ def prepare( | |
| texts, | ||
| example_prompt, | ||
| extract_type, | ||
| split_type=SPLIT_TYPE_DOCUMENT, | ||
| language="zh", | ||
| **kwargs, | ||
| ): | ||
| # prepare input data | ||
| prepared_input.texts = texts | ||
| prepared_input.language = language | ||
| prepared_input.split_type = "document" | ||
| if split_type not in VALID_SPLIT_TYPES: | ||
| raise ValueError("split_type must be document, paragraph, or sentence") | ||
|
|
||
| prepared_input.split_type = split_type | ||
| prepared_input.example_prompt = example_prompt | ||
| prepared_input.schema = schema | ||
| prepared_input.extract_type = extract_type | ||
|
|
||
| def build_flow(self, schema, texts, example_prompt, extract_type, language="zh", **kwargs): | ||
| def build_flow( | ||
| self, | ||
| schema, | ||
| texts, | ||
| example_prompt, | ||
| extract_type, | ||
| split_type=SPLIT_TYPE_DOCUMENT, | ||
| language="zh", | ||
| **kwargs, | ||
| ): | ||
| pipeline = GPipeline() | ||
| prepared_input = WkFlowInput() | ||
| # prepare input data | ||
| self.prepare(prepared_input, schema, texts, example_prompt, extract_type, language) | ||
| self.prepare( | ||
| prepared_input, | ||
| schema, | ||
| texts, | ||
| example_prompt, | ||
| extract_type, | ||
| split_type, | ||
| language, | ||
| ) | ||
|
|
||
| pipeline.createGParam(prepared_input, "wkflow_input") | ||
| pipeline.createGParam(WkFlowState(), "wkflow_state") | ||
|
|
@@ -70,6 +95,8 @@ def post_deal(self, pipeline=None, **kwargs): | |
| res = pipeline.getGParamWithNoEmpty("wkflow_state").to_json() | ||
| vertices = res.get("vertices", []) | ||
| edges = res.get("edges", []) | ||
| chunk_count = len(res.get("chunks", [])) | ||
| log.info("Graph extraction chunk_count: %s", chunk_count) | ||
| if not vertices and not edges: | ||
| log.info("Please check the schema.(The schema may not match the Doc)") | ||
| return json.dumps( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Evidence: the dropdown is wired into
extract_graph, but the existingstore_prompt()call only savesdoc,schema, andexample_prompt; reload also only restores those fields, andBasePromptConfig.save_to_yaml()has no split-type field.Impact: after reload, a user who selected
paragraphorsentencesilently falls back todocument, so the next extraction can run with different chunking than the UI state they expected.Requested fix: save and reload this split type through the prompt config path, or make the control explicitly transient. A prompt-config round-trip test would cover the regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I updated the demo prompt config path to persist
graph_extract_split_typeand reload it into the graph split dropdown. This should prevent the selectedparagraphorsentencevalue from silently falling back todocumentafter reload.