feat: make graph extraction split configurable#359
Conversation
|
I added an additional flow-level test to verify that a non-default graph extraction split type is passed into the workflow input used by the graph extraction flow. Updated local checks:
|
imbajin
left a comment
There was a problem hiding this comment.
Blocking: yes. Summary: the new split option has sentence-semantics and lint regressions that should be fixed before merge. Evidence: targeted pytest passed, but local chunk-split repro and ruff check exposed the issues.
| graph_data_btn0 = gr.Button("Clear Graph Data", size="sm") | ||
|
|
||
| vector_import_bt = gr.Button("Import into Vector", variant="primary") | ||
| graph_split_type = gr.Dropdown( |
There was a problem hiding this comment.
Evidence: the dropdown is wired into extract_graph, but the existing store_prompt() call only saves doc, schema, and example_prompt; reload also only restores those fields, and BasePromptConfig.save_to_yaml() has no split-type field.
Impact: after reload, a user who selected paragraph or sentence silently falls back to document, so the next extraction can run with different chunking than the UI state they expected.
Requested fix: save and reload this split type through the prompt config path, or make the control explicitly transient. A prompt-config round-trip test would cover the regression.
There was a problem hiding this comment.
Thanks for the review. I updated the demo prompt config path to persist graph_extract_split_type and reload it into the graph split dropdown. This should prevent the selected paragraph or sentence value from silently falling back to document after reload.
|
|
||
| from hugegraph_llm.flows.common import BaseFlow | ||
| from hugegraph_llm.nodes.document_node.chunk_split import ChunkSplitNode | ||
| from hugegraph_llm.operators.document_op.chunk_split import ( |
There was a problem hiding this comment.
🧹 Sort the import block
Evidence: uv run --project .. --extra llm --extra dev ruff check src/hugegraph_llm/flows/graph_extract.py src/hugegraph_llm/utils/graph_index_utils.py src/hugegraph_llm/operators/document_op/chunk_split.py src/tests/document/test_graph_extract_configurable_split.py fails with I001 Import block is un-sorted or un-formatted on this file.
Impact: the PR will fail the repository lint gate even though the targeted tests pass.
Requested fix: run Ruff import sorting on this file and commit the formatted import order.
There was a problem hiding this comment.
Thanks. I ran Ruff import sorting and formatting on the touched files, and the import block has been reordered by ruff check --fix.
Purpose
Closes #343.
This PR makes the graph extraction split type configurable instead of always forcing
document.Design
The graph extraction flow now accepts an optional
split_typeargument and keepsdocumentas the default to preserve the existing behavior.Supported split strategies:
document: keeps the current behavior and sends each uploaded/raw document as one chunk.paragraph: uses the existingChunkSplitparagraph strategy withchunk_size=500,chunk_overlap=30, and language-aware separators.sentence: uses punctuation-based sentence-boundary splitting for.,?,!,。,?,!,;, and;.The selected value is passed from the demo UI to
extract_graph(), then toSchedulerSingleton.schedule_flow(..., split_type=split_type), and finally toGraphExtractFlow.prepare()/build_flow().Invalid split types fail early with a clear error message listing the supported values.
The existing
vertices/edgesJSON contract used by “Load into GraphDB” is preserved.chunk_countis logged for debugging instead of being added to the returned JSON.The selected graph extraction split type is persisted through the prompt config path and restored into the demo dropdown after reload.
For PDF compatibility, this PR treats extracted PDF text the same as other text input and includes representative PDF-like extracted text coverage in tests.
Tests
uv run ruff format --check .uv run pytest src/tests/document/test_graph_extract_configurable_split.pyuv run pytest src/tests/document