Skip to content

feat: make graph extraction split configurable#359

Open
nw9663644-eng wants to merge 3 commits into
apache:mainfrom
nw9663644-eng:feat-configurable-graph-split
Open

feat: make graph extraction split configurable#359
nw9663644-eng wants to merge 3 commits into
apache:mainfrom
nw9663644-eng:feat-configurable-graph-split

Conversation

@nw9663644-eng
Copy link
Copy Markdown

@nw9663644-eng nw9663644-eng commented Jun 4, 2026

Purpose

Closes #343.

This PR makes the graph extraction split type configurable instead of always forcing document.

Design

The graph extraction flow now accepts an optional split_type argument and keeps document as the default to preserve the existing behavior.

Supported split strategies:

  • document: keeps the current behavior and sends each uploaded/raw document as one chunk.
  • paragraph: uses the existing ChunkSplit paragraph strategy with chunk_size=500, chunk_overlap=30, and language-aware separators.
  • sentence: uses punctuation-based sentence-boundary splitting for ., ?, !, , , , , and ;.

The selected value is passed from the demo UI to extract_graph(), then to SchedulerSingleton.schedule_flow(..., split_type=split_type), and finally to GraphExtractFlow.prepare() / build_flow().

Invalid split types fail early with a clear error message listing the supported values.

The existing vertices / edges JSON contract used by “Load into GraphDB” is preserved. chunk_count is logged for debugging instead of being added to the returned JSON.

The selected graph extraction split type is persisted through the prompt config path and restored into the demo dropdown after reload.

For PDF compatibility, this PR treats extracted PDF text the same as other text input and includes representative PDF-like extracted text coverage in tests.

Tests

  • uv run ruff format --check .
  • uv run pytest src/tests/document/test_graph_extract_configurable_split.py
  • uv run pytest src/tests/document

@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. enhancement New feature or request labels Jun 4, 2026
@github-actions github-actions Bot added the llm label Jun 4, 2026
@nw9663644-eng
Copy link
Copy Markdown
Author

I added an additional flow-level test to verify that a non-default graph extraction split type is passed into the workflow input used by the graph extraction flow.

Updated local checks:

  • uv run ruff format --check .
  • uv run pytest src/tests/document/test_graph_extract_configurable_split.py
  • uv run pytest src/tests/document

Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: yes. Summary: the new split option has sentence-semantics and lint regressions that should be fixed before merge. Evidence: targeted pytest passed, but local chunk-split repro and ruff check exposed the issues.

Comment thread hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py Outdated
graph_data_btn0 = gr.Button("Clear Graph Data", size="sm")

vector_import_bt = gr.Button("Import into Vector", variant="primary")
graph_split_type = gr.Dropdown(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Persist the selected split type

Evidence: the dropdown is wired into extract_graph, but the existing store_prompt() call only saves doc, schema, and example_prompt; reload also only restores those fields, and BasePromptConfig.save_to_yaml() has no split-type field.

Impact: after reload, a user who selected paragraph or sentence silently falls back to document, so the next extraction can run with different chunking than the UI state they expected.

Requested fix: save and reload this split type through the prompt config path, or make the control explicitly transient. A prompt-config round-trip test would cover the regression.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I updated the demo prompt config path to persist graph_extract_split_type and reload it into the graph split dropdown. This should prevent the selected paragraph or sentence value from silently falling back to document after reload.


from hugegraph_llm.flows.common import BaseFlow
from hugegraph_llm.nodes.document_node.chunk_split import ChunkSplitNode
from hugegraph_llm.operators.document_op.chunk_split import (
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Sort the import block

Evidence: uv run --project .. --extra llm --extra dev ruff check src/hugegraph_llm/flows/graph_extract.py src/hugegraph_llm/utils/graph_index_utils.py src/hugegraph_llm/operators/document_op/chunk_split.py src/tests/document/test_graph_extract_configurable_split.py fails with I001 Import block is un-sorted or un-formatted on this file.

Impact: the PR will fail the repository lint gate even though the targeted tests pass.

Requested fix: run Ruff import sorting on this file and commit the formatted import order.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I ran Ruff import sorting and formatting on the touched files, and the import block has been reordered by ruff check --fix.

@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jun 5, 2026
@nw9663644-eng nw9663644-eng requested a review from imbajin June 5, 2026 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request llm size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Make graph extraction use configurable chunk splitting

3 participants