feat: make graph extraction split configurable by nw9663644-eng · Pull Request #359 · apache/hugegraph-ai

nw9663644-eng · 2026-06-04T07:23:25Z

Purpose

Closes #343.

This PR makes the graph extraction split type configurable instead of always forcing document.

Design

The graph extraction flow now accepts an optional split_type argument and keeps document as the default to preserve the existing behavior.

Supported split strategies:

document: keeps the current behavior and sends each uploaded/raw document as one chunk.
paragraph: uses the existing ChunkSplit paragraph strategy with chunk_size=500, chunk_overlap=30, and language-aware separators.
sentence: uses punctuation-based sentence-boundary splitting for ., ?, !, 。, ？, ！, ；, and ;.

The selected value is passed from the demo UI to extract_graph(), then to SchedulerSingleton.schedule_flow(..., split_type=split_type), and finally to GraphExtractFlow.prepare() / build_flow().

Invalid split types fail early with a clear error message listing the supported values.

The existing vertices / edges JSON contract used by “Load into GraphDB” is preserved. chunk_count is logged for debugging instead of being added to the returned JSON.

The selected graph extraction split type is persisted through the prompt config path and restored into the demo dropdown after reload.

For PDF compatibility, this PR treats extracted PDF text the same as other text input and includes representative PDF-like extracted text coverage in tests.

Tests

uv run ruff format --check .
uv run pytest src/tests/document/test_graph_extract_configurable_split.py
uv run pytest src/tests/document

nw9663644-eng · 2026-06-04T07:51:04Z

I added an additional flow-level test to verify that a non-default graph extraction split type is passed into the workflow input used by the graph extraction flow.

Updated local checks:

uv run ruff format --check .
uv run pytest src/tests/document/test_graph_extract_configurable_split.py
uv run pytest src/tests/document

imbajin

Blocking: yes. Summary: the new split option has sentence-semantics and lint regressions that should be fixed before merge. Evidence: targeted pytest passed, but local chunk-split repro and ruff check exposed the issues.

imbajin · 2026-06-05T04:03:52Z

                    graph_data_btn0 = gr.Button("Clear Graph Data", size="sm")

            vector_import_bt = gr.Button("Import into Vector", variant="primary")
+            graph_split_type = gr.Dropdown(


⚠️ Persist the selected split type

Evidence: the dropdown is wired into extract_graph, but the existing store_prompt() call only saves doc, schema, and example_prompt; reload also only restores those fields, and BasePromptConfig.save_to_yaml() has no split-type field.

Impact: after reload, a user who selected paragraph or sentence silently falls back to document, so the next extraction can run with different chunking than the UI state they expected.

Requested fix: save and reload this split type through the prompt config path, or make the control explicitly transient. A prompt-config round-trip test would cover the regression.

Thanks for the review. I updated the demo prompt config path to persist graph_extract_split_type and reload it into the graph split dropdown. This should prevent the selected paragraph or sentence value from silently falling back to document after reload.

imbajin · 2026-06-05T04:03:53Z


 from hugegraph_llm.flows.common import BaseFlow
 from hugegraph_llm.nodes.document_node.chunk_split import ChunkSplitNode
+from hugegraph_llm.operators.document_op.chunk_split import (


🧹 Sort the import block

Evidence: uv run --project .. --extra llm --extra dev ruff check src/hugegraph_llm/flows/graph_extract.py src/hugegraph_llm/utils/graph_index_utils.py src/hugegraph_llm/operators/document_op/chunk_split.py src/tests/document/test_graph_extract_configurable_split.py fails with I001 Import block is un-sorted or un-formatted on this file.

Impact: the PR will fail the repository lint gate even though the targeted tests pass.

Requested fix: run Ruff import sorting on this file and commit the formatted import order.

Thanks. I ran Ruff import sorting and formatting on the touched files, and the import block has been reordered by ruff check --fix.

feat: make graph extraction split configurable

c009344

dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. enhancement New feature or request labels Jun 4, 2026

github-actions Bot added the llm label Jun 4, 2026

chore: cover graph split flow forwarding

c2890de

imbajin reviewed Jun 5, 2026

View reviewed changes

fix: address graph split review comments

574a637

dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jun 5, 2026

nw9663644-eng requested a review from imbajin June 5, 2026 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make graph extraction split configurable#359

feat: make graph extraction split configurable#359
nw9663644-eng wants to merge 3 commits into
apache:mainfrom
nw9663644-eng:feat-configurable-graph-split

nw9663644-eng commented Jun 4, 2026 •

edited

Loading

Uh oh!

nw9663644-eng commented Jun 4, 2026

Uh oh!

imbajin left a comment

Uh oh!

Uh oh!

imbajin Jun 5, 2026

Uh oh!

nw9663644-eng Jun 5, 2026

Uh oh!

imbajin Jun 5, 2026

Uh oh!

nw9663644-eng Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nw9663644-eng commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Design

Tests

Uh oh!

nw9663644-eng commented Jun 4, 2026

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

imbajin Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

nw9663644-eng Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

imbajin Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

nw9663644-eng Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nw9663644-eng commented Jun 4, 2026 •

edited

Loading