feat(llm): add /graph/extract API for programmatic graph extraction by Nishieee · Pull Request #351 · apache/hugegraph-ai

Nishieee · 2026-05-31T04:31:14Z

Summary

Closes #348.

HugeGraph-LLM already supports graph extraction through the Gradio demo, but there was no public REST endpoint for it. This PR adds POST /graph/extract to the existing FastAPI app, routing requests through SchedulerSingleton and FlowName.GRAPH_EXTRACT — the same path the demo uses.

Key changes

Add GraphExtractRequest with validation for texts, schema, split_type, and related options
Add graph_http_api and register it on the existing auth router
Make split_type configurable in GraphExtractFlow (default "document", so demo behavior is unchanged)
Return structured JSON (vertices / edges as arrays), with optional warning and meta

Example request

{
  "texts": "Sarah is 30 and works as an attorney.",
  "schema": { "vertexlabels": [], "edgelabels": [], "propertykeys": [] },
  "split_type": "document",
  "include_meta": true
}

Invalid or empty input returns 422; scheduler failures return 500.

Test plan

cd hugegraph-llm && SKIP_EXTERNAL_SERVICES=true uv run pytest src/tests/api/test_graph_api.py -v --tb=short
Regression check: /rag, /text2gremlin, /config/graph, and /graph/extract all register
Ruff format and lint pass
Manual curl against running app with extract LLM configured
Compare API output with Gradio graph extraction on the same input

Copilot

Pull request overview

Adds a public FastAPI endpoint for graph extraction so clients can programmatically invoke the existing GRAPH_EXTRACT scheduler flow instead of using only the Gradio demo.

Changes:

Adds /graph/extract API wiring and request/response handling.
Adds GraphExtractRequest validation and tests for routing, validation, and scheduler errors.
Makes split_type configurable in GraphExtractFlow while preserving the default.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`hugegraph-llm/src/hugegraph_llm/api/graph_api.py`	Adds the new graph extraction REST endpoint.
`hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py`	Adds request model and input normalization for graph extraction.
`hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py`	Threads configurable `split_type` into graph extraction flow preparation.
`hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py`	Registers the graph extraction API router with the existing app.
`hugegraph-llm/src/tests/api/test_graph_api.py`	Adds API tests for successful extraction, validation failures, errors, and route registration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

imbajin

Blocking: yes. Summary: The new graph-name extraction path needs request-scoped graph configuration before it is safe for programmatic use. Evidence: static review; targeted graph API tests passed.

imbajin

Blocking: yes. Summary: request-scoped graph config still mutates shared HugeGraph settings. Evidence: static review of hugegraph-llm/src/hugegraph_llm/api/graph_api.py lines 29-35.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

+    @model_validator(mode="after")
+    def require_client_config_for_named_schema(self):
+        # A named-graph schema needs request-scoped connection settings; inline JSON
+        # schemas (starting with "{") are self-contained and never hit HugeGraph.
+        schema = self.graph_schema
+        is_named_schema = isinstance(schema, str) and not schema.strip().startswith("{")
+        if is_named_schema and self.client_config is None:
+            raise ValueError(
+                "client_config is required when 'schema' refers to an existing graph name; "
+                "provide inline schema JSON instead to extract without a HugeGraph connection."
+            )
+        return self


Reject client_config when 'schema' is inline JSON (it never connects to HugeGraph, so it was silently ignored), and require client_config.graph to match a named-graph schema. Also fix GraphConfigRequest.gs to be Optional[str]. Adds tests for both rejection paths and the triples extract_type forwarding. Co-authored-by: Cursor <cursoragent@cursor.com>

imbajin

Blocking: yes. Summary: The named-schema request path can still inherit a stale global graphspace when client_config.gs is omitted. Evidence: static review of SchemaManager graphspace fallback plus targeted graph API/schema-manager tests passing.

Omitting client_config.gs no longer inherits the global huge_settings graphspace. WkFlowInput now carries the whole connection as one dict (None = use globals), and SchemaManager applies it wholesale instead of per-field None fallback. Adds tests for the gs-omitted case with a non-empty global graphspace and the no-connection fallback path. Co-authored-by: Cursor <cursoragent@cursor.com>

LRriver

Suggested changes before merging:

Strongly Recommended API Contract Adjustments

Move the graph extraction request/response models out of the RAG request module.

GraphExtractRequest currently lives in hugegraph_llm/api/models/rag_requests.py, but graph extraction is a separate API surface from RAG. Please move the graph extraction models into dedicated files, for example:

# hugegraph_llm/api/models/graph_extract_requests.py
class GraphExtractRequest(BaseModel):
    ...

# hugegraph_llm/api/models/graph_extract_responses.py
class GraphExtractResponse(BaseModel):
    ...

Move the route registration into a dedicated graph extraction API module.

Instead of keeping this endpoint in a generic graph_api.py, please use a dedicated module such as:

# hugegraph_llm/api/graph_extract_api.py
def graph_extract_http_api(router: APIRouter):
    ...

Then demo/rag_demo/app.py can register graph_extract_http_api(api_auth). This keeps the graph extraction API boundary separate from RAG and leaves a clear place for related graph extraction endpoints.

Define a stable response envelope instead of returning the raw flow result.

The endpoint currently returns a raw flow-shaped response like:

{
  "vertices": [],
  "edges": [],
  "meta": {}
}

Please define a minimal response model from the start, for example:

class GraphExtractResponse(BaseModel):
    status: Literal["succeeded"] = "succeeded"
    result: Dict[str, Any]
    warnings: List[str] = Field(default_factory=list)
    meta: Dict[str, Any] = Field(default_factory=dict)

Example response:

{
  "status": "succeeded",
  "result": {
    "vertices": [],
    "edges": []
  },
  "warnings": [],
  "meta": {
    "vertex_count": 0,
    "edge_count": 0,
    "text_count": 1
  }
}

Make the triples contract explicit, or keep this endpoint property-graph only.

The request model currently accepts both "triples" and "property_graph", but the response handling is mainly shaped around vertices and edges.

Please either:

remove "triples" from the public request enum for this PR, or
normalize triples responses to a clear shape, for example:

{
  "status": "succeeded",
  "result": {
    "triples": []
  },
  "warnings": [],
  "meta": {}
}

Avoid reusing GraphConfigRequest directly for this endpoint.

GraphConfigRequest has defaults such as url="127.0.0.1:8080" and graph="hugegraph". For request-scoped graph extraction, it is clearer to avoid implicit defaults and use a dedicated config model:

class GraphExtractClientConfig(BaseModel):
    model_config = ConfigDict(extra="forbid")

    url: Optional[str] = None
    graph: Optional[str] = None
    user: Optional[str] = None
    pwd: Optional[str] = None
    gs: Optional[str] = None

Restrict request-supplied client_config.url.

The endpoint should not allow arbitrary request-supplied HugeGraph URLs. At minimum, either omit request-level URL override in this PR or validate it against the configured HugeGraph server URL. If multiple HugeGraph servers need to be supported, that should be controlled by a server-side allowlist.

Keep the strict request-local graph config checks already added.

The current direction is good:

named schema requires request-local config
client_config.graph must match the schema name
request-scoped connection fields do not fall back field-by-field into global settings

Please keep these checks.

Should add if feasible

Add a small service boundary without broadening this PR.

A tiny GraphExtractService.extract_sync(req) wrapper would make the route thinner and keep scheduler/error/response normalization out of the FastAPI handler. It does not need to introduce any async job or import logic.

Add/update tests for the public API contract.

Please cover:

response envelope shape: status, result, warnings, meta
texts accepting both a string and a list
invalid or arbitrary client_config.url rejected
named schema + mismatched client_config.graph rejected
inline schema + client_config behavior explicitly tested
either property-graph-only behavior, or a real triples response shape if triples remains public
existing /rag, /text2gremlin, and /config/graph routes still register

Nishieee · 2026-06-03T20:22:09Z

@LRriver

thanks for the detailed review. here's what i'm taking and where i landed on the open calls.

api contract:

remove triples, property-graph only for this PR. it never actually worked here: post_deal only returns vertices/edges, and InfoExtract only emits triples when no schema is set, but this flow always sets one. can add it later with a real triples shape.
add GraphExtractResponse (status / result / warnings / meta) instead of returning the raw flow dict.
add a dedicated GraphExtractClientConfig (all-optional, extra="forbid") so the implicit 127.0.0.1:8080 / hugegraph defaults can't kick in.
drop the request-level url override and use the configured server url. can add a server-side allowlist later if multi-server support is needed.
keeping the strict request-local checks already added (named schema requires local config, client_config.graph must match the schema name, no field-by-field fallback into globals).

tests:

response envelope shape (status / result / warnings / meta)
texts accepting both a string and a list
mismatched client_config.graph rejected
inline schema + client_config behavior
property-graph-only output
/rag, /text2gremlin, /config/graph still registering

structural:

moving the models to graph_extract_requests.py / graph_extract_responses.py and the route to graph_extract_api.py, both in this PR.
the GraphExtractService.extract_sync(req) wrapper is the one i'd like your call on: happy to add it here, or land it as a quick follow-up so this PR stays focused on the contract + security fixes. which do you prefer?

Separate graph extraction from the RAG API and harden its request-scoped graph config contract. - move GraphExtractRequest/GraphExtractClientConfig to api/models/graph_extract_requests.py and GraphExtractResponse to api/models/graph_extract_responses.py - move the route to api/graph_extract_api.py (graph_extract_http_api); register it from demo/rag_demo/app.py and drop graph_api.py - return a stable GraphExtractResponse envelope (status/result/ warnings/meta) instead of the raw flow dict - make the endpoint property-graph only (remove "triples" from the enum) - add a dedicated GraphExtractClientConfig (extra="forbid", all-optional, no url field) so 127.0.0.1:8080/hugegraph defaults can't leak in - force the connection url to the configured server url; reject request-supplied urls - keep strict checks: named schema requires client_config, client_config.graph must match the schema name, inline schema + client_config rejected, no field-by-field fallback into globals - rewrite tests as test_graph_extract_api.py covering the public contract Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

LRriver · 2026-06-04T10:27:27Z

@LRriver

thanks for the detailed review. here's what i'm taking and where i landed on the open calls.

api contract:

remove triples, property-graph only for this PR. it never actually worked here: post_deal only returns vertices/edges, and InfoExtract only emits triples when no schema is set, but this flow always sets one. can add it later with a real triples shape.

add GraphExtractResponse (status / result / warnings / meta) instead of returning the raw flow dict.

add a dedicated GraphExtractClientConfig (all-optional, extra="forbid") so the implicit 127.0.0.1:8080 / hugegraph defaults can't kick in.

drop the request-level url override and use the configured server url. can add a server-side allowlist later if multi-server support is needed.

keeping the strict request-local checks already added (named schema requires local config, client_config.graph must match the schema name, no field-by-field fallback into globals).

tests:

response envelope shape (status / result / warnings / meta)

texts accepting both a string and a list

mismatched client_config.graph rejected

inline schema + client_config behavior

property-graph-only output

/rag, /text2gremlin, /config/graph still registering

structural:

moving the models to graph_extract_requests.py / graph_extract_responses.py and the route to graph_extract_api.py, both in this PR.

the GraphExtractService.extract_sync(req) wrapper is the one i'd like your call on: happy to add it here, or land it as a quick follow-up so this PR stays focused on the contract + security fixes. which do you prefer?

I agree with the proposed scope choices:

Keep this PR property-graph only and leave triples for a later PR with a real request/response contract.
Add the GraphExtractResponse envelope instead of returning the raw flow dict.
Use dedicated graph extract request/response/config models instead of reusing RAG or global graph config models.
Drop request-level URL override for this PR.
Move the route and models into graph-extract-specific modules.

The one point I would adjust is the service boundary: please include the thin GraphExtractService.extract_sync(req) wrapper in this PR.

The wrapper should only move the existing synchronous extraction path out of the FastAPI handler: scheduler invocation, raw result parsing, response normalization, and error mapping. It should not add async jobs, graph import, extract-and-import, or a broader service abstraction in this PR. This keeps the public route small and makes the API behavior easier to test before merge.

A few details should also be covered before merging:

Since request-level URL override is being dropped, please make sure url is not accepted in GraphExtractClientConfig; with extra="forbid", a request containing client_config.url should fail validation instead of being silently ignored.
Inline schema validation should validate the actual schema shape, not only the presence of vertexlabels and edgelabels. Reusing CheckSchema(deepcopy(schema)).run() or the existing schema validation path in the request validator would keep invalid schema input as a 422 before the scheduler runs.
WkFlowInput.reset() should clear graph_client_config to avoid leaking request-scoped connection settings across reused flow inputs.

With those changes, the PR can stay focused while still landing a clean public API boundary.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

- retain graph extraction split_type validation from main - preserve request-scoped HugeGraph client config from PR - add regression coverage for build_flow parameter forwarding

- allow the vector index test scheduler stub to record keyword arguments - assert graph extraction PDF entrypoint forwards the default split type - keep coverage focused on the main and PR contract merge

imbajin

Thanks for your contribution

Add POST /graph/extract REST API for programmatic graph extraction.

e9db218

dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels May 31, 2026

github-actions Bot added the llm label May 31, 2026

imbajin requested a review from Copilot May 31, 2026 13:38

Copilot started reviewing on behalf of imbajin May 31, 2026 13:39 View session

Copilot AI reviewed May 31, 2026

View reviewed changes

Comment thread hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py Outdated

Comment thread hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py Outdated

Nishieee and others added 2 commits May 31, 2026 18:31

Potential fix for pull request finding

73d983c

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Validate graph schema shape at request layer

a71dc3c

imbajin reviewed Jun 1, 2026

View reviewed changes

Comment thread hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py Outdated

Add request-scoped graph config to graph extract API.

28d9afc

imbajin reviewed Jun 2, 2026

View reviewed changes

Comment thread hugegraph-llm/src/hugegraph_llm/api/graph_api.py Outdated

fix(llm): make /graph/extract graph config request-scoped

92fef67

imbajin requested a review from Copilot June 2, 2026 03:21

Copilot started reviewing on behalf of imbajin June 2, 2026 03:22 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

imbajin reviewed Jun 2, 2026

View reviewed changes

Comment thread hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py Outdated

Nishieee requested a review from imbajin June 3, 2026 04:14

LRriver reviewed Jun 3, 2026

View reviewed changes

imbajin requested a review from Copilot June 4, 2026 03:21

Copilot started reviewing on behalf of imbajin June 4, 2026 03:21 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

Comment thread hugegraph-llm/src/hugegraph_llm/api/models/graph_extract_requests.py Outdated

Comment thread hugegraph-llm/src/hugegraph_llm/state/ai_state.py

Nishieee and others added 3 commits June 4, 2026 13:03

Potential fix for pull request finding

c23ed08

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

feat(llm): add GraphExtractService + harden /graph/extract validation

808a668

fix(llm): merge main into graph extract API

ea5c3ed

- retain graph extraction split_type validation from main - preserve request-scoped HugeGraph client config from PR - add regression coverage for build_flow parameter forwarding

fix(llm): update graph extract test scheduler

0519277

- allow the vector index test scheduler stub to record keyword arguments - assert graph extraction PDF entrypoint forwards the default split type - keep coverage focused on the main and PR contract merge

imbajin approved these changes Jun 8, 2026

View reviewed changes

dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 8, 2026

imbajin merged commit eb8af86 into apache:main Jun 8, 2026
24 of 32 checks passed

imbajin changed the title ~~Add POST /graph/extract REST API for programmatic graph extraction.~~ feat(llm): add /graph/extract API for programmatic graph extraction Jun 8, 2026

Conversation

Nishieee commented May 31, 2026

Summary

Key changes

Example request

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LRriver left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Strongly Recommended API Contract Adjustments

Should add if feasible

Uh oh!

Nishieee commented Jun 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

LRriver commented Jun 4, 2026

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LRriver left a comment •

edited

Loading