Skip to content

feat(llm): add /graph/extract API for programmatic graph extraction#351

Merged
imbajin merged 12 commits into
apache:mainfrom
Nishieee:feat/graph-extract-rest-api
Jun 8, 2026
Merged

feat(llm): add /graph/extract API for programmatic graph extraction#351
imbajin merged 12 commits into
apache:mainfrom
Nishieee:feat/graph-extract-rest-api

Conversation

@Nishieee
Copy link
Copy Markdown
Contributor

Summary

Closes #348.

HugeGraph-LLM already supports graph extraction through the Gradio demo, but there was no public REST endpoint for it. This PR adds POST /graph/extract to the existing FastAPI app, routing requests through SchedulerSingleton and FlowName.GRAPH_EXTRACT — the same path the demo uses.

Key changes

  • Add GraphExtractRequest with validation for texts, schema, split_type, and related options
  • Add graph_http_api and register it on the existing auth router
  • Make split_type configurable in GraphExtractFlow (default "document", so demo behavior is unchanged)
  • Return structured JSON (vertices / edges as arrays), with optional warning and meta

Example request

{
  "texts": "Sarah is 30 and works as an attorney.",
  "schema": { "vertexlabels": [], "edgelabels": [], "propertykeys": [] },
  "split_type": "document",
  "include_meta": true
}

Invalid or empty input returns 422; scheduler failures return 500.

Test plan

  • cd hugegraph-llm && SKIP_EXTERNAL_SERVICES=true uv run pytest src/tests/api/test_graph_api.py -v --tb=short
  • Regression check: /rag, /text2gremlin, /config/graph, and /graph/extract all register
  • Ruff format and lint pass
  • Manual curl against running app with extract LLM configured
  • Compare API output with Gradio graph extraction on the same input

@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels May 31, 2026
@github-actions github-actions Bot added the llm label May 31, 2026
@imbajin imbajin requested a review from Copilot May 31, 2026 13:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a public FastAPI endpoint for graph extraction so clients can programmatically invoke the existing GRAPH_EXTRACT scheduler flow instead of using only the Gradio demo.

Changes:

  • Adds /graph/extract API wiring and request/response handling.
  • Adds GraphExtractRequest validation and tests for routing, validation, and scheduler errors.
  • Makes split_type configurable in GraphExtractFlow while preserving the default.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
hugegraph-llm/src/hugegraph_llm/api/graph_api.py Adds the new graph extraction REST endpoint.
hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py Adds request model and input normalization for graph extraction.
hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py Threads configurable split_type into graph extraction flow preparation.
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py Registers the graph extraction API router with the existing app.
hugegraph-llm/src/tests/api/test_graph_api.py Adds API tests for successful extraction, validation failures, errors, and route registration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py Outdated
Comment thread hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py Outdated
Nishieee and others added 2 commits May 31, 2026 18:31
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: yes. Summary: The new graph-name extraction path needs request-scoped graph configuration before it is safe for programmatic use. Evidence: static review; targeted graph API tests passed.

Comment thread hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py Outdated
Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: yes. Summary: request-scoped graph config still mutates shared HugeGraph settings. Evidence: static review of hugegraph-llm/src/hugegraph_llm/api/graph_api.py lines 29-35.

Comment thread hugegraph-llm/src/hugegraph_llm/api/graph_api.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Comment on lines +220 to +231
@model_validator(mode="after")
def require_client_config_for_named_schema(self):
# A named-graph schema needs request-scoped connection settings; inline JSON
# schemas (starting with "{") are self-contained and never hit HugeGraph.
schema = self.graph_schema
is_named_schema = isinstance(schema, str) and not schema.strip().startswith("{")
if is_named_schema and self.client_config is None:
raise ValueError(
"client_config is required when 'schema' refers to an existing graph name; "
"provide inline schema JSON instead to extract without a HugeGraph connection."
)
return self
Reject client_config when 'schema' is inline JSON (it never connects to
HugeGraph, so it was silently ignored), and require client_config.graph
to match a named-graph schema. Also fix GraphConfigRequest.gs to be
Optional[str]. Adds tests for both rejection paths and the triples
extract_type forwarding.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: yes. Summary: The named-schema request path can still inherit a stale global graphspace when client_config.gs is omitted. Evidence: static review of SchemaManager graphspace fallback plus targeted graph API/schema-manager tests passing.

Comment thread hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py Outdated
Omitting client_config.gs no longer inherits the global huge_settings
graphspace. WkFlowInput now carries the whole connection as one dict
(None = use globals), and SchemaManager applies it wholesale instead of
per-field None fallback. Adds tests for the gs-omitted case with a
non-empty global graphspace and the no-connection fallback path.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Nishieee Nishieee requested a review from imbajin June 3, 2026 04:14
Copy link
Copy Markdown
Contributor

@LRriver LRriver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested changes before merging:

Strongly Recommended API Contract Adjustments

  1. Move the graph extraction request/response models out of the RAG request module.

GraphExtractRequest currently lives in hugegraph_llm/api/models/rag_requests.py, but graph extraction is a separate API surface from RAG. Please move the graph extraction models into dedicated files, for example:

# hugegraph_llm/api/models/graph_extract_requests.py
class GraphExtractRequest(BaseModel):
    ...

# hugegraph_llm/api/models/graph_extract_responses.py
class GraphExtractResponse(BaseModel):
    ...
  1. Move the route registration into a dedicated graph extraction API module.

Instead of keeping this endpoint in a generic graph_api.py, please use a dedicated module such as:

# hugegraph_llm/api/graph_extract_api.py
def graph_extract_http_api(router: APIRouter):
    ...

Then demo/rag_demo/app.py can register graph_extract_http_api(api_auth). This keeps the graph extraction API boundary separate from RAG and leaves a clear place for related graph extraction endpoints.

  1. Define a stable response envelope instead of returning the raw flow result.

The endpoint currently returns a raw flow-shaped response like:

{
  "vertices": [],
  "edges": [],
  "meta": {}
}

Please define a minimal response model from the start, for example:

class GraphExtractResponse(BaseModel):
    status: Literal["succeeded"] = "succeeded"
    result: Dict[str, Any]
    warnings: List[str] = Field(default_factory=list)
    meta: Dict[str, Any] = Field(default_factory=dict)

Example response:

{
  "status": "succeeded",
  "result": {
    "vertices": [],
    "edges": []
  },
  "warnings": [],
  "meta": {
    "vertex_count": 0,
    "edge_count": 0,
    "text_count": 1
  }
}
  1. Make the triples contract explicit, or keep this endpoint property-graph only.

The request model currently accepts both "triples" and "property_graph", but the response handling is mainly shaped around vertices and edges.

Please either:

  • remove "triples" from the public request enum for this PR, or
  • normalize triples responses to a clear shape, for example:
{
  "status": "succeeded",
  "result": {
    "triples": []
  },
  "warnings": [],
  "meta": {}
}
  1. Avoid reusing GraphConfigRequest directly for this endpoint.

GraphConfigRequest has defaults such as url="127.0.0.1:8080" and graph="hugegraph". For request-scoped graph extraction, it is clearer to avoid implicit defaults and use a dedicated config model:

class GraphExtractClientConfig(BaseModel):
    model_config = ConfigDict(extra="forbid")

    url: Optional[str] = None
    graph: Optional[str] = None
    user: Optional[str] = None
    pwd: Optional[str] = None
    gs: Optional[str] = None
  1. Restrict request-supplied client_config.url.

The endpoint should not allow arbitrary request-supplied HugeGraph URLs. At minimum, either omit request-level URL override in this PR or validate it against the configured HugeGraph server URL. If multiple HugeGraph servers need to be supported, that should be controlled by a server-side allowlist.

  1. Keep the strict request-local graph config checks already added.

The current direction is good:

  • named schema requires request-local config
  • client_config.graph must match the schema name
  • request-scoped connection fields do not fall back field-by-field into global settings

Please keep these checks.

Should add if feasible

  1. Add a small service boundary without broadening this PR.

A tiny GraphExtractService.extract_sync(req) wrapper would make the route thinner and keep scheduler/error/response normalization out of the FastAPI handler. It does not need to introduce any async job or import logic.

  1. Add/update tests for the public API contract.

Please cover:

  • response envelope shape: status, result, warnings, meta
  • texts accepting both a string and a list
  • invalid or arbitrary client_config.url rejected
  • named schema + mismatched client_config.graph rejected
  • inline schema + client_config behavior explicitly tested
  • either property-graph-only behavior, or a real triples response shape if triples remains public
  • existing /rag, /text2gremlin, and /config/graph routes still register

@Nishieee
Copy link
Copy Markdown
Contributor Author

Nishieee commented Jun 3, 2026

@LRriver

thanks for the detailed review. here's what i'm taking and where i landed on the open calls.

api contract:

  • remove triples, property-graph only for this PR. it never actually worked here: post_deal only returns vertices/edges, and InfoExtract only emits triples when no schema is set, but this flow always sets one. can add it later with a real triples shape.
  • add GraphExtractResponse (status / result / warnings / meta) instead of returning the raw flow dict.
  • add a dedicated GraphExtractClientConfig (all-optional, extra="forbid") so the implicit 127.0.0.1:8080 / hugegraph defaults can't kick in.
  • drop the request-level url override and use the configured server url. can add a server-side allowlist later if multi-server support is needed.
  • keeping the strict request-local checks already added (named schema requires local config, client_config.graph must match the schema name, no field-by-field fallback into globals).

tests:

  • response envelope shape (status / result / warnings / meta)
  • texts accepting both a string and a list
  • mismatched client_config.graph rejected
  • inline schema + client_config behavior
  • property-graph-only output
  • /rag, /text2gremlin, /config/graph still registering

structural:

  • moving the models to graph_extract_requests.py / graph_extract_responses.py and the route to graph_extract_api.py, both in this PR.
  • the GraphExtractService.extract_sync(req) wrapper is the one i'd like your call on: happy to add it here, or land it as a quick follow-up so this PR stays focused on the contract + security fixes. which do you prefer?

Separate graph extraction from the RAG API and harden its
request-scoped graph config contract.

- move GraphExtractRequest/GraphExtractClientConfig to
  api/models/graph_extract_requests.py and GraphExtractResponse to
  api/models/graph_extract_responses.py
- move the route to api/graph_extract_api.py (graph_extract_http_api);
  register it from demo/rag_demo/app.py and drop graph_api.py
- return a stable GraphExtractResponse envelope (status/result/
  warnings/meta) instead of the raw flow dict
- make the endpoint property-graph only (remove "triples" from the enum)
- add a dedicated GraphExtractClientConfig (extra="forbid", all-optional,
  no url field) so 127.0.0.1:8080/hugegraph defaults can't leak in
- force the connection url to the configured server url; reject
  request-supplied urls
- keep strict checks: named schema requires client_config,
  client_config.graph must match the schema name, inline schema +
  client_config rejected, no field-by-field fallback into globals
- rewrite tests as test_graph_extract_api.py covering the public contract

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comment thread hugegraph-llm/src/hugegraph_llm/api/models/graph_extract_requests.py Outdated
Comment thread hugegraph-llm/src/hugegraph_llm/state/ai_state.py
@LRriver
Copy link
Copy Markdown
Contributor

LRriver commented Jun 4, 2026

@LRriver

thanks for the detailed review. here's what i'm taking and where i landed on the open calls.

api contract:

  • remove triples, property-graph only for this PR. it never actually worked here: post_deal only returns vertices/edges, and InfoExtract only emits triples when no schema is set, but this flow always sets one. can add it later with a real triples shape.
  • add GraphExtractResponse (status / result / warnings / meta) instead of returning the raw flow dict.
  • add a dedicated GraphExtractClientConfig (all-optional, extra="forbid") so the implicit 127.0.0.1:8080 / hugegraph defaults can't kick in.
  • drop the request-level url override and use the configured server url. can add a server-side allowlist later if multi-server support is needed.
  • keeping the strict request-local checks already added (named schema requires local config, client_config.graph must match the schema name, no field-by-field fallback into globals).

tests:

  • response envelope shape (status / result / warnings / meta)
  • texts accepting both a string and a list
  • mismatched client_config.graph rejected
  • inline schema + client_config behavior
  • property-graph-only output
  • /rag, /text2gremlin, /config/graph still registering

structural:

  • moving the models to graph_extract_requests.py / graph_extract_responses.py and the route to graph_extract_api.py, both in this PR.
  • the GraphExtractService.extract_sync(req) wrapper is the one i'd like your call on: happy to add it here, or land it as a quick follow-up so this PR stays focused on the contract + security fixes. which do you prefer?

I agree with the proposed scope choices:

  • Keep this PR property-graph only and leave triples for a later PR with a real request/response contract.
  • Add the GraphExtractResponse envelope instead of returning the raw flow dict.
  • Use dedicated graph extract request/response/config models instead of reusing RAG or global graph config models.
  • Drop request-level URL override for this PR.
  • Move the route and models into graph-extract-specific modules.

The one point I would adjust is the service boundary: please include the thin GraphExtractService.extract_sync(req) wrapper in this PR.

The wrapper should only move the existing synchronous extraction path out of the FastAPI handler: scheduler invocation, raw result parsing, response normalization, and error mapping. It should not add async jobs, graph import, extract-and-import, or a broader service abstraction in this PR. This keeps the public route small and makes the API behavior easier to test before merge.

A few details should also be covered before merging:

  • Since request-level URL override is being dropped, please make sure url is not accepted in GraphExtractClientConfig; with extra="forbid", a request containing client_config.url should fail validation instead of being silently ignored.
  • Inline schema validation should validate the actual schema shape, not only the presence of vertexlabels and edgelabels. Reusing CheckSchema(deepcopy(schema)).run() or the existing schema validation path in the request validator would keep invalid schema input as a 422 before the scheduler runs.
  • WkFlowInput.reset() should clear graph_client_config to avoid leaking request-scoped connection settings across reused flow inputs.

With those changes, the PR can stay focused while still landing a clean public API boundary.

Nishieee and others added 3 commits June 4, 2026 13:03
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
- retain graph extraction split_type validation from main

- preserve request-scoped HugeGraph client config from PR

- add regression coverage for build_flow parameter forwarding
- allow the vector index test scheduler stub to record keyword arguments

- assert graph extraction PDF entrypoint forwards the default split type

- keep coverage focused on the main and PR contract merge
Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 8, 2026
@imbajin imbajin merged commit eb8af86 into apache:main Jun 8, 2026
24 of 32 checks passed
@imbajin imbajin changed the title Add POST /graph/extract REST API for programmatic graph extraction. feat(llm): add /graph/extract API for programmatic graph extraction Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request lgtm This PR has been approved by a maintainer llm size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add a public REST API for graph extraction

4 participants