feat(llm): add /graph/extract API for programmatic graph extraction#351
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a public FastAPI endpoint for graph extraction so clients can programmatically invoke the existing GRAPH_EXTRACT scheduler flow instead of using only the Gradio demo.
Changes:
- Adds
/graph/extractAPI wiring and request/response handling. - Adds
GraphExtractRequestvalidation and tests for routing, validation, and scheduler errors. - Makes
split_typeconfigurable inGraphExtractFlowwhile preserving the default.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
hugegraph-llm/src/hugegraph_llm/api/graph_api.py |
Adds the new graph extraction REST endpoint. |
hugegraph-llm/src/hugegraph_llm/api/models/rag_requests.py |
Adds request model and input normalization for graph extraction. |
hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py |
Threads configurable split_type into graph extraction flow preparation. |
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py |
Registers the graph extraction API router with the existing app. |
hugegraph-llm/src/tests/api/test_graph_api.py |
Adds API tests for successful extraction, validation failures, errors, and route registration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
imbajin
left a comment
There was a problem hiding this comment.
Blocking: yes. Summary: The new graph-name extraction path needs request-scoped graph configuration before it is safe for programmatic use. Evidence: static review; targeted graph API tests passed.
imbajin
left a comment
There was a problem hiding this comment.
Blocking: yes. Summary: request-scoped graph config still mutates shared HugeGraph settings. Evidence: static review of hugegraph-llm/src/hugegraph_llm/api/graph_api.py lines 29-35.
| @model_validator(mode="after") | ||
| def require_client_config_for_named_schema(self): | ||
| # A named-graph schema needs request-scoped connection settings; inline JSON | ||
| # schemas (starting with "{") are self-contained and never hit HugeGraph. | ||
| schema = self.graph_schema | ||
| is_named_schema = isinstance(schema, str) and not schema.strip().startswith("{") | ||
| if is_named_schema and self.client_config is None: | ||
| raise ValueError( | ||
| "client_config is required when 'schema' refers to an existing graph name; " | ||
| "provide inline schema JSON instead to extract without a HugeGraph connection." | ||
| ) | ||
| return self |
Reject client_config when 'schema' is inline JSON (it never connects to HugeGraph, so it was silently ignored), and require client_config.graph to match a named-graph schema. Also fix GraphConfigRequest.gs to be Optional[str]. Adds tests for both rejection paths and the triples extract_type forwarding. Co-authored-by: Cursor <cursoragent@cursor.com>
imbajin
left a comment
There was a problem hiding this comment.
Blocking: yes. Summary: The named-schema request path can still inherit a stale global graphspace when client_config.gs is omitted. Evidence: static review of SchemaManager graphspace fallback plus targeted graph API/schema-manager tests passing.
Omitting client_config.gs no longer inherits the global huge_settings graphspace. WkFlowInput now carries the whole connection as one dict (None = use globals), and SchemaManager applies it wholesale instead of per-field None fallback. Adds tests for the gs-omitted case with a non-empty global graphspace and the no-connection fallback path. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Suggested changes before merging:
Strongly Recommended API Contract Adjustments
- Move the graph extraction request/response models out of the RAG request module.
GraphExtractRequest currently lives in hugegraph_llm/api/models/rag_requests.py, but graph extraction is a separate API surface from RAG. Please move the graph extraction models into dedicated files, for example:
# hugegraph_llm/api/models/graph_extract_requests.py
class GraphExtractRequest(BaseModel):
...
# hugegraph_llm/api/models/graph_extract_responses.py
class GraphExtractResponse(BaseModel):
...- Move the route registration into a dedicated graph extraction API module.
Instead of keeping this endpoint in a generic graph_api.py, please use a dedicated module such as:
# hugegraph_llm/api/graph_extract_api.py
def graph_extract_http_api(router: APIRouter):
...Then demo/rag_demo/app.py can register graph_extract_http_api(api_auth). This keeps the graph extraction API boundary separate from RAG and leaves a clear place for related graph extraction endpoints.
- Define a stable response envelope instead of returning the raw flow result.
The endpoint currently returns a raw flow-shaped response like:
{
"vertices": [],
"edges": [],
"meta": {}
}Please define a minimal response model from the start, for example:
class GraphExtractResponse(BaseModel):
status: Literal["succeeded"] = "succeeded"
result: Dict[str, Any]
warnings: List[str] = Field(default_factory=list)
meta: Dict[str, Any] = Field(default_factory=dict)Example response:
{
"status": "succeeded",
"result": {
"vertices": [],
"edges": []
},
"warnings": [],
"meta": {
"vertex_count": 0,
"edge_count": 0,
"text_count": 1
}
}- Make the
triplescontract explicit, or keep this endpoint property-graph only.
The request model currently accepts both "triples" and "property_graph", but the response handling is mainly shaped around vertices and edges.
Please either:
- remove
"triples"from the public request enum for this PR, or - normalize triples responses to a clear shape, for example:
{
"status": "succeeded",
"result": {
"triples": []
},
"warnings": [],
"meta": {}
}- Avoid reusing
GraphConfigRequestdirectly for this endpoint.
GraphConfigRequest has defaults such as url="127.0.0.1:8080" and graph="hugegraph". For request-scoped graph extraction, it is clearer to avoid implicit defaults and use a dedicated config model:
class GraphExtractClientConfig(BaseModel):
model_config = ConfigDict(extra="forbid")
url: Optional[str] = None
graph: Optional[str] = None
user: Optional[str] = None
pwd: Optional[str] = None
gs: Optional[str] = None- Restrict request-supplied
client_config.url.
The endpoint should not allow arbitrary request-supplied HugeGraph URLs. At minimum, either omit request-level URL override in this PR or validate it against the configured HugeGraph server URL. If multiple HugeGraph servers need to be supported, that should be controlled by a server-side allowlist.
- Keep the strict request-local graph config checks already added.
The current direction is good:
- named schema requires request-local config
client_config.graphmust match the schema name- request-scoped connection fields do not fall back field-by-field into global settings
Please keep these checks.
Should add if feasible
- Add a small service boundary without broadening this PR.
A tiny GraphExtractService.extract_sync(req) wrapper would make the route thinner and keep scheduler/error/response normalization out of the FastAPI handler. It does not need to introduce any async job or import logic.
- Add/update tests for the public API contract.
Please cover:
- response envelope shape:
status,result,warnings,meta textsaccepting both a string and a list- invalid or arbitrary
client_config.urlrejected - named schema + mismatched
client_config.graphrejected - inline schema +
client_configbehavior explicitly tested - either property-graph-only behavior, or a real
triplesresponse shape iftriplesremains public - existing
/rag,/text2gremlin, and/config/graphroutes still register
|
thanks for the detailed review. here's what i'm taking and where i landed on the open calls. api contract:
tests:
structural:
|
Separate graph extraction from the RAG API and harden its request-scoped graph config contract. - move GraphExtractRequest/GraphExtractClientConfig to api/models/graph_extract_requests.py and GraphExtractResponse to api/models/graph_extract_responses.py - move the route to api/graph_extract_api.py (graph_extract_http_api); register it from demo/rag_demo/app.py and drop graph_api.py - return a stable GraphExtractResponse envelope (status/result/ warnings/meta) instead of the raw flow dict - make the endpoint property-graph only (remove "triples" from the enum) - add a dedicated GraphExtractClientConfig (extra="forbid", all-optional, no url field) so 127.0.0.1:8080/hugegraph defaults can't leak in - force the connection url to the configured server url; reject request-supplied urls - keep strict checks: named schema requires client_config, client_config.graph must match the schema name, inline schema + client_config rejected, no field-by-field fallback into globals - rewrite tests as test_graph_extract_api.py covering the public contract Co-authored-by: Cursor <cursoragent@cursor.com>
I agree with the proposed scope choices:
The one point I would adjust is the service boundary: please include the thin The wrapper should only move the existing synchronous extraction path out of the FastAPI handler: scheduler invocation, raw result parsing, response normalization, and error mapping. It should not add async jobs, graph import, extract-and-import, or a broader service abstraction in this PR. This keeps the public route small and makes the API behavior easier to test before merge. A few details should also be covered before merging:
With those changes, the PR can stay focused while still landing a clean public API boundary. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
- retain graph extraction split_type validation from main - preserve request-scoped HugeGraph client config from PR - add regression coverage for build_flow parameter forwarding
- allow the vector index test scheduler stub to record keyword arguments - assert graph extraction PDF entrypoint forwards the default split type - keep coverage focused on the main and PR contract merge
Summary
Closes #348.
HugeGraph-LLM already supports graph extraction through the Gradio demo, but there was no public REST endpoint for it. This PR adds
POST /graph/extractto the existing FastAPI app, routing requests throughSchedulerSingletonandFlowName.GRAPH_EXTRACT— the same path the demo uses.Key changes
GraphExtractRequestwith validation fortexts,schema,split_type, and related optionsgraph_http_apiand register it on the existing auth routersplit_typeconfigurable inGraphExtractFlow(default"document", so demo behavior is unchanged)vertices/edgesas arrays), with optionalwarningandmetaExample request
{ "texts": "Sarah is 30 and works as an attorney.", "schema": { "vertexlabels": [], "edgelabels": [], "propertykeys": [] }, "split_type": "document", "include_meta": true }Invalid or empty input returns
422; scheduler failures return500.Test plan
cd hugegraph-llm && SKIP_EXTERNAL_SERVICES=true uv run pytest src/tests/api/test_graph_api.py -v --tb=short/rag,/text2gremlin,/config/graph, and/graph/extractall registercurlagainst running app with extract LLM configured