feat: support PDF uploads for RAG documents by nw9663644-eng · Pull Request #356 · apache/hugegraph-ai

nw9663644-eng · 2026-06-01T15:19:56Z

Summary

Support text-based PDF uploads in the RAG document upload path.

Changes

Add PDF text extraction support in read_documents()
Use pypdf to extract text from PDF files page by page
Handle encrypted, unreadable, and scanned-image-only PDFs with clear Gradio errors
Keep existing TXT and DOCX behavior unchanged
Update the demo upload copy to mention TXT, DOCX, and PDF

Test

python -m py_compile hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py

Closes #345

nw9663644-eng · 2026-06-01T16:12:04Z

I have covered the requested scope and suggested tests.

Current test coverage includes:

TXT file reading regression
DOCX file reading regression
text-based PDF reading
PDF files without extractable text
unreadable PDF behavior
encrypted PDF behavior
unsupported file type behavior

The implementation also covers:

adding pypdf to hugegraph-llm/pyproject.toml
extracting PDF text page by page in stable order
replacing the previous PDF TODO error path
keeping existing TXT and DOCX behavior unchanged
updating the demo upload copy to mention TXT, DOCX, and PDF

I also checked the dependency lock situation. The repository did not have an existing uv.lock file before running uv lock; running it locally generated a new root-level uv.lock. To avoid introducing a large new lock file unrelated to this focused change, I did not include it in this PR.

Local checks:

python -m py_compile hugegraph-llm/src/tests/test_vector_index_utils.py
python -m py_compile hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py
git diff upstream/main --check

Copilot

Pull request overview

Adds text-based PDF support to the RAG document upload path so PDFs can be used in both vector index building and graph extraction (closes #345).

Changes:

Add pypdf-based PDF text extraction (read_pdf_text()) and wire it into read_documents().
Add unit tests covering TXT/DOCX/PDF success cases and common PDF failure modes (encrypted, unreadable, no extractable text).
Update demo UI copy to include PDF alongside TXT and DOCX.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
hugegraph-llm/src/tests/test_vector_index_utils.py	Adds unit tests for `read_documents()` across TXT/DOCX/PDF and PDF error cases.
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py	Implements PDF text extraction via `pypdf` and updates extension handling + error messages.
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py	Updates upload instructions to mention PDF support.
hugegraph-llm/pyproject.toml	Adds `pypdf` dependency required for PDF parsing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nw9663644-eng · 2026-06-02T06:23:37Z

+    try:
+        reader = PdfReader(full_path)
+
+        if reader.is_encrypted:
+            raise gr.Error(
+                "Encrypted PDF files are not supported. "
+                "Please upload an unencrypted PDF."
+            )
+
+        page_texts = []
+        for page in reader.pages:
+            page_text = page.extract_text() or ""
+            if page_text.strip():
+                page_texts.append(page_text)
+
+        text = "\n".join(page_texts).strip()
+        if not text:
+            raise gr.Error(
+                "No extractable text was found in this PDF. "
+                "Scanned-image PDFs are not supported without OCR."
+            )
+
+        return text


I updated read_pdf_text() to open PDF files with a context manager and pass the binary stream to PdfReader, so the file handle is closed after extraction, including error paths. I also fixed the PDF text join syntax and verified the updated files with py_compile.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

+    for offset in offsets:
+        pdf += f"{offset:010d} 00000 n \n".encode()
+
+    pdf += (f"trailer\n<< /Size {len(objects) + 1} /Root 1 0 R >>\nstartxref\n{xref_offset}\n%%EOF\n").encode()


+def test_read_documents_reads_txt_file(tmp_path):
+    txt_path = tmp_path / "sample.txt"
+    txt_path.write_text("hello hugegraph", encoding="utf-8")
+
+    result = read_documents([SimpleNamespace(name=str(txt_path))], "")


imbajin

Blocking: no. Summary: PDF upload still rejects common uppercase file extensions; one line comment requests case-insensitive suffix handling. Evidence: local read_documents([sample.PDF], '') returned Please input txt, docx, or pdf file.; targeted pytest and Ruff passed.

imbajin · 2026-06-02T19:37:55Z

@@ -46,10 +74,9 @@ def read_documents(input_file, input_text):
                    text += "\n"
                texts.append(text)
            elif full_path.endswith(".pdf"):


⚠️ Normalize upload suffixes before dispatch

Evidence: read_documents() now accepts PDFs only when full_path.endswith(".pdf") matches exactly; I reproduced that a valid upload named sample.PDF falls through to Please input txt, docx, or pdf file. instead of reaching read_pdf_text(). Impact: the UI advertises PDF uploads, but common uppercase or mixed-case filenames are still rejected in both vector-index and graph-extraction paths that share this helper. Please normalize the suffix once, for example with Path(full_path).suffix.lower(), and add a mixed-case PDF regression test.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

imbajin

Blocking: no. Summary: PDF parsing is covered, but the upload entrypoint contract still needs a small regression test. Evidence: targeted pytest passed (8 passed), and Ruff check/format passed for the changed parser/test files.

imbajin · 2026-06-03T07:29:16Z

+    assert result == ["hello hugegraph"]
+
+
+def test_read_documents_reads_pdf_file(tmp_path):


⚠️ Cover the upload entrypoints too

Evidence: this suite proves read_documents() parses PDFs, but the UI exposes the change through build_vector_index() and extract_graph() via the shared upload component. Impact: a regression in caller wiring, argument forwarding, or the scheduler boundary could still ship while these helper tests pass. Please add contract-level tests that mock the scheduler and verify a PDF upload is accepted and forwarded as extracted text through both entrypoints.

Thanks for the review. I added contract-level tests for both upload entrypoints:

build_vector_index() now verifies that a PDF upload is parsed and forwarded to the scheduler as extracted text.

extract_graph() now verifies that a PDF upload is parsed and forwarded to the graph extraction scheduler flow with the schema and prompt preserved.

Local checks:

uv run pytest hugegraph-llm/src/tests/document/test_vector_index_utils.py

uv run ruff format --check .

imbajin

Non-blocking dependency reproducibility note.

imbajin · 2026-06-04T10:13:05Z

    "gradio",
    "jieba",
    "python-docx",
+    "pypdf",


⚠️ Consider constraining the new PDF dependency

pypdf is added to hugegraph-llm/pyproject.toml, but it is not added to the root tool.uv.constraint-dependencies block where the repo already constrains adjacent LLM/runtime dependencies such as python-docx, gradio, and langchain-text-splitters.

This is not a correctness blocker, but it leaves uv sync --extra llm --extra dev free to resolve different pypdf versions over time. Please consider adding a root constraint, for example pypdf~=6.12.0, to keep dependency resolution reproducible with the rest of the LLM stack.

Thanks for the note. The root uv constraint has been added for pypdf as pypdf~=6.12.0 under tool.uv.constraint-dependencies, so the PDF parser dependency is now resolved consistently with the rest of the LLM runtime stack.

- bump the PyPI pycgraph dependency from 3.2.2 to 3.2.4 - update the aarch64 CGraph source tag fallback to v3.2.4 - verify pycgraph imports and flow construction with uv

- add pypdf to root uv constraint dependencies - keep PDF parser resolution aligned with LLM runtime deps - verify pypdf import, PDF tests, and Ruff checks

LRriver · 2026-06-05T13:23:24Z

Thanks for the updates. The PDF upload implementation itself looks good to me: it adds pypdf-based extraction in the shared read_documents() path, handles mixed-case suffixes, returns clear Gradio errors for encrypted/unreadable/no-text PDFs, updates the upload copy, and adds coverage for TXT/DOCX/PDF plus both build_vector_index() and extract_graph() entrypoints.

I have one blocking scope/dependency concern before merge:

hugegraph-llm/pyproject.toml also changes pycgraph from 3.2.2 to 3.2.4, including the aarch64 git source tag. This does not appear related to #345 or PDF upload support. pycgraph is the workflow engine used across many HugeGraph-LLM flows, so this broadens the PR blast radius beyond the document upload path. The current tests verify the PDF/document behavior, but they do not establish that the scheduler/flow dependency upgrade is safe across RAG, graph extraction, Text2Gremlin, import/update flows, and the aarch64 source path.

Please either revert the pycgraph changes from this PR, or explain why PDF support requires this upgrade and add the corresponding broader flow/integration verification. My preference is to keep this PR focused on PDF upload support and move any pycgraph upgrade to a separate PR.

Non-blocking improvement: the PDF tests are already sufficient for the main upload path, but they could be strengthened with a real encrypted PDF fixture, e.g. using PdfWriter.encrypt(), and a multi-page PDF test that asserts page-order-preserving extraction. This would make the issue’s page-by-page/stable-order acceptance criteria more explicit, but I do not think it needs to block this PR once the unrelated dependency bump is resolved.

This reverts commit 3f4f61d.

nw9663644-eng · 2026-06-05T14:27:01Z

Thanks for the updates. The PDF upload implementation itself looks good to me: it adds pypdf-based extraction in the shared read_documents() path, handles mixed-case suffixes, returns clear Gradio errors for encrypted/unreadable/no-text PDFs, updates the upload copy, and adds coverage for TXT/DOCX/PDF plus both build_vector_index() and extract_graph() entrypoints.感谢您提供的更新。PDF 上传功能的实现本身看起来不错：它在共享的 read_documents() 路径中添加了基于 pypdf 提取功能，能够处理大小写混合的后缀，对于加密/不可读/无文本的 PDF 文件返回清晰的 Gradio 错误信息，更新了上传副本，并增加了对 TXT/DOCX/PDF 格式的支持，同时还添加了 build_vector_index() 和 extract_graph() 这两个入口点。

I have one blocking scope/dependency concern before merge:合并前我有一个阻塞性的范围/依赖关系问题：

hugegraph-llm/pyproject.toml also changes pycgraph from 3.2.2 to 3.2.4, including the aarch64 git source tag. This does not appear related to #345 or PDF upload support. pycgraph is the workflow engine used across many HugeGraph-LLM flows, so this broadens the PR blast radius beyond the document upload path. The current tests verify the PDF/document behavior, but they do not establish that the scheduler/flow dependency upgrade is safe across RAG, graph extraction, Text2Gremlin, import/update flows, and the aarch64 source path.hugegraph-llm/pyproject.toml 还将 pycgraph 从 3.2.2 更新到 3.2.4 ，包括 aarch64 的 git 源代码标签。这似乎与 #345 或 PDF 上传支持无关。pycgraph 是许多 HugeGraph-LLM 流程中使用的工作流引擎，因此此次 pycgraph 扩大了 PR 的影响范围，使其超出文档上传路径。目前的测试验证了 PDF/文档的行为，但并未证明调度器/流程依赖项的升级在 RAG、图提取、Text2Gremlin、导入/更新流程以及 aarch64 源代码路径中是安全的。

Please either revert the pycgraph changes from this PR, or explain why PDF support requires this upgrade and add the corresponding broader flow/integration verification. My preference is to keep this PR focused on PDF upload support and move any pycgraph upgrade to a separate PR.请要么撤销此 PR 中对 pycgraph 更改，要么解释为什么 PDF 支持需要此次升级，并添加相应的更全面的流程/集成验证。我倾向于让此 PR 专注于 PDF 上传支持，并将任何 pycgraph 升级移至单独的 PR 中。

Non-blocking improvement: the PDF tests are already sufficient for the main upload path, but they could be strengthened with a real encrypted PDF fixture, e.g. using PdfWriter.encrypt(), and a multi-page PDF test that asserts page-order-preserving extraction. This would make the issue’s page-by-page/stable-order acceptance criteria more explicit, but I do not think it needs to block this PR once the unrelated dependency bump is resolved.非阻塞性改进：目前针对主上传路径的 PDF 测试已经足够完善，但可以通过添加真正的加密 PDF 测试用例（例如使用 PdfWriter.encrypt() 以及一个多页 PDF 测试用例来增强其安全性，该测试用例能够确保提取过程保持页面顺序。这将使问题的逐页/稳定顺序验收标准更加明确，但我认为一旦无关的依赖项增加问题得到解决，它就不需要阻塞此 PR。

Thanks for the review. I reverted the unrelated pycgraph version/source-tag update from this PDF upload PR and kept the PDF-specific dependency changes only, including the root pypdf~=6.12.0 constraint.

Local checks passed:

uv run ruff format --check .
uv run pytest hugegraph-llm/src/tests/document/test_vector_index_utils.py

nw9663644-eng · 2026-06-05T14:48:00Z

Thanks for the suggestion. I added follow-up PDF coverage for this PR:

an encrypted PDF fixture using PdfWriter.encrypt()
a multi-page PDF fixture that verifies page-order-preserving extraction

Local checks passed:

uv run ruff format --check .
uv run pytest hugegraph-llm/src/tests/document/test_vector_index_utils.py

feat: support PDF uploads for RAG documents

f511c44

dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. enhancement New feature or request labels Jun 1, 2026

github-actions Bot added the llm label Jun 1, 2026

wn12222 added 3 commits June 1, 2026 23:45

chore: add PDF document reading tests

295d50c

chore: add document upload regression tests

837ece9

chore: add PDF upload edge case tests

f112387

nw9663644-eng mentioned this pull request Jun 1, 2026

[Feature] Support PDF uploads for RAG index and graph extraction input #345

Open

3 tasks

chore: add license header to test file

5aacb8c

imbajin requested a review from Copilot June 1, 2026 21:54

Copilot started reviewing on behalf of imbajin June 1, 2026 21:55 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

wn12222 added 3 commits June 2, 2026 14:07

chore: close PDF file stream after reading

5f95a27

fix: repair PDF text join syntax

73e6987

chore: format PDF upload changes

91410d2

imbajin requested a review from Copilot June 2, 2026 19:10

Copilot started reviewing on behalf of imbajin June 2, 2026 19:10 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

imbajin reviewed Jun 2, 2026

View reviewed changes

wn12222 added 2 commits June 3, 2026 11:56

fix: normalize upload file suffixes

92d0ca2

chore: include PDF upload tests in CI

b895cb5

nw9663644-eng requested a review from imbajin June 3, 2026 04:01

chore: merge upstream main

e302d0e

imbajin requested a review from Copilot June 3, 2026 07:17

Copilot started reviewing on behalf of imbajin June 3, 2026 07:17 View session

Copilot AI reviewed Jun 3, 2026

View reviewed changes

imbajin reviewed Jun 3, 2026

View reviewed changes

nw9663644-eng requested a review from imbajin June 4, 2026 05:38

chore: cover PDF upload entrypoints

7765135

imbajin reviewed Jun 4, 2026

View reviewed changes

imbajin added 2 commits June 4, 2026 18:15

chore(hugegraph-llm): update pycgraph to 3.2.4

3f4f61d

- bump the PyPI pycgraph dependency from 3.2.2 to 3.2.4 - update the aarch64 CGraph source tag fallback to v3.2.4 - verify pycgraph imports and flow construction with uv

chore: constrain pypdf dependency

ceddcb6

- add pypdf to root uv constraint dependencies - keep PDF parser resolution aligned with LLM runtime deps - verify pypdf import, PDF tests, and Ruff checks

LRriver mentioned this pull request Jun 4, 2026

[Feature] Make graph extraction use configurable chunk splitting #343

Open

3 tasks

Revert "chore(hugegraph-llm): update pycgraph to 3.2.4"

8419303

This reverts commit 3f4f61d.

test: strengthen PDF upload coverage

faedf8f

nw9663644-eng requested a review from imbajin June 5, 2026 14:48

		assert result == ["hello hugegraph"]


		def test_read_documents_reads_pdf_file(tmp_path):

Conversation

nw9663644-eng commented Jun 1, 2026

Summary

Changes

Test

Uh oh!

nw9663644-eng commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

nw9663644-eng Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

imbajin Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

imbajin Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

nw9663644-eng Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

imbajin Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

nw9663644-eng Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

LRriver commented Jun 5, 2026

Uh oh!

nw9663644-eng commented Jun 5, 2026

Uh oh!

nw9663644-eng commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nw9663644-eng commented Jun 1, 2026 •

edited

Loading