Skip to content

feat: support PDF uploads for RAG documents#356

Open
nw9663644-eng wants to merge 16 commits into
apache:mainfrom
nw9663644-eng:feat-support-pdf-upload
Open

feat: support PDF uploads for RAG documents#356
nw9663644-eng wants to merge 16 commits into
apache:mainfrom
nw9663644-eng:feat-support-pdf-upload

Conversation

@nw9663644-eng
Copy link
Copy Markdown

Summary

Support text-based PDF uploads in the RAG document upload path.

Changes

  • Add PDF text extraction support in read_documents()
  • Use pypdf to extract text from PDF files page by page
  • Handle encrypted, unreadable, and scanned-image-only PDFs with clear Gradio errors
  • Keep existing TXT and DOCX behavior unchanged
  • Update the demo upload copy to mention TXT, DOCX, and PDF

Test

  • python -m py_compile hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py

Closes #345

@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. enhancement New feature or request labels Jun 1, 2026
@github-actions github-actions Bot added the llm label Jun 1, 2026
@nw9663644-eng
Copy link
Copy Markdown
Author

nw9663644-eng commented Jun 1, 2026

I have covered the requested scope and suggested tests.

Current test coverage includes:

  • TXT file reading regression
  • DOCX file reading regression
  • text-based PDF reading
  • PDF files without extractable text
  • unreadable PDF behavior
  • encrypted PDF behavior
  • unsupported file type behavior

The implementation also covers:

  • adding pypdf to hugegraph-llm/pyproject.toml
  • extracting PDF text page by page in stable order
  • replacing the previous PDF TODO error path
  • keeping existing TXT and DOCX behavior unchanged
  • updating the demo upload copy to mention TXT, DOCX, and PDF

I also checked the dependency lock situation. The repository did not have an existing uv.lock file before running uv lock; running it locally generated a new root-level uv.lock. To avoid introducing a large new lock file unrelated to this focused change, I did not include it in this PR.

Local checks:

  • python -m py_compile hugegraph-llm/src/tests/test_vector_index_utils.py
  • python -m py_compile hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py
  • git diff upstream/main --check

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds text-based PDF support to the RAG document upload path so PDFs can be used in both vector index building and graph extraction (closes #345).

Changes:

  • Add pypdf-based PDF text extraction (read_pdf_text()) and wire it into read_documents().
  • Add unit tests covering TXT/DOCX/PDF success cases and common PDF failure modes (encrypted, unreadable, no extractable text).
  • Update demo UI copy to include PDF alongside TXT and DOCX.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
hugegraph-llm/src/tests/test_vector_index_utils.py Adds unit tests for read_documents() across TXT/DOCX/PDF and PDF error cases.
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py Implements PDF text extraction via pypdf and updates extension handling + error messages.
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py Updates upload instructions to mention PDF support.
hugegraph-llm/pyproject.toml Adds pypdf dependency required for PDF parsing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +33 to +55
try:
reader = PdfReader(full_path)

if reader.is_encrypted:
raise gr.Error(
"Encrypted PDF files are not supported. "
"Please upload an unencrypted PDF."
)

page_texts = []
for page in reader.pages:
page_text = page.extract_text() or ""
if page_text.strip():
page_texts.append(page_text)

text = "\n".join(page_texts).strip()
if not text:
raise gr.Error(
"No extractable text was found in this PDF. "
"Scanned-image PDFs are not supported without OCR."
)

return text
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated read_pdf_text() to open PDF files with a context manager and pass the binary stream to PdfReader, so the file handle is closed after extraction, including error paths. I also fixed the PDF text join syntax and verified the updated files with py_compile.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

for offset in offsets:
pdf += f"{offset:010d} 00000 n \n".encode()

pdf += (f"trailer\n<< /Size {len(objects) + 1} /Root 1 0 R >>\nstartxref\n{xref_offset}\n%%EOF\n").encode()
Comment thread hugegraph-llm/src/tests/document/test_vector_index_utils.py
Comment on lines +55 to +59
def test_read_documents_reads_txt_file(tmp_path):
txt_path = tmp_path / "sample.txt"
txt_path.write_text("hello hugegraph", encoding="utf-8")

result = read_documents([SimpleNamespace(name=str(txt_path))], "")
Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: no. Summary: PDF upload still rejects common uppercase file extensions; one line comment requests case-insensitive suffix handling. Evidence: local read_documents([sample.PDF], '') returned Please input txt, docx, or pdf file.; targeted pytest and Ruff passed.

@@ -46,10 +74,9 @@ def read_documents(input_file, input_text):
text += "\n"
texts.append(text)
elif full_path.endswith(".pdf"):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Normalize upload suffixes before dispatch

Evidence: read_documents() now accepts PDFs only when full_path.endswith(".pdf") matches exactly; I reproduced that a valid upload named sample.PDF falls through to Please input txt, docx, or pdf file. instead of reaching read_pdf_text(). Impact: the UI advertises PDF uploads, but common uppercase or mixed-case filenames are still rejected in both vector-index and graph-extraction paths that share this helper. Please normalize the suffix once, for example with Path(full_path).suffix.lower(), and add a mixed-case PDF regression test.

@nw9663644-eng nw9663644-eng requested a review from imbajin June 3, 2026 04:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: no. Summary: PDF parsing is covered, but the upload entrypoint contract still needs a small regression test. Evidence: targeted pytest passed (8 passed), and Ruff check/format passed for the changed parser/test files.

assert result == ["hello hugegraph"]


def test_read_documents_reads_pdf_file(tmp_path):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Cover the upload entrypoints too

Evidence: this suite proves read_documents() parses PDFs, but the UI exposes the change through build_vector_index() and extract_graph() via the shared upload component. Impact: a regression in caller wiring, argument forwarding, or the scheduler boundary could still ship while these helper tests pass. Please add contract-level tests that mock the scheduler and verify a PDF upload is accepted and forwarded as extracted text through both entrypoints.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I added contract-level tests for both upload entrypoints:

  • build_vector_index() now verifies that a PDF upload is parsed and forwarded to the scheduler as extracted text.
  • extract_graph() now verifies that a PDF upload is parsed and forwarded to the graph extraction scheduler flow with the schema and prompt preserved.

Local checks:

  • uv run pytest hugegraph-llm/src/tests/document/test_vector_index_utils.py
  • uv run ruff format --check .

@nw9663644-eng nw9663644-eng requested a review from imbajin June 4, 2026 05:38
Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking dependency reproducibility note.

"gradio",
"jieba",
"python-docx",
"pypdf",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Consider constraining the new PDF dependency

pypdf is added to hugegraph-llm/pyproject.toml, but it is not added to the root tool.uv.constraint-dependencies block where the repo already constrains adjacent LLM/runtime dependencies such as python-docx, gradio, and langchain-text-splitters.

This is not a correctness blocker, but it leaves uv sync --extra llm --extra dev free to resolve different pypdf versions over time. Please consider adding a root constraint, for example pypdf~=6.12.0, to keep dependency resolution reproducible with the rest of the LLM stack.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the note. The root uv constraint has been added for pypdf as pypdf~=6.12.0 under tool.uv.constraint-dependencies, so the PDF parser dependency is now resolved consistently with the rest of the LLM runtime stack.

imbajin added 2 commits June 4, 2026 18:15
- bump the PyPI pycgraph dependency from 3.2.2 to 3.2.4

- update the aarch64 CGraph source tag fallback to v3.2.4

- verify pycgraph imports and flow construction with uv
- add pypdf to root uv constraint dependencies

- keep PDF parser resolution aligned with LLM runtime deps

- verify pypdf import, PDF tests, and Ruff checks
@LRriver
Copy link
Copy Markdown
Contributor

LRriver commented Jun 5, 2026

Thanks for the updates. The PDF upload implementation itself looks good to me: it adds pypdf-based extraction in the shared read_documents() path, handles mixed-case suffixes, returns clear Gradio errors for encrypted/unreadable/no-text PDFs, updates the upload copy, and adds coverage for TXT/DOCX/PDF plus both build_vector_index() and extract_graph() entrypoints.

I have one blocking scope/dependency concern before merge:

hugegraph-llm/pyproject.toml also changes pycgraph from 3.2.2 to 3.2.4, including the aarch64 git source tag. This does not appear related to #345 or PDF upload support. pycgraph is the workflow engine used across many HugeGraph-LLM flows, so this broadens the PR blast radius beyond the document upload path. The current tests verify the PDF/document behavior, but they do not establish that the scheduler/flow dependency upgrade is safe across RAG, graph extraction, Text2Gremlin, import/update flows, and the aarch64 source path.

Please either revert the pycgraph changes from this PR, or explain why PDF support requires this upgrade and add the corresponding broader flow/integration verification. My preference is to keep this PR focused on PDF upload support and move any pycgraph upgrade to a separate PR.

Non-blocking improvement: the PDF tests are already sufficient for the main upload path, but they could be strengthened with a real encrypted PDF fixture, e.g. using PdfWriter.encrypt(), and a multi-page PDF test that asserts page-order-preserving extraction. This would make the issue’s page-by-page/stable-order acceptance criteria more explicit, but I do not think it needs to block this PR once the unrelated dependency bump is resolved.

@nw9663644-eng
Copy link
Copy Markdown
Author

Thanks for the updates. The PDF upload implementation itself looks good to me: it adds pypdf-based extraction in the shared read_documents() path, handles mixed-case suffixes, returns clear Gradio errors for encrypted/unreadable/no-text PDFs, updates the upload copy, and adds coverage for TXT/DOCX/PDF plus both build_vector_index() and extract_graph() entrypoints.感谢您提供的更新。PDF 上传功能的实现本身看起来不错:它在共享的 read_documents() 路径中添加了基于 pypdf 提取功能,能够处理大小写混合的后缀,对于加密/不可读/无文本的 PDF 文件返回清晰的 Gradio 错误信息,更新了上传副本,并增加了对 TXT/DOCX/PDF 格式的支持,同时还添加了 build_vector_index()extract_graph() 这两个入口点。

I have one blocking scope/dependency concern before merge:合并前我有一个阻塞性的范围/依赖关系问题:

hugegraph-llm/pyproject.toml also changes pycgraph from 3.2.2 to 3.2.4, including the aarch64 git source tag. This does not appear related to #345 or PDF upload support. pycgraph is the workflow engine used across many HugeGraph-LLM flows, so this broadens the PR blast radius beyond the document upload path. The current tests verify the PDF/document behavior, but they do not establish that the scheduler/flow dependency upgrade is safe across RAG, graph extraction, Text2Gremlin, import/update flows, and the aarch64 source path.hugegraph-llm/pyproject.toml 还将 pycgraph3.2.2 更新到 3.2.4 ,包括 aarch64 的 git 源代码标签。这似乎与 #345 或 PDF 上传支持无关。pycgraph 是许多 HugeGraph-LLM 流程中使用的工作流引擎,因此此次 pycgraph 扩大了 PR 的影响范围,使其超出文档上传路径。目前的测试验证了 PDF/文档的行为,但并未证明调度器/流程依赖项的升级在 RAG、图提取、Text2Gremlin、导入/更新流程以及 aarch64 源代码路径中是安全的。

Please either revert the pycgraph changes from this PR, or explain why PDF support requires this upgrade and add the corresponding broader flow/integration verification. My preference is to keep this PR focused on PDF upload support and move any pycgraph upgrade to a separate PR.请要么撤销此 PR 中对 pycgraph 更改,要么解释为什么 PDF 支持需要此次升级,并添加相应的更全面的流程/集成验证。我倾向于让此 PR 专注于 PDF 上传支持,并将任何 pycgraph 升级移至单独的 PR 中。

Non-blocking improvement: the PDF tests are already sufficient for the main upload path, but they could be strengthened with a real encrypted PDF fixture, e.g. using PdfWriter.encrypt(), and a multi-page PDF test that asserts page-order-preserving extraction. This would make the issue’s page-by-page/stable-order acceptance criteria more explicit, but I do not think it needs to block this PR once the unrelated dependency bump is resolved.非阻塞性改进:目前针对主上传路径的 PDF 测试已经足够完善,但可以通过添加真正的加密 PDF 测试用例(例如使用 PdfWriter.encrypt() 以及一个多页 PDF 测试用例来增强其安全性,该测试用例能够确保提取过程保持页面顺序。这将使问题的逐页/稳定顺序验收标准更加明确,但我认为一旦无关的依赖项增加问题得到解决,它就不需要阻塞此 PR。

Thanks for the review. I reverted the unrelated pycgraph version/source-tag update from this PDF upload PR and kept the PDF-specific dependency changes only, including the root pypdf~=6.12.0 constraint.

Local checks passed:

  • uv run ruff format --check .
  • uv run pytest hugegraph-llm/src/tests/document/test_vector_index_utils.py

@nw9663644-eng
Copy link
Copy Markdown
Author

Thanks for the suggestion. I added follow-up PDF coverage for this PR:

  • an encrypted PDF fixture using PdfWriter.encrypt()
  • a multi-page PDF fixture that verifies page-order-preserving extraction

Local checks passed:

  • uv run ruff format --check .
  • uv run pytest hugegraph-llm/src/tests/document/test_vector_index_utils.py

@nw9663644-eng nw9663644-eng requested a review from imbajin June 5, 2026 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request llm size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support PDF uploads for RAG index and graph extraction input

5 participants