feat: support PDF uploads for RAG documents#356
Conversation
|
I have covered the requested scope and suggested tests. Current test coverage includes:
The implementation also covers:
I also checked the dependency lock situation. The repository did not have an existing Local checks:
|
There was a problem hiding this comment.
Pull request overview
Adds text-based PDF support to the RAG document upload path so PDFs can be used in both vector index building and graph extraction (closes #345).
Changes:
- Add
pypdf-based PDF text extraction (read_pdf_text()) and wire it intoread_documents(). - Add unit tests covering TXT/DOCX/PDF success cases and common PDF failure modes (encrypted, unreadable, no extractable text).
- Update demo UI copy to include PDF alongside TXT and DOCX.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| hugegraph-llm/src/tests/test_vector_index_utils.py | Adds unit tests for read_documents() across TXT/DOCX/PDF and PDF error cases. |
| hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py | Implements PDF text extraction via pypdf and updates extension handling + error messages. |
| hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py | Updates upload instructions to mention PDF support. |
| hugegraph-llm/pyproject.toml | Adds pypdf dependency required for PDF parsing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| reader = PdfReader(full_path) | ||
|
|
||
| if reader.is_encrypted: | ||
| raise gr.Error( | ||
| "Encrypted PDF files are not supported. " | ||
| "Please upload an unencrypted PDF." | ||
| ) | ||
|
|
||
| page_texts = [] | ||
| for page in reader.pages: | ||
| page_text = page.extract_text() or "" | ||
| if page_text.strip(): | ||
| page_texts.append(page_text) | ||
|
|
||
| text = "\n".join(page_texts).strip() | ||
| if not text: | ||
| raise gr.Error( | ||
| "No extractable text was found in this PDF. " | ||
| "Scanned-image PDFs are not supported without OCR." | ||
| ) | ||
|
|
||
| return text |
There was a problem hiding this comment.
I updated read_pdf_text() to open PDF files with a context manager and pass the binary stream to PdfReader, so the file handle is closed after extraction, including error paths. I also fixed the PDF text join syntax and verified the updated files with py_compile.
| for offset in offsets: | ||
| pdf += f"{offset:010d} 00000 n \n".encode() | ||
|
|
||
| pdf += (f"trailer\n<< /Size {len(objects) + 1} /Root 1 0 R >>\nstartxref\n{xref_offset}\n%%EOF\n").encode() |
| def test_read_documents_reads_txt_file(tmp_path): | ||
| txt_path = tmp_path / "sample.txt" | ||
| txt_path.write_text("hello hugegraph", encoding="utf-8") | ||
|
|
||
| result = read_documents([SimpleNamespace(name=str(txt_path))], "") |
imbajin
left a comment
There was a problem hiding this comment.
Blocking: no. Summary: PDF upload still rejects common uppercase file extensions; one line comment requests case-insensitive suffix handling. Evidence: local read_documents([sample.PDF], '') returned Please input txt, docx, or pdf file.; targeted pytest and Ruff passed.
| @@ -46,10 +74,9 @@ def read_documents(input_file, input_text): | |||
| text += "\n" | |||
| texts.append(text) | |||
| elif full_path.endswith(".pdf"): | |||
There was a problem hiding this comment.
Evidence: read_documents() now accepts PDFs only when full_path.endswith(".pdf") matches exactly; I reproduced that a valid upload named sample.PDF falls through to Please input txt, docx, or pdf file. instead of reaching read_pdf_text(). Impact: the UI advertises PDF uploads, but common uppercase or mixed-case filenames are still rejected in both vector-index and graph-extraction paths that share this helper. Please normalize the suffix once, for example with Path(full_path).suffix.lower(), and add a mixed-case PDF regression test.
imbajin
left a comment
There was a problem hiding this comment.
Blocking: no. Summary: PDF parsing is covered, but the upload entrypoint contract still needs a small regression test. Evidence: targeted pytest passed (8 passed), and Ruff check/format passed for the changed parser/test files.
| assert result == ["hello hugegraph"] | ||
|
|
||
|
|
||
| def test_read_documents_reads_pdf_file(tmp_path): |
There was a problem hiding this comment.
Evidence: this suite proves read_documents() parses PDFs, but the UI exposes the change through build_vector_index() and extract_graph() via the shared upload component. Impact: a regression in caller wiring, argument forwarding, or the scheduler boundary could still ship while these helper tests pass. Please add contract-level tests that mock the scheduler and verify a PDF upload is accepted and forwarded as extracted text through both entrypoints.
There was a problem hiding this comment.
Thanks for the review. I added contract-level tests for both upload entrypoints:
- build_vector_index() now verifies that a PDF upload is parsed and forwarded to the scheduler as extracted text.
- extract_graph() now verifies that a PDF upload is parsed and forwarded to the graph extraction scheduler flow with the schema and prompt preserved.
Local checks:
- uv run pytest hugegraph-llm/src/tests/document/test_vector_index_utils.py
- uv run ruff format --check .
imbajin
left a comment
There was a problem hiding this comment.
Non-blocking dependency reproducibility note.
| "gradio", | ||
| "jieba", | ||
| "python-docx", | ||
| "pypdf", |
There was a problem hiding this comment.
pypdf is added to hugegraph-llm/pyproject.toml, but it is not added to the root tool.uv.constraint-dependencies block where the repo already constrains adjacent LLM/runtime dependencies such as python-docx, gradio, and langchain-text-splitters.
This is not a correctness blocker, but it leaves uv sync --extra llm --extra dev free to resolve different pypdf versions over time. Please consider adding a root constraint, for example pypdf~=6.12.0, to keep dependency resolution reproducible with the rest of the LLM stack.
There was a problem hiding this comment.
Thanks for the note. The root uv constraint has been added for pypdf as pypdf~=6.12.0 under tool.uv.constraint-dependencies, so the PDF parser dependency is now resolved consistently with the rest of the LLM runtime stack.
- bump the PyPI pycgraph dependency from 3.2.2 to 3.2.4 - update the aarch64 CGraph source tag fallback to v3.2.4 - verify pycgraph imports and flow construction with uv
- add pypdf to root uv constraint dependencies - keep PDF parser resolution aligned with LLM runtime deps - verify pypdf import, PDF tests, and Ruff checks
|
Thanks for the updates. The PDF upload implementation itself looks good to me: it adds I have one blocking scope/dependency concern before merge:
Please either revert the Non-blocking improvement: the PDF tests are already sufficient for the main upload path, but they could be strengthened with a real encrypted PDF fixture, e.g. using |
This reverts commit 3f4f61d.
Thanks for the review. I reverted the unrelated Local checks passed:
|
|
Thanks for the suggestion. I added follow-up PDF coverage for this PR:
Local checks passed:
|
Summary
Support text-based PDF uploads in the RAG document upload path.
Changes
read_documents()pypdfto extract text from PDF files page by pageTest
python -m py_compile hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.pyCloses #345