Skip to content

Add maxPages option to PDFParserConfig to limit page processing#2803

Draft
jnioche wants to merge 2 commits intoapache:mainfrom
DigitalPebble:feature/pdf-max-pages
Draft

Add maxPages option to PDFParserConfig to limit page processing#2803
jnioche wants to merge 2 commits intoapache:mainfrom
DigitalPebble:feature/pdf-max-pages

Conversation

@jnioche
Copy link
Copy Markdown
Contributor

@jnioche jnioche commented May 7, 2026

Summary

  • Adds maxPages field to PDFParserConfig (default -1, no limit)
  • AbstractPDF2XHTML.processPages() breaks out of the page loop early when the limit is reached, skipping all text extraction, font mapping, and content stream work for subsequent pages
  • Setter validates that the value is -1 or >= 1, throwing IllegalArgumentException otherwise

Performance

Benchmarked on a 738-page PDF:

  • Full parse: ~1,800 ms
  • First 5 pages: ~135 ms (13× faster)

jnioche added 2 commits May 7, 2026 20:11
Adds a configurable maxPages field to PDFParserConfig (default -1, no limit).
When set, AbstractPDF2XHTML.processPages() breaks out of the page loop early,
avoiding all text extraction, font mapping, and content stream work for pages
beyond the limit. Validated on a 738-page PDF: parsing 5 pages is 13x faster
than a full parse.
Adds maxPages to the PDF parser full config example, which is included
in docs/configuration/parsers/pdf-parser.adoc and validated by
ConfigExamplesTest#testPdfParserFullConfig.
@jnioche jnioche marked this pull request as draft May 7, 2026 19:21
@tballison
Copy link
Copy Markdown
Contributor

Should we do page range instead? pageStart, pageEnd, with 1 index? That would allow iterating through in chunks.

I'm ok with this as is. wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants