Add maxPages option to PDFParserConfig to limit page processing by jnioche · Pull Request #2803 · apache/tika

jnioche · 2026-05-07T19:17:27Z

Summary

Adds maxPages field to PDFParserConfig (default -1, no limit)
AbstractPDF2XHTML.processPages() breaks out of the page loop early when the limit is reached, skipping all text extraction, font mapping, and content stream work for subsequent pages
Setter validates that the value is -1 or >= 1, throwing IllegalArgumentException otherwise

Performance

Benchmarked on a 738-page PDF:

Full parse: ~1,800 ms
First 5 pages: ~135 ms (13× faster)

Adds a configurable maxPages field to PDFParserConfig (default -1, no limit). When set, AbstractPDF2XHTML.processPages() breaks out of the page loop early, avoiding all text extraction, font mapping, and content stream work for pages beyond the limit. Validated on a 738-page PDF: parsing 5 pages is 13x faster than a full parse.

Adds maxPages to the PDF parser full config example, which is included in docs/configuration/parsers/pdf-parser.adoc and validated by ConfigExamplesTest#testPdfParserFullConfig.

tballison · 2026-05-07T19:38:29Z

Should we do page range instead? pageStart, pageEnd, with 1 index? That would allow iterating through in chunks.

I'm ok with this as is. wdyt?

jnioche added 2 commits May 7, 2026 20:11

Document maxPages option in pdf-parser-full.json example

1443de9

Adds maxPages to the PDF parser full config example, which is included in docs/configuration/parsers/pdf-parser.adoc and validated by ConfigExamplesTest#testPdfParserFullConfig.

jnioche marked this pull request as draft May 7, 2026 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add maxPages option to PDFParserConfig to limit page processing#2803

Add maxPages option to PDFParserConfig to limit page processing#2803
jnioche wants to merge 2 commits intoapache:mainfrom
DigitalPebble:feature/pdf-max-pages

jnioche commented May 7, 2026

Uh oh!

tballison commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jnioche commented May 7, 2026

Summary

Performance

Uh oh!

tballison commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants