All notable changes in pdfminer.six will be documented in this file.
The format is based on Keep a Changelog.
- Restore
cmapandcmap_cleantargets to `Makefile (#1256)
- Reproducibility issue when generating cmap
.json.gzfiles (#1242) - Switch to
bytearrayinapply_png_predictorto avoid out of memory error (and speed it up) (#1247) - Check the type of
Nwhen creating anICCBasedcolor space (#1252) - Endless recursion issue with circular or corrupted
Prevchains in cross-refeerence tables (#1253)
- Support cmap types 6, 10 and 12 (#598)
- Encapsulate error when failing to get attribute or data from stream (#1225, #1226)
- Validate maximum size of xref start (#1227)
- Use lazy %-style formatting for logging (#1234)
- Unused methods close, tell and poll on PSBaseParser (#1230)
- Eliminated arbitrary code execution vulnerability (CVE-2025-64512) by replacing pickle CMap storage with json - users with custom pickle CMaps can use
tools/convert_cmaps_to_json.pyto convert to JSON format (#1172))
- Support for colored and uncolored tiling patterns per ISO 32000 (#1171)
- Pre-commit hooks for automated code quality checks (#1215)
- Ruff rules for for modernized Python syntax (#1218)
- Using makefile instead of nox for local development (#1222)
- Fix
struct.errorwhen processing PDFs with odd-length font encoding buffers (#1169) PSBaseParsercombines tokens split across streams (#1158)- Improve exception handling in
PDFDocumentwith more precise error propagation (#1220)
- Support for Python 3.14 (#1209)
- Refuse to execute circular references to content streams (including Form XObjects) (#1143)
IndexErrorwhen saving image with no filters`` (#1117)- Copying color space scs and ncs (#1140)
- Correct linewidth calculation in
PDFPageInterpreter.do_w(#1165)
- Support for Python 3.9 (#1208)
- Arbitrary code execution when loading pickle cmaps (issue)
- Support for extracting images with TIFF predictor (#1058)
- Correct tightest fitting bounding boxes for rotated content (#1114)
TypeErrorwhen passing wrong number of arguments tosafe_rgb(#1118)OverflowErrorinsafe_floatwhen input is too large (#1121)- Saving colour spaces on the graphics stack (#1119)
- Remove padding from AES-encrypted strings(#1123)
- TrueType fonts without encoding now correctly default to WinAnsiEncoding (#1164)
TypeErrorwhen parsing font width with indirect object references (#1098)ValueErrorwhen loading xref with invalid position or generation numbers that cannot be parsed as int (#1099)- Safely converting PDF stack objects to float or int in PDFInterpreter (#1100)
TypeErrorwhen parsing font bbox with incorrect values (#1103)ValueErroron incorrect stream lengths for ASCII85 data (#1112)
- Support for Python 3.13 (#1092)
- Reduce memory overhead on runlength encoding by using lists (#1055)
- Using
pyproject.tomlinstead ofsetup.py(#1028)
TypeErrorwhen CID character widths are not parseable as floats (#1001)TypeErrorraised by extract_text method with compressed PDF file (#1029)PSBaseParsercan't handle tokens split across end of buffer (#1030)TypeErrorwhen CropBox is an indirect object reference (#1004)- Remove redundant line to be able to recognize rectangles (#1066)
- Support indirect objects for filters (#1062)
- Make sure
bytesisbyteswhere it counts (#1069)
- Support for Python 3.8 (#1091)
- Using absolute instead of relative imports ([#995])
- Using standard library functions for ascii85 and asciihex (#1031)
- The third argument (generation number) to
PDFObjRef(#972)
TypeErrorwhen corrupt PDF object reference cannot be parsed as int (#972)])TypeErrorwhen corrupt PDF literal cannot be converted to str (#978)ValueErrorwhen corrupt PDF specifies a negative xref location (#980)ValueErrorwhen corrupt PDF specifies an invalid mediabox (#987)RecursionErrorwhen corrupt PDF specifies a recursive /Pages object (#998)TypeErrorwhen corrupt PDF specifies text-positioning operators with invalid values (#1000)- inline image parsing fails when stream data contains "EI\n" ([#1008](#1008
TypeErrorwhen parsing object reference as mediabox (#1082)
- Deprecated tools, functions and classes (#974)
- Support for zipped jpeg's (#938)
- Fuzzing harnesses for integration into Google's OSS-Fuzz (949)
- Support for setuptools-git-versioning version 2.0.0 (#957)
- Resolving mediabox and pdffont (#834)
- Keywords that aren't terminated by the pattern
END_KEYWORDbefore end-of-stream are parsed (#885) ValueErrorwrong error message when specifying codec for text output (#902)- Resolve stream filter parameters (#906)
- Reading cmap's with whitespace in the name (#935)
- Optimize
apply_png_predictorby using lists (#912)
- Updated Python 3.7 syntax to 3.8 (#956)
- Updated all Python version specifications to a minimum of 3.8 (#969)
- Support for Python 3.6 and 3.7 (#921)
- Output converter for the hOCR format (#651)
- Font name aliases for Arial, Courier New and Times New Roman (#790)
- Documentation on why special characters can sometimes not be extracted (#829)
- Storing Bezier path and dashing style of line in LTCurve (#801)
- Broken CI/CD pipeline by setting upper version limit for black, mypy, pip and setuptools (#921)
flake8failures (#921)ValueErrorwhen bmp images with 1 bit channel are decoded (#773)ValueErrorwhen trying to decrypt empty metadata values (#766)- Sphinx errors during building of documentation (#760)
TypeErrorwhen getting default width of font (#720)- Installing typing-extensions on Python 3.6 and 3.7 (#775)
TypeErrorin cmapdb.py when parsing null characters (#768)- Color "convenience operators" now (per spec) also set color space (#794)
ValueErrorwhen extracting images, due to breaking changes in Pillow (#827)- Small typo's and issues in the documentation (#828)
- Ignore non-Unicode cmaps in TrueType fonts (#806)
- Using non-hardcoded version string and setuptools-git-versioning to enable installation from source and building on Python 3.12 (#922)
- Usage of
if __name__ == "__main__"where it was only intended for testing purposes (#756)
- Support for Python 3.6 and 3.7 because they are end-of-life (#923)
- Ignoring (invalid) path constructors that do not begin with
m(#749)
- Removed upper version bounds (#755)
IndexErrorwhen handling invalid bfrange code map in CMap (#731)TypeErrorin lzw.py whenself.tableis not set (#732)TypeErrorin encodingdb.py when name of unicode is not str (#733)TypeErrorin HTMLConverter when using a bytes fontname (#734)
- Exporting images without any specific encoding (#737)
- Using charset-normalizer instead of chardet for less restrictive license (#744)
- Export type annotations from pypi package per PEP561 (#679)
- Support for identity cmap's (#626)
- Add support for PDF page labels (#680)
- Installation of Pillow as an optional extra dependency (#714)
- Handle decompression error due to CRC checksum error (#637)
- Regression (since 20191107) in
LTLayoutContainer.group_textboxesthat returned some text lines out of order (#659) - Add handling of JPXDecode filter to enable extraction of images for some pdfs (#645)
- Fix extraction of jbig2 files, which was producing invalid files (#652)
- Crash in
pdf2txt.py --boxes-flow=disabled(#682) - Only use xref fallback if
PDFNoValidXRefis raised andfallbackis True (#684) - Ignore empty characters when analyzing layout (#499)
- Replace warnings.warn with logging.Logger.warning in line with recommended use (#673)
- Switched from nose to pytest, from tox to nox and from Travis CI to GitHub Actions (#704)
- Unnecessary return statements without argument at the end of functions (#707)
- Add support for PDF 2.0 (ISO 32000-2) AES-256 encryption (#614)
- Support for Paeth PNG filter compression (predictor value = 4) (#537)
- Type annotations (#661)
KeyErrorwhen'Encrypt'but not'ID'present intrailer(#594)- Fix issue of ValueError and KeyError raised in PDFdocument and PDFparser (#573)
- Fix issue of TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' (#529)
- Fix
PermissionErrorwhen creating temporary filepaths on windows when running tests (#484) - Fix
AttributeErrorwhen dumping a TOC with bytes destinations (#600) - Fix issue of some Chinese characters can not be extracted correctly (#593)
- Detecting trailer correctly when surrounded with needless whitespace (#535)
- Fix
.paint_pathlogic for handling single line segments and extracting point-on-curve positions of Beziér path commands (#530) - Raising
UnboundLocalErrorwhen a bad--output-typeis used (#610) TypeErrorwhen usingTagExtractorwith non-string or non-bytes tag values (#610)- Using
io.TextIOBaseas the file to write to (#616) - Parsing \r\n after the escape character in a literal string (#616)
- Support for Python 3.4 and 3.5 (#522)
- Unused dependency on
sortedcontainerspackage (#525) - Support for non-standard output streams that are not binary (#523)
- Dependency on typing-extensions introduced by #661 (#677)
- Support for Python 3.4 and 3.5 (#507)
- Option to disable boxes flow layout analysis when using pdf2txt (#479)
- Support for
pathlib.PurePathinopen_filename(#492)
- Pass caching parameter to PDFResourceManager in
high_levelfunctions (#475) - Fix
.paint_pathlogic for handling non-rect quadrilaterals and decomposing complex paths (#512) - Fix out-of-bound access on some PDFs (#483)
- Remove unused rijndael encryption implementation (#465)
- Rename PDFTextExtractionNotAllowedError to PDFTextExtractionNotAllowed to revert breaking change (#461)
- Always try to get CMap, not only for identity encodings (#438)
- Support for painting multiple rectangles at once (#371)
- Validate image object in do_EI is a PDFStream (#451)
- Hiding fallback xref by default from dumppdf.py output (#431)
- Raise a warning instead of an error when extracting text from a non-extractable PDF (#453)
- Switched from pycryptodome to cryptography package for AES decryption (#456)
- Python3 shebang line to script in tools (#408)
- Fix ordering of textlines within a textbox when
boxes_flow=None(#412)
- Allow boxes_flow LAParam to be passed as None, validate the input, and update documentation (#396)
- Also accept file-like objects in high level functions
extract_textandextract_pages(#393)
- Text no longer comes in reverse order when advanced layout analysis is disabled (#399)
- Updated misleading documentation for
word_marginandchar_margin(#407) - Ignore ValueError when converting font encoding differences (#389)
- Grouping of text lines outside of parent container bounding box (#386)
- Group text lines if they are centered (#384)
- Removed samples/issue-00152-embedded-pdf.pdf because it contains a possible security thread; a javascript enabled object (#364)
- Interpret two's complement integer as unsigned integer (#352)
- Fix font name in html output such that it is recognized by browser (#357)
- Compute correct font height by removing scaling with font bounding box height (#348)
- KeyError when extracting embedded files and a Unicode file specification is missing (#338)
- The command-line utility latin2ascii.py (#360)
- Support for Python 2 (#346)
- Enforce pep8 coding style by adding flake8 to CI (#345)
- Wrong order of text box grouping introduced by PR #315 (#335)
- Simple wrapper to easily extract text from a PDF file #330
- Support for extracting JBIG2 encoded images (#311 and #46)
- Sphinx documentation that is published on Read the Docs (#329)
- Unhandled AssertionError when dumping pdf containing reference to object id 0 (#318)
- Debug flag actually changes logging level to debug for pdf2txt.py and dumppdf.py (#325)
- Using argparse instead of getopt for command line interface of dumppdf.py (#321)
- Refactor
LTLayoutContainer.group_textboxesfor a significant speed up in layout analysis (#315)
- Files for external applications such as django, cgi and pyinstaller (#320)
- Support for Python 2 is dropped at January 1st, 2020 (#307)
- Contribution guidelines in CONTRIBUTING.md (#259)
- Support new encodings OneByteEncoding and DLIdent for CMaps (#283)
- Use
six.iteritems()instead ofdict().iteritems()to ensure Python2 and Python3 compatibility (#274) - Properly convert Adobe Glyph names to unicode characters (#263)
- Allow CMap to be a content stream (#283)
- Resolve indirect objects for width and bounding boxes for fonts (#273)
- Actually updating stroke color in graphic state (#298)
- Interpret (invalid) negative font descent as a positive descent (#203)
- Correct colorspace comparison for images (#132)
- Allow for bounding boxes with zero height or width by removing assertion (#246)