This project is a Python-based tool designed to process and classify PDF documents. It extracts text from PDFs using OCR (Optical Character Recognition), leverages a Large Language Model (LLM) to identify key information, and then renames the PDFs based on the extracted metadata.
-
PDF Processing: Converts PDF files into images, extracts text using Tesseract OCR, and stores the extracted data.
-
LLM-Based Classification: Uses Google’s Gemini LLM to analyze the extracted text and identify key information such as document date, document type, sender, and invoice number.
-
Intelligent Renaming: Renames PDF files based on the extracted metadata, creating a structured and informative naming convention.
-
Data Management: Loads and manages classification data from CSV files.
-
Robust Error Handling: Includes error handling for PDF processing, OCR, and LLM interactions.
-
Testability: Comes with a comprehensive suite of unit tests to ensure code quality and prevent regressions.
-
Python 3.10+
-
Code dependencies are listed in requirements.txt (see installation)
requirements.txtlink:requirements.txt[role=include]
-
Tesseract OCR Engine (version 4.1.0 or higher recommended)
-
A Google API key (for using the Gemini LLM)
This section describes how to install and set up the PDF Classifier project.
-
Python 3.10 or higher
-
Git (for cloning the repository)
-
Tesseract OCR Engine (version 4.1.0 or higher recommended)
-
A Google API key (for using the Gemini LLM)
-
Clone the repository:
git clone <repository-url> cd pdfclassifier
-
Create a virtual environment (recommended):
python3 -m venv venv source venv/bin/activate # On Linux/macOS python3 -m venv venv source venv\Scripts\activate # On Windows
-
Install the Python dependencies:
pip install -r requirements.txt
-
Install Tesseract OCR:
-
Linux:
bash sudo apt-get update sudo apt-get install tesseract-ocr sudo apt-get install libtesseract-dev
-
MacOS
brew install tesseract tesseract-lang
-
Windows: Download the installer from the official Tesseract OCR website and follow the installation instructions.
-
-
Set up the environment variables:
-
Run the
env-make.shscript to create a.envfile and set theGOOGLE_API_KEYandTESSDATA_PREFIXenvironment variables:bash env-make.sh
-
This script will prompt you to enter your
GOOGLE_API_KEY.NoteThe TESSDATA_PREFIXis set to/Users/thst/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdatain theenv-make.shscript. If you installed Tesseract in a different location, you need to adjust this path in the script.
-
-
Activate the environment variables:
-
Source the
.envfile to activate the environment variables:source .env
The application will load the
.envfile on startup.
-
-
Virtual Environments: Using a virtual environment is highly recommended to isolate the project’s dependencies from other projects and from the system’s Python installation.
-
Tesseract OCR: Make sure that the
tesseractcommand is available in your system’s PATH after installation. You can test this by runningtesseract --versionin your terminal. -
Google API Key: You can obtain a Google API key from the Google Cloud Console. The Gemini API keys can specifically be obtained through Google AI studio here: https://aistudio.google.com/app/apikey
-
Replace Placeholders: Remember to replace
<repository-url>with your actual repository URL.
The PDF Classifier project provides a set of modules within the src/python/classifier directory that can be used to build a larger application for processing, classifying, and renaming PDF documents. The core modules are:
-
data.py: Handles data loading and management, including reading and writing CSV files. -
llm_classifier.py: Contains the logic for using the Gemini LLM to analyze text and extract key information. -
pdfprocessor.py: Contains the logic for processing PDF files, extracting images, and performing OCR. -
renamer.py: Contains the logic for renaming PDF files based on extracted metadata.
These modules are designed to be used together within a larger application, not as standalone scripts.
The intended workflow for using the PDF Classifier modules is as follows:
-
PDF Processing: Use the
pdfprocessor.pymodule to process PDF files in a folder, extract images from each page, and then extract text from each image using Tesseract OCR. The extracted text is stored in temporary CSV files. -
LLM-Based Classification: Use the
llm_classifier.pymodule to analyze the extracted text from the temporary CSV files. It extracts key information such as document date, document type, sender, and invoice number. The extracted information is stored in a pandas DataFrame. -
Intelligent Renaming: Use the
renamer.pymodule to rename the PDF files based on the extracted metadata. It reads the classification data from a CSV file and applies the extracted information to rename the PDF files.
-
Initialize a LLM with model and your API-key
link:src/python/pdfclassify.py[role=include]
-
Run the PDF processing on the input files. This creates temporary files and data files. If you do not cleanup, you can use the files in a detached run later.
link:src/python/pdfclassify.py[role=include]
-
The PDF processing produces a list of
PdfDatastructures that collect the generated image files and the tesseract OCR data files. For convenience the OCR data is also included in memory.The next step is creating features on the files using the LLM. Since PDFs are split into pages, the naive use is to find the page with the best average on the feature quality.
link:src/python/pdfclassify.py[role=include]
-
The LLM part yields a feature per PDF, here collected in a
features_list. In the next step, the feature data is converted toFileDatato start the renaming. Sanitize the data for filename use.link:src/python/pdfclassify.py[role=include]
-
The sanitized list of
FileDataobjects can now be injected into the renamer module to finally rename the PDFslink:src/python/pdfclassify.py[role=include]
The functionality is collected in a commandline utility called pdfclassify.py. The script has a commandline help that can be called with --help:
link:src/doc/help.txt[role=include]|
Note
|
When changing the command arguments during development, call the update-help.sh script to update the documentation.
|
The application will try to use data found at the --pdf-out-location if it newer than the accompanied PDF. If you want to overwrite the intermediate files, set the --force flag.
The --dry-run flag will do all processing but not copy or move the PDF.
python pdfclassify.py \
--pdf-in /location/of/scanned/pdf \
--pdf-out /target/location/of/classified/files \
--copy \
--results \
--features
python pdfclassify.py \
--pdf-in /location/of/scanned/pdf \
--pdf-out /target/location/of/classified/files \
--move \
--force
python pdfclassify.py \
--pdf-in /location/of/scanned/pdf \
--pdf-out /target/location/of/classified/files \
--move \
--force \
--dry-run
The PDF processor creates a folder work.d in the --pdf-out folder. For each PDF consumed from --pdf-in, a folder with the basename of the pdf is created.
PDF read |
|
Work folder |
|
Data files |
|
Image files |
|
If the flag --features is set, the results from the LLM are writen to a file in the --pdf-out folder. The name can be defined using the --features-name-fmt setting. The placeholder {pdf_name} can be used in the format to use the source-pdf name in the feature-file-name.
The default feature file name is {pdf_name}-feature.csv
| key | value | quality |
|---|---|---|
id |
/Volumes/scans/SCAN_0001.pdf |
1.0 |
Document Date |
2022-08-01 |
1.0 |
Document Type |
rechnung |
1.0 |
Sender |
OTTO |
1.0 |
Invoice Number |
A-BC-0004711 |
1.0 |
If using the pdfprocessor class, the feature is a pandas DataFrame object.
The script collects all features produced by the LLM in the file all-features.csv. The file is only written if --features is set.
The features are further processed into a renaming table, the results.csv.
The results must have the following columns.
-
scanfile: The name of the PDF file. -
docdate: The document date inYYYY-MM-DDformat. -
doctype: The document type (e.g., "invoice", "account statement", "other"). -
sendername: The name of the sender. -
docid: The document ID (e.g., invoice number, account number). -
receivername: The name of the receiver. -
dateoffile: The date of the file inYYYY-MM-DDformat. -
extension: The file extension (e.g., "pdf").
The scanfile is the name of the PDF in the --pdf-in folder. It is used to copy or move the file to the filename constructed from the remaining fields.
The script searches for the files needed to continue processing before the actual processing starts. If the files are found and the source PDF is changed after the data file has been created or the force flag is set, the script will reconstruct the data from the previous run and use it.
In theory, you can stop the script at any time and restart it. It should continue where it has left the process and will not run the LLM on files where a valid feature.csv is present. It will call pdf2images, though, on all PDF, but the images will not be written if they already exist. Subsequently OCR will not happen, if not forced, given the OCR results are already there and fresh.
The logging is quite comprehensive when a file is used instead of producing data.
There are no build steps defined. You can start to hack away immediatly
pdfclassifier/ ├─ requirements.txt # Project dependencies ├─ update-help.sh # Script to create the src/doc/help.txt ├─ env-make.sh # Script to create the .env file ├─ src/ │ ├─ doc/ │ │ └─ help.txt # rendered --help output for docs │ ├─ python/ │ │ ├─ classifier/ │ │ │ ├─ init.py │ │ │ ├─ data.py # Data loading and management │ │ │ ├─ llm_classifier.py # LLM-based classification │ │ ├ ├─ pdfprocessor.py # PDF processing and OCR │ │ │ └─ renamer.py # PDF renaming │ │ └─ pdfclassify.py # the classification application │ └─ test/ │ └─ python/ │ ├─ init.py │ ├─ test_data.py # Unit tests for data.py │ ├─ test_pdfprocessor.py # Unit tests for pdfprocessor.py │ └─ test_renamer.py # Unit tests for renamer.py └─ README.adoc # This file
Feel free to fork the application if you want to contribute. Forks are welcome, please adhere to the license.
As you might have noticed, there is a lot of German used, this application is created for my own needs and if you want new features and have ideas, pull requests are welcome.
The application solves a real world problem for me and is an interesting experiment to use LLMs in something more useful than generating memes.