Extract structured data from PDF documents with local-language content and turn it into reusable corpora for local-language AI.
This project is especially focused on documents that are not already available in clean, structured form — for example scanned dictionaries, glossaries, word lists, school subject references, and other PDFs that contain valuable language data but are hard to reuse directly.
The extracted data is intended to support downstream tasks such as retrieval, evaluation, translation, question answering, and representation learning. One target use case is the Twi-English quality estimation / embedding work shown in the demo below:
- Demo: https://huggingface.co/spaces/ghananlpcommunity/twi-eng-qe-e5-demo
- Model: https://huggingface.co/ghananlpcommunity/twi-eng-qe-e5
The repository currently contains:
sources/— source PDFs to extract fromdata/— extracted CSV outputs and cleaned corporaextract_data_gemini.py— batch PDF extractor powered by Geminibuild_parallel_corpus.py— corpus cleaner/merger for parallel CSV filesdocument_recipe.json— the current extraction recipe used by the extractor
The bundled sample data is centered on Twi and related local-language material, including:
- dictionaries
- glossaries
- bilingual word lists
- school subject glossaries
- language guides
- historical/legal texts with local-language content
If you are looking for good starter documents, prioritize:
- scanned or OCR-friendly PDFs
- dictionaries and glossaries
- well-structured bilingual or parallel content
- older documents that are still useful but not widely available in structured format
- documents you are allowed to redistribute or process
- PDFs are placed in
sources/<language>/. extract_data_gemini.pyscans the PDFs and uses Gemini to:- identify the document structure
- detect the languages present
- generate or update an extraction recipe
- extract rows into CSV format
- The output CSVs are written to
data/<language>/. build_parallel_corpus.pycleans and merges the parallel CSVs into corpus files such as:parallel_sentences.csvparallel_words.csv
- Python 3.10+
- A valid Gemini API key
- The Python packages used by the extractor:
google-genaipypdfium2Pillow
Create a virtual environment and install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install google-genai pypdfium2 PillowSet your Gemini API key:
export GEMINI_API_KEY="your-api-key-here"The current scripts use absolute paths in their configuration:
extract_data_gemini.pypointsSOURCES_DIRandDATA_DIRto a local machine pathbuild_parallel_corpus.pypointsDATA_DIRto a local machine path
Before running the scripts in a new checkout, update those paths so they point to your local copy of this repository.
After updating the local paths in extract_data_gemini.py, run:
python extract_data_gemini.pyThe extractor runs in batch mode:
- it scans the
sources/folder for PDFs - it skips PDFs that already have a matching CSV in
data/ - it writes one CSV per PDF, using a name like:
<pdf_stem>_parallel.csv<pdf_stem>_mono.csv
The extractor also saves the generated recipe to document_recipe.json.
After extraction, update the local path in build_parallel_corpus.py and run:
python build_parallel_corpus.pyThis script:
- reads all CSV files in the data directory
- identifies the canonical language set
- cleans noisy dictionary markup and cross references
- separates entries into sentence-like and word-like corpora
- writes:
parallel_sentences.csvparallel_words.csv
archives2data/
├── build_parallel_corpus.py
├── document_recipe.json
├── extract_data_gemini.py
├── data/
│ └── twi/
└── sources/
└── twi/
Contributions are welcome. Because this repository is meant to grow useful local-language data, the most helpful contribution is often a new PDF source plus the corresponding extracted CSV.
Please consider adding PDFs that are:
- dictionaries or glossaries
- bilingual or parallel
- scanned but readable
- high quality and text-rich
- useful for local-language AI
- legally shareable or otherwise permitted for redistribution
- Fork the repository and create a feature branch.
- Add the PDF to the appropriate folder under
sources/.- For example:
sources/twi/
- For example:
- If needed, run
extract_data_gemini.pyto generate the matching CSV. - Run
build_parallel_corpus.pyif the new data should be included in the shared parallel corpora. - Verify that the output looks correct and that special characters are preserved.
- Open a pull request.
Please describe:
- the document title
- source or URL, if available
- language(s)
- whether it is monolingual or parallel
- why the document is useful
- any licensing or redistribution notes
- Do not commit API keys or secrets.
- Prefer descriptive filenames for new PDFs and CSVs.
- Avoid adding low-quality scans unless they are uniquely valuable.
- If you are adding generated CSVs, make sure the source PDF is also included or clearly referenced.
This project is intentionally designed to preserve local-language characters and orthography. When contributing new documents, please be careful to keep:
- diacritics
- special characters such as
ɛ,ɔ, andŋ - standard orthography for the target language
If a document contains multiple variants, synonyms, or semicolon-separated subentries, prefer a clean structured extraction that splits them into usable rows.
Only add documents that you have permission to process and redistribute. If a PDF has unclear licensing, check the source first or avoid including it in a public pull request.
This work is aligned with the broader GhanaNLP effort to build better resources for local-language NLP and evaluation.