A Python tool that extracts university ranking data from the Webometrics Ranking PDF and converts it into a queryable CSV format.
This project extracts structured data from the Webometrics Ranking of Universities PDF (July 2025 edition) containing over 31,000 ranked educational institutions worldwide. The extraction tool processes the PDF's text-based format and produces a clean CSV file that can be efficiently queried with Python pandas.
- Extracts ranking data from a 921-page PDF document
- Converts to CSV format for easy querying and integration
- Handles edge cases including entries with and without ROR (Research Organization Registry) URLs
- Preserves data fidelity - captures all entries, including multiple institutions sharing the same rank
- Provides progress feedback during extraction with real-time statistics
The extracted CSV contains the following fields:
world_rank: Numeric ranking from 1 to 31,869 (multiple institutions can share the same rank)name: Full name of the educational institutionror_url: Research Organization Registry identifier URL (may be empty for some entries)page: Source page number in the original PDF
Data Range:
- Highest ranked: Harvard University (#1)
- Lowest ranked: Various institutions at rank #31,869
- Total entries: Over 28,000 institutions
The extraction process uses pdfplumber to parse text from the PDF since the document contains text-based tables rather than structured table objects. The script:
- Skips content pages: Automatically skips the first 4 pages which contain metadata and documentation
- Extracts text: Processes each page's text content line by line
- Pattern matching: Uses regex patterns to identify data rows in two formats:
- With ROR URL:
rank name https://ror.org/... optional_rank - Without ROR URL:
rank name rank(common on later pages)
- With ROR URL:
- Data validation: Filters out empty entries and validates rank consistency
- Output generation: Creates a clean CSV file sorted by world rank
- Python 3.7+
- Virtual environment (recommended)
- Create and activate a virtual environment:
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtpdfplumber- PDF text extractionpandas- Data manipulation and CSV export
- Ensure
Webometrics.pdfis in the project root directory - Run the extraction script:
python extract_to_csv.pyThe script will:
- Show real-time progress with page count, percentage complete, rows extracted, and estimated time remaining
- Process all 917 data pages (pages 5-921)
- Generate
webometrics_rankings.csvin the project root
Once extracted, use pandas to query the CSV:
import pandas as pd
# Load the CSV
df = pd.read_csv("webometrics_rankings.csv")
# Search by university name
def get_rank_by_name(name: str):
mask = df["name"].str.lower().str.contains(name.lower())
result = df[mask].sort_values("world_rank").head(10)
return result[["world_rank", "name", "ror_url"]]
# Example queries
print(get_rank_by_name("Stanford"))
print(get_rank_by_name("Harvard"))
# Find all institutions at a specific rank
rank_1 = df[df["world_rank"] == 1]
print(f"Institutions ranked #1: {len(rank_1)}")
# Get top 10 universities
top_10 = df.head(10)
print(top_10[["world_rank", "name"]])The script generates webometrics_rankings.csv with the following characteristics:
- Format: Standard CSV with headers
- Encoding: UTF-8
- Sorting: Ordered by world rank (ascending)
- Empty ROR URLs: Represented as empty strings for institutions without ROR identifiers
webometrics-rankings/
├── README.md # This file
├── requirements.txt # Python dependencies
├── extract_to_csv.py # Main extraction script
├── inspect_pdf.py # PDF inspection utility (optional)
├── Webometrics.pdf # Source PDF file
└── webometrics_rankings.csv # Generated output (after running)
- Primary source: Ranking Web of Universities webometrics.info July 2025 edition
- ROR information: Research Organization Registry
- Additional data: Zenodo records
- This is a one-time extraction tool - the CSV becomes the canonical dataset
- The PDF's first 4 pages contain metadata/documentation and are automatically skipped
- Many institutions share the same rank (especially at rank 31,869)
- Some entries on later pages do not have ROR URLs assigned
This extraction tool is provided as-is. The underlying ranking data is from Webometrics and subject to their terms of use.