Webometrics Rankings Extractor

A Python tool that extracts university ranking data from the Webometrics Ranking PDF and converts it into a queryable CSV format.

Overview

This project extracts structured data from the Webometrics Ranking of Universities PDF (July 2025 edition) containing over 31,000 ranked educational institutions worldwide. The extraction tool processes the PDF's text-based format and produces a clean CSV file that can be efficiently queried with Python pandas.

What It Does

Extracts ranking data from a 921-page PDF document
Converts to CSV format for easy querying and integration
Handles edge cases including entries with and without ROR (Research Organization Registry) URLs
Preserves data fidelity - captures all entries, including multiple institutions sharing the same rank
Provides progress feedback during extraction with real-time statistics

Data Structure

The extracted CSV contains the following fields:

world_rank: Numeric ranking from 1 to 31,869 (multiple institutions can share the same rank)
name: Full name of the educational institution
ror_url: Research Organization Registry identifier URL (may be empty for some entries)
page: Source page number in the original PDF

Data Range:

Highest ranked: Harvard University (#1)
Lowest ranked: Various institutions at rank #31,869
Total entries: Over 28,000 institutions

How It Works

The extraction process uses pdfplumber to parse text from the PDF since the document contains text-based tables rather than structured table objects. The script:

Skips content pages: Automatically skips the first 4 pages which contain metadata and documentation
Extracts text: Processes each page's text content line by line
Pattern matching: Uses regex patterns to identify data rows in two formats:
- With ROR URL: rank name https://ror.org/... optional_rank
- Without ROR URL: rank name rank (common on later pages)
Data validation: Filters out empty entries and validates rank consistency
Output generation: Creates a clean CSV file sorted by world rank

Setup

Requirements

Python 3.7+
Virtual environment (recommended)

Installation

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Dependencies

pdfplumber - PDF text extraction
pandas - Data manipulation and CSV export

Usage

Running the Extraction

Ensure Webometrics.pdf is in the project root directory
Run the extraction script:

python extract_to_csv.py

The script will:

Show real-time progress with page count, percentage complete, rows extracted, and estimated time remaining
Process all 917 data pages (pages 5-921)
Generate webometrics_rankings.csv in the project root

Querying the Data

Once extracted, use pandas to query the CSV:

import pandas as pd

# Load the CSV
df = pd.read_csv("webometrics_rankings.csv")

# Search by university name
def get_rank_by_name(name: str):
    mask = df["name"].str.lower().str.contains(name.lower())
    result = df[mask].sort_values("world_rank").head(10)
    return result[["world_rank", "name", "ror_url"]]

# Example queries
print(get_rank_by_name("Stanford"))
print(get_rank_by_name("Harvard"))

# Find all institutions at a specific rank
rank_1 = df[df["world_rank"] == 1]
print(f"Institutions ranked #1: {len(rank_1)}")

# Get top 10 universities
top_10 = df.head(10)
print(top_10[["world_rank", "name"]])

Output

The script generates webometrics_rankings.csv with the following characteristics:

Format: Standard CSV with headers
Encoding: UTF-8
Sorting: Ordered by world rank (ascending)
Empty ROR URLs: Represented as empty strings for institutions without ROR identifiers

Project Structure

webometrics-rankings/
├── README.md                  # This file
├── requirements.txt           # Python dependencies
├── extract_to_csv.py         # Main extraction script
├── inspect_pdf.py            # PDF inspection utility (optional)
├── Webometrics.pdf           # Source PDF file
└── webometrics_rankings.csv  # Generated output (after running)

Data Sources

Primary source: Ranking Web of Universities webometrics.info July 2025 edition
ROR information: Research Organization Registry
Additional data: Zenodo records

Notes

This is a one-time extraction tool - the CSV becomes the canonical dataset
The PDF's first 4 pages contain metadata/documentation and are automatically skipped
Many institutions share the same rank (especially at rank 31,869)
Some entries on later pages do not have ROR URLs assigned

License

This extraction tool is provided as-is. The underlying ranking data is from Webometrics and subject to their terms of use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Webometrics Rankings Extractor

Overview

What It Does

Data Structure

How It Works

Setup

Requirements

Installation

Dependencies

Usage

Running the Extraction

Querying the Data

Output

Project Structure

Data Sources

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
Webometrics.pdf		Webometrics.pdf
extract_to_csv.py		extract_to_csv.py
inspect_pdf.py		inspect_pdf.py
requirements.txt		requirements.txt
webometrics_rankings.csv		webometrics_rankings.csv

Folders and files

Latest commit

History

Repository files navigation

Webometrics Rankings Extractor

Overview

What It Does

Data Structure

How It Works

Setup

Requirements

Installation

Dependencies

Usage

Running the Extraction

Querying the Data

Output

Project Structure

Data Sources

Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages