Skip to content

singularityhacker/webometrics-rankings

Repository files navigation

Webometrics Rankings Extractor

A Python tool that extracts university ranking data from the Webometrics Ranking PDF and converts it into a queryable CSV format.

Overview

This project extracts structured data from the Webometrics Ranking of Universities PDF (July 2025 edition) containing over 31,000 ranked educational institutions worldwide. The extraction tool processes the PDF's text-based format and produces a clean CSV file that can be efficiently queried with Python pandas.

What It Does

  • Extracts ranking data from a 921-page PDF document
  • Converts to CSV format for easy querying and integration
  • Handles edge cases including entries with and without ROR (Research Organization Registry) URLs
  • Preserves data fidelity - captures all entries, including multiple institutions sharing the same rank
  • Provides progress feedback during extraction with real-time statistics

Data Structure

The extracted CSV contains the following fields:

  • world_rank: Numeric ranking from 1 to 31,869 (multiple institutions can share the same rank)
  • name: Full name of the educational institution
  • ror_url: Research Organization Registry identifier URL (may be empty for some entries)
  • page: Source page number in the original PDF

Data Range:

  • Highest ranked: Harvard University (#1)
  • Lowest ranked: Various institutions at rank #31,869
  • Total entries: Over 28,000 institutions

How It Works

The extraction process uses pdfplumber to parse text from the PDF since the document contains text-based tables rather than structured table objects. The script:

  1. Skips content pages: Automatically skips the first 4 pages which contain metadata and documentation
  2. Extracts text: Processes each page's text content line by line
  3. Pattern matching: Uses regex patterns to identify data rows in two formats:
    • With ROR URL: rank name https://ror.org/... optional_rank
    • Without ROR URL: rank name rank (common on later pages)
  4. Data validation: Filters out empty entries and validates rank consistency
  5. Output generation: Creates a clean CSV file sorted by world rank

Setup

Requirements

  • Python 3.7+
  • Virtual environment (recommended)

Installation

  1. Create and activate a virtual environment:
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Dependencies

  • pdfplumber - PDF text extraction
  • pandas - Data manipulation and CSV export

Usage

Running the Extraction

  1. Ensure Webometrics.pdf is in the project root directory
  2. Run the extraction script:
python extract_to_csv.py

The script will:

  • Show real-time progress with page count, percentage complete, rows extracted, and estimated time remaining
  • Process all 917 data pages (pages 5-921)
  • Generate webometrics_rankings.csv in the project root

Querying the Data

Once extracted, use pandas to query the CSV:

import pandas as pd

# Load the CSV
df = pd.read_csv("webometrics_rankings.csv")

# Search by university name
def get_rank_by_name(name: str):
    mask = df["name"].str.lower().str.contains(name.lower())
    result = df[mask].sort_values("world_rank").head(10)
    return result[["world_rank", "name", "ror_url"]]

# Example queries
print(get_rank_by_name("Stanford"))
print(get_rank_by_name("Harvard"))

# Find all institutions at a specific rank
rank_1 = df[df["world_rank"] == 1]
print(f"Institutions ranked #1: {len(rank_1)}")

# Get top 10 universities
top_10 = df.head(10)
print(top_10[["world_rank", "name"]])

Output

The script generates webometrics_rankings.csv with the following characteristics:

  • Format: Standard CSV with headers
  • Encoding: UTF-8
  • Sorting: Ordered by world rank (ascending)
  • Empty ROR URLs: Represented as empty strings for institutions without ROR identifiers

Project Structure

webometrics-rankings/
├── README.md                  # This file
├── requirements.txt           # Python dependencies
├── extract_to_csv.py         # Main extraction script
├── inspect_pdf.py            # PDF inspection utility (optional)
├── Webometrics.pdf           # Source PDF file
└── webometrics_rankings.csv  # Generated output (after running)

Data Sources

Notes

  • This is a one-time extraction tool - the CSV becomes the canonical dataset
  • The PDF's first 4 pages contain metadata/documentation and are automatically skipped
  • Many institutions share the same rank (especially at rank 31,869)
  • Some entries on later pages do not have ROR URLs assigned

License

This extraction tool is provided as-is. The underlying ranking data is from Webometrics and subject to their terms of use.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages