Skip to content

juandelaf1/DataScope

Β 
Β 

Repository files navigation

πŸ“‘ DataScope β€” Multi-Source Salary Analytics

DataScope Banner

CI Docker Python 3.11 MIT License Kaggle GitHub stars


Executive Summary

KPI Value
Unified records 5,370 (LinkedIn + Kaggle DS + Spain multi-source)
Countries 50 with PPP-adjusted salary comparisons
Spain median salary €43,200/yr β€” PPP-adjusted $66,683 USD
Best-paid role (Spain) Data Scientist β€” Senior median €61.5K
Global experience β†’ salary r = 0.49 (moderate-strong, p < 0.00001)
Title normalization 1,037 β†’ 15 roles, 99.8% coverage
Eurostat PPP countries 47 (1960-2025 population + GDP)
Test coverage 79 tests (70 title normalization + pipeline + CI)
Dashboard Streamlit + Plotly interactive

Dataset Composition

Source Records Geography Year
LinkedIn Job Postings (Kaggle) 1,831 Global (US/UK centric) 2023
Kaggle DS Salaries 607 Global multi-year 2020–2024
Manfred 2026 Salary Guide 2,700 Spain (triangular sampling from published bands) 2026
Kaggle DS 2024 Spain 127 Spain 2024
Glassdoor ES 105 Spain 2024
Eurostat PPP 1,268 rows Γ— 47 countries EU + neighbours 1960–2025

The global unified dataset (data/global/global_salaries_unified.csv) merges LinkedIn + Kaggle DS into a single schema. The Spain 2026 dataset (data/espana/spain_salaries_2026_unified.csv) combines Manfred, Glassdoor, and Kaggle ES with title normalization.


Pipeline Architecture

LinkedIn ─┐
Kaggle DS ── β†’ load_global_salaries.py β†’ global_salaries_unified.csv
Spain β”€β”€β”€β”€β”€β”˜                              β”‚
                                          β”œβ”€β”€β†’ DuckDB analytics (5 SQL queries)
                                          β”‚     β”œβ”€β”€ 01_median_salary_by_role
                                          β”‚     β”œβ”€β”€ 02_median_salary_by_country
                                          β”‚     β”œβ”€β”€ 03_seniority_premium
Manfred 2026 ─┐                            β”‚     β”œβ”€β”€ 04_top_roles_by_country
Kaggle ES ───── β†’ load_spain_2026.py       β”‚     └── 05_market_sizing
Glassdoor ES β”€β”˜                             β”‚
                                          └──→ Streamlit dashboard
Eurostat ──────→ load_eurostat.py β†’ ppp_rates.csv β†’ PPP-adjusted analysis

Key scripts

Script Purpose
scripts/load_global_salaries.py Unified global dataset builder (LinkedIn + Kaggle + Spain)
scripts/load_spain_2026.py Spain multi-source loader (Manfred + Kaggle + Glassdoor)
scripts/load_eurostat.py Eurostat TSV downloader + PPP rate computation
scripts/normalize_titles.py Title normalization (1,037 β†’ 15 roles, seniority extraction, threshold grouping)
scripts/run_analytics.py DuckDB analytics engine (5 queries)
dashboard/app.py Streamlit interactive dashboard

Key Findings

Spain Market (2026)

  • Data Scientist is Spain's best-paid data role (Senior median: €61.5K)
  • Spain median salary: €43,200/yr raw β†’ $66,683 USD PPP-adjusted (+54% vs EU average)
  • Manfred 2026 dominates the Spain dataset (92%) but is also the most current and Spain-specific

Global Patterns

  • Experience β†’ Salary: r = 0.49 β€” dominant factor across all roles
  • Remote premium: βˆ’1.2% (p = 0.46) β€” no significant effect for data roles
  • Views β†’ Applies: r = 0.91 β€” visibility drives applications
  • Salary MNAR: 70.87% missing; juniors disproportionately affected (47.33% hide vs 23.86% seniors)

Methodological Correction

Initial analysis used all LinkedIn professions (~75K salaries). Corrected metrics isolate Data Roles only (1,831 records):

Metric All Professions Data Roles Only
Remote premium +45.1% βˆ’1.2% (p=0.46)
Views β†’ Applies r = 0.62 r = 0.91
Experience β†’ Salary r = 0.43 r = 0.49

Interactive Dashboard

uv run streamlit run dashboard/app.py

Or with Docker:

docker compose up

Five tabs: LinkedIn Salaries β†’ Global Unified β†’ Spain 2026 β†’ Analytics β†’ Bias Analysis.


Local Setup

# Clone
git clone https://github.com/juandelaf1/Pearsons_Four.git
cd DataScope

# Install with uv
uv sync

# Run the full pipeline (optional β€” data is pre-computed)
uv run python scripts/load_global_salaries.py
uv run python scripts/run_analytics.py

# Launch dashboard
uv run streamlit run dashboard/app.py

# Run tests
uv run pytest tests/ -v

Docker Deployment

# Build and run
docker compose up

# Or pull from DockerHub
docker pull juandelaf1/datascope
docker run -p 8501:8501 juandelaf1/datascope

Then open http://localhost:8501.


Tech Stack

Category Tools
Language Python 3.11
Data pandas, numpy, DuckDB
Analytics DuckDB SQL, scipy
Visualization Plotly, Streamlit
Scraping scrapling (Chrome TLS impersonation)
Eurostat Custom TSV downloader (load_eurostat.py)
Pipeline uv, modular Python scripts
CI/CD GitHub Actions (79 tests)
Deployment Docker, docker-compose
Dataset Kaggle (global + Spain)

Project Structure

DataScope/
β”œβ”€β”€ .github/workflows/ci.yml           # CI pipeline (79 tests)
β”œβ”€β”€ dashboard/app.py                    # Streamlit interactive dashboard
β”œβ”€β”€ Dockerfile + docker-compose.yml     # Containerized deployment
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ datascope_eda_linkedin.ipynb
β”‚   β”œβ”€β”€ datascope_eda_enhanced.ipynb
β”‚   β”œβ”€β”€ datascope_bias_analysis.ipynb
β”‚   └── datascope_audit.ipynb
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ load_global_salaries.py         # Global dataset builder
β”‚   β”œβ”€β”€ load_spain_2026.py              # Spain multi-source loader
β”‚   β”œβ”€β”€ load_eurostat.py                # Eurostat + PPP rates
β”‚   β”œβ”€β”€ normalize_titles.py             # Title normalization engine
β”‚   β”œβ”€β”€ run_analytics.py                # DuckDB analytics
β”‚   β”œβ”€β”€ pipeline/                       # Modular pipeline modules
β”‚   └── generate_*.py                   # Visualization generators
β”œβ”€β”€ queries/                            # 5 DuckDB SQL queries
β”œβ”€β”€ tests/                              # 79 unit tests
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ global/global_salaries_unified.csv    # 5,370 records, 50 countries
β”‚   β”œβ”€β”€ espana/spain_salaries_2026_unified.csv # 2,932 Spain records
β”‚   └── eurostat/ppp_rates.csv                # 1,268 rows, 47 countries
β”œβ”€β”€ slides/datascope_presentacion.pptx
β”œβ”€β”€ ROADMAP.md
β”œβ”€β”€ CHANGELOG.md
└── pyproject.toml

Credits

Original Team Project (Phase 1 β€” May 2026)

Role Name GitHub
Data Wrangler & Product Owner Juan de la Fuente @juandelaf1
Statistical Analysis Isabela TΓ©llez @Isabela-Tellez
Data Visualization & Scrum Master Anas Fady @Anasfady
Ethics & Strategy Vanessa GarcΓ­a @garciaguadalupevanessa-bit

Personal Extension (Phase 2–3 β€” June 2026)

Extended fork by Juan de la Fuente:

  • Spain multi-source salary integration (Manfred 2026, Glassdoor ES, Kaggle ES)
  • DuckDB analytics engine (5 SQL queries)
  • Title normalization (1,037 β†’ 15 roles, 99.8% coverage)
  • Eurostat PPP loader (47 countries)
  • Unit tests + CI/CD (79 tests)
  • Docker deployment
  • Full repo cleanup + rebranding

License

MIT License β€” see LICENSE.

About

Multi-Source Salary Analytics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 91.7%
  • Python 8.2%
  • Other 0.1%