| KPI | Value |
|---|---|
| Unified records | 5,370 (LinkedIn + Kaggle DS + Spain multi-source) |
| Countries | 50 with PPP-adjusted salary comparisons |
| Spain median salary | β¬43,200/yr β PPP-adjusted $66,683 USD |
| Best-paid role (Spain) | Data Scientist β Senior median β¬61.5K |
| Global experience β salary | r = 0.49 (moderate-strong, p < 0.00001) |
| Title normalization | 1,037 β 15 roles, 99.8% coverage |
| Eurostat PPP countries | 47 (1960-2025 population + GDP) |
| Test coverage | 79 tests (70 title normalization + pipeline + CI) |
| Dashboard | Streamlit + Plotly interactive |
| Source | Records | Geography | Year |
|---|---|---|---|
| LinkedIn Job Postings (Kaggle) | 1,831 | Global (US/UK centric) | 2023 |
| Kaggle DS Salaries | 607 | Global multi-year | 2020β2024 |
| Manfred 2026 Salary Guide | 2,700 | Spain (triangular sampling from published bands) | 2026 |
| Kaggle DS 2024 Spain | 127 | Spain | 2024 |
| Glassdoor ES | 105 | Spain | 2024 |
| Eurostat PPP | 1,268 rows Γ 47 countries | EU + neighbours | 1960β2025 |
The global unified dataset (
data/global/global_salaries_unified.csv) merges LinkedIn + Kaggle DS into a single schema. The Spain 2026 dataset (data/espana/spain_salaries_2026_unified.csv) combines Manfred, Glassdoor, and Kaggle ES with title normalization.
LinkedIn ββ
Kaggle DS ββ€ β load_global_salaries.py β global_salaries_unified.csv
Spain ββββββ β
ββββ DuckDB analytics (5 SQL queries)
β βββ 01_median_salary_by_role
β βββ 02_median_salary_by_country
β βββ 03_seniority_premium
Manfred 2026 ββ β βββ 04_top_roles_by_country
Kaggle ES βββββ€ β load_spain_2026.py β βββ 05_market_sizing
Glassdoor ES ββ β
ββββ Streamlit dashboard
Eurostat βββββββ load_eurostat.py β ppp_rates.csv β PPP-adjusted analysis
| Script | Purpose |
|---|---|
scripts/load_global_salaries.py |
Unified global dataset builder (LinkedIn + Kaggle + Spain) |
scripts/load_spain_2026.py |
Spain multi-source loader (Manfred + Kaggle + Glassdoor) |
scripts/load_eurostat.py |
Eurostat TSV downloader + PPP rate computation |
scripts/normalize_titles.py |
Title normalization (1,037 β 15 roles, seniority extraction, threshold grouping) |
scripts/run_analytics.py |
DuckDB analytics engine (5 queries) |
dashboard/app.py |
Streamlit interactive dashboard |
- Data Scientist is Spain's best-paid data role (Senior median: β¬61.5K)
- Spain median salary: β¬43,200/yr raw β $66,683 USD PPP-adjusted (+54% vs EU average)
- Manfred 2026 dominates the Spain dataset (92%) but is also the most current and Spain-specific
- Experience β Salary: r = 0.49 β dominant factor across all roles
- Remote premium: β1.2% (p = 0.46) β no significant effect for data roles
- Views β Applies: r = 0.91 β visibility drives applications
- Salary MNAR: 70.87% missing; juniors disproportionately affected (47.33% hide vs 23.86% seniors)
Initial analysis used all LinkedIn professions (~75K salaries). Corrected metrics isolate Data Roles only (1,831 records):
| Metric | All Professions | Data Roles Only |
|---|---|---|
| Remote premium | +45.1% | β1.2% (p=0.46) |
| Views β Applies | r = 0.62 | r = 0.91 |
| Experience β Salary | r = 0.43 | r = 0.49 |
uv run streamlit run dashboard/app.pyOr with Docker:
docker compose upFive tabs: LinkedIn Salaries β Global Unified β Spain 2026 β Analytics β Bias Analysis.
# Clone
git clone https://github.com/juandelaf1/Pearsons_Four.git
cd DataScope
# Install with uv
uv sync
# Run the full pipeline (optional β data is pre-computed)
uv run python scripts/load_global_salaries.py
uv run python scripts/run_analytics.py
# Launch dashboard
uv run streamlit run dashboard/app.py
# Run tests
uv run pytest tests/ -v# Build and run
docker compose up
# Or pull from DockerHub
docker pull juandelaf1/datascope
docker run -p 8501:8501 juandelaf1/datascopeThen open http://localhost:8501.
| Category | Tools |
|---|---|
| Language | Python 3.11 |
| Data | pandas, numpy, DuckDB |
| Analytics | DuckDB SQL, scipy |
| Visualization | Plotly, Streamlit |
| Scraping | scrapling (Chrome TLS impersonation) |
| Eurostat | Custom TSV downloader (load_eurostat.py) |
| Pipeline | uv, modular Python scripts |
| CI/CD | GitHub Actions (79 tests) |
| Deployment | Docker, docker-compose |
| Dataset | Kaggle (global + Spain) |
DataScope/
βββ .github/workflows/ci.yml # CI pipeline (79 tests)
βββ dashboard/app.py # Streamlit interactive dashboard
βββ Dockerfile + docker-compose.yml # Containerized deployment
βββ notebooks/
β βββ datascope_eda_linkedin.ipynb
β βββ datascope_eda_enhanced.ipynb
β βββ datascope_bias_analysis.ipynb
β βββ datascope_audit.ipynb
βββ scripts/
β βββ load_global_salaries.py # Global dataset builder
β βββ load_spain_2026.py # Spain multi-source loader
β βββ load_eurostat.py # Eurostat + PPP rates
β βββ normalize_titles.py # Title normalization engine
β βββ run_analytics.py # DuckDB analytics
β βββ pipeline/ # Modular pipeline modules
β βββ generate_*.py # Visualization generators
βββ queries/ # 5 DuckDB SQL queries
βββ tests/ # 79 unit tests
βββ data/
β βββ global/global_salaries_unified.csv # 5,370 records, 50 countries
β βββ espana/spain_salaries_2026_unified.csv # 2,932 Spain records
β βββ eurostat/ppp_rates.csv # 1,268 rows, 47 countries
βββ slides/datascope_presentacion.pptx
βββ ROADMAP.md
βββ CHANGELOG.md
βββ pyproject.toml
| Role | Name | GitHub |
|---|---|---|
| Data Wrangler & Product Owner | Juan de la Fuente | @juandelaf1 |
| Statistical Analysis | Isabela TΓ©llez | @Isabela-Tellez |
| Data Visualization & Scrum Master | Anas Fady | @Anasfady |
| Ethics & Strategy | Vanessa GarcΓa | @garciaguadalupevanessa-bit |
Extended fork by Juan de la Fuente:
- Spain multi-source salary integration (Manfred 2026, Glassdoor ES, Kaggle ES)
- DuckDB analytics engine (5 SQL queries)
- Title normalization (1,037 β 15 roles, 99.8% coverage)
- Eurostat PPP loader (47 countries)
- Unit tests + CI/CD (79 tests)
- Docker deployment
- Full repo cleanup + rebranding
MIT License β see LICENSE.
