The ELO Router is a rating-based routing method that ranks LLMs using the Elo rating system, originally developed for chess. It converts historical performance data into pairwise comparisons and computes a global ranking. All queries are routed to the single highest-rated LLM.
This router is inspired by the Elo Rating System and RouteLLM:
-
RouteLLM: Learning to Route LLMs with Preference Data
- Ong, I., et al. (2024). arXiv:2406.18665. Published at ICLR 2025.
- Implements
sw_rankingrouter using weighted Elo calculation.
-
Original Elo System:
- Elo, A. E. (1978). "The Rating of Chessplayers, Past and Present." Arco Publishing.
-
Application to LLMs:
- Zheng, L., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS.
- Bradley-Terry Model: Bradley, R. A., & Terry, M. E. (1952). "Rank Analysis of Incomplete Block Designs." Biometrika.
Historical Data → Pairwise Battles → Elo Computation → Single Best Model Selection
(winner/loser) (Logistic Regression MLE)
-
Training Phase:
- For each query in training data, identify the best-performing LLM
- Create pairwise "battles": winner (best LLM) vs. each loser (other LLMs)
- Use logistic regression to estimate Elo scores via Maximum Likelihood Estimation
- Save Elo scores to disk
-
Inference Phase:
- Load precomputed Elo scores
- Always select the LLM with the highest Elo rating
- Route ALL queries to this single model (query-independent routing)
- Global Ranking: Computes a single global rating for each LLM
- Query-Independent: Unlike KNN/MLP/SVM, ignores query content during inference
- Pairwise Comparisons: Based on relative performance, not absolute scores
- Statistical Foundation: Grounded in Bradley-Terry model and MLE
For each pairwise battle (model A vs. model B):
P(A wins) = 1 / (1 + 10^((Rating_B - Rating_A) / 400))
The trainer uses logistic regression to find Elo ratings that maximize the likelihood of observed battle outcomes.
For each query:
- Identify the best-performing model (winner)
- Create battles: winner vs. all other models (losers)
- Generate symmetric battles (A vs B and B vs A) for balanced training
Example:
Query: "Explain gravity"
Performance: GPT-4 (0.95), Claude (0.85), Llama (0.70)
Battles created:
GPT-4 vs Claude → GPT-4 wins
GPT-4 vs Llama → GPT-4 wins
Claude vs GPT-4 → Claude loses
Llama vs GPT-4 → Llama loses
Uses logistic regression MLE to find Elo ratings that best explain battle outcomes:
- Initialize all models at 1000 rating
- Fit logistic regression to predict battle winners
- Convert coefficients to Elo scores (scaled by 400)
Saves Elo scores as a dictionary: {"GPT-4": 1250, "Claude": 1180, "Llama": 950}
No hyperparameters to tune! The Elo computation is deterministic given the training data.
Fixed Constants (in trainer code):
SCALE: 400.0 - Standard Elo scale factorBASE: 10.0 - Elo probability baseINIT_RATING: 1000.0 - Starting rating for all models
| Parameter | Description |
|---|---|
query_data_train |
Training queries in JSONL format |
routing_data_train |
Historical routing performance data (query-LLM pairs with performance scores) |
llm_data |
LLM candidate information (models, API names, metadata) |
| Parameter | Purpose | Usage |
|---|---|---|
save_model_path |
Where to save computed Elo scores | Training: saves {model_name: elo_score} dictionary |
load_model_path |
Elo scores to load for inference | Testing: path to saved .pkl file |
During inference:
- Loads Elo scores from
load_model_path - Selects the model with the highest rating
- Routes all queries to this single model
- No query-specific routing decisions
The ELO Router can be used via the llmrouter command-line interface:
# Compute Elo rankings
llmrouter train --router elorouter --config configs/model_config_train/elorouter.yaml
# Train with quiet mode
llmrouter train --router elorouter --config configs/model_config_train/elorouter.yaml --quiet# Route a single query (always selects highest-rated model)
llmrouter infer --router elorouter --config configs/model_config_test/elorouter.yaml \
--query "What is the meaning of life?"
# Route queries from a file
llmrouter infer --router elorouter --config configs/model_config_test/elorouter.yaml \
--input queries.jsonl --output results.json
# Route only (without calling LLM API)
llmrouter infer --router elorouter --config configs/model_config_test/elorouter.yaml \
--query "Explain quantum mechanics" --route-only# Launch chat interface
llmrouter chat --router elorouter --config configs/model_config_test/elorouter.yaml
# Launch with custom port
llmrouter chat --router elorouter --config configs/model_config_test/elorouter.yaml --port 8080
# Create a public shareable link
llmrouter chat --router elorouter --config configs/model_config_test/elorouter.yaml --sharefrom llmrouter.models import EloRouter, EloRouterTrainer
# Initialize router with training configuration
router = EloRouter(yaml_path="configs/model_config_train/elorouter.yaml")
# Create trainer
trainer = EloRouterTrainer(router=router, device="cpu")
# Compute Elo scores
trainer.train()
# Elo scores will be saved to the path specified in save_model_path
# View the computed rankings
print("Elo Rankings:")
for model, score in sorted(router.elo_scores.items(), key=lambda x: -x[1]):
print(f" {model}: {score:.2f}")Command Line Training:
python tests/train_test/test_elorouter.py --yaml_path configs/model_config_train/elorouter.yamlfrom llmrouter.models import EloRouter
# Initialize router with test configuration (loads Elo scores)
router = EloRouter(yaml_path="configs/model_config_test/elorouter.yaml")
# Route a single query
query = {"query": "What is the meaning of life?"}
result = router.route_single(query)
print(f"Selected Model: {result['model_name']}")
# Note: This will ALWAYS be the same model (highest Elo rating)from llmrouter.models import EloRouter
# Initialize router
router = EloRouter(yaml_path="configs/model_config_test/elorouter.yaml")
# Prepare batch of queries
queries = [
{"query": "Explain quantum mechanics", "ground_truth": "..."},
{"query": "Write a poem about AI", "ground_truth": "..."},
{"query": "Solve x^2 + 5x + 6 = 0", "ground_truth": "..."}
]
# Route and execute (all queries go to the same best model)
results = router.route_batch(batch=queries, task_name="general")
# All queries routed to the same model
unique_models = set(r['model_name'] for r in results)
print(f"Number of unique models used: {len(unique_models)}") # Always 1Training Configuration (configs/model_config_train/elorouter.yaml):
data_path:
query_data_train: 'data/example_data/query_data/default_query_train.jsonl'
routing_data_train: 'data/example_data/routing_data/default_routing_train_data.jsonl'
llm_data: 'data/example_data/llm_candidates/default_llm.json'
model_path:
ini_model_path: ''
save_model_path: 'saved_models/elorouter/elorouter.pkl'
metric:
weights:
performance: 1 # Primary criterion for determining winners
cost: 0
llm_judge: 0Testing Configuration (configs/model_config_test/elorouter.yaml):
data_path:
llm_data: 'data/example_data/llm_candidates/default_llm.json'
model_path:
load_model_path: 'saved_models/elorouter/elorouter.pkl'- ✅ Simple and Interpretable: Single global ranking that's easy to understand
- ✅ Statistically Grounded: Based on Bradley-Terry model and MLE
- ✅ No Hyperparameters: No tuning required, fully deterministic
- ✅ Handles Imbalanced Comparisons: Elo naturally handles varying numbers of battles per model
- ✅ Battle-Tested: Proven system used in chess, sports, and now LLM leaderboards
- ✅ Fast Inference: Just a dictionary lookup (O(1))
- ❌ Query-Agnostic: Ignores query content, always routes to the same model
- ❌ No Specialization: Cannot leverage model strengths for specific query types
- ❌ Single Model: Cannot distribute load or use ensembles
- ❌ Assumes Transitivity: Assumes if A > B and B > C, then A > C (may not hold for LLMs)
- ❌ Static Rankings: Must retrain to update Elo scores
- ❌ No Cost-Performance Trade-off: Always chooses highest-rated model regardless of cost
- ❌ Data Hungry: Needs sufficient pairwise comparisons for accurate rankings
Good Use Cases:
- Want a simple baseline that always uses the "best" model
- Need a global ranking of LLM capabilities
- Have abundant training data with consistent evaluation metrics
- Don't need query-specific routing (all queries are similar)
- Want interpretable, explainable routing (just show the Elo ranking)
NOT Recommended When:
- Queries have diverse types (coding, math, creative writing, etc.)
- Need to optimize cost (Elo always picks highest-rated, often most expensive model)
- Want to leverage specialized model strengths
- Need to distribute load across multiple models
- Have limited training data
Elo Score Meaning
--------- -------
1400+ Dominant model, wins most battles
1200-1400 Strong model, competitive
1000-1200 Average model, mixed performance
800-1000 Weak model, loses most battles
<800 Very weak model, rarely wins
Elo Diff Expected Win Rate
-------- -----------------
0 50%
100 64%
200 76%
400 91%
If Model A has Elo 1200 and Model B has Elo 1000 (diff = 200), Model A is expected to win ~76% of battles.
| Aspect | ELO Router | KNN Router | MLP/SVM Router |
|---|---|---|---|
| Query-Specific | ❌ No | ✅ Yes | ✅ Yes |
| Training Speed | Fast | None | Medium |
| Inference Speed | Instant | Medium | Fast |
| Interpretability | High (rankings) | High (neighbors) | Low |
| Model Diversity | Single model | Multiple models | Multiple models |
| Hyperparameters | None | Few | Many |
| Data Efficiency | Medium | High | Medium |
- Framework: Custom implementation using scikit-learn's LogisticRegression
- Battle Generation: Symmetric battles (A vs B and B vs A) for balanced training
- MLE Solver: LBFGS optimizer for logistic regression
- Output: Dictionary mapping model names to Elo scores
- Serialization: Saved as
.pklfiles using pickle
-
Training Data Quality:
- Ensure performance metrics are reliable and consistent
- Include diverse queries to avoid bias
- Need sufficient queries (50+ recommended) for stable rankings
-
Performance Metric Selection:
- Use
metric.weights.performance = 1for accuracy-based ranking - Can incorporate cost if needed (but defeats purpose of pure ranking)
- Ensure metric is comparable across different query types
- Use
-
Model Pool:
- Works best with 3-10 models
- Too few models → limited routing value
- Too many models → sparse battle data per pair
-
Retraining Strategy:
- Retrain periodically as new data arrives
- Monitor if model capabilities change over time
- Update when adding new models to the pool
-
Use as Baseline:
- ELO Router is excellent as a baseline for comparison
- Compare query-specific routers against ELO to measure value of personalization
This router is directly inspired by Chatbot Arena (LMSYS):
- Chatbot Arena uses Elo ratings to rank LLMs based on human preferences
- Users vote on pairwise comparisons → Elo scores computed
- Creates public LLM leaderboard
Key Difference:
- Chatbot Arena: Human preference battles
- ELO Router: Automated performance metric battles
While the defaults work well, you can modify the Elo constants in trainer.py:
# Larger SCALE → bigger rating differences
elo_scores = compute_elo_mle(battles_df, SCALE=500.0, BASE=10.0, INIT_RATING=1500.0)You can modify the battle generation to consider cost-adjusted performance:
# In custom trainer
df["adjusted_performance"] = df["performance"] / (df["cost"] ** 0.5)
# Then use adjusted_performance to determine winnersCompute separate Elo rankings for different metrics (accuracy, speed, cost-efficiency) and combine them.
- Largest LLM Router: Always picks the largest model (simpler heuristic)
- Smallest LLM Router: Always picks the smallest model (cost-focused)
- Hybrid LLM Router: Weighted combination of multiple routing strategies
- Matrix Factorization Router: Learns query-model affinity (query-specific alternative)
For questions or issues, please refer to the main LLMRouter documentation or open an issue on GitHub.