feat: add classification metrics (F1, Precision, Recall) for dataset-level evaluationfeat: add classification metrics (F1, Precision, Recall) for dataset-…#6081
Conversation
| class TestF1Score: | ||
| def test_perfect_predictions(self): | ||
| metric = F1Score(average="macro") | ||
| result = metric.score( | ||
| predictions=["cat", "cat", "dog"], | ||
| references=["cat", "cat", "dog"], | ||
| ) |
There was a problem hiding this comment.
Can we rename TestF1Score.test_perfect_predictions and the other new tests to the SDK naming convention test_WHAT__CASE_DESCRIPTION__EXPECTED_RESULT, using test_WHAT__happyflow for happy paths?
Finding type: AI Coding Guidelines | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
sdks/python/tests/unit/evaluation/metrics/test_classification.py around lines 9-15, the
test method test_perfect_predictions in class TestF1Score does not follow the required
naming convention. Rename that method to test_f1_score__happyflow (keeping the same body
and assertions). Then, across the same file, rename the other test methods to the
pattern test_<what>__<case_description>__<expected_result> (e.g.
test_all_wrong_predictions -> test_f1_score__all_wrong__zero, test_weighted_average ->
test_f1_score__weighted_average__range, test_micro_average ->
test_f1_score__micro_average__range, and similar for PrecisionScore and RecallScore
tests) while preserving their bodies, assertions, and test coverage.
| from typing import Any, List, Literal, Optional | ||
|
|
||
|
|
||
| from sklearn.metrics import f1_score, precision_score, recall_score |
There was a problem hiding this comment.
Should we replace from sklearn.metrics import f1_score, precision_score, recall_score with import sklearn.metrics as metrics and call metrics.f1_score()/metrics.precision_score()/metrics.recall_score() to follow sdks/python/AGENTS.md's module-style import guideline?
from sklearn.metrics import f1_score, precision_score, recall_score => import sklearn.metrics as metrics
f1_score(...) => metrics.f1_score(...)
Finding type: AI Coding Guidelines | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
sdks/python/src/opik/evaluation/metrics/heuristics/classification.py around line 4, the
file does `from sklearn.metrics import f1_score, precision_score, recall_score` which
violates the SDK import-style guideline. Replace that named import with a module import
`import sklearn.metrics as metrics` and update all call sites in this file to use the
module prefix (e.g. change `f1_score(...)` to `metrics.f1_score(...)`,
`precision_score(...)` to `metrics.precision_score(...)`, and `recall_score(...)` to
`metrics.recall_score(...)`) so the three score() methods continue to work and the code
follows module-style imports.
| class F1Score(base_metric.BaseMetric): | ||
| """ | ||
| A metric that computes the F1-score for classification tasks at the dataset level. | ||
|
|
||
|
|
||
| Unlike per-sample metrics, this metric requires all predictions and references | ||
| to be collected first, then scored together. | ||
|
|
||
|
|
||
| Args: | ||
| average: Averaging strategy - 'macro', 'micro', or 'weighted'. Defaults to 'macro'. | ||
| name: The name of the metric. Defaults to "f1_score_metric". |
There was a problem hiding this comment.
F1Score, PrecisionScore, and RecallScore aren't exported from metrics/__init__.py, should we import them and add them to __all__?
| class F1Score(base_metric.BaseMetric): | |
| """ | |
| A metric that computes the F1-score for classification tasks at the dataset level. | |
| Unlike per-sample metrics, this metric requires all predictions and references | |
| to be collected first, then scored together. | |
| Args: | |
| average: Averaging strategy - 'macro', 'micro', or 'weighted'. Defaults to 'macro'. | |
| name: The name of the metric. Defaults to "f1_score_metric". | |
| from .heuristics.classification import F1Score, PrecisionScore, RecallScore | |
| __all__ += ["F1Score", "PrecisionScore", "RecallScore"] |
Finding type: Breaking Changes | Severity: 🟠 Medium
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
sdks/python/src/opik/evaluation/metrics/heuristics/classification.py around lines 10-79
the new classes F1Score, PrecisionScore, and RecallScore are implemented and meant to be
public. Update the package export file
sdks/python/src/opik/evaluation/metrics/__init__.py (around the existing imports at
lines ~24-32) to import these classes (e.g. from .heuristics.classification import
F1Score, PrecisionScore, RecallScore) and add their names to the module's __all__ list
so they are re-exported and accessible via from opik.evaluation.metrics import F1Score,
PrecisionScore, RecallScore.
| if len(predictions) != len(references): | ||
| raise ValueError( | ||
| f"predictions and references must have the same length, " | ||
| f"got {len(predictions)} and {len(references)}" | ||
| ) | ||
|
|
||
| value = f1_score( | ||
| references, predictions, average=self._average, zero_division=0 | ||
| ) | ||
|
|
There was a problem hiding this comment.
ValueError for length mismatch and sklearn errors (empty/invalid inputs) leaks through — should we raise MetricComputationError instead and wrap the sklearn call?
Finding types: Logical Bugs | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
sdks/python/src/opik/evaluation/metrics/heuristics/classification.py around lines 65-74
(and analogous checks in PrecisionScore.score and RecallScore.score), replace the bare
ValueError for len(predictions) != len(references) with
opik.exceptions.MetricComputationError, preserving the existing message. Then guard for
empty predictions/references and raise MetricComputationError with a clear message if
empty. Wrap the sklearn.metrics call (f1_score, precision_score, recall_score) in
try/except ValueError as e and raise MetricComputationError("failed to compute
[metric-name]: {e}") from e. Add an import for MetricComputationError at the top of the
file. Update any unit tests that assert ValueError for these conditions to expect
MetricComputationError instead.
|
I don't get this PR. You are implementing those metrics as a BaseMetric but those types of metrics are for per-sample evaluation. The issue this stems from already pointed that these kind of metrics should be experiment scoring functions instead. Which makes sense to me. How is this intended to be used with the |
…ation scoring functions
|
Hi @kriogenia — thank you for the feedback, very helpful.
All follow the same pattern as compute_std_deviation in test_experiment_scoring_functions.py. |
| def test_f1_scoring_function__weighted_average__range(self): | ||
| fn = f1_scoring_function(average="weighted") | ||
| results = _make_test_results( | ||
| predictions=["cat", "dog", "cat"], | ||
| references=["cat", "cat", "cat"], | ||
| ) | ||
| scores = fn(results) | ||
| assert 0.0 <= scores[0].value <= 1.0 |
There was a problem hiding this comment.
test_f1_scoring_function__weighted_average__range only asserts 0.0 <= scores[0].value <= 1.0, should we assert the actual expected scores like weighted F1 ≃ 0.8, micro F1 ≃ 0.667, macro precision ≃ 0.5, weighted recall ≃ 0.667 instead?
Finding type: Assert expected behavior | Severity: 🟠 Medium
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
sdks/python/tests/unit/evaluation/test_classification_scoring.py around lines 63 to 70,
the test_f1_scoring_function__weighted_average__range only asserts 0.0 <= score <= 1.0;
change it to assert the weighted F1 equals approximately 0.8 using pytest.approx(0.8,
rel=1e-3) (or pytest.approx(0.8, abs=1e-3)). In the same file around lines 72 to 79
(test_f1_scoring_function__micro_average__range) replace the trivial range assertion
with an assert that the micro F1 is approximately 0.667 using pytest.approx(0.667,
rel=1e-3). Around lines 111 to 118 (test_precision_scoring_function__partial__range)
assert the macro precision is approximately 0.5 with pytest.approx(0.5, rel=1e-3).
Around lines 137 to 144 (test_recall_scoring_function__partial__range) assert the
weighted recall is approximately 0.667 with pytest.approx(0.667, rel=1e-3). Keep the
test inputs unchanged and use pytest.approx for stable floating comparisons.
| def _compute( | ||
| test_results: List[test_result.TestResult], | ||
| ) -> List[score_result.ScoreResult]: | ||
| predictions = [ | ||
| str(r.test_case.task_output.get(output_key, "")) | ||
| for r in test_results | ||
| if r.test_case.task_output is not None | ||
| ] | ||
| references = [ | ||
| str(r.test_case.task_output.get(reference_key, "")) | ||
| for r in test_results | ||
| if r.test_case.task_output is not None |
There was a problem hiding this comment.
task_output.get(..., "") turns a missing reference_key into an empty-string ground truth — should we skip results missing reference_key or pull the reference from the dataset item/scoring inputs instead?
Finding type: Logical Bugs | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
sdks/python/src/opik/evaluation/classification_scoring.py around lines 75 to 88, the
inner _compute of f1_scoring_function builds predictions and references using
task_output.get(..., "") which inserts empty-string sentinels when the key is missing.
Change the list comprehensions to skip any test_result where test_case.task_output is
None or does not contain the requested key (e.g. use `if r.test_case.task_output is not
None and output_key in r.test_case.task_output`), and access the value by indexing
(r.test_case.task_output[output_key]) instead of using a default. Also ensure the same
precise change is applied to the equivalent comprehensions in precision_scoring_function
and recall_scoring_function so missing keys are skipped rather than treated as empty
labels.
|
@anshuchowdaryalapati you will address the baz-reviewer comments? |
Closes #5988
What this PR does
Adds three new dataset-level classification metrics to the Opik Python SDK:
F1Score— supports macro, micro, weighted averagingPrecisionScore— supports macro, micro, weighted averagingRecallScore— supports macro, micro, weighted averagingMotivation
LLM pipelines are not just generation — they include classification components like:
These need dataset-level metrics (F1, Precision, Recall), not per-sample scores.
Currently users have to compute these outside Opik in notebooks/scripts.
This PR brings everything into one place.
Changes
sdks/python/src/opik/evaluation/metrics/heuristics/classification.py— 3 new metric classessdks/python/tests/unit/evaluation/metrics/test_classification.py— 12 unit testsTesting
python -m pytest tests/unit/evaluation/metrics/test_classification.py -v
12 passed