feat: add classification metrics (F1, Precision, Recall) for dataset-level evaluationfeat: add classification metrics (F1, Precision, Recall) for dataset-… by anshuchowdaryalapati · Pull Request #6081 · comet-ml/opik

anshuchowdaryalapati · 2026-04-06T08:23:35Z

What this PR does

Adds three new dataset-level classification metrics to the Opik Python SDK:

F1Score — supports macro, micro, weighted averaging
PrecisionScore — supports macro, micro, weighted averaging
RecallScore — supports macro, micro, weighted averaging

Motivation

LLM pipelines are not just generation — they include classification components like:

query routing
label assignment
document filtering

These need dataset-level metrics (F1, Precision, Recall), not per-sample scores.
Currently users have to compute these outside Opik in notebooks/scripts.
This PR brings everything into one place.

Changes

sdks/python/src/opik/evaluation/metrics/heuristics/classification.py — 3 new metric classes
sdks/python/tests/unit/evaluation/metrics/test_classification.py — 12 unit tests

Testing

python -m pytest tests/unit/evaluation/metrics/test_classification.py -v
12 passed

…level evaluation

baz-reviewer · 2026-04-06T08:28:10Z

+class TestF1Score:
+    def test_perfect_predictions(self):
+        metric = F1Score(average="macro")
+        result = metric.score(
+            predictions=["cat", "cat", "dog"],
+            references=["cat", "cat", "dog"],
+        )


Can we rename TestF1Score.test_perfect_predictions and the other new tests to the SDK naming convention test_WHAT__CASE_DESCRIPTION__EXPECTED_RESULT, using test_WHAT__happyflow for happy paths?

_{Finding type: AI Coding Guidelines | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In sdks/python/tests/unit/evaluation/metrics/test_classification.py around lines 9-15, the test method test_perfect_predictions in class TestF1Score does not follow the required naming convention. Rename that method to test_f1_score__happyflow (keeping the same body and assertions). Then, across the same file, rename the other test methods to the pattern test_<what>__<case_description>__<expected_result> (e.g. test_all_wrong_predictions -> test_f1_score__all_wrong__zero, test_weighted_average -> test_f1_score__weighted_average__range, test_micro_average -> test_f1_score__micro_average__range, and similar for PrecisionScore and RecallScore tests) while preserving their bodies, assertions, and test coverage.

baz-reviewer · 2026-04-06T08:28:10Z

+from typing import Any, List, Literal, Optional
+
+
+from sklearn.metrics import f1_score, precision_score, recall_score


Should we replace from sklearn.metrics import f1_score, precision_score, recall_score with import sklearn.metrics as metrics and call metrics.f1_score()/metrics.precision_score()/metrics.recall_score() to follow sdks/python/AGENTS.md's module-style import guideline?

from sklearn.metrics import f1_score, precision_score, recall_score => import sklearn.metrics as metrics
f1_score(...) => metrics.f1_score(...)

_{Finding type: AI Coding Guidelines | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In sdks/python/src/opik/evaluation/metrics/heuristics/classification.py around line 4, the file does `from sklearn.metrics import f1_score, precision_score, recall_score` which violates the SDK import-style guideline. Replace that named import with a module import `import sklearn.metrics as metrics` and update all call sites in this file to use the module prefix (e.g. change `f1_score(...)` to `metrics.f1_score(...)`, `precision_score(...)` to `metrics.precision_score(...)`, and `recall_score(...)` to `metrics.recall_score(...)`) so the three score() methods continue to work and the code follows module-style imports.

baz-reviewer · 2026-04-06T08:28:10Z

+class F1Score(base_metric.BaseMetric):
+    """
+    A metric that computes the F1-score for classification tasks at the dataset level.
+
+
+    Unlike per-sample metrics, this metric requires all predictions and references
+    to be collected first, then scored together.
+
+
+    Args:
+        average: Averaging strategy - 'macro', 'micro', or 'weighted'. Defaults to 'macro'.
+        name: The name of the metric. Defaults to "f1_score_metric".


F1Score, PrecisionScore, and RecallScore aren't exported from metrics/__init__.py, should we import them and add them to __all__?

Suggested change

class F1Score(base_metric.BaseMetric):

"""

A metric that computes the F1-score for classification tasks at the dataset level.

Unlike per-sample metrics, this metric requires all predictions and references

to be collected first, then scored together.

Args:

average: Averaging strategy - 'macro', 'micro', or 'weighted'. Defaults to 'macro'.

name: The name of the metric. Defaults to "f1_score_metric".

from .heuristics.classification import F1Score, PrecisionScore, RecallScore

__all__ += ["F1Score", "PrecisionScore", "RecallScore"]

_{Finding type: Breaking Changes | Severity: 🟠 Medium}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In sdks/python/src/opik/evaluation/metrics/heuristics/classification.py around lines 10-79 the new classes F1Score, PrecisionScore, and RecallScore are implemented and meant to be public. Update the package export file sdks/python/src/opik/evaluation/metrics/__init__.py (around the existing imports at lines ~24-32) to import these classes (e.g. from .heuristics.classification import F1Score, PrecisionScore, RecallScore) and add their names to the module's __all__ list so they are re-exported and accessible via from opik.evaluation.metrics import F1Score, PrecisionScore, RecallScore.

baz-reviewer · 2026-04-06T08:28:10Z

+        if len(predictions) != len(references):
+            raise ValueError(
+                f"predictions and references must have the same length, "
+                f"got {len(predictions)} and {len(references)}"
+            )
+
+        value = f1_score(
+            references, predictions, average=self._average, zero_division=0
+        )
+


ValueError for length mismatch and sklearn errors (empty/invalid inputs) leaks through — should we raise MetricComputationError instead and wrap the sklearn call?

_{Finding types: Logical Bugs | Severity: 🔴 High}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In sdks/python/src/opik/evaluation/metrics/heuristics/classification.py around lines 65-74 (and analogous checks in PrecisionScore.score and RecallScore.score), replace the bare ValueError for len(predictions) != len(references) with opik.exceptions.MetricComputationError, preserving the existing message. Then guard for empty predictions/references and raise MetricComputationError with a clear message if empty. Wrap the sklearn.metrics call (f1_score, precision_score, recall_score) in try/except ValueError as e and raise MetricComputationError("failed to compute [metric-name]: {e}") from e. Add an import for MetricComputationError at the top of the file. Update any unit tests that assert ValueError for these conditions to expect MetricComputationError instead.

kriogenia · 2026-04-06T09:53:18Z

I don't get this PR. You are implementing those metrics as a BaseMetric but those types of metrics are for per-sample evaluation. The issue this stems from already pointed that these kind of metrics should be experiment scoring functions instead. Which makes sense to me.

How is this intended to be used with the evaluate for example? Shouldn't there be some integration tests with them instead of just unit tests.

…ation scoring functions

anshuchowdaryalapati · 2026-04-06T11:37:05Z

Hi @kriogenia — thank you for the feedback, very helpful.
I've refactored the implementation completely. Instead of BaseMetric, I've implemented experiment-level scoring functions — plain callables that take List[TestResult] and return List[ScoreResult], designed to be passed directly to experiment_scoring_functions in opik.evaluate().
New file: sdks/python/src/opik/evaluation/classification_scoring.py

f1_scoring_function(average="macro"|"micro"|"weighted")
precision_scoring_function(average="macro"|"micro"|"weighted")
recall_scoring_function(average="macro"|"micro"|"weighted")

All follow the same pattern as compute_std_deviation in test_experiment_scoring_functions.py.
13 unit tests passing. Regarding integration tests — I'd like to add them in tests/e2e/evaluation/test_experiment_scoring_functions.py following the existing pattern. Could you confirm if that's the right place? I don't have a running backend to validate e2e locally — happy to add them if you can confirm the setup.

baz-reviewer · 2026-04-06T11:40:32Z

+    def test_f1_scoring_function__weighted_average__range(self):
+        fn = f1_scoring_function(average="weighted")
+        results = _make_test_results(
+            predictions=["cat", "dog", "cat"],
+            references=["cat", "cat", "cat"],
+        )
+        scores = fn(results)
+        assert 0.0 <= scores[0].value <= 1.0


test_f1_scoring_function__weighted_average__range only asserts 0.0 <= scores[0].value <= 1.0, should we assert the actual expected scores like weighted F1 ≃ 0.8, micro F1 ≃ 0.667, macro precision ≃ 0.5, weighted recall ≃ 0.667 instead?

_{Finding type: Assert expected behavior | Severity: 🟠 Medium}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In sdks/python/tests/unit/evaluation/test_classification_scoring.py around lines 63 to 70, the test_f1_scoring_function__weighted_average__range only asserts 0.0 <= score <= 1.0; change it to assert the weighted F1 equals approximately 0.8 using pytest.approx(0.8, rel=1e-3) (or pytest.approx(0.8, abs=1e-3)). In the same file around lines 72 to 79 (test_f1_scoring_function__micro_average__range) replace the trivial range assertion with an assert that the micro F1 is approximately 0.667 using pytest.approx(0.667, rel=1e-3). Around lines 111 to 118 (test_precision_scoring_function__partial__range) assert the macro precision is approximately 0.5 with pytest.approx(0.5, rel=1e-3). Around lines 137 to 144 (test_recall_scoring_function__partial__range) assert the weighted recall is approximately 0.667 with pytest.approx(0.667, rel=1e-3). Keep the test inputs unchanged and use pytest.approx for stable floating comparisons.

baz-reviewer · 2026-04-06T11:40:32Z

+    def _compute(
+        test_results: List[test_result.TestResult],
+    ) -> List[score_result.ScoreResult]:
+        predictions = [
+            str(r.test_case.task_output.get(output_key, ""))
+            for r in test_results
+            if r.test_case.task_output is not None
+        ]
+        references = [
+            str(r.test_case.task_output.get(reference_key, ""))
+            for r in test_results
+            if r.test_case.task_output is not None


task_output.get(..., "") turns a missing reference_key into an empty-string ground truth — should we skip results missing reference_key or pull the reference from the dataset item/scoring inputs instead?

_{Finding type: Logical Bugs | Severity: 🔴 High}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In sdks/python/src/opik/evaluation/classification_scoring.py around lines 75 to 88, the inner _compute of f1_scoring_function builds predictions and references using task_output.get(..., "") which inserts empty-string sentinels when the key is missing. Change the list comprehensions to skip any test_result where test_case.task_output is None or does not contain the requested key (e.g. use `if r.test_case.task_output is not None and output_key in r.test_case.task_output`), and access the value by indexing (r.test_case.task_output[output_key]) instead of using a default. Also ensure the same precise change is applied to the equivalent comprehensions in precision_scoring_function and recall_scoring_function so missing keys are skipped rather than treated as empty labels.

dsblank · 2026-04-07T17:13:40Z

@anshuchowdaryalapati you will address the baz-reviewer comments?

feat: add classification metrics (F1, Precision, Recall) for dataset-…

5442a03

…level evaluation

anshuchowdaryalapati requested a review from a team as a code owner April 6, 2026 08:23

github-actions Bot added python Pull requests that update Python code tests Including test files, or tests related like configuration. Python SDK labels Apr 6, 2026

baz-reviewer Bot reviewed Apr 6, 2026

View reviewed changes

refactor: replace BaseMetric approach with experiment-level classific…

2379a24

…ation scoring functions

baz-reviewer Bot reviewed Apr 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add classification metrics (F1, Precision, Recall) for dataset-level evaluationfeat: add classification metrics (F1, Precision, Recall) for dataset-…#6081

feat: add classification metrics (F1, Precision, Recall) for dataset-level evaluationfeat: add classification metrics (F1, Precision, Recall) for dataset-…#6081
anshuchowdaryalapati wants to merge 2 commits intocomet-ml:mainfrom
anshuchowdaryalapati:main

anshuchowdaryalapati commented Apr 6, 2026

Uh oh!

baz-reviewer Bot Apr 6, 2026

Uh oh!

baz-reviewer Bot Apr 6, 2026

Uh oh!

baz-reviewer Bot Apr 6, 2026

Uh oh!

baz-reviewer Bot Apr 6, 2026

Uh oh!

kriogenia commented Apr 6, 2026

Uh oh!

anshuchowdaryalapati commented Apr 6, 2026 •

edited

Loading

Uh oh!

baz-reviewer Bot Apr 6, 2026

Uh oh!

baz-reviewer Bot Apr 6, 2026

Uh oh!

dsblank commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		from typing import Any, List, Literal, Optional


		from sklearn.metrics import f1_score, precision_score, recall_score

Conversation

anshuchowdaryalapati commented Apr 6, 2026

What this PR does

Motivation

Changes

Testing

Uh oh!

baz-reviewer Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

kriogenia commented Apr 6, 2026

Uh oh!

anshuchowdaryalapati commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baz-reviewer Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

dsblank commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anshuchowdaryalapati commented Apr 6, 2026 •

edited

Loading