Skip to content

feat: add classification metrics (F1, Precision, Recall) for dataset-level evaluationfeat: add classification metrics (F1, Precision, Recall) for dataset-…#6081

Open
anshuchowdaryalapati wants to merge 2 commits intocomet-ml:mainfrom
anshuchowdaryalapati:main

Conversation

@anshuchowdaryalapati
Copy link
Copy Markdown

Closes #5988

What this PR does

Adds three new dataset-level classification metrics to the Opik Python SDK:

  • F1Score — supports macro, micro, weighted averaging
  • PrecisionScore — supports macro, micro, weighted averaging
  • RecallScore — supports macro, micro, weighted averaging

Motivation

LLM pipelines are not just generation — they include classification components like:

  • query routing
  • label assignment
  • document filtering

These need dataset-level metrics (F1, Precision, Recall), not per-sample scores.
Currently users have to compute these outside Opik in notebooks/scripts.
This PR brings everything into one place.

Changes

  • sdks/python/src/opik/evaluation/metrics/heuristics/classification.py — 3 new metric classes
  • sdks/python/tests/unit/evaluation/metrics/test_classification.py — 12 unit tests

Testing

python -m pytest tests/unit/evaluation/metrics/test_classification.py -v
12 passed

@anshuchowdaryalapati anshuchowdaryalapati requested a review from a team as a code owner April 6, 2026 08:23
@github-actions github-actions Bot added python Pull requests that update Python code tests Including test files, or tests related like configuration. Python SDK labels Apr 6, 2026
Comment on lines +9 to +15
class TestF1Score:
def test_perfect_predictions(self):
metric = F1Score(average="macro")
result = metric.score(
predictions=["cat", "cat", "dog"],
references=["cat", "cat", "dog"],
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename TestF1Score.test_perfect_predictions and the other new tests to the SDK naming convention test_WHAT__CASE_DESCRIPTION__EXPECTED_RESULT, using test_WHAT__happyflow for happy paths?

Finding type: AI Coding Guidelines | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
sdks/python/tests/unit/evaluation/metrics/test_classification.py around lines 9-15, the
test method test_perfect_predictions in class TestF1Score does not follow the required
naming convention. Rename that method to test_f1_score__happyflow (keeping the same body
and assertions). Then, across the same file, rename the other test methods to the
pattern test_<what>__<case_description>__<expected_result> (e.g.
test_all_wrong_predictions -> test_f1_score__all_wrong__zero, test_weighted_average ->
test_f1_score__weighted_average__range, test_micro_average ->
test_f1_score__micro_average__range, and similar for PrecisionScore and RecallScore
tests) while preserving their bodies, assertions, and test coverage.

from typing import Any, List, Literal, Optional


from sklearn.metrics import f1_score, precision_score, recall_score
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we replace from sklearn.metrics import f1_score, precision_score, recall_score with import sklearn.metrics as metrics and call metrics.f1_score()/metrics.precision_score()/metrics.recall_score() to follow sdks/python/AGENTS.md's module-style import guideline?

from sklearn.metrics import f1_score, precision_score, recall_score => import sklearn.metrics as metrics
f1_score(...) => metrics.f1_score(...)

Finding type: AI Coding Guidelines | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
sdks/python/src/opik/evaluation/metrics/heuristics/classification.py around line 4, the
file does `from sklearn.metrics import f1_score, precision_score, recall_score` which
violates the SDK import-style guideline. Replace that named import with a module import
`import sklearn.metrics as metrics` and update all call sites in this file to use the
module prefix (e.g. change `f1_score(...)` to `metrics.f1_score(...)`,
`precision_score(...)` to `metrics.precision_score(...)`, and `recall_score(...)` to
`metrics.recall_score(...)`) so the three score() methods continue to work and the code
follows module-style imports.

Comment on lines +10 to +21
class F1Score(base_metric.BaseMetric):
"""
A metric that computes the F1-score for classification tasks at the dataset level.


Unlike per-sample metrics, this metric requires all predictions and references
to be collected first, then scored together.


Args:
average: Averaging strategy - 'macro', 'micro', or 'weighted'. Defaults to 'macro'.
name: The name of the metric. Defaults to "f1_score_metric".
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F1Score, PrecisionScore, and RecallScore aren't exported from metrics/__init__.py, should we import them and add them to __all__?

Suggested change
class F1Score(base_metric.BaseMetric):
"""
A metric that computes the F1-score for classification tasks at the dataset level.
Unlike per-sample metrics, this metric requires all predictions and references
to be collected first, then scored together.
Args:
average: Averaging strategy - 'macro', 'micro', or 'weighted'. Defaults to 'macro'.
name: The name of the metric. Defaults to "f1_score_metric".
from .heuristics.classification import F1Score, PrecisionScore, RecallScore
__all__ += ["F1Score", "PrecisionScore", "RecallScore"]

Finding type: Breaking Changes | Severity: 🟠 Medium


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
sdks/python/src/opik/evaluation/metrics/heuristics/classification.py around lines 10-79
the new classes F1Score, PrecisionScore, and RecallScore are implemented and meant to be
public. Update the package export file
sdks/python/src/opik/evaluation/metrics/__init__.py (around the existing imports at
lines ~24-32) to import these classes (e.g. from .heuristics.classification import
F1Score, PrecisionScore, RecallScore) and add their names to the module's __all__ list
so they are re-exported and accessible via from opik.evaluation.metrics import F1Score,
PrecisionScore, RecallScore.

Comment on lines +65 to +74
if len(predictions) != len(references):
raise ValueError(
f"predictions and references must have the same length, "
f"got {len(predictions)} and {len(references)}"
)

value = f1_score(
references, predictions, average=self._average, zero_division=0
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueError for length mismatch and sklearn errors (empty/invalid inputs) leaks through — should we raise MetricComputationError instead and wrap the sklearn call?

Finding types: Logical Bugs | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
sdks/python/src/opik/evaluation/metrics/heuristics/classification.py around lines 65-74
(and analogous checks in PrecisionScore.score and RecallScore.score), replace the bare
ValueError for len(predictions) != len(references) with
opik.exceptions.MetricComputationError, preserving the existing message. Then guard for
empty predictions/references and raise MetricComputationError with a clear message if
empty. Wrap the sklearn.metrics call (f1_score, precision_score, recall_score) in
try/except ValueError as e and raise MetricComputationError("failed to compute
[metric-name]: {e}") from e. Add an import for MetricComputationError at the top of the
file. Update any unit tests that assert ValueError for these conditions to expect
MetricComputationError instead.

@kriogenia
Copy link
Copy Markdown

I don't get this PR. You are implementing those metrics as a BaseMetric but those types of metrics are for per-sample evaluation. The issue this stems from already pointed that these kind of metrics should be experiment scoring functions instead. Which makes sense to me.

How is this intended to be used with the evaluate for example? Shouldn't there be some integration tests with them instead of just unit tests.

@anshuchowdaryalapati
Copy link
Copy Markdown
Author

anshuchowdaryalapati commented Apr 6, 2026

Hi @kriogenia — thank you for the feedback, very helpful.
I've refactored the implementation completely. Instead of BaseMetric, I've implemented experiment-level scoring functions — plain callables that take List[TestResult] and return List[ScoreResult], designed to be passed directly to experiment_scoring_functions in opik.evaluate().
New file: sdks/python/src/opik/evaluation/classification_scoring.py

  • f1_scoring_function(average="macro"|"micro"|"weighted")
  • precision_scoring_function(average="macro"|"micro"|"weighted")
  • recall_scoring_function(average="macro"|"micro"|"weighted")

All follow the same pattern as compute_std_deviation in test_experiment_scoring_functions.py.
13 unit tests passing. Regarding integration tests — I'd like to add them in tests/e2e/evaluation/test_experiment_scoring_functions.py following the existing pattern. Could you confirm if that's the right place? I don't have a running backend to validate e2e locally — happy to add them if you can confirm the setup.

Comment on lines +63 to +70
def test_f1_scoring_function__weighted_average__range(self):
fn = f1_scoring_function(average="weighted")
results = _make_test_results(
predictions=["cat", "dog", "cat"],
references=["cat", "cat", "cat"],
)
scores = fn(results)
assert 0.0 <= scores[0].value <= 1.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_f1_scoring_function__weighted_average__range only asserts 0.0 <= scores[0].value <= 1.0, should we assert the actual expected scores like weighted F1 ≃ 0.8, micro F1 ≃ 0.667, macro precision ≃ 0.5, weighted recall ≃ 0.667 instead?

Finding type: Assert expected behavior | Severity: 🟠 Medium


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
sdks/python/tests/unit/evaluation/test_classification_scoring.py around lines 63 to 70,
the test_f1_scoring_function__weighted_average__range only asserts 0.0 <= score <= 1.0;
change it to assert the weighted F1 equals approximately 0.8 using pytest.approx(0.8,
rel=1e-3) (or pytest.approx(0.8, abs=1e-3)). In the same file around lines 72 to 79
(test_f1_scoring_function__micro_average__range) replace the trivial range assertion
with an assert that the micro F1 is approximately 0.667 using pytest.approx(0.667,
rel=1e-3). Around lines 111 to 118 (test_precision_scoring_function__partial__range)
assert the macro precision is approximately 0.5 with pytest.approx(0.5, rel=1e-3).
Around lines 137 to 144 (test_recall_scoring_function__partial__range) assert the
weighted recall is approximately 0.667 with pytest.approx(0.667, rel=1e-3). Keep the
test inputs unchanged and use pytest.approx for stable floating comparisons.

Comment on lines +75 to +86
def _compute(
test_results: List[test_result.TestResult],
) -> List[score_result.ScoreResult]:
predictions = [
str(r.test_case.task_output.get(output_key, ""))
for r in test_results
if r.test_case.task_output is not None
]
references = [
str(r.test_case.task_output.get(reference_key, ""))
for r in test_results
if r.test_case.task_output is not None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task_output.get(..., "") turns a missing reference_key into an empty-string ground truth — should we skip results missing reference_key or pull the reference from the dataset item/scoring inputs instead?

Finding type: Logical Bugs | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
sdks/python/src/opik/evaluation/classification_scoring.py around lines 75 to 88, the
inner _compute of f1_scoring_function builds predictions and references using
task_output.get(..., "") which inserts empty-string sentinels when the key is missing.
Change the list comprehensions to skip any test_result where test_case.task_output is
None or does not contain the requested key (e.g. use `if r.test_case.task_output is not
None and output_key in r.test_case.task_output`), and access the value by indexing
(r.test_case.task_output[output_key]) instead of using a default. Also ensure the same
precise change is applied to the equivalent comprehensions in precision_scoring_function
and recall_scoring_function so missing keys are skipped rather than treated as empty
labels.

@dsblank
Copy link
Copy Markdown
Contributor

dsblank commented Apr 7, 2026

@anshuchowdaryalapati you will address the baz-reviewer comments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Python SDK python Pull requests that update Python code tests Including test files, or tests related like configuration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FR]: Support for classification metrics (Precision, Recall, F1) with dataset-level evaluation

3 participants