Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds POLLUX, a criteria-based LLM-judge suitable for any generative tasks with customizable criteria descriptions.
PolluxLLMJudgeMetricis exposed as a custom metric class inlighteval.metrics.metrics_sample(not registered on the built-inMetricsenum), because both the scoring scale and criterion description are defined at initialization time.Summary
PolluxLLMJudgeMetric, built onJudgeLM(OpenAI-compatible HTTP API by default, plus existing LightEval judge backends).SampleLevelComputation, notJudgeLLM, so any judge model id (e.g. Hugging Face repo ids for POLLUX checkpoints) works with theopenaibackend without the OpenAI-model whitelist enforced onJudgeLLM.get_judge_prompt_polluxand_build_pollux_prompt_textinjudge_utils.make_pollux_score_parser(pattern)andmake_pollux_feedback_parser(pattern)with defaultsPOLLUX_DEFAULT_SCORE_RE(bare numeric answer) and no feedback (POLLUX_DEFAULT_FEEDBACK_REisNone). For tagged judge output, usePOLLUX_TAGGED_SCORE_REandPOLLUX_TAGGED_FEEDBACK_RE([RESULT]…[END],[FEEDBACK]…[RESULT]).process_judge_response_pollux/parse_pollux_feedbackremain aliases for the default factories.pollux_score. Optionalpollux_feedbackwheninclude_feedback=Trueand a non-Nonefeedback_patternis supplied (otherwise feedback is the empty string). Corpus aggregation should average onlypollux_score(or a customcorpus_level_fn), not free-text feedback.doc.query, answer fromresponse.final_text[0], optional reference fromdoc.specific["reference_answer"]passed as POLLUX gold;optionsare alwaysNonefor this metric.docs/source/metric-list.mdx(usage + patterns),[[autodoc]]forPolluxLLMJudgeMetricindocs/source/package_reference/metrics.mdx.tests/unit/metrics/test_pollux_judge.py(mockedJudgeLM, no network).tests/unit/metrics/test_cases/pollux_judge.jsonis a deliberate placeholder withmetric_classpollux_llm_judge_customso automated JSON runs skip it until aMetricsenum entry exists.Why this metric is not on the
MetricsenumMetricslists preset metrics with fixed wiring (metric_name,corpus_level_fn, …).SampleLevelMetric(..., sample_level_fn=PolluxLLMJudgeMetric(...), batched_compute=True, ...), like other custom metrics. Adding an enum member without an agreed preset would bloat the API.(If a canonical preset appears later, we could add an enum entry and, if needed, list it in
SKIPPED_METRICSfor automated JSON — same idea assimpleqa_judge.)