Papers
arxiv:2602.13110

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Published on Feb 19
Authors:
,
,

Abstract

SCOPE framework uses Bidirectional Preference Entropy to improve LLM-based pairwise evaluation by providing better uncertainty estimation and maintaining low error rates while achieving higher coverage compared to traditional methods.

AI-generated summary

Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level α. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at α= 0.10, SCOPE consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk approx 0.097 to 0.099), while retaining substantial coverage, reaching 0.89 on RewardBench with Qwen-14B and 0.98 on RewardBench with Qwen-32B. Compared to naïve baselines, SCOPE accepts up to 2.4times more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2602.13110
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.13110 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.13110 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.13110 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.