Papers
arxiv:2502.18018

Verdict: A Library for Scaling Judge-Time Compute

Published on Nov 5, 2025
Authors:

Abstract

Verdict is an open-source library that enhances LLM-based evaluation systems by scaling judge-time compute through modular reasoning units, achieving competitive performance with much larger models across tasks like content moderation and fact-checking.

AI-generated summary

The use of LLMs as automated judges ("LLM-as-a-judge") is now widespread, yet standard judges suffer from a multitude of reliability issues. To address these challenges, we introduce Verdict, an open-source library for scaling judge-time compute to enhance the accuracy, reliability, and interpretability of automated evaluators. Verdict leverages the composition of modular reasoning units (such as verification, debate, and aggregation) and increased inference-time compute to improve LLM judge quality. Across a variety of challenging tasks such as content moderation, fact-checking, and hallucination detection, Verdict judges achieves performance competitive with orders-of-magnitude larger fine-tuned judges, prompted judges, and reasoning models. Our framework establishes a foundation for scalable, interpretable, and reliable LLM-based evaluation systems for both researchers and practitioners.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2502.18018
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.18018 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.18018 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.18018 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.