arxiv:2604.03922

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Published on Apr 5

· Submitted by

Hui Sun on Apr 8

Upvote

Authors:

Hui Sun ,

Abstract

Researchers address the challenge of selecting correct code candidates from LLM-generated outputs by developing ACES, a method that ranks tests based on their ability to distinguish correct from incorrect code through leave-one-out evaluation and AUC consistency scoring.

AI-generated summary

Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a circular dependency. Our key insight is that we need not determine test correctness at all: test votes should rank, not merely count. What matters is not how many codes pass a test, but whether the test can distinguish correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose ACES~(AUC ConsistEncy Scoring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@k on multiple code generation benchmarks.

View arXiv page View PDF Add to collection

Community

sun0o0

Paper author Paper submitter about 13 hours ago

Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a circular dependency. Our key insight is that we need not determine test correctness at all: test votes should rank, not merely count. What matters is not how many codes pass a test, but whether the test can distinguish correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC (LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose ACES (AUC ConsistEncy Scoring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@$$k$$ on multiple code generation benchmarks.

avahal

about 3 hours ago

Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/aces-who-tests-the-tests-leave-one-out-auc-consistency-for-code-generation-6325-2d0aae23
Covers the executive summary, detailed methodology, and practical applications.

mishig

8 minutes ago

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Current benchmarks for code generation rely on test suites to judge solution correctness, yet the quality of those test suites is rarely examined. ACES introduces a principled leave-one-out methodology to measure and improve the reliability of test suites used to rank code-generation models, ensuring that benchmark results reflect genuine capability differences rather than artifacts of weak tests.

Key Idea

ACES evaluates test-suite quality by removing one test case at a time and checking whether the resulting solution rankings remain stable. If dropping a particular test never changes any model’s ranking, that test is considered low-quality — it adds no discriminative power. By quantifying this through AUC consistency, ACES provides a scalar measure of how much each individual test contributes to reliable evaluation.

Method / Approach

The framework computes the Area Under the Curve (AUC) of pairwise model comparisons across the full test suite, then systematically recomputes it with each test held out. Tests whose removal causes negligible AUC shift are flagged as redundant or uninformative. ACES then filters the test suite down to a compact, high-consistency subset that preserves — or even sharpens — the original ranking signal.

Results

The filtered test suites produced by ACES are substantially smaller yet yield more stable and reproducible rankings across code-generation models. This demonstrates that many existing benchmark tests are redundant or noisy, and that targeted filtering can make code-generation evaluation both cheaper and more trustworthy.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.03922

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.03922 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.03922 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.03922 in a Space README.md to link it from this page.

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Abstract

Community

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Key Idea

Method / Approach

Results

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1