ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
Abstract
ProEval uses transfer learning with pre-trained Gaussian Processes and Bayesian quadrature to efficiently evaluate generative AI models by identifying failure cases with significantly fewer samples than traditional methods.
Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SCATR: Simple Calibrated Test-Time Ranking (2026)
- Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection (2026)
- Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs (2026)
- Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model (2026)
- Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation (2026)
- Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation (2026)
- REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.23099 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper