arxiv:2603.20576

Can AI Agents Answer Your Data Questions? A Benchmark for Data Agents

Published on Mar 21

· Submitted by

Shreya Shankar on Mar 25

UC Berkeley

Upvote

Authors:

Abstract

A comprehensive benchmark evaluates enterprise data agents' ability to integrate and analyze multi-database data through natural language, revealing significant challenges in real-world applications.

AI-generated summary

Users across enterprises increasingly rely on AI agents to query their data through natural language. However, building reliable data agents remains difficult because real-world data is often fragmented across multiple heterogeneous database systems, with inconsistent references and information buried in unstructured text. Existing benchmarks only tackle individual pieces of this problem -- e.g., translating natural-language questions into SQL queries, answering questions over small tables provided in context -- but do not evaluate the full pipeline of integrating, transforming, and analyzing data across multiple database systems. To fill this gap, we present the Data Agent Benchmark (DAB), grounded in a formative study of enterprise data agent workloads across six industries. DAB comprises 54 queries across 12 datasets, 9 domains, and 4 database management systems. On DAB, the best frontier model (Gemini-3-Pro) achieves only 38% pass@1 accuracy. We benchmark five frontier LLMs, analyze their failure modes, and distill takeaways for future data agent development. Our benchmark and experiment code are published at github.com/ucbepic/DataAgentBench.

View arXiv page View PDF Project page GitHub 16 Add to collection

Community

shreyashankar

Paper submitter 1 day ago

Users across enterprises increasingly rely on AI agents to query
their data through natural language. However, building reliable data
agents remains difficult because real-world data is often fragmented
across multiple heterogeneous database systems, with inconsistent
references and information buried in unstructured text. Existing
benchmarks only tackle individual pieces of this problem—e.g.,
translating natural-language questions into SQL queries, answering questions over small tables provided in context—but do not
evaluate the full pipeline of integrating, transforming, and analyzing data across multiple database systems. To fill this gap, we
present the Data Agent Benchmark (DAB), grounded in a formative
study of enterprise data agent workloads across six industries. DAB
comprises 54 queries across 12 datasets, 9 domains, and 4 database
management systems. On DAB, the best frontier model (Gemini3-Pro) achieves only 38% pass@1 accuracy. We benchmark five
frontier LLMs, analyze their failure modes, and distill takeaways
for future data agent development. Our benchmark and experiment
code are published at github.com/ucbepic/DataAgentBench.

librarian-bot

about 18 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2603.20576

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.20576 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.20576 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.20576 in a Space README.md to link it from this page.