Papers
arxiv:2602.04064

The CitizenQuery Benchmark: A Novel Dataset and Evaluation Pipeline for Measuring LLM Performance in Citizen Query Tasks

Published on Feb 3
Authors:
,
,
,
,
,
,
,

Abstract

Large language models face significant challenges in accurately responding to citizen queries due to potential misinformation risks, necessitating careful evaluation of their trustworthiness and reliability in public sector applications.

AI-generated summary

"Citizen queries" are questions asked by an individual about government policies, guidance, and services that are relevant to their circumstances, encompassing a range of topics including benefits, taxes, immigration, employment, public health, and more. This represents a compelling use case for Large Language Models (LLMs) that respond to citizen queries with information that is adapted to a user's context and communicated according to their needs. However, in this use case, any misinformation could have severe, negative, likely invisible ramifications for an individual placing their trust in a model's response. To this effect, we introduce CitizenQuery-UK, a benchmark dataset of 22 thousand pairs of citizen queries and responses that have been synthetically generated from the swathes of public information on gov.uk about government in the UK. We present the curation methodology behind CitizenQuery-UK and an overview of its contents. We also introduce a methodology for the benchmarking of LLMs with the dataset, using an adaptation of FActScore to benchmark 11 models for factuality, abstention frequency, and verbosity. We document these results, and interpret them in the context of the public sector, finding that: (i) there are distinct performance profiles across model families, but each is competitive; (ii) high variance undermines utility; (iii) abstention is low and verbosity is high, with implications on reliability; and (iv) more trustworthy AI requires acknowledged "fallibility" in the way it interacts with users. The contribution of our research lies in assessing the trustworthiness of LLMs in citizen query tasks; as we see a world of increasing AI integration into day-to-day life, our benchmark, built entirely on open data, lays the foundations for better evidenced decision-making regarding AI and the public sector.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.04064 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.04064 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.04064 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.