Papers
arxiv:2604.23290

An Analysis of Active Learning Algorithms using Real-World Crowd-sourced Text Annotations

Published on Apr 25
Authors:
,
,

Abstract

Research examines active learning techniques with noisy oracles using real crowd-sourced text classification annotations, revealing performance challenges in real-world deployment scenarios.

AI-generated summary

Active learning algorithms automatically identify the most informative samples from large amounts of unlabeled data and tremendously reduce human annotation effort in inducing a machine learning model. In a conventional active learning setup, the labeling oracles are assumed to be infallible, that is, they always provide correct answers (in terms of class labels) to the queried unlabeled instances, which cannot be guaranteed in real-world applications. To this end, a body of research has focused on the development of active learning algorithms in the presence of imperfect / noisy oracles. Existing research on active learning with noisy oracles typically simulate the oracles using machine learning models; however, real-world situations are much more challenging, and using ML models to simulate the annotation patterns may not appropriately capture the nuances of real-world annotation challenges. In this research, we first collect annotations of text samples (from 3 benchmark text classification datasets) from crowd-sourced workers through a crowd-sourcing platform. We then conduct extensive empirical studies of 8 commonly used active learning techniques (in conjunction with deep neural networks) using the obtained annotations. Our analyses sheds light on the performance of these techniques under real-world challenges, where annotators can provide incorrect labels, and can also refuse to provide labels. We hope this research will provide valuable insights that will be useful for the deployment of deep active learning systems in real-world applications. The obtained annotations can be accessed at https://github.com/varuntotakura/al_rcta/.

Community

Paper author

The proposed dataset can be accessed at this URL: https://github.com/varuntotakura/al_rcta/.

To appear in Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2026)

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.23290
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.23290 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.23290 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.23290 in a Space README.md to link it from this page.

Collections including this paper 1