Papers
arxiv:2605.30288

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

Published on May 29
ยท Submitted by
Jian Yang
on Jun 3
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

MIRA is a source-aware filtering framework for mid-training data selection in LLM development that uses self-anchored rubric discovery to balance scalability and semantic accuracy across heterogeneous data sources.

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.

Community

Paper submitter

๐Ÿง  MIRA is a data selection framework for the mid-training stage of LLM development โ€” the phase between pretraining and post-training that uses large-scale curated data to strengthen capabilities like coding, reasoning, and tool use. ๐Ÿ’ก The core challenge is that mid-training corpora are extremely heterogeneous, mixing web documents, code, math, agent traces, and tool-use logs, so no single quality criterion works across all sources.

๐Ÿ”‘ MIRA's key insight is to make rubric construction itself part of data selection, rather than relying on fixed or global quality criteria. It operates in four steps:

1๏ธโƒฃ Source Clustering ๐Ÿ—‚๏ธ: Groups 21 data sources into capability-coherent clusters based on content-embedding similarity;

2๏ธโƒฃ Self-Anchored Rubric Discovery ๐Ÿ”: A frontier teacher model (Kimi-K2.6) freely proposes quality dimensions for sampled records, which are then clustered into group-specific anchor rubrics;

3๏ธโƒฃ Anchored Judge Distillation ๐ŸŽ“: These fixed rubrics are used to generate structured teacher labels (~2M scored records), which are distilled into lightweight group-specific student scorers (Qwen3.5-35B-A3B) for full-corpus inference;

4๏ธโƒฃ Source-Aware Filtering ๐ŸŽฏ: Reliability masking suppresses unreliable scoring dimensions, and per-group retention thresholds preserve source diversity.

๐Ÿ“Š In code-oriented mid-training experiments on Qwen2.5-Coder-14B, MIRA-Group uses only 25B tokens โ€” half the full 50B-token corpus ๐Ÿ”ฅ โ€” yet achieves the best macro average (64.20) across nine benchmarks, outperforming perplexity filtering, DSIR, DataMan, and random selection, while matching the unfiltered full-corpus run.

๐Ÿ”ฌ Analysis shows that MIRA's scores are robust to sequence length ๐Ÿ“, its discovered rubrics are source-adaptive while still subsuming generic quality criteria โœ…, and its reliability masking effectively identifies and suppresses poorly-calibrated scoring dimensions ๐Ÿ›ก๏ธ.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.30288
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30288 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30288 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30288 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.