MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection
Abstract
MIRA is a source-aware filtering framework for mid-training data selection in LLM development that uses self-anchored rubric discovery to balance scalability and semantic accuracy across heterogeneous data sources.
Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.
Community
๐ง MIRA is a data selection framework for the mid-training stage of LLM development โ the phase between pretraining and post-training that uses large-scale curated data to strengthen capabilities like coding, reasoning, and tool use. ๐ก The core challenge is that mid-training corpora are extremely heterogeneous, mixing web documents, code, math, agent traces, and tool-use logs, so no single quality criterion works across all sources.
๐ MIRA's key insight is to make rubric construction itself part of data selection, rather than relying on fixed or global quality criteria. It operates in four steps:
1๏ธโฃ Source Clustering ๐๏ธ: Groups 21 data sources into capability-coherent clusters based on content-embedding similarity;
2๏ธโฃ Self-Anchored Rubric Discovery ๐: A frontier teacher model (Kimi-K2.6) freely proposes quality dimensions for sampled records, which are then clustered into group-specific anchor rubrics;
3๏ธโฃ Anchored Judge Distillation ๐: These fixed rubrics are used to generate structured teacher labels (~2M scored records), which are distilled into lightweight group-specific student scorers (Qwen3.5-35B-A3B) for full-corpus inference;
4๏ธโฃ Source-Aware Filtering ๐ฏ: Reliability masking suppresses unreliable scoring dimensions, and per-group retention thresholds preserve source diversity.
๐ In code-oriented mid-training experiments on Qwen2.5-Coder-14B, MIRA-Group uses only 25B tokens โ half the full 50B-token corpus ๐ฅ โ yet achieves the best macro average (64.20) across nine benchmarks, outperforming perplexity filtering, DSIR, DataMan, and random selection, while matching the unfiltered full-corpus run.
๐ฌ Analysis shows that MIRA's scores are robust to sequence length ๐, its discovered rubrics are source-adaptive while still subsuming generic quality criteria โ , and its reliability masking effectively identifies and suppresses poorly-calibrated scoring dimensions ๐ก๏ธ.
Get this paper in your agent:
hf papers read 2605.30288 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper