arxiv:2605.30288

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

Published on May 29

· Submitted by

Authors:

Abstract

MIRA is a source-aware filtering framework for mid-training data selection in LLM development that uses self-anchored rubric discovery to balance scalability and semantic accuracy across heterogeneous data sources.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.

View arXiv page View PDF Project page Add to collection

Community

csjiaya

Paper submitter about 5 hours ago

🧠 MIRA is a data selection framework for the mid-training stage of LLM development — the phase between pretraining and post-training that uses large-scale curated data to strengthen capabilities like coding, reasoning, and tool use. 💡 The core challenge is that mid-training corpora are extremely heterogeneous, mixing web documents, code, math, agent traces, and tool-use logs, so no single quality criterion works across all sources.

🔑 MIRA's key insight is to make rubric construction itself part of data selection, rather than relying on fixed or global quality criteria. It operates in four steps:

1️⃣ Source Clustering 🗂️: Groups 21 data sources into capability-coherent clusters based on content-embedding similarity;

2️⃣ Self-Anchored Rubric Discovery 🔍: A frontier teacher model (Kimi-K2.6) freely proposes quality dimensions for sampled records, which are then clustered into group-specific anchor rubrics;

3️⃣ Anchored Judge Distillation 🎓: These fixed rubrics are used to generate structured teacher labels (~2M scored records), which are distilled into lightweight group-specific student scorers (Qwen3.5-35B-A3B) for full-corpus inference;

4️⃣ Source-Aware Filtering 🎯: Reliability masking suppresses unreliable scoring dimensions, and per-group retention thresholds preserve source diversity.

📊 In code-oriented mid-training experiments on Qwen2.5-Coder-14B, MIRA-Group uses only 25B tokens — half the full 50B-token corpus 🔥 — yet achieves the best macro average (64.20) across nine benchmarks, outperforming perplexity filtering, DSIR, DataMan, and random selection, while matching the unfiltered full-corpus run.

🔬 Analysis shows that MIRA's scores are robust to sequence length 📏, its discovered rubrics are source-adaptive while still subsuming generic quality criteria ✅, and its reliability masking effectively identifies and suppresses poorly-calibrated scoring dimensions 🛡️.