Sparse Auto-Encoders (SAEs) for Mechanistic Interpretability Collection A compilation of sparse auto-encoders trained on large language models. • 37 items • Updated 20 days ago • 19
Nemotron-Post-Training-v3 Collection Collection of datasets used in the post-training phase of Nemotron Nano v3. • 7 items • Updated 13 days ago • 54
The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models Paper • 2510.13996 • Published Oct 15, 2025 • 8
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training Paper • 2506.01732 • Published Jun 2, 2025 • 6
open-sci-ref-0.01 Collection Research baseline models trained on various open reference datasets • 12 items • Updated Jul 23, 2025 • 4
view article Article Releasing Common Corpus: the largest public domain dataset for training LLMs Mar 20, 2024 • 30
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper • 2506.20920 • Published Jun 26, 2025 • 75
view article Article Assisted Generation: a new direction toward low-latency text generation May 11, 2023 • 74
Common Models Collection The first generation of models pretrained on Common Corpus. • 5 items • Updated Dec 5, 2024 • 41
Qwen2.5 Collection Qwen2.5 language models, including pretrained and instruction-tuned models of 7 sizes, including 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B. • 46 items • Updated 6 days ago • 672
GEITje 7B: A Large Open Dutch Language Model Collection All models and datasets relating to GEITje • 8 items • Updated Jan 25, 2025 • 5