view article Article Ulysses Sequence Parallelism: Training with Million-Token Contexts 7 days ago • 20
view article Article FlashHead: Accelerating Language Model Inference ~ *Efficient drop-in replacement for the classification head* 4 days ago • 1
Nemotron-Pre-Training-Datasets Collection Large scale pre-training datasets used in the Nemotron family of models. • 12 items • Updated 4 days ago • 121
Lost in Backpropagation: The LM Head is a Gradient Bottleneck Paper • 2603.10145 • Published 5 days ago • 6
NVIDIA Nemotron v3 Collection Open, Production-ready Enterprise Models • 12 items • Updated 4 days ago • 198
MixtureVitae study models and datasets Collection Collection of models and dataset related to MixtureVitae, open and fully reproducible pretraining dataset built from permissive sources • 16 items • Updated about 1 month ago • 1
Running on CPU Upgrade 181 The Synthetic Data Playbook: Generating Trillions of the Finest Tokens 📝 181 Explore synthetic data experiments in a bookshelf view
view article Article Scaling Pedagogical Pre-training: From Optimal Mixing to 10 Billion Tokens 9 days ago • 4
🤏 Smol-Data Collection Tried and tested mixes for strong pretraining. Inspired by https://huggingface.co/blog/codelion/optimal-dataset-mixing • 14 items • Updated 13 days ago • 12