TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior Paper • 2512.20757 • Published 12 days ago • 16
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper • 2506.20920 • Published Jun 26, 2025 • 75
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published Jun 5, 2025 • 59
Concept Lancet: Image Editing with Compositional Representation Transplant Paper • 2504.02828 • Published Apr 3, 2025 • 16
mStyleDistance: Multilingual Style Embeddings and their Evaluation Paper • 2502.15168 • Published Feb 21, 2025 • 3
mStyleDistance: Multilingual Style Embeddings and their Evaluation Paper • 2502.15168 • Published Feb 21, 2025 • 3
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation Paper • 2502.14846 • Published Feb 20, 2025 • 14
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper • 2502.02737 • Published Feb 4, 2025 • 253
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models Paper • 2409.17146 • Published Sep 25, 2024 • 121
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models Paper • 2409.17146 • Published Sep 25, 2024 • 121
ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style Transfer Paper • 2308.15459 • Published Aug 29, 2023 • 1
Large Language Models Can Self-Improve At Web Agent Tasks Paper • 2405.20309 • Published May 30, 2024 • 2
TinyStyler: Efficient Few-Shot Text Style Transfer with Authorship Embeddings Paper • 2406.15586 • Published Jun 21, 2024 • 2
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published Jun 25, 2024 • 99
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows Paper • 2402.10379 • Published Feb 16, 2024 • 31
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows Paper • 2402.10379 • Published Feb 16, 2024 • 31