Pretraining Data
updated
opencsg/Fineweb-Edu-Chinese-V2.1
Viewer
• Updated
• 958M • 59.1k
• 65
Preview
• Updated
• 135k
• 33
Viewer
• Updated
• 3.8B • 7.92k
• 108
allenai/dolma3_dolmino_pool
Updated
• 54.6k
• 7
allenai/dolma3_longmino_pool
Updated
• 31.8k
• 12
Viewer
• Updated
• 476M • 28.4k
• 818
Viewer
• Updated
• 4.48B • 82.6k
• 762
Viewer
• Updated
• 61.6M • 7.13k
• 285
Viewer
• Updated
• 819M • 53.1k
• 12
tokyotech-llm/swallow-code-v2
Viewer
• Updated
• 147M • 173k
• 33
ByteDance-Seed/Code-Contests-Plus
Viewer
• Updated
• 49.2k • 24.8k
• 60
Viewer
• Updated
• 7.09M • 4.44k
• 161
nvidia/Nemotron-Pretraining-Code-v2
Viewer
• Updated
• 836M • 1.85k
• 106
nvidia/Nemotron-Pretraining-Specialized-v1
Viewer
• Updated
• 60.7M • 4.04k
• 73
nvidia/Nemotron-CC-Math-v1
Viewer
• Updated
• 190M • 3.2k
• 67
nvidia/Nemotron-Pretraining-SFT-v1
Viewer
• Updated
• 299M • 1.7k
• 62
Viewer
• Updated
• 1.86M • 18.3k
• 229
EssentialAI/essential-web-v1.0
Preview
• Updated
• 104k
• 218
EssentialAI/eai-taxonomy-stem-w-dclm
Preview
• Updated
• 18.9k
• 6
EssentialAI/eai-taxonomy-med-w-dclm
Viewer
• Updated
• 81.2M • 3.23k
• 8
EssentialAI/eai-taxonomy-code-w-dclm
Viewer
• Updated
• 274M • 85k
• 9
EssentialAI/eai-taxonomy-math-w-fm
Viewer
• Updated
• 21.6M • 194
• 5
Viewer
• Updated
• 27.9B • 34
• 3
DataMuncher-Labs/UltiMath
Viewer
• Updated
• 32.9B • 7.37k
• 43
HuggingFaceFW/finetranslations
Viewer
• Updated
• 3.33B • 30.6k
• 272
Viewer
• Updated
• 69.9k • 98.2k
• 382
JetBrains-Research/commit-chronicle
Viewer
• Updated
• 10.9M • 2.08k
• 11
ASSERT-KTH/repairllama-datasets
Viewer
• Updated
• 460k • 29
• 2
Updated
• 16.6k
• 77
Viewer
• Updated
• 778k • 30
Viewer
• Updated
• 3.5M • 152
nick007x/github-code-2025
Viewer
• Updated
• 147M • 8.81k
• 114
tokyotech-llm/swallow-code
Viewer
• Updated
• 129M • 11.8k
• 63
loubnabnl/github-code-duplicate
Viewer
• Updated
• 115M • 322
• 1
macrocosm-os/code-parrot-github-code
Viewer
• Updated
• 115M • 356
• 13
nyuuzyou/google-code-archive
Viewer
• Updated
• 65.8M • 4.76k
• 72
ad6398/Deepmind-CodeContest-Unrolled
Viewer
• Updated
• 13.2M • 47
• 2
datablations/python-megatron
Updated
• 11.3k
• 1
nomic-ai/cornstack-python-v1
Viewer
• Updated
• 23.6M • 2.71k
• 21
Viewer
• Updated
• 22.6M • 4.78k
Viewer
• Updated
• 552k • 203
utter-project/github-code-2025-above-2-stars
Viewer
• Updated
• 103M • 3.21k
• 3
tokyotech-llm/swallow-math-v2
Viewer
• Updated
• 17.4M • 4.91k
• 27
Viewer
• Updated
• 181M • 81.9k
• 257
Viewer
• Updated
• 513k • 42
• 6
Updated
• 97.7k
Preview
• Updated
• 91.6k
• 87
CodedotAI/code_clippy_github
Viewer
• Updated
• 2.4M • 699k
• 20
Lichess/standard-chess-games
Viewer
• Updated
• 7.14B • 444k
• 63
jablonkagroup/chempile-mlift
Viewer
• Updated
• 51.5M • 12.6k
• 13
Viewer
• Updated
• 49.8M • 13
jablonkagroup/chempile-code
Viewer
• Updated
• 2.27M • 1.48k
• 4
jablonkagroup/chempile-paper
Viewer
• Updated
• 11.7M • 404
• 6
jablonkagroup/chempile-education
Viewer
• Updated
• 66.9k • 602
• 6
Viewer
• Updated
• 164k • 19
• 2
institutional/institutional-books-1.0
Viewer
• Updated
• 983k • 3.7k
• 267
Viewer
• Updated
• 34.5k • 68
common-pile/youtube_filtered
Viewer
• Updated
• 986k • 45
• 5
common-pile/ubuntu_irc_filtered
Viewer
• Updated
• 216k • 19
• 1
common-pile/pubmed_filtered
Viewer
• Updated
• 4.77M • 361
• 3
common-pile/github_archive_filtered
Viewer
• Updated
• 23.3M • 40
• 1
common-pile/biodiversity_heritage_library_filtered
Viewer
• Updated
• 16.5M • 81
• 1
common-pile/uspto_filtered
Viewer
• Updated
• 14.4M • 900
• 3
common-pile/usgpo_filtered
Viewer
• Updated
• 2.34M • 78
• 1
soda-research/emilia-mm-pretrain-fix
Viewer
• Updated
• 12.6M • 2.16k
fineinstructions/fineinstructions_nemotron
Viewer
• Updated
• 1.23B • 2.72k
• 4