-
Humanity's Last Exam
Paper • 2501.14249 • Published • 77 -
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Paper • 2206.04615 • Published • 5 -
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Paper • 2210.09261 • Published • 1 -
BIG-Bench Extra Hard
Paper • 2502.19187 • Published • 10
Collections
Discover the best community collections!
Collections including paper arxiv:2311.07911
-
Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning
Paper • 2211.04325 • Published • 1 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 25 -
On the Opportunities and Risks of Foundation Models
Paper • 2108.07258 • Published • 2 -
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
Paper • 2204.07705 • Published • 2
-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 151 -
ReFT: Reasoning with Reinforced Fine-Tuning
Paper • 2401.08967 • Published • 31 -
Tuning Language Models by Proxy
Paper • 2401.08565 • Published • 22 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 69
-
Holistic Evaluation of Text-To-Image Models
Paper • 2311.04287 • Published • 15 -
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks
Paper • 2311.07463 • Published • 15 -
Trusted Source Alignment in Large Language Models
Paper • 2311.06697 • Published • 12 -
DiLoCo: Distributed Low-Communication Training of Language Models
Paper • 2311.08105 • Published • 16
-
On the Theoretical Limitations of Embedding-Based Retrieval
Paper • 2508.21038 • Published • 20 -
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Paper • 2507.21509 • Published • 32 -
Why Language Models Hallucinate
Paper • 2509.04664 • Published • 195 -
Introduction to Multi-Armed Bandits
Paper • 1904.07272 • Published
-
Instruction-Following Evaluation for Large Language Models
Paper • 2311.07911 • Published • 22 -
HuggingFaceH4/mt_bench_prompts
Viewer • Updated • 80 • 3.73k • 21 -
vectara/hallucination_evaluation_model
Text Classification • 0.1B • Updated • 165k • 338 -
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 243
-
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Paper • 2303.16634 • Published • 3 -
miracl/miracl-corpus
Viewer • Updated • 77.2M • 2.54k • 51 -
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper • 2306.05685 • Published • 39 -
How is ChatGPT's behavior changing over time?
Paper • 2307.09009 • Published • 24
-
Humanity's Last Exam
Paper • 2501.14249 • Published • 77 -
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Paper • 2206.04615 • Published • 5 -
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Paper • 2210.09261 • Published • 1 -
BIG-Bench Extra Hard
Paper • 2502.19187 • Published • 10
-
On the Theoretical Limitations of Embedding-Based Retrieval
Paper • 2508.21038 • Published • 20 -
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Paper • 2507.21509 • Published • 32 -
Why Language Models Hallucinate
Paper • 2509.04664 • Published • 195 -
Introduction to Multi-Armed Bandits
Paper • 1904.07272 • Published
-
Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning
Paper • 2211.04325 • Published • 1 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 25 -
On the Opportunities and Risks of Foundation Models
Paper • 2108.07258 • Published • 2 -
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
Paper • 2204.07705 • Published • 2
-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 151 -
ReFT: Reasoning with Reinforced Fine-Tuning
Paper • 2401.08967 • Published • 31 -
Tuning Language Models by Proxy
Paper • 2401.08565 • Published • 22 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 69
-
Instruction-Following Evaluation for Large Language Models
Paper • 2311.07911 • Published • 22 -
HuggingFaceH4/mt_bench_prompts
Viewer • Updated • 80 • 3.73k • 21 -
vectara/hallucination_evaluation_model
Text Classification • 0.1B • Updated • 165k • 338 -
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 243
-
Holistic Evaluation of Text-To-Image Models
Paper • 2311.04287 • Published • 15 -
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks
Paper • 2311.07463 • Published • 15 -
Trusted Source Alignment in Large Language Models
Paper • 2311.06697 • Published • 12 -
DiLoCo: Distributed Low-Communication Training of Language Models
Paper • 2311.08105 • Published • 16
-
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Paper • 2303.16634 • Published • 3 -
miracl/miracl-corpus
Viewer • Updated • 77.2M • 2.54k • 51 -
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper • 2306.05685 • Published • 39 -
How is ChatGPT's behavior changing over time?
Paper • 2307.09009 • Published • 24