Papers
arxiv:2605.13511

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

Published on May 13
· Submitted by
Cindy
on May 14
Authors:
,
,
,

Abstract

Many-shot in-context learning for reasoning tasks exhibits different scaling behaviors than non-reasoning tasks, with demonstration ordering and selection significantly impacting performance.

AI-generated summary

In-context learning (ICL) adapts large language models (LLMs) to new tasks by conditioning on demonstrations in the prompt without parameter updates. With long-context models, many-shot ICL can use dozens to hundreds of examples and achieve performance comparable to fine-tuning, yet current understanding of its scaling behavior is largely derived from non-reasoning tasks. We study many-shot chain-of-thought in-context learning (CoT-ICL) for reasoning and show that standard many-shot rules do not transfer. Across non-reasoning and reasoning-oriented LLMs and across non-reasoning and reasoning tasks, we find: (i) a setting-dependent scaling effect, where increasing the number of CoT demonstrations is unstable for non-reasoning LLMs and benefits mainly reasoning-oriented LLMs; (ii) similarity-based retrieval helps on non-reasoning tasks but fails on reasoning, since semantic similarity poorly predicts procedural (i.e., CoT) compatibility; and (iii) an order-scaling effect, where performance variance grows with more CoT demonstrations. We interpret these behaviors by viewing many-shot CoT-ICL as in-context test-time learning rather than scaled pattern matching, and suggests two principles: (i) demonstrations should be easy for the target model to understand, and (ii) they should be ordered to support a smooth conceptual progression. Guided by the principle, we propose Curvilinear Demonstration Selection (CDS), a simple ordering method that yields up to a 5.42 percentage-point gain on geometry with 64 demonstrations. Overall, our results reframe the long context window from a retrieval buffer into a structured curriculum for in-context test-time learning.

Community

Paper submitter

We believe this work provides a step toward bridging ICL from pattern matching to in-context test time learning with two principles proposed, showing that reasoning performance relies on demonstrations being both understandable to the model and smoothly sequenced to facilitate conceptual progression.

Really impressive work! The reframing of many-shot CoT-ICL as in-context test-time learning is a compelling perspective, and the finding that similarity-based retrieval actually hurts reasoning tasks is quite eye-opening. CDS is also a elegantly simple yet effective solution.

May I ask if you plan to open-source the code and experimental scripts? Would love to reproduce and build on this.

·

Thank you for your encouraging feedback! We're really glad the perspective on many-shot CoT-ICL as in-context test-time learning was interesting to you.

To answer your question: yes, we will open-source the code and experimental scripts by next week. I'll add the GitHub link right here once it's ready. Really appreciate your interest in building on our work.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.13511
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.13511 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.13511 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.13511 in a Space README.md to link it from this page.

Collections including this paper 2