s1: Simple test-time scaling
Minimal recipe for test-time scaling and strong reasoning performance matching o1-preview with just 1,000 examples & budget forcing
Pager:
https://arxiv.org/pdf/2501.19393
文章目录
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute
to improve performance. Recently, OpenAI’s o1 model showed this capability but did not publicly
share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance.
First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality.
Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking
process or lengthening it by appending “Wait” multiple times to the model’s generation when it
tries to end. This ca

订阅专栏 解锁全文
1083

被折叠的 条评论
为什么被折叠?



