March 13, 2025

【第164期】CodeMonkeys：软件工程中一种test time compute方法

17 minutes

Seventy3: 用NotebookLM将论文生成播客，让大家跟着AI一起进步。

今天的主题是：CodeMonkeys: Scaling Test-Time Compute for Software Engineering

Summary

The "CodeMonkeys" paper introduces a system that improves large language model (LLM) performance on software engineering tasks by scaling test-time compute. This scaling is achieved by iteratively generating and testing code edits, both serially (more iterations per attempt) and in parallel (multiple attempts simultaneously). The system identifies relevant code context, generates candidate edits with accompanying tests, and selects the best edit through voting and a dedicated selection process. By amortizing the cost of context identification and using a combination of test-based voting and model-based selection, CodeMonkeys achieves competitive results on the SWE-bench Verified dataset. The paper also explores combining edits from multiple sources, demonstrating the effectiveness of their selection method in heterogeneous ensembles. Furthermore, an exploration of DeepSeek-V3 as a cheaper alternative to Claude Sonnet 3.5 is analyzed for potential benefits.

“CodeMonkeys” 论文提出了一种提升大语言模型（LLM）在软件工程任务上表现的系统，其核心思路是扩展测试时计算（test-time compute）。这种扩展通过迭代地生成和测试代码修改来实现，包括串行方式（在单次尝试中进行更多迭代）和并行方式（同时进行多个尝试）。

该系统首先识别相关代码上下文，然后生成候选代码修改及其测试，并通过投票机制和专门的选择流程挑选最佳修改方案。通过摊销上下文识别成本，并结合基于测试的投票和基于模型的选择，CodeMonkeys 在 SWE-bench Verified 数据集上取得了具备竞争力的结果。

此外，论文还探索了如何整合来自多个来源的代码修改，验证了该系统在异构集成（heterogeneous ensembles）中的有效性。同时，研究对比了DeepSeek-V3 作为 Claude Sonnet 3.5 的低成本替代方案，分析了其潜在优势。

原文链接：https://arxiv.org/abs/2501.14723

...more