This briefing document summarises the key findings and implications of the research paper "A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks" by Shakya et al. The study investigates the capabilities of two leading Large Language Models (LLMs), ChatGPT o3-mini and DeepSeek-R1, in solving competitive programming problems from Codeforces. The evaluation focuses on the accuracy of solutions, memory efficiency, and runtime performance across easy, medium, and hard difficulty levels.
Study Limitations:
The study acknowledges several limitations:
Single-shot prompting: The lack of follow-up prompts might have limited the refinement of generated outputs, as "LLM-assisted programming requires human intervention to ensure correctness".
Model versions: The study used ChatGPT o3-mini but not the more programming-focused DeepSeek-Coder, which "could have demonstrated better results than the R1".
Limited task set: The use of only 29 programming tasks might limit the generalisability of the results.
Single programming language: Focusing solely on C++ might limit the applicability across different coding environments.
Prompt formulation: While a consistent prompt was used, exploring different prompts could yield further insights.
The authors suggest that future research should address these limitations by using more diverse problem sets, exploring multiple programming languages, testing different prompting strategies, and comparing more recent versions of these and other LLMs.
Key Takeaways:
ChatGPT o3-mini demonstrates superior performance in solving medium-difficulty competitive programming tasks compared to DeepSeek-R1 in a zero-shot setting.
Both models struggle significantly with hard programming tasks, indicating the current limitations of LLMs in handling high-complexity problems without further human guidance or advanced prompting techniques.
ChatGPT generally exhibits better runtime performance, while DeepSeek sometimes shows lower memory consumption, though often at the cost of correctness.
The study highlights the ongoing need for human intervention and advanced prompting strategies to effectively utilise LLMs for solving programming tasks, particularly those beyond the easy difficulty level.
Future research should explore the impact of different prompting techniques, model versions (like DeepSeek-Coder), a wider range of tasks and programming languages, to gain a more comprehensive understanding of LLM capabilities in code generation.