This research paper explores the limitations of large language models (LLMs) in solving grade-school math problems, specifically focusing on their ability to perform multi-step reasoning. The authors introduce a new benchmark called "Compositional GSM" which chains together two simple math problems, requiring the LLM to use the answer of the first question as input for the second. They find that most LLMs struggle with this task, exhibiting a significant gap between their performance on individual problems and their ability to solve these compositional problems. This gap is particularly pronounced in smaller, cost-efficient models, and even in models specifically designed for math problem-solving. The paper also investigates the effects of instruction tuning and fine-tuning on compositional reasoning, finding that while these techniques can improve performance on individual problems, they can also lead to overfitting and reduced generalization. Ultimately, the authors argue that the current methods of evaluating LLMs' mathematical reasoning abilities may be overly optimistic, and that more complex and "out-of-distribution" tasks are needed to better understand these models' true capabilities.