This research paper investigates the mathematical reasoning abilities of Large Language Models (LLMs) and finds that they exhibit significant limitations. The authors introduce a new benchmark, GSM-Symbolic, which generates variations of questions from the existing GSM8K dataset to better evaluate LLM performance. Their findings show that LLMs struggle with mathematical reasoning tasks, particularly when the difficulty level is increased or when seemingly irrelevant information is added to the questions. The authors suggest that LLMs might be performing a form of pattern matching rather than true logical reasoning, highlighting the need for further research into developing more robust and generalizable problem-solving skills in AI models.