This research paper explores the limitations of current large language models (LLMs) in handling tasks that require multiple, interwoven skills (cross capabilities). The authors argue that while LLMs excel in specific areas like reasoning, coding, or image recognition, their performance drastically declines when these skills need to be combined. To address this issue, they introduce CrossEval, a comprehensive benchmark designed to evaluate both individual and cross capabilities. CrossEval includes a wide range of prompts and uses human annotators to rate model responses. The study reveals a consistent “Law of the Weakest Link” effect, meaning an LLM’s performance on cross-capability tasks is primarily limited by its weakest individual capability. This finding emphasizes the need for future research to prioritize improving LLMs’ weaker areas in order to enhance their effectiveness in real-world applications.