This research paper explores the possibility of equipping large language models (LLMs) with the ability to "think" before responding to user instructions. The authors propose a training method called Thought Preference Optimization (TPO) that utilizes reinforcement learning to encourage LLMs to generate internal thoughts, which are then used to improve their responses. TPO does not rely on labeled data for thought processes, instead optimizing thoughts based on the quality of the responses they generate. The researchers demonstrate that TPO outperforms traditional LLMs on benchmark tasks, especially in non-reasoning categories like language translation and marketing, suggesting that Thinking LLMs have broad utility beyond traditional reasoning tasks.