Researchers have developed a scalable method called the
Recursive Feature Machine (RFM) to identify and manipulate the internal knowledge of
artificial intelligence models. By extracting
linear concept representations, this approach allows for
model steering, which can adjust model behavior toward specific semantic notions like languages, political stances, or coding proficiency. The study demonstrates that this technique improves
AI safety and performance across various architectures, often surpassing the effectiveness of traditional prompting. Furthermore, these internal features prove highly efficient for
monitoring hallucinations and toxic content, outperforming even advanced judge models like GPT-4o. Ultimately, the findings suggest that
model capabilities can be significantly enhanced by directly engaging with their internal activation spaces rather than relying solely on external text interactions.
References:
- Beaglehole D, Radhakrishnan A, Boix-Adsera E, et al. Toward universal steering and monitoring of AI models[J]. Science, 2026, 391(6787): 787-792.