Share Interpretable Reward Modeling with Active Concept Bottlenecks

Copy link

July 14, 2025

Interpretable Reward Modeling with Active Concept Bottlenecks

11 minutes

This academic paper introduces Concept Bottleneck Reward Models (CB-RM), a novel framework designed to enhance the interpretability of reward functions used in Reinforcement Learning from Human Feedback (RLHF). Unlike traditional opaque models, CB-RM decomposes reward prediction into human-understandable concepts, such as helpfulness or correctness. To address the high cost of data annotation, the authors propose an active learning (AL) strategy, leveraging an Expected Information Gain (EIG) acquisition function to efficiently select the most informative concept labels to query. Experiments on the UltraFeedback dataset demonstrate that this approach significantly improves concept accuracy and sample efficiency without compromising overall preference prediction accuracy, moving towards more transparent and auditable AI alignment. The research also cautions against potential information leakage when using large language models pre-trained on evaluation datasets.

...more

View all episodes

By Enoch H. Kang

July 14, 2025

Interpretable Reward Modeling with Active Concept Bottlenecks

11 minutes

...more

Sign up to save your podcasts