
Sign up to save your podcasts
Or


This paper provides a formal theoretical framework for success conditioning, a widely used reinforcement learning heuristic employed in Decision Transformers and language model alignment. The author proves that this technique is not merely a heuristic but exactly solves a trust-region optimization problem using a unique chi-squared divergence constraint. A central contribution is the Action-Influence Identity, which demonstrates that the magnitude of policy improvement is equal to the statistical variability in success rates attributable to the behavior policy's actions. This identity reveals that success conditioning is inherently conservative: it avoids dangerous distribution shifts by design and fails only when it becomes overly cautious in the absence of sufficient signal. Furthermore, the research explains how return thresholding acts as a proxy that can amplify these improvements, provided the chosen success criteria remain aligned with the true objective. Ultimately, the work bridges the gap between simple supervised fine-tuning on successful outcomes and the rigorous mathematical foundations of policy optimization.
By Enoch H. KangThis paper provides a formal theoretical framework for success conditioning, a widely used reinforcement learning heuristic employed in Decision Transformers and language model alignment. The author proves that this technique is not merely a heuristic but exactly solves a trust-region optimization problem using a unique chi-squared divergence constraint. A central contribution is the Action-Influence Identity, which demonstrates that the magnitude of policy improvement is equal to the statistical variability in success rates attributable to the behavior policy's actions. This identity reveals that success conditioning is inherently conservative: it avoids dangerous distribution shifts by design and fails only when it becomes overly cautious in the absence of sufficient signal. Furthermore, the research explains how return thresholding acts as a proxy that can amplify these improvements, provided the chosen success criteria remain aligned with the true objective. Ultimately, the work bridges the gap between simple supervised fine-tuning on successful outcomes and the rigorous mathematical foundations of policy optimization.