
Sign up to save your podcasts
Or


This paper introduces Off-Policy Generative Policy Optimization (OGPO), a novel reinforcement learning algorithm designed to efficiently fine-tune generative control policies (GCPs) for complex robotic tasks. By viewing action generation as a denoising MDP nested within the environmental process, the method utilizes off-policy critics as terminal rewards to optimize the full generative process without expensive backpropagation. This approach bridges the gap between sample efficiency and expressive performance, outperforming existing techniques like residual learning or simple policy steering. Enhanced versions, such as OGPO+ and OGPO+CA, incorporate success-based regularization and conservative advantages to mitigate critic over-exploitation and performance dips during the transition from offline to online learning. Ultimately, the research demonstrates that OGPO can successfully fine-tune poorly-initialized models to near-perfect success rates in contact-rich manipulation environments, even when expert data is unavailable during the online phase.
By Enoch H. KangThis paper introduces Off-Policy Generative Policy Optimization (OGPO), a novel reinforcement learning algorithm designed to efficiently fine-tune generative control policies (GCPs) for complex robotic tasks. By viewing action generation as a denoising MDP nested within the environmental process, the method utilizes off-policy critics as terminal rewards to optimize the full generative process without expensive backpropagation. This approach bridges the gap between sample efficiency and expressive performance, outperforming existing techniques like residual learning or simple policy steering. Enhanced versions, such as OGPO+ and OGPO+CA, incorporate success-based regularization and conservative advantages to mitigate critic over-exploitation and performance dips during the transition from offline to online learning. Ultimately, the research demonstrates that OGPO can successfully fine-tune poorly-initialized models to near-perfect success rates in contact-rich manipulation environments, even when expert data is unavailable during the online phase.