
Sign up to save your podcasts
Or


The October 25, 2025 Bytedance paper introduces **Open-o3 Video**, a novel framework developed by researchers from **Peking University** and **ByteDance**, aimed at advancing video reasoning by incorporating explicit spatio-temporal evidence. Unlike prior models that only generate textual rationales, Open-o3 Video explicitly highlights key **timestamps** and **bounding boxes** to ground its answers in visual observations. To achieve this, the authors curate two new datasets, **STGR-CoT-30k** and **STGR-RL-36k**, and utilize a two-stage training strategy involving supervised fine-tuning and **Group Sequence Policy Optimization (GSPO)** with specialized rewards. This approach, which includes adaptive temporal proximity and temporal gating mechanisms, significantly improves performance on the **V-STAR benchmark** and other video understanding tasks, making video reasoning more accurate and verifiable.
Source:
https://arxiv.org/pdf/2510.20579
By mcgrofThe October 25, 2025 Bytedance paper introduces **Open-o3 Video**, a novel framework developed by researchers from **Peking University** and **ByteDance**, aimed at advancing video reasoning by incorporating explicit spatio-temporal evidence. Unlike prior models that only generate textual rationales, Open-o3 Video explicitly highlights key **timestamps** and **bounding boxes** to ground its answers in visual observations. To achieve this, the authors curate two new datasets, **STGR-CoT-30k** and **STGR-RL-36k**, and utilize a two-stage training strategy involving supervised fine-tuning and **Group Sequence Policy Optimization (GSPO)** with specialized rewards. This approach, which includes adaptive temporal proximity and temporal gating mechanisms, significantly improves performance on the **V-STAR benchmark** and other video understanding tasks, making video reasoning more accurate and verifiable.
Source:
https://arxiv.org/pdf/2510.20579