Recent advances in text-to-video generation have achieved impressive fidelity and complex temporal dynamics in short clips. However, many state-of-the-art video diffusion models operate in a **non-causal** fashion: they generate an entire video in one go with bidirectional attention across time. This means future frames can influence past ones during generation, yielding high quality but precluding real-time use cases where future information isn’t available at inference time. By contrast, **...