1) Only modifies the inference logic of Full-Attention to achieve a global receptive field within a single layer.
2) Multi-scale window-attention realizes a linear transformation of the attention computation complexity from O(HW) to O(hwN), where hw denotes the window size and N denotes the number of windows.
3) Plug-and-play with elegant one-line code replacement, while being compatible with existing attention acceleration libraries such as FlashAttention, SageAttention, etc.
T3-Video is much faster and better
Pseudocode with one-line code replacement
Compatible with multiple models
@misc{t3video,
title={Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10$\times$},
author={Jiangning Zhang and Junwei Zhu and Teng Hu and Yabiao Wang and Donghao Luo and Weijian Cao and Zhenye Gan and Xiaobin Hu and Zhucun Xue and Chengjie Wang},
year={2025},
eprint={2512.13492},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.13492},
}