1) Only modifies the inference logic of Full-Attention to achieve a global receptive field within a single layer.
2) Multi-scale window-attention realizes a linear transformation of the attention computation complexity from O(HW) to O(hwN), where hw denotes the window size and N denotes the number of windows.
3) Plug-and-play with elegant one-line code replacement, while being compatible with existing attention acceleration libraries such as FlashAttention, SageAttention, etc.
T3-Video is much faster and better
Pseudocode with one-line code replacement
Compatible with multiple models
@article{ultravideo,
title={UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions},
author={Xue, Zhucun and Zhang, Jiangning and Hu, Teng and He, Haoyang and Chen, Yinan and Cai, Yuxuan and Wang, Yabiao and Wang, Chengjie and Liu, Yong and Li, Xiangtai and Tao, Dacheng},
journal={arXiv preprint arXiv:2506.13691},
year={2025}
}