T3-Video: Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10×

1Youtu Lab, Tencent   2Zhejiang University  

T3-Video is capable of native UHD-4K video generation with efficient fine-tuning using only UltraVideo.

4K Vision World demo generated by our T3-Video-Wan2.1-T2V-1.3B, where the prompt for each video is generated by GPT-4o and sorted by GDP.

High-lights for the T3 Module

1) Only modifies the inference logic of Full-Attention to achieve a global receptive field within a single layer.
2) Multi-scale window-attention realizes a linear transformation of the attention computation complexity from O(HW) to O(hwN), where hw denotes the window size and N denotes the number of windows.
3) Plug-and-play with elegant one-line code replacement, while being compatible with existing attention acceleration libraries such as FlashAttention, SageAttention, etc.

data-composition

T3-Video is much faster and better

data-composition

Pseudocode with one-line code replacement

data-composition

Compatible with multiple models

BibTeX


      @article{ultravideo,
        title={UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions},
        author={Xue, Zhucun and Zhang, Jiangning and Hu, Teng and He, Haoyang and Chen, Yinan and Cai, Yuxuan and Wang, Yabiao and Wang, Chengjie and Liu, Yong and Li, Xiangtai and Tao, Dacheng}, 
        journal={arXiv preprint arXiv:2506.13691},
        year={2025}
      }