T3-Video: Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10×

Jiangning Zhang^1,2, Junwei Zhu¹, Teng Hu¹, Yabiao Wang^1,2, Donghao Luo¹,
Weijian Cao¹, Zhenye Gan¹, Xiaobin Hu¹, Zhucun Xue², Chengjie Wang¹,

¹Youtu Lab, Tencent ²Zhejiang University

Paper Native 4K World Vision Code Models

T3-Video achieves native UHD-4K video generation via efficient fine-tuning on just the 42K UltraVideo dataset.

4K Vision World demo generated by our T3-Video-Wan2.1-T2V-1.3B, where the prompt for each video is generated by GPT-4o and sorted by GDP.

High-lights for the T3 Module

1) Only modifies the inference logic of Full-Attention to achieve a global receptive field within a single layer.
2) Multi-scale window-attention realizes a linear transformation of the attention computation complexity from O(HW) to O(hwN), where hw denotes the window size and N denotes the number of windows.
3) Plug-and-play with elegant one-line code replacement, while being compatible with existing attention acceleration libraries such as FlashAttention, SageAttention, etc.

T3-Video is much faster and better

Pseudocode with one-line code replacement

Compatible with multiple models

BibTeX


      @misc{t3video,
          title={Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10$\times$}, 
          author={Jiangning Zhang and Junwei Zhu and Teng Hu and Yabiao Wang and Donghao Luo and Weijian Cao and Zhenye Gan and Xiaobin Hu and Zhucun Xue and Chengjie Wang},
          year={2025},
          eprint={2512.13492},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2512.13492}, 
      }