High-lights
Soul: A multimodal-driven framework that can generate semanticlly coherent, high-fidelity (1080P), long-term (minute-level), and real-world (walking) digital human animation based on 1) single-frame human/anthropomorphic images, 2) text prompts, and 3) audio, achieving accurate i) lip synchronization, ii) vivid facial expressions, and iii) stable identity preservation.
Soul-1M: A million-scale finely annotated dataset covering 1) human portraits, 2) upper bodies, 3) full bodies, and 4) multi-person scenarios, which alleviates the data scarcity problem through an automated annotation pipeline.
Soul-Bench: Provides a 1) comprehensive and 2) fair evaluation system for audio/text-guided digital human animation methods.
High-efficiency: i) Native 1080P generation, ii) step/CFG distillation, and iii) efficient VAE decoder.