Wan2.1 is an open-source AI video generation model based on a diffusion Transformer architecture. For text-to-video (T2V) tasks, it can synthesize high-quality video clips from natural language prompts, leveraging the innovative Wan-VAE 3D spatiotemporal variational autoencoder for improved spatial-temporal expressiveness and decoding efficiency.
The Wan2.1-T2V version supports up to 720P resolution and a wide range of style/scenario prompts, delivering detailed and natural video outputs. The lightweight model variant runs smoothly on consumer GPUs with as little as 8GB VRAM, balancing accessibility and performance.