Designed for image-referenced consistency generation: it extracts a subject’s identity and appearance from 1–4 reference images and, combined with a text prompt, generates a video, aiming to keep identity/appearance (facial features, clothing, etc.) stable under temporal consistency. Capabilities: extract identity/appearance from reference images and condition on text to produce temporally coherent videos, emphasizing identity stability and multi-subject consistency.