- Core concept: An audio-driven lip-sync model using a sparse-frame/keyframe strategy for video dubbing; it preserves identity consistency over long durations and naturally couples head motion, facial expressions, and body pose to the audio. It supports an “image + audio → talking video” mode (starting from a single image) with no upper limit on video length.
- Input/Output: Inputs speech audio (optionally with text/phoneme alignments) and a reference portrait (video or a single image); outputs a talking-face video that closely matches the audio while preserving natural head/expression dynamics and the subject’s identity beyond just the lips.