Lip synced video from just an audio clip and a base video? We got you!

LatentSync is an advanced lip sync framework that creates natural-looking speech by analyzing audio and generating matching lip movements. It uses audio-conditioned models to ensure accuracy and integrates with ComfyUI, a powerful AI workflow tool, to refine the process.

One of the biggest challenges is maintaining smooth and consistent lip movements across frames for a realistic result, something we will explore today!

0:00
/0:05

Verified to work on ThinkDiffusion Build: Feb 27, 2025

ComfyUI v0.3.18 with LTX v0.9.1 model support.
Why do we specify the build date? ComfyUI and custom node versions that are part of this workflow that are updated after this date may change the behavior or outputs of the workflow.

We've also written another guide on an alternate lip syncing technique with Automatic1111 and SadTalker:

SadTalker - Talking head videos
SadTalker enables the creation of lifelike talking head videos through a user-friendly platform on ThinkDiffusion.

The Framework

Source: Arxiv Paper
💡
The LatentSync framework takes a video, uses AI to predict the correct lip sync based on audio input and refines it by comparing with real frames for accuracy.

Difference of LatentSync & LivePortrait

💡
LatentSync focuses solely on achieving precise lip syncing, which completes in just a few minutes. It is ideal for projects where speed is crucial. In contrast, LivePortrait generates full facial expressions, but I find that this process takes significantly longer.

How to run LatentSync in ComfyUI

Installation guide

💡
Download the workflow and drag & drop it into your ComfyUI window, whether locally or on ThinkDiffusion. If you're using ThinkDiffusion, it's necessary to use at minimum the Turbo 24gb machine, but we do recommend the Ultra 48gb machine.

Custom Node

If there are red nodes in the workflow, it means that the workflow lacks the certain required nodes. Install the custom nodes in order for the workflow to work.

  1. Go to ComfyUI Manager  > Click Install Missing Custom Nodes
  1. Check the list below if there's a list of custom nodes that needs to be installed and click the install.

Models

For this guide you'll need 2 models which are auto-downloaded by its custom node. If this fails, you can download from the HuggingFace links below.

Option 1

Upon first run of the workflow, the workflow will automatically fetch the necessary models. If the automatic download does not succeed, proceed with Option 2.

Option 2

💡
Copy the model path (latentsync_unet.pt and tiny.pt) and upload to the ThinkDiffusion directories listed below.
Model Name Model Link Address ThinkDiffusion Upload Directory
latentsync_unet.pt
📋 Copy Path
...comfyui/custom_nodes/ComfyUI-LatentSyncWrapper/checkpoints/
tiny.pt
📋 Copy Path
...comfyui/custom_nodes/ComfyUI-LatentSyncWrapper/checkpoints/whisper/

Step-by-step Workflow Guide

This workflow was pretty easy to set up and runs well from the default settings. Here are a few steps where you might want to take extra note.

Steps Recommended Nodes
1. Load a Video

Load a video with clear front-facing face, realistic, 25 fps and does not contain more than one face. The resolution should not go beyond 1080p.
ThinkDiffusion-StableDiffusion-ComfyUI-latentsync-load-video.png
2. Load an Audio

Load an audio file that has no background music and noise. It should be a clear vocal voice.
ThinkDiffusion-StableDiffusion-ComfyUI-latentsync-load-audio.png
3. Set the Settings

Video Length Adjuster Node includes three modes: "Normal" passes frames with padding to prevent loss, perfect for standard lip-syncing. "Pingpong" creates a forward-backward video loop for back-and-forth animations. "loop_to_audio" repeats frames to match longer audio durations, maintaining synchronization.
ThinkDiffusion-StableDiffusion-ComfyUI-latentsync-check-the-settings.png
4. Check the Generated Video

Check the results of your generation. If unsatisfactory, generate again.
ThinkDiffusion-StableDiffusion-ComfyUI-latentsync-check-generated-video.png
💡
If you encounter a "RuntimeError: Face not detected," be sure to set the number of frame load cap under 150 in Load Video. The video clip's length should be under 1 minute. These adjustments should help to troubleshoot the issue, allowing the tool to function properly and detect faces as intended.
💡
If you encounter a "tuple index out of range" error, be sure to update both your ComfyUI and the LatentSync custom nodes. Keeping these components up-to-date is essential for ensuring smooth operation and compatibility, which ultimately helps to maintain an efficient workflow and effective results.
💡
When I notice the final result has visible flickering or isn't in sync, I simply generate it again. I find this step crucial to ensure that the output meets the quality standards I have for my videos.

Limitations of LatentSync

While testing this, I noticed a few certain limitations:

  • It works best with videos showing clear, front-facing views of faces.
  • It doesn't support anime or cartoon faces.
  • The video should be at 25 frames per second.
  • The face should stay visible the whole time in the video and don't use videos with more than one face.

Examples

0:00
/0:05

Man reciting a number countdown.

0:00
/0:05

Woman as an Optimus Prime.

0:00
/0:05

Man in desk initiating a sequence.


If you’re having issues with installation or slow hardware, you can try any of these workflows on a more powerful GPU in your browser with ThinkDiffusion.

If you enjoy ComfyUI and you want to test out creating awesome animations, then feel free to check out some more workflows below. And have fun out there with your videos!

AI Video Speed: How LTX is Reshaping Video2Video as We Know It
0:00 /0:02 1× AI video workflows and models are everywhere and it might be hard to select the one for you. If you’re reading this you’re already on your path to using one of the top open source models currently. LTX can deliver really powerful results in amazing
Unleashing Creativity: How Hunyuan Redefines Video Generation
0:00 /0:04 1× Hey there, video enthusiasts! It’s a thrill to see how quickly things are changing, especially in the way we create videos. Picture this: with just a few clicks, you can transform your existing clips into fresh, creative videos. Sounds fun, right? Whether you’re just
Turning Words Into Action: The Magic of CogVideoX
CogVideoX is perfect for beginners who want to start making amazing videos from ANY still image. This Image2Video and Text2Video model can be used inside ComfyUI, a powerful interface that makes working with image-to-video easy and fun.