Unlock Lip-Synced Cartoon Avatar Videos with This AI-Powered Workflow
1. Workflow Overview

This workflow generates lip-synced cartoon avatar videos (e.g., Sonic) at any resolution. It syncs mouth movements with input audio, producing ~10s videos (~8 mins on RTX 4090).
2. Core Models
Model/Plugin | Function | Source/Installation |
---|---|---|
SVD XT 1.1 | Base video generation model | Download |
SONIC UNet | Lip-sync specialized UNet | Load |
VHS Video | Video synthesis plugin | Install via ComfyUI Manager |
3. Key Nodes
Node Name | Function | Installation | Dependencies |
---|---|---|---|
| Load base model | Built-in | SVD XT 1.1 model |
| Load lip-sync UNet | Manual SONIC plugin install |
|
| Preprocess audio/image data | SONIC plugin | CLIP vision encoder |
| Merge video/audio | Install | FFmpeg required |
4. Workflow Groups
Group 1: Data Loading
Inputs:
Image (e.g.,
45b437ee...png
)Audio (e.g.,
10s-aijuxi.wav
)
Outputs: Preprocessed data
Key Nodes:
LoadImage
,LoadAudio
,SONIC_PreData
Group 2: Lip-Sync Generation
Inputs: Preprocessed data + model
Outputs: Frames with mouth movements
Key Node:
SONICSampler
(controls FPS/seed)
Group 3: Video Export
Inputs: Frames + original audio
Outputs: MP4 (H.264 encoded)
Key Node:
VHS_VideoCombine
5. Inputs & Outputs
Input Parameters:
Image: 1080x1920 PNG (clear mouth area required)
Audio: 10s WAV file
Frame Rate: Default 25 FPS (adjustable)
Seed: Random or fixed (e.g.,
837794266
)
Output: MP4 video (e.g.,
output/Sonic/aijuxi_xxxx.mp4
)
6. Notes
⚠️ Hardware: NVIDIA GPU (recommended RTX 4090, ≥16GB VRAM)
⚠️ Model Prep:
Place
svd_xt_1_1
inmodels/checkpoints
unet.pth
must be in SONIC plugin directory
✅ Optimization:
Shorter audio reduces generation time
Set
weight_dtype
tofp16
inSONICSampler