Unlock the Power of Lip-Synced Talking Avatars with Sonic Digital Human Workflow
1. Workflow Overview

This "Sonic Digital Human" workflow generates lip-synced talking avatar videos by combining input images (e.g. portraits) with audio (e.g. speech). Based on Stable Video Diffusion (SVD) framework, it outputs MP4 videos with synchronized facial animations.
2. Core Models
Model/Component | Function | Source |
---|---|---|
svd_xt_1_1 | Base video diffusion model | Download to |
Sonic model (unet.pth) | Lip-sync control | Quark/Baidu links in workflow |
CLIP Vision | Image feature extraction | Built-in |
3. Key Nodes
Node | Purpose | Installation |
---|---|---|
SONICTLoader | Load Sonic adapter | Install |
SONIC_PreData | Fuse audio/image data | Same as above |
VHS_VideoCombine | Video compositing |
|
LoadAudio | Audio file loader | Built-in |
4. Pipeline Structure
Input Group
Image:
LoadImage
(e.g.image.png
)Audio:
LoadAudio
(e.g.April28.MP3
)
Processing Group
Data fusion:
SONIC_PreData
encodes temporal dataConfig: Image size 768x768, audio weight=0.5
Generation Group
SONICSampler
: 25 steps, 25fpsOutput: 8fps H.264 video (CRF=19)
5. I/O Specifications
Input Requirements:
Image: 1139x1151 PNG recommended
Audio: MP3/WAV with clear speech
Output:
Video:
ComfyUI/output/AnimateDiff_xxxx-audio.mp4
6. Critical Notes
Model Setup:
Download Sonic model from provided cloud links
Verify
svd_xt_1_1
model path
Performance:
VRAM ≥16GB required
Reduce FPS to 8 for lower resource usage
Troubleshooting:
Desync lips: Check audio sample rate (44.1kHz)
Choppy video: Adjust CRF (18-23)