Unlock the Power of Lip-Synced Talking Avatars with Sonic Digital Human Workflow

CN
ComfyUI.org
2025-05-12 10:19:13

1. Workflow Overview

makxk21z6rud9c51e5iad820040e7a084159f2f98ce8eb25731e14d23e45f8ee0810d94557cc5bc30a1.gif

This "Sonic Digital Human" workflow generates lip-synced talking avatar videos by combining input images (e.g. portraits) with audio (e.g. speech). Based on Stable Video Diffusion (SVD) framework, it outputs MP4 videos with synchronized facial animations.

2. Core Models

Model/Component

Function

Source

svd_xt_1_1

Base video diffusion model

Download to models/checkpoints

Sonic model (unet.pth)

Lip-sync control

Quark/Baidu links in workflow

CLIP Vision

Image feature extraction

Built-in

3. Key Nodes

Node

Purpose

Installation

SONICTLoader

Load Sonic adapter

Install ComfyUI_Sonic

SONIC_PreData

Fuse audio/image data

Same as above

VHS_VideoCombine

Video compositing

VideoHelperSuite plugin

LoadAudio

Audio file loader

Built-in

4. Pipeline Structure

  1. Input Group

    • Image: LoadImage (e.g. image.png)

    • Audio: LoadAudio (e.g. April28.MP3)

  2. Processing Group

    • Data fusion: SONIC_PreData encodes temporal data

    • Config: Image size 768x768, audio weight=0.5

  3. Generation Group

    • SONICSampler: 25 steps, 25fps

    • Output: 8fps H.264 video (CRF=19)

5. I/O Specifications

  • Input Requirements:

    • Image: 1139x1151 PNG recommended

    • Audio: MP3/WAV with clear speech

  • Output:

    • Video: ComfyUI/output/AnimateDiff_xxxx-audio.mp4

6. Critical Notes

  1. Model Setup:

    • Download Sonic model from provided cloud links

    • Verify svd_xt_1_1 model path

  2. Performance:

    • VRAM ≥16GB required

    • Reduce FPS to 8 for lower resource usage

  3. Troubleshooting:

    • Desync lips: Check audio sample rate (44.1kHz)

    • Choppy video: Adjust CRF (18-23)

Recommend