Unlock Lip-Synced Cartoon Avatar Videos with This AI-Powered Workflow

CN
ComfyUI.org
2025-04-02 10:33:01

1. Workflow Overview

m8zsf9i5mqnr29ggr0jf2090560740d3541f2208f82b2f09965c34ae4604b1fe4946bd6f83153ef720d.gif

This workflow generates lip-synced cartoon avatar videos (e.g., Sonic) at any resolution. It syncs mouth movements with input audio, producing ~10s videos (~8 mins on RTX 4090).


2. Core Models

Model/Plugin

Function

Source/Installation

SVD XT 1.1

Base video generation model

Download svd_xt_1_1 checkpoint

SONIC UNet

Lip-sync specialized UNet

Load unet.pth

VHS Video

Video synthesis plugin

Install via ComfyUI Manager


3. Key Nodes

Node Name

Function

Installation

Dependencies

ImageOnlyCheckpointLoader

Load base model

Built-in

SVD XT 1.1 model

SONICTLoader

Load lip-sync UNet

Manual SONIC plugin install

unet.pth file

SONIC_PreData

Preprocess audio/image data

SONIC plugin

CLIP vision encoder

VHS_VideoCombine

Merge video/audio

Install ComfyUI-VideoHelperSuite

FFmpeg required


4. Workflow Groups

  • Group 1: Data Loading

    • Inputs:

      • Image (e.g., 45b437ee...png)

      • Audio (e.g., 10s-aijuxi.wav)

    • Outputs: Preprocessed data

    • Key Nodes: LoadImage, LoadAudio, SONIC_PreData

  • Group 2: Lip-Sync Generation

    • Inputs: Preprocessed data + model

    • Outputs: Frames with mouth movements

    • Key Node: SONICSampler (controls FPS/seed)

  • Group 3: Video Export

    • Inputs: Frames + original audio

    • Outputs: MP4 (H.264 encoded)

    • Key Node: VHS_VideoCombine


5. Inputs & Outputs

  • Input Parameters:

    • Image: 1080x1920 PNG (clear mouth area required)

    • Audio: 10s WAV file

    • Frame Rate: Default 25 FPS (adjustable)

    • Seed: Random or fixed (e.g., 837794266)

  • Output: MP4 video (e.g., output/Sonic/aijuxi_xxxx.mp4)


6. Notes

  • ⚠️ Hardware: NVIDIA GPU (recommended RTX 4090, ≥16GB VRAM)

  • ⚠️ Model Prep:

    • Place svd_xt_1_1 in models/checkpoints

    • unet.pth must be in SONIC plugin directory

  • Optimization:

    • Shorter audio reduces generation time

    • Set weight_dtype to fp16 in SONICSampler