Unlock Pro-Level Lip-Sync Videos: A Step-by-Step Workflow

CN
ComfyUI.org
2025-04-30 09:15:09

1. Workflow Overview

ma3pzr0t23srxo8vkthh2e89642b035ba691c2f87ff47aacc67a1196e439f148ca3fb923c7187f11077b.gif

This is a professional lip-sync video generation workflow combining WanVideo model with FantasyTalking technology. Key features:

  • Audio-driven precise lip synchronization (using wav2vec2 ASR)

  • Multimodal conditioning (text+image+audio)

  • Dual output formats (MP4 + GIF)

Core Models:

  • Wan2_1-I2V-14B-720P_fp8: 14B-parameter video model

  • fantasytalking_fp16.safetensors: Lip-sync adapter

  • facebook/wav2vec2-base-960h: Audio feature extractor

2. Node Breakdown

Critical Nodes:

  1. FantasyTalkingWav2VecEmbeds

    • Converts audio to lip movement parameters

    • Key params: 81 frames, audio_cfg_scale=23

  2. WanVideoSampler

    • Advanced sampler with UniPC scheduler

    • Config: 30 steps, CFG=5

  3. WanVideoImageToVideoEncode

    • Temporal image encoder

    • Default res: 832x480 (16:9)

  4. VHS_VideoCombine

    • Requires VideoHelperSuite

    • Outputs both H.264 MP4 and GIF

Dependencies:

  • Must install ComfyUI-WanVideoWrapper

  • ~35GB model downloads required

3. Workflow Structure

Processing Stages:

  1. Input Preparation

    • Load character image (512x768) → resize via KJNodes

    • Load WAV audio

  2. Feature Extraction

    • CLIP vision encoding (vit_h)

    • T5 text encoding (Chinese umt5-xxl)

    • wav2vec2 audio processing

  3. Video Generation

    • TeaCache for VRAM optimization

    • FP8 mixed precision acceleration

  4. Output

    • 23fps MP4 + looped GIF

4. I/O Specification

Inputs:

  • Source image: ComfyUI_temp_nupri_00001_.png

  • Audio: [jok老师]说得好像您带我以来我考好过几次一样.wav

  • Prompt:

    Positive: "A woman talking to camera"  
    Negative: "Overexposed, static, blurry details..."  

Outputs:

  • MP4: WanVideoWrapper_I2V_FantasyTalking_[timestamp].mp4

  • GIF: Same prefix .gif

5. Critical Notes

  1. Hardware Requirements

    • Min VRAM: 12GB (FP16 mode)

    • Recommended: RTX 3090/4090

  2. Troubleshooting

    • For CUDA OOM:

      Reduce block_size in WanVideoTorchCompileSettings (current 128)  
    • Lip sync issues: Adjust audio_cfg_scale

  3. Model Paths

    • Wan models: ComfyUI/models/wanvideo/

    • Audio models auto-download to: ComfyUI/models/wav2vec2/