Unlock Pro-Level Lip-Sync Videos: A Step-by-Step Workflow
1. Workflow Overview

This is a professional lip-sync video generation workflow combining WanVideo model with FantasyTalking technology. Key features:
Audio-driven precise lip synchronization (using wav2vec2 ASR)
Multimodal conditioning (text+image+audio)
Dual output formats (MP4 + GIF)
Core Models:
Wan2_1-I2V-14B-720P_fp8
: 14B-parameter video modelfantasytalking_fp16.safetensors
: Lip-sync adapterfacebook/wav2vec2-base-960h
: Audio feature extractor
2. Node Breakdown
Critical Nodes:
FantasyTalkingWav2VecEmbeds
Converts audio to lip movement parameters
Key params: 81 frames, audio_cfg_scale=23
WanVideoSampler
Advanced sampler with UniPC scheduler
Config: 30 steps, CFG=5
WanVideoImageToVideoEncode
Temporal image encoder
Default res: 832x480 (16:9)
VHS_VideoCombine
Requires VideoHelperSuite
Outputs both H.264 MP4 and GIF
Dependencies:
Must install
ComfyUI-WanVideoWrapper
~35GB model downloads required
3. Workflow Structure
Processing Stages:
Input Preparation
Load character image (512x768) → resize via KJNodes
Load WAV audio
Feature Extraction
CLIP vision encoding (vit_h)
T5 text encoding (Chinese umt5-xxl)
wav2vec2 audio processing
Video Generation
TeaCache for VRAM optimization
FP8 mixed precision acceleration
Output
23fps MP4 + looped GIF
4. I/O Specification
Inputs:
Source image:
ComfyUI_temp_nupri_00001_.png
Audio:
[jok老师]说得好像您带我以来我考好过几次一样.wav
Prompt:
Positive: "A woman talking to camera" Negative: "Overexposed, static, blurry details..."
Outputs:
MP4:
WanVideoWrapper_I2V_FantasyTalking_[timestamp].mp4
GIF: Same prefix
.gif
5. Critical Notes
Hardware Requirements
Min VRAM: 12GB (FP16 mode)
Recommended: RTX 3090/4090
Troubleshooting
For
CUDA OOM
:Reduce block_size in WanVideoTorchCompileSettings (current 128)
Lip sync issues: Adjust
audio_cfg_scale
Model Paths
Wan models:
ComfyUI/models/wanvideo/
Audio models auto-download to:
ComfyUI/models/wav2vec2/