Unlock the Power of AI-Generated Videos: A Comprehensive Workflow Guide
1. Workflow Overview

Purpose: Generates videos from text (T2V) or images (I2V) using Tencent's WanXiang 1.3B model, with post-processing like super-resolution and frame interpolation.
Core Models:
Wan2_1-T2V-1.3B
: Text-to-video base model (1.3B params).Wan2_1-I2V-14B
: Image-to-video model (14B params, FP8 quantized).UMT5-XXL
: Multilingual text encoder (supports Chinese prompts).FILM-Net
: Frame interpolation model (for smoother videos).
2. Key Nodes
Node Name | Function | Installation | Dependencies |
---|---|---|---|
| Video sampling | Install | Requires |
| Handles Chinese/English prompts | Same as above | Needs |
| Combines frames into MP4 | Install | Requires FFmpeg |
| Frame interpolation | Install |
|
3. Workflow Structure
Group 1: Text-to-Video (T2V)
Input: Prompts (e.g., "vibrant cartoon style"), negative prompts, seed.
Process: Text encoding via
UMT5-XXL
, video generation byWan2_1-T2V
.Output: Raw video (720P).
Group 2: Image-to-Video (I2V)
Input: Reference image (e.g.,
ComfyUI_06397_.png
), prompts.Process: Uses
CLIP-Vision
for image features,Wan2_1-I2V
for generation.Output: Dynamic video (81+ frames).
Group 3: Post-Processing
Upscaling:
ESRGAN_4x
for higher resolution.Frame Interpolation: Boosts FPS from 16 to 32 via
FILM-Net
.Output: Final MP4 (H.264, CRF=19).
4. Input & Output
Key Inputs:
Prompts: Mix Chinese/English (e.g., "产品摄影,拉近镜头").
Resolution: Default
720x1280
(portrait) or1280x720
(landscape).Frames: I2V requires ≥81 frames.
Output:
Format: MP4 (H.264).
Path: Saved to
ComfyUI/output/Hunyuan/videos/
.
5. Notes
VRAM Requirements:
I2V model needs ≥16GB GPU (FP8 quantization still demands high-end cards).
Enable
torch.compile
for speed (requires CUDA 12+).
Common Issues:
Frame count <81: Adjust
WanVideoBlockSwap
settings.OOM errors: Reduce resolution or disable
bf16
.
Optimization:
Use
TeaCache
node to save VRAM (enabled by default).Include motion keywords in prompts (e.g., "zoom in").