Unlock the Power of AI-Generated Videos: A Comprehensive Workflow Guide
1. Workflow Overview

Purpose: Generates videos from text (T2V) or images (I2V) using Tencent's WanXiang 1.3B model, with post-processing like super-resolution and frame interpolation.
Core Models:
Wan2_1-T2V-1.3B: Text-to-video base model (1.3B params).Wan2_1-I2V-14B: Image-to-video model (14B params, FP8 quantized).UMT5-XXL: Multilingual text encoder (supports Chinese prompts).FILM-Net: Frame interpolation model (for smoother videos).
2. Key Nodes
Node Name | Function | Installation | Dependencies |
|---|---|---|---|
| Video sampling | Install | Requires |
| Handles Chinese/English prompts | Same as above | Needs |
| Combines frames into MP4 | Install | Requires FFmpeg |
| Frame interpolation | Install |
|
3. Workflow Structure
Group 1: Text-to-Video (T2V)
Input: Prompts (e.g., "vibrant cartoon style"), negative prompts, seed.
Process: Text encoding via
UMT5-XXL, video generation byWan2_1-T2V.Output: Raw video (720P).
Group 2: Image-to-Video (I2V)
Input: Reference image (e.g.,
ComfyUI_06397_.png), prompts.Process: Uses
CLIP-Visionfor image features,Wan2_1-I2Vfor generation.Output: Dynamic video (81+ frames).
Group 3: Post-Processing
Upscaling:
ESRGAN_4xfor higher resolution.Frame Interpolation: Boosts FPS from 16 to 32 via
FILM-Net.Output: Final MP4 (H.264, CRF=19).
4. Input & Output
Key Inputs:
Prompts: Mix Chinese/English (e.g., "产品摄影,拉近镜头").
Resolution: Default
720x1280(portrait) or1280x720(landscape).Frames: I2V requires ≥81 frames.
Output:
Format: MP4 (H.264).
Path: Saved to
ComfyUI/output/Hunyuan/videos/.
5. Notes
VRAM Requirements:
I2V model needs ≥16GB GPU (FP8 quantization still demands high-end cards).
Enable
torch.compilefor speed (requires CUDA 12+).
Common Issues:
Frame count <81: Adjust
WanVideoBlockSwapsettings.OOM errors: Reduce resolution or disable
bf16.
Optimization:
Use
TeaCachenode to save VRAM (enabled by default).Include motion keywords in prompts (e.g., "zoom in").