Unlock the Power of Text-to-Video Generation with Aliyun's Wan2.1 Model

CN
ComfyUI.org
2025-04-02 09:23:24

1. Workflow Overview

m8zpwo2ykk6cl4j4b8j185e2c240252d08f726a5b8c213b87216d8e704d6b208b277ccc4420c415eb18.gif

This workflow utilizes Aliyun's Wan2.1 model for Text-to-Video (T2V) generation. It integrates text encoding, video diffusion, and VAE decoding to produce dynamic video content. Key features:

  • Supports Chinese prompts (e.g., "滑雪的男人" - "a man skiing")

  • Configurable frame rate (default: 16fps) and resolution (480x768)

  • Includes negative prompts for quality filtering

2. Core Models

Model Name

Function

Installation

Wan2.1-T2V-1.3B

Video diffusion backbone

Manual download (.safetensors)

umt5-xxl-enc

Chinese text encoder

Place in models/wan_t5

Wan2.1_VAE

Latent space decoder

Manual download

3. Key Nodes

  • LoadWanVideoT5TextEncoder
    Loads the Chinese text encoder (umt5-xxl-enc). Use bf16 precision to save VRAM.

  • WanVideoTextEncode
    Processes positive/negative prompts. Example negative prompts filter low-quality content.

  • WanVideoModelLoader
    Loads the main video model with options for fp32/fp16 and VRAM optimization.

  • WanVideoSampler
    Core sampler parameters:

    • steps: 10 (lower for faster video generation)

    • cfg_scale: 6 (lower for creative freedom)

    • sampler: dpm++

  • VHS_VideoCombine
    Combines frames into MP4 video with configurable:

    • Frame rate (16fps)

    • Output format (H.264, CRF=19)

    • Filename prefix (WanVideo2_1_T2V)

4. Workflow Structure

Group 1: Text Processing

  • Input: Chinese prompt

  • Output: Text embeddings

  • Key nodes: LoadWanVideoT5TextEncoderWanVideoTextEncode

Group 2: Video Generation

  • Input: Text embeds + empty image embeds (480x768)

  • Output: Latent video data

  • Key nodes: WanVideoSampler

Group 3: Video Export

  • Input: Decoded image sequence

  • Output: MP4 file

  • Key nodes: WanVideoDecodeVHS_VideoCombine

5. I/O Specifications

Input Parameters:

  • Resolution: 480x768 (set in WanVideoEmptyEmbeds)

  • Seed: Fixed/Random (example: 1057359483639287)

  • Prompts: Natural Chinese language (avoid complex syntax)

Output:

  • MP4 video (saved to ComfyUI output folder)

  • Includes generation metadata

6. Notes

⚠️ VRAM Requirements

  • Minimum 12GB (16GB recommended)

  • Enable offload_device for optimization

⚠️ Model Installation

  • Download Wan2.1 models manually from official sources

  • Text encoder path: models/wan_t5/umt5-xxl-enc-bf16.safetensors

⚠️ Dependencies

  • Requires ComfyUI-WanVideoWrapper & VideoHelperSuite

  • Install via ComfyUI Manager