From Pose to Playback: Mastering Video Generation with Tongyi Wanxiang's Fun-ControlNet

CN
ComfyUI.org
2025-05-30 07:45:57

1. Workflow Overview

mbai0envrfwofyw77t9cceb84a33b7ead19863b681235b708dd3bf9f90c6b224acd0f34d5af50ec801.png

This workflow, titled "Tongyi Wanxiang-WAN2.1-Fun ControlNet Video Generation [Pose/Depth Control]", is designed for:

  • Video Generation: Creates dynamic videos from input control signals (e.g., pose/depth maps).

  • Style Control: Uses Fun-ControlNet for precise content control (e.g., character motion).

  • Post-Processing: Includes video upscaling, frame interpolation, and final rendering.

2. Core Models

  • WAN2.1-Fun-ControlNet: Main video generation model with multi-modal control.

  • Meta-Llama-3.1-8B: Generates captions for input images.

  • FILM VFI: Frame interpolation model for smoother motion.

  • 4x_foolhardy_Remacri: Upscales video resolution.

3. Key Nodes

Video Generation

  • WanVideoModelLoader: Loads the WAN2.1-Fun-ControlNet model.

  • WanVideoSampler: Generates video frames with configurable parameters (steps, CFG scale).

  • WanVideoDecode: Decodes latent frames to images.

Control Signal Processing

  • AIO_Preprocessor: Preprocesses control maps (e.g., pose/depth).

  • WanVideoControlEmbeds: Encodes control signals.

Post-Processing

  • FILM VFI: Interpolates frames for smoother playback.

  • ImageUpscaleWithModel: Enhances video resolution.

  • VHS_VideoCombine: Renders final video (supports audio merging).

Utilities

  • Joy_caption_two: Generates text prompts from reference images.

  • easy cleanGpuUsed: Clears GPU memory to prevent overflow.

4. Workflow Structure (Groups)

  1. Input Control Video Group

    • Input: Uploaded video or control images (e.g., pose maps).

    • Key Nodes: VHS_LoadVideo, ImageResizeKJ (resizes input).

  2. Fun-Control Group

    • Input: Control signals, prompts, model parameters.

    • Key Nodes: WanVideoSampler, WanVideoControlEmbeds.

  3. Reference Image Captioning Group

    • Input: Reference image.

    • Key Node: Joy_caption_two (generates descriptive text).

  4. Post-Processing Group

    • Input: Raw generated frames.

    • Key Nodes: FILM VFI (interpolation), VHS_VideoCombine (final render).

5. Inputs & Outputs

  • Input Parameters:

    • Control video, resolution (default: 480x832), prompts, frame limit (default: 49).

  • Output:

    • Final video (MP4), optionally upscaled and interpolated.

6. Notes & Tips

  • VRAM Requirement: Recommended GPU with 16GB+ VRAM (e.g., RTX 3090).

  • Dependencies: Install ComfyUI-WanVideoWrapper and ComfyUI-VideoHelperSuite manually.

  • Common Issues:

    • Missing model files: Ensure Wan2.1-Fun-Control-14B_fp8_e4m3fn.safetensors is downloaded.

    • Resolution mismatch: Align input video and control map dimensions.