Discover the Ultimate Video Transformation Workflow: Wan2.1 VACE Unleashed

CN
ComfyUI.org
2025-04-25 09:40:35

1. Workflow Overview

m9wlozlo7in3t7kocz23a3ea37ed55437a7436110c0b4c4e1fa8a6121ea0f62e25f6d6cb5f43b5f7fe.gif
  • Purpose:
    This workflow transforms input videos into stylized animations using Wan2.1 VACE with:

    • Pose Control (OpenPose) and Depth Control (Depth Map)

    • Frame interpolation (FILM VFI) and video upscaling

    • Auto-prompt generation via Florence2

  • Core Models:

    • Wan2.1 VACE: Main video generation model for style transfer

    • Florence2: Image captioning model for auto-prompts

    • DepthAnything V2: Depth map generator for structural control

    • FILM VFI: Frame interpolation model (16FPS → 32FPS)


2. Key Nodes

Node

Function

Installation

Dependencies

WanVideoModelLoader

Loads Wan2.1 model

ComfyUI-WanVideoWrapper

Download models: HuggingFace

DepthAnything_V2

Generates depth maps

ComfyUI-DepthAnythingV2

Requires depth_anything_v2_vitl_fp16.safetensors

Florence2Run

Auto-generates prompts

ComfyUI-Florence2

Load Florence-2-Flux-Large model

FILM VFI

Frame interpolation

Built-in

Download film_net_fp32.pt

VHS_VideoCombine

Video rendering/export

ComfyUI-VideoHelperSuite

Requires FFmpeg


3. Workflow Structure

Group 1: Input Setup

  • Inputs: Video file, reference image, seed, resolution cap (e.g., 1280x720)

  • Outputs: Preprocessed frames

Group 2: Control Generation

  • Pose Control: OpenPose keypoints via DWPreprocessor

  • Depth Control: Depth maps via DepthAnything_V2

  • Prompts: Manual input or auto-generated by Florence2

Group 3: Video Generation

  • Wan2.1 Model: Generates latent video frames

  • VACE Encoding: Encodes frames for model processing

Group 4: Post-Processing

  • Frame Interpolation: Upsamples to 32FPS with FILM VFI

  • Video Export: Combines frames into MP4


4. Inputs & Outputs

  • Required Inputs:

    • Video file (MP4)

    • Reference image (e.g., Girl_85_Highres.png)

    • Positive prompt (e.g., "Night scene, a dancing girl")

    • Resolution cap (default: 1280)

  • Output:

    • Final video (saved to output/Video)

    • Intermediate results (depth maps, pose keypoints)


5. Notes

  1. Hardware:

    • ≥12GB VRAM (use BlockSwap for lower VRAM)

    • Enable Triton/SageAttn for 20%-50% speed boost

  2. Troubleshooting:

    • Download missing models via ComfyUI Manager

    • Depth control is more stable than pose control

  3. Optimization:

    • Adjust blocks_to_swap (30-40) in WanVideoBlockSwap