Transform Your Videos into Stylized Animations with Advanced AI Technology

CN
ComfyUI.org
2025-04-08 12:55:45

1. Workflow Overview

m98i6ak71n5emyh5ipoac621c4a9c97aa3b0c527bbdaf51d7df54fe38ef8bb9f0fce98fbeb54732ba22.gif
  • Purpose: Transforms input videos into stylized animations using Wan2.1 model with dual control via line art (AnimeLineArt) and depth maps (DepthAnything).

  • Key Tech: Combines ControlNet, T5 text encoding, and frame interpolation for dynamic content.

2. Core Models

Model Name

Function

Wan2.1-Fun-Control-14B

Main model for video generation (FP8 optimized).

AnimeLineArtPreprocessor

Extracts line art from input video for style control.

DepthAnythingPreprocessor

Generates depth maps for spatial consistency.

Florence2-Flux-Large

Auto-generates captions for video frames.

3. Key Nodes & Installation

Node Name

Function

Installation

WanVideoWrapper

Core nodes for video generation (model loading, sampling, encoding).

GitHub: ComfyUI-WanVideoWrapper

ControlNet Aux

Preprocessors for line art and depth maps.

ComfyUI Manager: comfyui-controlnet-aux

Video Helper Suite

Video loading/combining tools.

ComfyUI Manager: comfyui-videohelpersuite

Florence2

Image captioning.

GitHub: comfyui-florence2

Required Models:

  • Wan2.1-Fun-Control-14B_fp8_e4m3fn.safetensors (Download)

  • umt5-xxl-enc-bf16.safetensors (T5 encoder).

4. Workflow Structure

  1. Input Group (δΈŠδΌ θ§†ι’‘εŠε‚θ€ƒε›Ύ):

    • Inputs: Raw video (VHS_LoadVideo), reference image (LoadImage).

    • Process:

      • Frame extraction β†’ Line art + depth map generation.

      • Caption generation via Florence2Run.

    • Outputs: Preprocessed images + text prompts.

  2. Model Loading (wanζ¨‘εž‹):

    • Loads Wan2.1, T5 encoder, VAE, and configures optimizations (TorchCompile, BlockSwap).

  3. Generation Group (ι‡‡ζ ·η”Ÿζˆ):

    • Inputs: Preprocessed images, text prompts, control args.

    • Process:

      • Text encoding (WanVideoTextEncode) β†’ Image encoding (WanVideoImageToVideoEncode) β†’ Sampling (WanVideoSampler).

    • Outputs: Latent video representation.

  4. Output Group:

    • Decodes latent to images (WanVideoDecode) β†’ Combines video (VHS_VideoCombine).

5. Inputs & Outputs

  • Inputs:

    • Video (MP4), reference image (PNG).

    • Resolution: 768x768 (adjusted via ImageResizeKJ).

    • Prompts: Auto-generated (Florence2) or manual (example includes positive/negative prompts).

  • Output:

    • Stylized video (H.264 MP4, 16fps).

6. Notes

  • VRAM: Minimum 16GB (recommended 24GB+ due to Wan2.1 size).

  • Common Errors:

    • Frame limit exceeded: Adjust frame_load_cap (currently 81 frames).

    • Line art failure: Ensure input video has motion.

  • Optimization:

    • Enable fp8 mode for lower VRAM usage.

    • Tweak BlockSwap for memory management.