Transform Your Videos into Stylized Animations with Advanced AI Technology
1. Workflow Overview

Purpose: Transforms input videos into stylized animations using Wan2.1 model with dual control via line art (
AnimeLineArt
) and depth maps (DepthAnything
).Key Tech: Combines ControlNet, T5 text encoding, and frame interpolation for dynamic content.
2. Core Models
Model Name | Function |
---|---|
Wan2.1-Fun-Control-14B | Main model for video generation (FP8 optimized). |
AnimeLineArtPreprocessor | Extracts line art from input video for style control. |
DepthAnythingPreprocessor | Generates depth maps for spatial consistency. |
Florence2-Flux-Large | Auto-generates captions for video frames. |
3. Key Nodes & Installation
Node Name | Function | Installation |
---|---|---|
WanVideoWrapper | Core nodes for video generation (model loading, sampling, encoding). | GitHub: |
ControlNet Aux | Preprocessors for line art and depth maps. | ComfyUI Manager: |
Video Helper Suite | Video loading/combining tools. | ComfyUI Manager: |
Florence2 | Image captioning. | GitHub: |
Required Models:
Wan2.1-Fun-Control-14B_fp8_e4m3fn.safetensors
(Download)umt5-xxl-enc-bf16.safetensors
(T5 encoder).
4. Workflow Structure
Input Group (
δΈδΌ θ§ι’εεθεΎ
):Inputs: Raw video (
VHS_LoadVideo
), reference image (LoadImage
).Process:
Frame extraction β Line art + depth map generation.
Caption generation via
Florence2Run
.
Outputs: Preprocessed images + text prompts.
Model Loading (
wan樑ε
):Loads Wan2.1, T5 encoder, VAE, and configures optimizations (
TorchCompile
,BlockSwap
).
Generation Group (
ιζ ·ηζ
):Inputs: Preprocessed images, text prompts, control args.
Process:
Text encoding (
WanVideoTextEncode
) β Image encoding (WanVideoImageToVideoEncode
) β Sampling (WanVideoSampler
).
Outputs: Latent video representation.
Output Group:
Decodes latent to images (
WanVideoDecode
) β Combines video (VHS_VideoCombine
).
5. Inputs & Outputs
Inputs:
Video (MP4), reference image (PNG).
Resolution: 768x768 (adjusted via
ImageResizeKJ
).Prompts: Auto-generated (Florence2) or manual (example includes positive/negative prompts).
Output:
Stylized video (H.264 MP4, 16fps).
6. Notes
VRAM: Minimum 16GB (recommended 24GB+ due to Wan2.1 size).
Common Errors:
Frame limit exceeded: Adjust
frame_load_cap
(currently 81 frames).Line art failure: Ensure input video has motion.
Optimization:
Enable
fp8
mode for lower VRAM usage.Tweak
BlockSwap
for memory management.