workflow

This workflow is an Image-to-Video (I2V) generation process based on the Sonic Diffusion model, combined with voice cloning capabilities. It transforms a static image and audio into a dynamic video while generating synthetic speech aligned with the input audio’s style. The workflow is split into two parts: voice cloning (using NTCosyVoice for TTS) and digital human generation (using Sonic Diffusion for video). The final output is a 256x256, 25-frame MP4 video, suitable for digital human animations or short video creation.

Core Models

Sonic Diffusion (svd_xt_1_1.safetensors)
- Function: An I2V model based on Stable Video Diffusion (SVD), extending static images into video frames.
- Source: Download from Sonic Diffusion official or Hugging Face, place in ComfyUI/models/checkpoints/.
Sonic UNet (unet.pth)
- Function: The core UNet network for Sonic Diffusion, driving video frame generation.
- Source: Download from Sonic official repository, place in specified path (typically ComfyUI/models/unet/).
NTCosyVoice (Built-in Model)
- Function: A Text-to-Speech (TTS) model with voice cloning and emotion control, generating synthetic speech matching input audio.
- Source: Automatically loaded via ComfyUI_NTCosyVoice plugin, no manual download required.

Component Explanation

ImageOnlyCheckpointLoader
- Purpose: Loads the Sonic Diffusion checkpoint.
- Function: Outputs model, CLIP Vision, and VAE for video generation.
- Installation: Built into ComfyUI.
- Dependencies: Requires svd_xt_1_1.safetensors.
SONICTLoader
- Purpose: Loads the Sonic UNet model and sets precision.
- Function: Outputs a Sonic-specific model (MODEL_SONIC) and data type (fp16) for optimized generation.
- Installation: Requires ComfyUI_Sonic plugin, install via ComfyUI Manager (search “Sonic”) or GitHub (https://github.com/smthemex/ComfyUI_Sonic).
- Dependencies: Requires unet.pth.
SONIC_PreData
- Purpose: Prepares data for Sonic Diffusion.
- Function: Combines image, audio, CLIP Vision, and VAE data, setting frame count (25) and conditioning strength (0.5).
- Installation: Requires ComfyUI_Sonic plugin.
SONICSampler
- Purpose: Executes Sonic Diffusion sampling to generate video frames.
- Function: Produces image sequences and frame rate (25 FPS) from preprocessed data.
- Installation: Requires ComfyUI_Sonic plugin.
LoadImage
- Purpose: Loads the input image.
- Function: Provides the static image as the basis for video generation.
- Installation: Built into ComfyUI.
LoadAudio
- Purpose: Loads input audio files.
- Function: Supplies audio for cloning or video conditioning.
- Installation: Built into ComfyUI.
NTCosyVoiceInstruct2Sampler
- Purpose: Generates synthetic speech (TTS).
- Function: Creates emotional speech (e.g., “happy”) based on input audio and text prompt.
- Installation: Requires ComfyUI_NTCosyVoice plugin, install via ComfyUI Manager (search “NTCosyVoice”) or GitHub (https://github.com/muxueChen/ComfyUI_NTCosyVoice).
PreviewAudio
- Purpose: Previews the generated TTS audio.
- Function: Used for debugging or verifying speech output.
- Installation: Built into ComfyUI.
VHS_VideoCombine
- Purpose: Combines image sequences into a video.
- Function: Outputs an MP4 video, supports audio sync (unused here) and frame rate adjustment (25 FPS).
- Installation: Requires ComfyUI-VideoHelperSuite, install via ComfyUI Manager (search “VideoHelperSuite”) or GitHub (https://github.com/kosinkadink/ComfyUI-VideoHelperSuite).

Workflow Structure

Voice Cloning Group (Group 1: 声音克隆)
- Nodes: LoadAudio (18) → NTCosyVoiceInstruct2Sampler → PreviewAudio
- Role: Loads reference audio (hy.WAV), generates synthetic speech with a “happy” emotion from the prompt (“你好，我是马斯克，我爱你们”), and previews it.
- Input Parameters: Audio file, text prompt, emotion (happy).
- Output: Synthetic speech (AUDIO).
Digital Human Generation Group (Group 2: 数字人生成)
- Nodes: ImageOnlyCheckpointLoader → SONICTLoader → LoadImage → LoadAudio (11) → SONIC_PreData → SONICSampler → VHS_VideoCombine
- Role: Loads the model and input data (image and audio), generates video frames, and combines them into an MP4 file.
- Input Parameters: Image (ComfyUI_temp_kbxmh_00003_.png), audio (杨幂.WAV), frame count (25), resolution (256x256).
- Output: 25-frame 256x256 MP4 video.

Inputs and Outputs

Expected Inputs:
- Image: ComfyUI_temp_kbxmh_00003_.png (256x256).
- Audio (Digital Human): 杨幂.WAV (for conditioning).
- Audio (Voice Cloning): hy.WAV (for TTS reference).
- Text Prompt (TTS): “你好，我是马斯克，我爱你们” (“Hello, I’m Musk, I love you”).
- Emotion (TTS): “happy”.
- Resolution: 256x256.
- Frame Count: 25.
- Frame Rate: 25 FPS.
Final Output:
- 256x256, 25-frame MP4 video (AnimateDiff_00005.mp4), without audio (TTS not integrated).
- Optional: TTS synthetic speech (previewed, not in video).

Notes and Tips

Resource Requirements: Sonic Diffusion requires at least 8GB VRAM; reduce frame count or resolution if VRAM is limited.
Model Files: Ensure svd_xt_1_1.safetensors and unet.pth are correctly placed, or errors will occur.
Plugin Installation: Install ComfyUI_Sonic, ComfyUI_NTCosyVoice, and ComfyUI-VideoHelperSuite, or nodes will be unavailable.
Audio Integration: TTS audio isn’t linked to VHS_VideoCombine; manually connect it for sound in the video.
Performance Optimization: Reduce frames (e.g., from 25 to 15) or use fp16 precision if generation is slow.

Transform Your Images: A Step-by-Step Guide to SUPIR-8K Wallpaper-Level Upscaling

Unleash Artistic Potential: Leveraging Flux.1 for Hand-Drawn Watercolor Images

Recommend

Transforming Line Art into 3D-Style Renders: A Deep Dive into ControlNet and Dual CLIP Encoding

Unlock Stunning Art: Transform line art into vibrant illustrations & 3D-style renders with ControlNet-guided generation & super-resolution. Learn how to use this AI workflow for breathtaking results.

Unlock Liquid Magic: Advanced I2V Workflow for Stunning Visual Effects

Generate Stunning Liquid Collision Videos with I2V Workflow! Discover how to combine WanVideo's custom models with GIMM-VFI for breathtaking effects. Learn more and start creating now!

Master Local Edits & Style Transfers with This Cutting-Edge Workflow

Unlock AI-powered image editing: Local inpainting, style transfer & auto-upscaling with ICEdit, Flux, and ESRGAN models. Try now and transform your images!

Unlock Spring Vitality: Transforming Text into Stunning 3D Art

Unlock stunning spring-themed typography with our "Spring Vitality" workflow! Transform black-and-white or 3D text images into artistic masterpieces with ease. Discover how to create captivating e-commerce posters and branding visuals automatically.

Mastering the Art of Chinese Illustrations with Advanced CLIP Encoders

Unlock stunning Eastern-style illustrations with this Flux workflow, featuring dual CLIP encoders, high-res output, and style enhancement via Lora. Discover how to generate breathtaking art with this advanced guide.

Summary

Generate stunning videos from images with Sonic Diffusion & voice cloning. Learn how to transform static images into dynamic videos with synthetic speech. Get started now!

Chapter

workflow:

CustomNodes:

SONICSampler ImageOnlyCheckpoi...