Bring Your Images to Life: AI-Driven Video Generation with Sonic Diffusion and NTCosyVoice
Workflow Overview

This workflow is an Image-to-Video (I2V) generation process based on the Sonic Diffusion model, combined with voice cloning capabilities. It transforms a static image and audio into a dynamic video while generating synthetic speech aligned with the input audio’s style. The workflow is split into two parts: voice cloning (using NTCosyVoice for TTS) and digital human generation (using Sonic Diffusion for video). The final output is a 256x256, 25-frame MP4 video, suitable for digital human animations or short video creation.
Core Models
Sonic Diffusion (svd_xt_1_1.safetensors)
Function: An I2V model based on Stable Video Diffusion (SVD), extending static images into video frames.
Source: Download from Sonic Diffusion official or Hugging Face, place in ComfyUI/models/checkpoints/.
Sonic UNet (unet.pth)
Function: The core UNet network for Sonic Diffusion, driving video frame generation.
Source: Download from Sonic official repository, place in specified path (typically ComfyUI/models/unet/).
NTCosyVoice (Built-in Model)
Function: A Text-to-Speech (TTS) model with voice cloning and emotion control, generating synthetic speech matching input audio.
Source: Automatically loaded via ComfyUI_NTCosyVoice plugin, no manual download required.
Component Explanation
ImageOnlyCheckpointLoader
Purpose: Loads the Sonic Diffusion checkpoint.
Function: Outputs model, CLIP Vision, and VAE for video generation.
Installation: Built into ComfyUI.
Dependencies: Requires svd_xt_1_1.safetensors.
SONICTLoader
Purpose: Loads the Sonic UNet model and sets precision.
Function: Outputs a Sonic-specific model (MODEL_SONIC) and data type (fp16) for optimized generation.
Installation: Requires ComfyUI_Sonic plugin, install via ComfyUI Manager (search “Sonic”) or GitHub (https://github.com/smthemex/ComfyUI_Sonic).
Dependencies: Requires unet.pth.
SONIC_PreData
Purpose: Prepares data for Sonic Diffusion.
Function: Combines image, audio, CLIP Vision, and VAE data, setting frame count (25) and conditioning strength (0.5).
Installation: Requires ComfyUI_Sonic plugin.
SONICSampler
Purpose: Executes Sonic Diffusion sampling to generate video frames.
Function: Produces image sequences and frame rate (25 FPS) from preprocessed data.
Installation: Requires ComfyUI_Sonic plugin.
LoadImage
Purpose: Loads the input image.
Function: Provides the static image as the basis for video generation.
Installation: Built into ComfyUI.
LoadAudio
Purpose: Loads input audio files.
Function: Supplies audio for cloning or video conditioning.
Installation: Built into ComfyUI.
NTCosyVoiceInstruct2Sampler
Purpose: Generates synthetic speech (TTS).
Function: Creates emotional speech (e.g., “happy”) based on input audio and text prompt.
Installation: Requires ComfyUI_NTCosyVoice plugin, install via ComfyUI Manager (search “NTCosyVoice”) or GitHub (https://github.com/muxueChen/ComfyUI_NTCosyVoice).
PreviewAudio
Purpose: Previews the generated TTS audio.
Function: Used for debugging or verifying speech output.
Installation: Built into ComfyUI.
VHS_VideoCombine
Purpose: Combines image sequences into a video.
Function: Outputs an MP4 video, supports audio sync (unused here) and frame rate adjustment (25 FPS).
Installation: Requires ComfyUI-VideoHelperSuite, install via ComfyUI Manager (search “VideoHelperSuite”) or GitHub (https://github.com/kosinkadink/ComfyUI-VideoHelperSuite).
Workflow Structure
Voice Cloning Group (Group 1: 声音克隆)
Nodes: LoadAudio (18) → NTCosyVoiceInstruct2Sampler → PreviewAudio
Role: Loads reference audio (hy.WAV), generates synthetic speech with a “happy” emotion from the prompt (“你好,我是马斯克,我爱你们”), and previews it.
Input Parameters: Audio file, text prompt, emotion (happy).
Output: Synthetic speech (AUDIO).
Digital Human Generation Group (Group 2: 数字人生成)
Nodes: ImageOnlyCheckpointLoader → SONICTLoader → LoadImage → LoadAudio (11) → SONIC_PreData → SONICSampler → VHS_VideoCombine
Role: Loads the model and input data (image and audio), generates video frames, and combines them into an MP4 file.
Input Parameters: Image (ComfyUI_temp_kbxmh_00003_.png), audio (杨幂.WAV), frame count (25), resolution (256x256).
Output: 25-frame 256x256 MP4 video.
Inputs and Outputs
Expected Inputs:
Image: ComfyUI_temp_kbxmh_00003_.png (256x256).
Audio (Digital Human): 杨幂.WAV (for conditioning).
Audio (Voice Cloning): hy.WAV (for TTS reference).
Text Prompt (TTS): “你好,我是马斯克,我爱你们” (“Hello, I’m Musk, I love you”).
Emotion (TTS): “happy”.
Resolution: 256x256.
Frame Count: 25.
Frame Rate: 25 FPS.
Final Output:
256x256, 25-frame MP4 video (AnimateDiff_00005.mp4), without audio (TTS not integrated).
Optional: TTS synthetic speech (previewed, not in video).
Notes and Tips
Resource Requirements: Sonic Diffusion requires at least 8GB VRAM; reduce frame count or resolution if VRAM is limited.
Model Files: Ensure svd_xt_1_1.safetensors and unet.pth are correctly placed, or errors will occur.
Plugin Installation: Install ComfyUI_Sonic, ComfyUI_NTCosyVoice, and ComfyUI-VideoHelperSuite, or nodes will be unavailable.
Audio Integration: TTS audio isn’t linked to VHS_VideoCombine; manually connect it for sound in the video.
Performance Optimization: Reduce frames (e.g., from 25 to 15) or use fp16 precision if generation is slow.