Bring Your Images to Life: AI-Driven Video Generation with Sonic Diffusion and NTCosyVoice

CN
ComfyUI.org
2025-03-11 08:14:19

Workflow Overview

m847sc3u5lw2sr9t5ch24875926c99018a3473a0764c9c2681f417d213a7a9a36e6fbc5a201d4d97a33.gif

This workflow is an Image-to-Video (I2V) generation process based on the Sonic Diffusion model, combined with voice cloning capabilities. It transforms a static image and audio into a dynamic video while generating synthetic speech aligned with the input audio’s style. The workflow is split into two parts: voice cloning (using NTCosyVoice for TTS) and digital human generation (using Sonic Diffusion for video). The final output is a 256x256, 25-frame MP4 video, suitable for digital human animations or short video creation.

Core Models

  1. Sonic Diffusion (svd_xt_1_1.safetensors)

    • Function: An I2V model based on Stable Video Diffusion (SVD), extending static images into video frames.

    • Source: Download from Sonic Diffusion official or Hugging Face, place in ComfyUI/models/checkpoints/.

  2. Sonic UNet (unet.pth)

    • Function: The core UNet network for Sonic Diffusion, driving video frame generation.

    • Source: Download from Sonic official repository, place in specified path (typically ComfyUI/models/unet/).

  3. NTCosyVoice (Built-in Model)

    • Function: A Text-to-Speech (TTS) model with voice cloning and emotion control, generating synthetic speech matching input audio.

    • Source: Automatically loaded via ComfyUI_NTCosyVoice plugin, no manual download required.

Component Explanation

  1. ImageOnlyCheckpointLoader

    • Purpose: Loads the Sonic Diffusion checkpoint.

    • Function: Outputs model, CLIP Vision, and VAE for video generation.

    • Installation: Built into ComfyUI.

    • Dependencies: Requires svd_xt_1_1.safetensors.

  2. SONICTLoader

    • Purpose: Loads the Sonic UNet model and sets precision.

    • Function: Outputs a Sonic-specific model (MODEL_SONIC) and data type (fp16) for optimized generation.

    • Installation: Requires ComfyUI_Sonic plugin, install via ComfyUI Manager (search “Sonic”) or GitHub (https://github.com/smthemex/ComfyUI_Sonic).

    • Dependencies: Requires unet.pth.

  3. SONIC_PreData

    • Purpose: Prepares data for Sonic Diffusion.

    • Function: Combines image, audio, CLIP Vision, and VAE data, setting frame count (25) and conditioning strength (0.5).

    • Installation: Requires ComfyUI_Sonic plugin.

  4. SONICSampler

    • Purpose: Executes Sonic Diffusion sampling to generate video frames.

    • Function: Produces image sequences and frame rate (25 FPS) from preprocessed data.

    • Installation: Requires ComfyUI_Sonic plugin.

  5. LoadImage

    • Purpose: Loads the input image.

    • Function: Provides the static image as the basis for video generation.

    • Installation: Built into ComfyUI.

  6. LoadAudio

    • Purpose: Loads input audio files.

    • Function: Supplies audio for cloning or video conditioning.

    • Installation: Built into ComfyUI.

  7. NTCosyVoiceInstruct2Sampler

    • Purpose: Generates synthetic speech (TTS).

    • Function: Creates emotional speech (e.g., “happy”) based on input audio and text prompt.

    • Installation: Requires ComfyUI_NTCosyVoice plugin, install via ComfyUI Manager (search “NTCosyVoice”) or GitHub (https://github.com/muxueChen/ComfyUI_NTCosyVoice).

  8. PreviewAudio

    • Purpose: Previews the generated TTS audio.

    • Function: Used for debugging or verifying speech output.

    • Installation: Built into ComfyUI.

  9. VHS_VideoCombine

    • Purpose: Combines image sequences into a video.

    • Function: Outputs an MP4 video, supports audio sync (unused here) and frame rate adjustment (25 FPS).

    • Installation: Requires ComfyUI-VideoHelperSuite, install via ComfyUI Manager (search “VideoHelperSuite”) or GitHub (https://github.com/kosinkadink/ComfyUI-VideoHelperSuite).

Workflow Structure

  1. Voice Cloning Group (Group 1: 声音克隆)

    • Nodes: LoadAudio (18) → NTCosyVoiceInstruct2Sampler → PreviewAudio

    • Role: Loads reference audio (hy.WAV), generates synthetic speech with a “happy” emotion from the prompt (“你好,我是马斯克,我爱你们”), and previews it.

    • Input Parameters: Audio file, text prompt, emotion (happy).

    • Output: Synthetic speech (AUDIO).

  2. Digital Human Generation Group (Group 2: 数字人生成)

    • Nodes: ImageOnlyCheckpointLoader → SONICTLoader → LoadImage → LoadAudio (11) → SONIC_PreData → SONICSampler → VHS_VideoCombine

    • Role: Loads the model and input data (image and audio), generates video frames, and combines them into an MP4 file.

    • Input Parameters: Image (ComfyUI_temp_kbxmh_00003_.png), audio (杨幂.WAV), frame count (25), resolution (256x256).

    • Output: 25-frame 256x256 MP4 video.

Inputs and Outputs

  • Expected Inputs:

    • Image: ComfyUI_temp_kbxmh_00003_.png (256x256).

    • Audio (Digital Human): 杨幂.WAV (for conditioning).

    • Audio (Voice Cloning): hy.WAV (for TTS reference).

    • Text Prompt (TTS): “你好,我是马斯克,我爱你们” (“Hello, I’m Musk, I love you”).

    • Emotion (TTS): “happy”.

    • Resolution: 256x256.

    • Frame Count: 25.

    • Frame Rate: 25 FPS.

  • Final Output:

    • 256x256, 25-frame MP4 video (AnimateDiff_00005.mp4), without audio (TTS not integrated).

    • Optional: TTS synthetic speech (previewed, not in video).

Notes and Tips

  1. Resource Requirements: Sonic Diffusion requires at least 8GB VRAM; reduce frame count or resolution if VRAM is limited.

  2. Model Files: Ensure svd_xt_1_1.safetensors and unet.pth are correctly placed, or errors will occur.

  3. Plugin Installation: Install ComfyUI_Sonic, ComfyUI_NTCosyVoice, and ComfyUI-VideoHelperSuite, or nodes will be unavailable.

  4. Audio Integration: TTS audio isn’t linked to VHS_VideoCombine; manually connect it for sound in the video.

  5. Performance Optimization: Reduce frames (e.g., from 25 to 15) or use fp16 precision if generation is slow.