Unlock the Power of Text-to-Speech with Index-TTS Workflow

CN
ComfyUI.org
2025-05-13 12:43:06

1. Workflow Overview

mami52gq4bcoca6ymybfd6caad3dbf53eed82e83256e9fbf73698d72959fcab2ecb34eb913c3701b5b3.png

This workflow converts text to natural speech using Index-TTS, supporting voice cloning and audio enhancement. Key features:

  • Text-to-Speech: Processes long texts (e.g., novels) into fluent speech.

  • Voice Cloning: Mimics speaker timbre from reference audio (e.g., 蔡徐坤.wav).

  • Noise Reduction: Cleans background noise for professional output.

Core Models:

  • Index-TTS: Main model for speech synthesis (requires plugin ComfyUI-Index-TTS).

  • Audio Tools: Noise removal (AudioCleanupNode), timbre loading (TimbreAudioLoader).


2. Key Nodes & Installation

Node

Function

Installation

IndexTTSNode

Converts text to speech with voice cloning.

Install plugin ComfyUI-Index-TTS (GitHub: chenpipi0807/ComfyUI-Index-TTS).

TimbreAudioLoader

Loads timbre templates (e.g., 抖音-读小说.wav).

Place audio files in ComfyUI/input.

AudioCleanupNode

Reduces noise (strength 0.7) and enhances audio.

Included in plugin.

LoadAudio

Loads reference audio (e.g., 蔡徐坤.wav).

Built-in node.

Dependencies:

  • Index-TTS models (~2-3GB) auto-download on first use.


3. Workflow Structure

Group 1: Input & Voice Cloning

  • Inputs:

    • Reference audio via LoadAudio.

    • Timbre template via TimbreAudioLoader.

  • Steps:

    1. IndexTTSNode generates speech from text (e.g., novel chapters).

    2. Parameters: Speed (1.0), emotion (0.8), seed (1155511506).

Group 2: Post-Processing

  • Input: Raw generated audio.

  • Steps:

    1. AudioCleanupNode applies noise reduction (100-8000Hz range).

    2. SaveAudio exports WAV to audio/ComfyUI.

Group 3: Preview & Output

  • Preview: Listen via PreviewAudio.

  • Output: WAV file (e.g., ComfyUI_20240513_142301.wav).


4. Inputs & Outputs

Inputs:

  • Text: Supports long texts (example: 4-chapter novel).

  • Reference Audio: Clear voice sample (≥10 sec recommended).

  • Timbre Template (optional): Style template (e.g., 抖音-读小说.wav).

Output:

  • WAV file saved in ComfyUI/audio.


5. Notes

  1. VRAM:

    • Index-TTS requires ~4GB VRAM; split long texts if needed.

  2. Quality Tips:

    • Adjust frequency_range in AudioCleanupNode to preserve voice clarity.

  3. Voice Control:

    • Change seed in IndexTTSNode for different voice variations.

  4. Debugging:

    • Avoid special characters in Chinese text to prevent garbled speech.

    • Pre-clean noisy reference audio for better cloning.