Unlock High-Quality Image Generation with Stable Cascade and CLIP Vision

CN
ComfyUI.org
2025-03-13 16:58:17

1. Workflow Overview

m8749td09ysnkp7yx9f347359c5c26a14bdd3bc7a3cd4351141895165ac5356ca4c694059bdd4a329e.png

This workflow uses ComfyUI to implement Stable Cascade for image generation, with a focus on generating high-quality images through Stage C and Stage B cascades for optimal output. The key feature of this workflow is the utilization of CLIP Vision to pass images as conditioning, which allows the use of unCLIP techniques to generate new versions of images in a similar fashion to SD2 unclip or SDXL, but at a faster pace.


2. Core Models

The workflow uses the following core models:

  1. Stable Cascade Stage C (Image Generation)

    • Model File: stable_cascade_stage_c.safetensors

    • Function: Generates the initial image based on text and image conditioning inputs.

  2. Stable Cascade Stage B (Image Enhancement)

    • Model File: stable_cascade_stage_b.safetensors

    • Function: Refines the generated image from Stage C for higher clarity and detail.

  3. CLIP Vision (Image Conditioning)

    • Function: Uses the current image to act as conditioning, enabling further image generation.


3. Key Components

1. unCLIPCheckpointLoader (Loading Stable Cascade C Model)

  • Function: Loads the Stable Cascade Stage C model.

  • Output: MODEL (Stable Cascade C)

2. CLIPTextEncode (Text Conditioning Encoder)

  • Function: Encodes text into conditioning format.

  • Output: CONDITIONING

3. CLIPVisionEncode (Image Conditioning Encoder)

  • Function: Converts the input image into CLIP conditioning format.

  • Output: CLIP_VISION_OUTPUT

4. unCLIPConditioning (Image Conditioning Transformation)

  • Function: Converts the output from CLIP Vision to Stable Cascade conditioning format.

  • Output: CONDITIONING

5. KSampler (Sampler for Image Generation)

  • Function: Uses the sampler model to generate images based on the given conditions.

  • Configuration: Inputs include model, positive conditioning, negative conditioning, and latent image.

6. VAEDecode (Decoding Latent Image into Final Image)

  • Function: Decodes latent space output from Stable Cascade to produce the final image.

  • Output: IMAGE

7. SaveImage (Saving Generated Image)

  • Function: Saves the generated image to disk.


4. Workflow Structure

The workflow is divided into two main stages:

  1. Stage C (Base Image Generation)

    • Uses text and CLIP Vision as inputs for Stable Cascade Stage C to generate an initial image.

  2. Stage B (Refinement Stage)

    • Uses the output from Stage C as conditioning for Stage B to refine the image with higher clarity and detail.


5. Notes

  • The GPU memory requirement is high; it is recommended to use a GPU with at least 16GB of VRAM.

  • Ensure that the Stable Cascade Stage C and Stage B models are properly installed before running the workflow.

  • CLIP Vision requires appropriate permissions to access and process images.