Boost Your Image Generation Game with Stable Diffusion, JOY Caption Two, and LORA
📝 Workflow Overview
.png)
This workflow is designed to reverse engineer prompts from reference images and generate new images using Stable Diffusion. It combines JOY Caption Two for prompt inference and FLUX with LORA models for enhanced image generation, producing high-quality images and allowing for comparison between input and output images.
🧠 Core Models
1️⃣ UNet (Stable Diffusion)
Function: The primary neural network responsible for noise removal and image generation.
Model Used:
基础算法_F.1
Installation:
Install via ComfyUI Manager.
Or manually download
.safetensors
and place it inmodels/checkpoints
.
2️⃣ VAE (Variational Autoencoder)
Function: Enhances image quality, particularly in detail and color.
Model Used:
ae.sft
Installation:
Install via ComfyUI Manager.
Or manually download
.vae.pt
and place it inmodels/vae
.
3️⃣ CLIP (Text Encoder)
Function: Converts text prompts into vectors for image generation.
Model Used:
t5xxl_fp8_e4m3fn
Installation:
Install via ComfyUI Manager.
Or manually download
.pt
files and place them inmodels/clip
.
4️⃣ JOY Caption Two (Prompt Inference)
Function: Describes input images and suggests suitable prompts for generation.
Model Used:
unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
Installation:
Requires additional JOY Caption Two plugin and Llama 3.1 model.
5️⃣ LORA (Style Enhancement)
Function: Enhances specific styles such as Chinese New Year themes or Floral Snake aesthetics.
Models Used:
J_3D图标素材2_中国新年_V_Flux
趣味-F.1- | 花样美蛇_V1
Installation:
Install via ComfyUI Manager.
Or manually place them in
models/lora
.
📦 Key Components (Nodes)
Node | Function |
---|---|
| Loads the UNet model. |
| Loads the VAE model. |
| Loads the CLIP language model. |
| Loads LORA models for style enhancement. |
| Loads the reference image. |
| Resizes the input image. |
| Loads the JOY Caption Two model. |
| Generates descriptive text from the input image. |
| Displays the inferred prompt. |
| Converts the inferred text prompt into vector form. |
| Handles the sampling and image generation process. |
| Encodes the input image into a latent space. |
| Decodes the latent space into the final image. |
| Saves the generated output image. |
| Compares the input and output images. |
📂 Major Workflow Groups
1️⃣ JOY Caption Two - Prompt Inference
Function: Uses JOY Caption Two to generate descriptive prompts from the input image.
Key Components:
Joy_caption_two_load
Joy_caption_two
ShowText
Input: Image
Output: Descriptive text (for Stable Diffusion)
2️⃣ Base Model Loading
Function: Loads the UNet, VAE, and CLIP models.
Key Components:
UNETLoader
VAELoader
DualCLIPLoader
3️⃣ Reference Image Input
Function: Loads and resizes the reference image.
Key Components:
LoadImage
ImageResizeKJ
4️⃣ LORA Model Selection
Function: Selects LORA models for style enhancement.
Key Components:
LoraLoaderModelOnly
5️⃣ Prompt Inference Result Input
Function: Encodes the JOY Caption Two-generated text into vectors for Stable Diffusion.
Key Components:
CLIPTextEncode
ConditioningZeroOut
6️⃣ Image Generation
Function: Generates the final image using UNet and VAE.
Key Components:
KSampler
VAEDecode
SaveImage
7️⃣ Image Comparison
Function: Compares the reference image with the generated image.
Key Components:
Image Comparer (rgthree)
🔢 Inputs & Outputs
📥 Main Inputs
Reference Image (for prompt inference)
LORA Selection (for style enhancement)
Sampling Parameters:
Seed Value
(randomization control)Sampling Method
(Euler, DPM++, etc.)Sampling Steps
(default 25 steps)
Text Prompt (generated via JOY Caption Two)
📤 Main Outputs
Final high-quality generated image
Reverse-engineered descriptive text
Comparison between the reference and generated images
⚠️ Important Considerations
Hardware Requirements
Requires at least 8GB GPU (12GB+ recommended).
JOY Caption Two can be memory-intensive; consider 4-bit quantized models.
LORA Model Compatibility
Different LORA models may affect results. Experiment with different combinations for optimal output.
Prompt Optimization
Reverse-engineered prompts may need manual refinement for best results.
Sampling Parameters
Lower sampling steps may lead to loss of detail (recommended 25–50 steps).
Euler is faster, while DPM++ provides higher quality.