ComfyUI integration for BAAI's Emu3.5 multimodal models
✅ STATUS: ALL NODES FULLY WORKING - Verified December 19, 2025
This repository provides ComfyUI custom nodes for running BAAI's Emu3.5 models for:
- Text-to-Image (T2I) - Generate images from text descriptions ✅
- Image Editing (X2I) - Transform/edit existing images ✅
- Interleaved Generation (Story Mode) - Create illustrated stories with text and images ✅
- Visual Q&A - Answer questions about images, OCR, image comparison ✅
| T2I / Basic Workflow | X2I (Image Edit) | Story Mode |
|---|---|---|
![]() |
![]() |
![]() |
Models Supported:
- Emu3.5-Image (34B params - T2I/X2I) - ✅ Working
- Emu3.5-Base (65B params - Story/Interleaved/VQA) - ✅ Working
- Vision Tokenizer (VQ-VAE for image encoding/decoding) - ✅ Working
- Emu 3.5 Loader V2 - Improved model loading with memory management
- Emu 3.5 T2I Sampler V2 - Text-to-image with tiled decoding for large images
- Emu 3.5 X2I (Image Edit) - Transform/edit images with text prompts
- Emu 3.5 Interleaved - Generate stories/tutorials with multiple images
- Emu 3.5 VQA - Visual question answering
- Emu 3.5 Memory Manager - VRAM management utilities
This project is built upon:
- BAAI Emu3.5 - Original model and codebase
- Paper: Emu3.5: Native Multimodal Models are World Learners
- Authors: Emu3.5 Team, Beijing Academy of Artificial Intelligence
- License: Apache 2.0
Development Contributors:
- Eric Rollei - ComfyUI node development and integration
- Claude Opus 4.5 (Anthropic) - AI pair programming assistant for debugging, compatibility fixes, and feature implementation
Technical Contributions:
- Transformers 4.57+ compatibility patches (DynamicCache API changes)
- Blackwell GPU (sm_120) eager attention workaround
- VQA task presets and multi-image comparison support
- Memory management and VRAM optimization
All model weights and architecture remain property of BAAI under Apache 2.0 license.
- ComfyUI installed
- Python 3.10+
- CUDA-capable GPU:
- Full BF16: 48GB+ VRAM (RTX A6000, RTX 6000 Ada/Blackwell)
- NF4 Quantized: 24GB+ VRAM (RTX 4090, RTX A5000)
- 100GB+ disk space for model weights
cd ComfyUI/custom_nodes
git clone --recursive https://github.com/EricRollei/Emu35-Comfyui-Nodes.git emu35
cd emu35
# Install dependencies
pip install -r requirements.txtNote: The
--recursiveflag automatically clones the patched Emu3.5 submodule with transformers 4.57+ compatibility fixes.
- Download this repository
- Extract to
ComfyUI/custom_nodes/emu35/ - Download Emu3.5 repo and place in
emu35/Emu3_5_repo/ - Install requirements:
pip install -r requirements.txt
Place models in ComfyUI/models/emu35/:
Option A: Full BF16 (48GB+ VRAM - Best Quality)
huggingface-cli download BAAI/Emu3.5-Image --local-dir models/emu35/Emu3.5-Image
huggingface-cli download BAAI/Emu3.5-VisionTokenizer --local-dir models/emu35/vision_tokenizerOption B: NF4 Quantized (24GB+ VRAM)
huggingface-cli download wikeeyang/Emu35-Image-NF4 --local-dir models/emu35/Emu3.5-Image-NF4
huggingface-cli download BAAI/Emu3.5-VisionTokenizer --local-dir models/emu35/vision_tokenizerDirectory structure:
ComfyUI/
├── custom_nodes/
│ └── emu35/
│ ├── nodes.py
│ ├── nodes_v2.py # New V2 nodes
│ ├── __init__.py
│ ├── patched_tokenization_emu3.py
│ └── Emu3_5_repo/ # Official repo
└── models/
└── emu35/
├── Emu3.5-Image/ # (or Emu3.5-Image-NF4/)
└── vision_tokenizer/
Loads the Emu3.5 model with improved memory management.
| Input | Type | Description |
|---|---|---|
model_name |
dropdown | Model folder (e.g., "Emu3.5-Image") |
vq_model_name |
dropdown | Vision tokenizer folder |
precision |
dropdown | bf16, fp16, fp32, or nf4 |
device |
dropdown | cuda:0, cuda:1, cpu |
vq_device |
dropdown | Device for VQ model |
attention_implementation |
dropdown | eager (recommended), sdpa |
| Output | Type | Description |
|---|---|---|
EMU35_MODEL |
model | Loaded language model |
EMU35_TOKENIZER |
tokenizer | Text tokenizer |
EMU35_VQ |
model | Vision tokenizer |
Text-to-image generation with improved quality and tiled decoding.
| Input | Type | Description |
|---|---|---|
model/tokenizer/vq_model |
- | From loader |
prompt |
string | Text description |
aspect_ratio |
dropdown | 1:1, 4:3, 3:4, 16:9, 9:16, etc. |
cfg_scale |
float | Guidance scale (default: 5.0) |
seed |
int | Random seed |
image_top_k |
int | Sampling top-k (default: 5120) |
image_temperature |
float | Sampling temperature (default: 1.0) |
tiled_decode |
bool | Use tiled VQ decoding (faster for large images) |
tile_size |
int | Tile size for decoding (default: 32) |
| Output | Type | Description |
|---|---|---|
IMAGE |
image | Generated image |
TEXT_RESPONSE |
string | Any text response |
REASONING |
string | Chain-of-thought reasoning (if any) |
Transform or edit existing images based on text instructions.
| Input | Type | Description |
|---|---|---|
model/tokenizer/vq_model |
- | From loader |
prompt |
string | Edit instruction (e.g., "Make the background a sunset") |
reference_image_1 |
image | Primary reference image |
image_area |
dropdown | Token resolution: 256x256 to 1024x1024 |
cfg_scale |
float | Guidance (default: 2.0 for X2I) |
seed |
int | Random seed |
reference_image_2 |
image | Optional second reference |
reference_image_3 |
image | Optional third reference |
tiled_decode |
bool | Use tiled VQ decoding |
Example Prompts for X2I:
- "Transform this image into a realistic photo"
- "Change the background to a beach sunset"
- "Add sunglasses to the person"
- "Replace the dog with a cat"
- With 2+ images: "Replace the [object] in first image with [object] from second image"
| Output | Type | Description |
|---|---|---|
IMAGE |
image | Edited image |
TEXT_RESPONSE |
string | Any text response |
REASONING |
string | Chain-of-thought reasoning |
Generate text with multiple embedded images (stories, tutorials).
| Input | Type | Description |
|---|---|---|
model/tokenizer/vq_model |
- | From loader |
prompt |
string | Topic/story to generate |
task_type |
dropdown | story, howto, explore |
max_images |
int | Number of images to generate (1-10) |
cfg_scale |
float | Guidance scale |
seed |
int | Random seed |
reference_image |
image | Optional reference for context |
| Output | Type | Description |
|---|---|---|
IMAGES |
image batch | Generated images |
TEXT_RESPONSE |
string | Full text with [IMAGE_N] markers |
REASONING |
string | Reasoning if present |
Analyze images, answer questions, describe content, and read text (OCR).
| Input | Type | Description |
|---|---|---|
model/tokenizer/vq_model |
- | From loader (use emu35-base) |
image |
image | Image to analyze |
question |
string | Question about the image |
max_tokens |
int | Max response length (default: 512) |
task_type |
dropdown | Preset task types (see below) |
temperature |
float | Response creativity (0.1-0.3 for accuracy, 0.5-0.7 for creativity) |
image2 |
image | Optional second image for comparison tasks |
| Output | Type | Description |
|---|---|---|
response |
string | Model's answer |
Task Types:
| Task Type | Description | Best For |
|---|---|---|
caption |
One-sentence summary | Quick descriptions |
describe |
Detailed description of subjects, background, colors, composition | Comprehensive analysis |
analyze |
Context, mood, artistic style analysis | Art/photo critique |
ocr |
Read and transcribe text from images | Screenshots, signs, documents |
question |
Free-form Q&A | Specific questions |
custom |
Use your own question directly | Any task |
Example Questions for VQA:
Basic Understanding:
- "Describe this image in detail."
- "What is the main subject of this image?"
- "What colors are dominant in this image?"
Object Identification:
- "What objects are visible in this image?"
- "Is there a person in this image? What are they doing?"
- "What type of animal is in the photo?"
Counting & Spatial:
- "How many people are in this image?"
- "What is to the left of the car?"
- "Where is the sun in this scene?"
OCR (Text Reading):
- "What text appears in this image?"
- "Read the sign in the background."
- "What does the label say?"
Analysis & Style:
- "What artistic style is this image?"
- "What time of day is shown?"
- "What emotion does this scene convey?"
- "Is this image indoor or outdoor?"
Comparison (with 2 images):
- "What's different between these two images?"
- "Which image shows more people?"
- "Compare the lighting in both photos."
Tips for Best Results:
- Use emu35-base model - The Image model is optimized for generation, not understanding
- Lower temperature (0.1-0.3) for factual answers like counting, OCR, or identification
- Higher temperature (0.5-0.7) for creative descriptions or artistic analysis
- Be specific - "What brand is the laptop?" works better than "What's in the image?"
- For OCR, use task_type="ocr" which has an optimized prompt
- For comparisons, connect both images (image + image2)
Utilities for VRAM management.
| Input | Type | Description |
|---|---|---|
action |
dropdown | clear_cache, report_memory, gc_collect |
any_input |
any | Pass-through connection |
| Output | Type | Description |
|---|---|---|
any_output |
any | Pass-through |
memory_info |
string | Memory status report |
The original nodes are still available for compatibility:
Emu35Loader- Original loaderEmu35Sampler- Original T2I samplerEmu35VQA- Original VQA nodeEmu35ClearCache- Cache clearing
| Configuration | GPU | VRAM Used | Speed |
|---|---|---|---|
| Full BF16 + T2I | RTX 6000 Blackwell 96GB | ~65GB | ~5-6 tok/s |
| Full BF16 + X2I (1 image) | RTX 6000 Blackwell 96GB | ~82GB | ~5.4 tok/s |
| NF4 + X2I (2 images @ 1024) | RTX 6000 Blackwell 96GB | ~50GB | ~4-5 tok/s |
| NF4 + T2I | RTX 4090 24GB | ~22GB | ~3-4 tok/s |
| Task | Resolution | Tokens | Time |
|---|---|---|---|
| T2I 1:1 | 1024x1024 | ~4096 | ~12 min |
| T2I 4:3 | 1168x880 | ~4000 | ~11 min |
| X2I | Same as input | ~4000 | ~13 min |
- Model Size: 34B parameters
- Training: 10T+ multimodal tokens
- Image Tokenization: VQ-VAE (IBQ) with 262,144 codebook
- Visual Tokens: Token IDs 151855-413998
- Max Resolution: 2048x2048 (128x128 latents)
BOS = 151849 # <|extra_203|> Begin generation
EOS = 151850 # <|extra_204|> End generation
IMG = 151851 # <|image token|>
BOI = 151852 # <|image start|>
EOI = 151853 # <|image end|>
EOL = 151846 # <|extra_200|> End of line
VISUAL_START = 151854 # First visual tokenT2I (Text-to-Image):
<|extra_203|>You are a helpful assistant for t2i task. USER: {prompt} ASSISTANT: <|extra_100|>
X2I (Image Edit):
<|extra_203|>You are a helpful assistant for x2i task. USER: <|IMAGE|>{prompt} ASSISTANT: <|extra_100|>
Issue: SDPA attention produces noise/garbage on Blackwell (sm_120) with CUDA 12.8.
Solution: Use attention_implementation="eager" (default in V2 loader).
Issue: Missing visual tokens in tokenizer. Solution: Patched tokenizer auto-synthesizes missing tokens.
Solutions:
- Use NF4 quantization (24GB VRAM)
- Reduce
image_areain X2I node - Use smaller aspect ratios
- Enable tiled decoding
Issue: Stopping criteria triggered by reference image tokens. Solution: Fixed in V2 - stopping criteria now ignores input tokens.
| GPU Architecture | SDPA | Eager | Recommended |
|---|---|---|---|
| Ampere (sm_80) | ✅ | ✅ | SDPA |
| Ada Lovelace (sm_89) | ✅ | ✅ | SDPA |
| Blackwell (sm_120) | ❌ | ✅ | Eager |
[Emu 3.5 Loader V2] → [Emu 3.5 T2I Sampler V2] → [Preview Image]
↑
prompt: "a red apple on a wooden table,
studio lighting, photorealistic"
[Load Image] → [Emu 3.5 X2I] → [Preview Image]
↑
[Emu 3.5 Loader V2]
↑
prompt: "Transform into an oil painting"
[Load Image 1] →
[Emu 3.5 X2I] → [Preview Image]
[Load Image 2] → ↑
prompt: "Replace the background
from image 1 with the
scene from image 2"
emu35/
├── nodes.py # V1 nodes (legacy)
├── nodes_v2.py # V2 nodes (recommended)
├── __init__.py # ComfyUI registration
├── patched_tokenization_emu3.py # Fixed tokenizer
├── download_nf4.py # NF4 download helper
├── Emu3_5_repo/ # Official Emu3.5 code
└── dev/ # Development/test scripts
cd ComfyUI/custom_nodes/emu35/dev
# Test tokenizer
python test_tokenizer.py
# Test minimal generation
python test_minimal.pyContributions welcome! Priority areas:
- vLLM integration for faster generation
- Additional sampling strategies
- Workflow examples
- Documentation improvements
- This integration code: MIT License
- Emu3.5 models and code: Apache 2.0 (BAAI)
@article{emu3.5,
title={Emu3.5: Native Multimodal Models are World Learners},
author={Emu3.5 Team},
journal={arXiv preprint arXiv:2510.26583},
year={2024}
}- Emu3.5 Project: https://emu.world/
- Paper: https://arxiv.org/pdf/2510.26583
- Official Code: https://github.com/baaivision/Emu3.5
- Model Weights: https://huggingface.co/BAAI/Emu3.5-Image
- NF4 Quantized: https://huggingface.co/wikeeyang/Emu35-Image-NF4
- ComfyUI: https://github.com/comfyanonymous/ComfyUI



