Is there an existing issue for this problem?
Install method
Manual
Operating system
Linux
GPU vendor
AMD (ROCm)
GPU model
AMD Radeon 8060S (gfx1151 / RDNA 3.5 / Strix Halo) — integrated APU in GMKtec EVO-X2
GPU VRAM
48 GB (UMA carve-out from 128 GB LPDDR5X system RAM)
Version number
6.12.0
Browser
N/A — reproduced entirely via REST API
System Information
OS: Ubuntu Noble 24.04 (kernel 6.17.0-20-generic)
ROCm: 7.2.1
PyTorch: 2.7.1+rocm7.2.1.git1dab218d
Docker base image: rocm/pytorch:rocm7.2.1_ubuntu24.04_py3.12_pytorch_release_2.7.1
HSA_OVERRIDE_GFX_VERSION: 11.5.1
HSA_ENABLE_SDMA: 0
PYTORCH_HIP_ALLOC_CONF: garbage_collection_threshold:0.6
amdgpu.lockup_timeout: 600000 (kernel boot param)
invokeai.yaml:
schema_version: 4.0.2
device: cuda
precision: float16
enable_partial_loading: true
device_working_mem_gb: 20
force_tiled_decode: true
What happened
The FLUX VAE decode step (flux_vae_decode invocation) causes an immediate GPU Hang on AMD Radeon 8060S (gfx1151) running ROCm 7.2.1. The crash follows a consistent pattern:
- First FLUX generation after starting InvokeAI: succeeds (denoising + VAE decode complete normally)
- Second FLUX generation in the same InvokeAI process: denoising completes, but the GPU immediately hangs when
flux_vae_decode begins
The crash produces:
HW Exception by GPU node-1 (Agent handle: 0x...) reason :GPU Hang
HW Exception by GPU node-1 (Agent handle: 0x...) reason :GPU Hang
This occurs immediately after the VAE model is loaded onto the device:
[MODEL CACHE] Loaded model 'cfc36333-...:vae' (AutoEncoder) onto cuda device in 0.02s
HW Exception by GPU node-1 reason :GPU Hang
Critical finding: ComfyUI does NOT have this issue. We ran 3+ consecutive FLUX generations through ComfyUI on the exact same hardware, same PyTorch build, same GGUF model, same VAE file — all completed successfully with zero crashes and zero restarts. This confirms the issue is specific to InvokeAI's model cache / session management, not ROCm or the GPU hardware.
Additionally, the Z-Image pipeline in InvokeAI (z_image_latents_to_image.py) uses the same VAE model file and same FluxAutoEncoder class but does not crash on subsequent runs. Only the FLUX pipeline (flux_vae_decode.py) is affected.
Investigation summary
We tested extensively to isolate the cause:
| Attempted fix |
Result |
Increase amdgpu.lockup_timeout (10s → 120s → 600s) |
No effect — crash is instant, not a timeout |
| Convert VAE weights FP32 → FP16 |
No effect on FLUX path (did help Z-Image separately) |
Patch flux_vae_decode.py to add TorchDevice.empty_cache() + torch.inference_mode() (matching z_image_latents_to_image.py) |
No effect |
Call POST /api/v2/models/empty_model_cache between runs |
No effect |
| Different resolutions (512x512 through 2000x1120) |
All crash on 2nd run |
| Different FLUX models (GGUF Q5_K_S, BnB NF4) |
All crash on 2nd run |
| Restart InvokeAI container between runs |
Partially effective — usually works but not guaranteed |
| Use ComfyUI instead |
Fully effective — zero crashes across all tests |
Code comparison
flux_vae_decode.py (crashes on 2nd run):
with vae_info.model_on_device(working_mem_bytes=estimated_working_memory) as (_, vae):
assert isinstance(vae, AutoEncoder)
vae_dtype = next(iter(vae.parameters())).dtype
latents = latents.to(device=TorchDevice.choose_torch_device(), dtype=vae_dtype)
img = vae.decode(latents) # <-- GPU Hang on 2nd+ run
z_image_latents_to_image.py (never crashes, same VAE class):
with seamless_context, vae_info.model_on_device(working_mem_bytes=estimated_working_memory) as (_, vae):
vae_dtype = next(iter(vae.parameters())).dtype
latents = latents.to(device=TorchDevice.choose_torch_device(), dtype=vae_dtype)
TorchDevice.empty_cache()
with torch.inference_mode():
if isinstance(vae, FluxAutoEncoder):
img = vae.decode(latents) # <-- works on repeated runs
Note: We patched flux_vae_decode.py to match the Z-Image pattern (adding empty_cache() + inference_mode()), but this did not fix the issue, suggesting the root cause is deeper — likely in how the model cache manages GPU state between FLUX pipeline runs.
What you expected to happen
Multiple consecutive FLUX generations should complete without GPU hangs, as they do in ComfyUI on the same hardware.
How to reproduce the problem
Steps to reproduce
- Run InvokeAI 6.12.0 in Docker with ROCm 7.2.1 on AMD gfx1151 (RDNA 3.5) GPU
- Set
HSA_OVERRIDE_GFX_VERSION=11.5.1, HSA_ENABLE_SDMA=0
- Generate any image using a FLUX model (e.g.,
flux1-dev-Q5_K_S.gguf) at any resolution
- First generation succeeds
- Without restarting InvokeAI, generate a second FLUX image
- Second generation completes denoising but crashes with
HW Exception: GPU Hang when flux_vae_decode runs
Note: This is 100% reproducible on our hardware. If testing on a different AMD GPU, the behavior may differ since gfx1151 is a new ISA target with limited ROCm testing.
Additional context
- The GPU is an integrated APU (Radeon 8060S in GMKtec EVO-X2) using unified memory (48 GB UMA carve-out from 128 GB LPDDR5X)
- gfx1151 is only supported in ROCm 7.2+ via the
gfx11-generic ISA target
- The crash is non-deterministic on "first after restart" — sometimes the first run also crashes, suggesting GPU state from a prior InvokeAI session may persist across container restarts
- After repeated GPU hangs, the GPU's MES (Micro Engine Scheduler) enters an unrecoverable state requiring a full server reboot
- ComfyUI uses the same
rocm/pytorch:rocm7.2.1 base image and same model files, confirming the issue is not in ROCm, PyTorch, or the model weights
- We suspect InvokeAI's persistent model cache leaves HIP/ROCm runtime state (streams, memory mappings, or kernel dispatch state) that corrupts on gfx1151 between FLUX VAE decode calls
Discord username
No response
Is there an existing issue for this problem?
Install method
Manual
Operating system
Linux
GPU vendor
AMD (ROCm)
GPU model
AMD Radeon 8060S (gfx1151 / RDNA 3.5 / Strix Halo) — integrated APU in GMKtec EVO-X2
GPU VRAM
48 GB (UMA carve-out from 128 GB LPDDR5X system RAM)
Version number
6.12.0
Browser
N/A — reproduced entirely via REST API
System Information
OS: Ubuntu Noble 24.04 (kernel 6.17.0-20-generic)
ROCm: 7.2.1
PyTorch: 2.7.1+rocm7.2.1.git1dab218d
Docker base image: rocm/pytorch:rocm7.2.1_ubuntu24.04_py3.12_pytorch_release_2.7.1
HSA_OVERRIDE_GFX_VERSION: 11.5.1
HSA_ENABLE_SDMA: 0
PYTORCH_HIP_ALLOC_CONF: garbage_collection_threshold:0.6
amdgpu.lockup_timeout: 600000 (kernel boot param)
invokeai.yaml:
schema_version: 4.0.2
device: cuda
precision: float16
enable_partial_loading: true
device_working_mem_gb: 20
force_tiled_decode: true
What happened
The FLUX VAE decode step (
flux_vae_decodeinvocation) causes an immediate GPU Hang on AMD Radeon 8060S (gfx1151) running ROCm 7.2.1. The crash follows a consistent pattern:flux_vae_decodebeginsThe crash produces:
This occurs immediately after the VAE model is loaded onto the device:
Critical finding: ComfyUI does NOT have this issue. We ran 3+ consecutive FLUX generations through ComfyUI on the exact same hardware, same PyTorch build, same GGUF model, same VAE file — all completed successfully with zero crashes and zero restarts. This confirms the issue is specific to InvokeAI's model cache / session management, not ROCm or the GPU hardware.
Additionally, the Z-Image pipeline in InvokeAI (
z_image_latents_to_image.py) uses the same VAE model file and sameFluxAutoEncoderclass but does not crash on subsequent runs. Only the FLUX pipeline (flux_vae_decode.py) is affected.Investigation summary
We tested extensively to isolate the cause:
amdgpu.lockup_timeout(10s → 120s → 600s)flux_vae_decode.pyto addTorchDevice.empty_cache()+torch.inference_mode()(matchingz_image_latents_to_image.py)POST /api/v2/models/empty_model_cachebetween runsCode comparison
flux_vae_decode.py(crashes on 2nd run):z_image_latents_to_image.py(never crashes, same VAE class):Note: We patched
flux_vae_decode.pyto match the Z-Image pattern (addingempty_cache()+inference_mode()), but this did not fix the issue, suggesting the root cause is deeper — likely in how the model cache manages GPU state between FLUX pipeline runs.What you expected to happen
Multiple consecutive FLUX generations should complete without GPU hangs, as they do in ComfyUI on the same hardware.
How to reproduce the problem
Steps to reproduce
HSA_OVERRIDE_GFX_VERSION=11.5.1,HSA_ENABLE_SDMA=0flux1-dev-Q5_K_S.gguf) at any resolutionHW Exception: GPU Hangwhenflux_vae_decoderunsNote: This is 100% reproducible on our hardware. If testing on a different AMD GPU, the behavior may differ since gfx1151 is a new ISA target with limited ROCm testing.
Additional context
gfx11-genericISA targetrocm/pytorch:rocm7.2.1base image and same model files, confirming the issue is not in ROCm, PyTorch, or the model weightsDiscord username
No response