Skip to content

[bug]: FLUX VAE decode causes GPU Hang on AMD gfx1151 (RDNA 3.5) — second+ generation always crashes #9053

@camerono

Description

@camerono

Is there an existing issue for this problem?

  • I have searched the existing issues

Install method

Manual

Operating system

Linux

GPU vendor

AMD (ROCm)

GPU model

AMD Radeon 8060S (gfx1151 / RDNA 3.5 / Strix Halo) — integrated APU in GMKtec EVO-X2

GPU VRAM

48 GB (UMA carve-out from 128 GB LPDDR5X system RAM)

Version number

6.12.0

Browser

N/A — reproduced entirely via REST API

System Information

OS: Ubuntu Noble 24.04 (kernel 6.17.0-20-generic)
ROCm: 7.2.1
PyTorch: 2.7.1+rocm7.2.1.git1dab218d
Docker base image: rocm/pytorch:rocm7.2.1_ubuntu24.04_py3.12_pytorch_release_2.7.1
HSA_OVERRIDE_GFX_VERSION: 11.5.1
HSA_ENABLE_SDMA: 0
PYTORCH_HIP_ALLOC_CONF: garbage_collection_threshold:0.6
amdgpu.lockup_timeout: 600000 (kernel boot param)

invokeai.yaml:
schema_version: 4.0.2
device: cuda
precision: float16
enable_partial_loading: true
device_working_mem_gb: 20
force_tiled_decode: true

What happened

The FLUX VAE decode step (flux_vae_decode invocation) causes an immediate GPU Hang on AMD Radeon 8060S (gfx1151) running ROCm 7.2.1. The crash follows a consistent pattern:

  1. First FLUX generation after starting InvokeAI: succeeds (denoising + VAE decode complete normally)
  2. Second FLUX generation in the same InvokeAI process: denoising completes, but the GPU immediately hangs when flux_vae_decode begins

The crash produces:

HW Exception by GPU node-1 (Agent handle: 0x...) reason :GPU Hang
HW Exception by GPU node-1 (Agent handle: 0x...) reason :GPU Hang

This occurs immediately after the VAE model is loaded onto the device:

[MODEL CACHE] Loaded model 'cfc36333-...:vae' (AutoEncoder) onto cuda device in 0.02s
HW Exception by GPU node-1 reason :GPU Hang

Critical finding: ComfyUI does NOT have this issue. We ran 3+ consecutive FLUX generations through ComfyUI on the exact same hardware, same PyTorch build, same GGUF model, same VAE file — all completed successfully with zero crashes and zero restarts. This confirms the issue is specific to InvokeAI's model cache / session management, not ROCm or the GPU hardware.

Additionally, the Z-Image pipeline in InvokeAI (z_image_latents_to_image.py) uses the same VAE model file and same FluxAutoEncoder class but does not crash on subsequent runs. Only the FLUX pipeline (flux_vae_decode.py) is affected.

Investigation summary

We tested extensively to isolate the cause:

Attempted fix Result
Increase amdgpu.lockup_timeout (10s → 120s → 600s) No effect — crash is instant, not a timeout
Convert VAE weights FP32 → FP16 No effect on FLUX path (did help Z-Image separately)
Patch flux_vae_decode.py to add TorchDevice.empty_cache() + torch.inference_mode() (matching z_image_latents_to_image.py) No effect
Call POST /api/v2/models/empty_model_cache between runs No effect
Different resolutions (512x512 through 2000x1120) All crash on 2nd run
Different FLUX models (GGUF Q5_K_S, BnB NF4) All crash on 2nd run
Restart InvokeAI container between runs Partially effective — usually works but not guaranteed
Use ComfyUI instead Fully effective — zero crashes across all tests

Code comparison

flux_vae_decode.py (crashes on 2nd run):

with vae_info.model_on_device(working_mem_bytes=estimated_working_memory) as (_, vae):
    assert isinstance(vae, AutoEncoder)
    vae_dtype = next(iter(vae.parameters())).dtype
    latents = latents.to(device=TorchDevice.choose_torch_device(), dtype=vae_dtype)
    img = vae.decode(latents)  # <-- GPU Hang on 2nd+ run

z_image_latents_to_image.py (never crashes, same VAE class):

with seamless_context, vae_info.model_on_device(working_mem_bytes=estimated_working_memory) as (_, vae):
    vae_dtype = next(iter(vae.parameters())).dtype
    latents = latents.to(device=TorchDevice.choose_torch_device(), dtype=vae_dtype)
    TorchDevice.empty_cache()
    with torch.inference_mode():
        if isinstance(vae, FluxAutoEncoder):
            img = vae.decode(latents)  # <-- works on repeated runs

Note: We patched flux_vae_decode.py to match the Z-Image pattern (adding empty_cache() + inference_mode()), but this did not fix the issue, suggesting the root cause is deeper — likely in how the model cache manages GPU state between FLUX pipeline runs.

What you expected to happen

Multiple consecutive FLUX generations should complete without GPU hangs, as they do in ComfyUI on the same hardware.

How to reproduce the problem

Steps to reproduce

  1. Run InvokeAI 6.12.0 in Docker with ROCm 7.2.1 on AMD gfx1151 (RDNA 3.5) GPU
  2. Set HSA_OVERRIDE_GFX_VERSION=11.5.1, HSA_ENABLE_SDMA=0
  3. Generate any image using a FLUX model (e.g., flux1-dev-Q5_K_S.gguf) at any resolution
  4. First generation succeeds
  5. Without restarting InvokeAI, generate a second FLUX image
  6. Second generation completes denoising but crashes with HW Exception: GPU Hang when flux_vae_decode runs

Note: This is 100% reproducible on our hardware. If testing on a different AMD GPU, the behavior may differ since gfx1151 is a new ISA target with limited ROCm testing.

Additional context

  • The GPU is an integrated APU (Radeon 8060S in GMKtec EVO-X2) using unified memory (48 GB UMA carve-out from 128 GB LPDDR5X)
  • gfx1151 is only supported in ROCm 7.2+ via the gfx11-generic ISA target
  • The crash is non-deterministic on "first after restart" — sometimes the first run also crashes, suggesting GPU state from a prior InvokeAI session may persist across container restarts
  • After repeated GPU hangs, the GPU's MES (Micro Engine Scheduler) enters an unrecoverable state requiring a full server reboot
  • ComfyUI uses the same rocm/pytorch:rocm7.2.1 base image and same model files, confirming the issue is not in ROCm, PyTorch, or the model weights
  • We suspect InvokeAI's persistent model cache leaves HIP/ROCm runtime state (streams, memory mappings, or kernel dispatch state) that corrupts on gfx1151 between FLUX VAE decode calls

Discord username

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions