[bug]: FLUX VAE decode causes GPU Hang on AMD gfx1151 (RDNA 3.5) — second+ generation always crashes

### Is there an existing issue for this problem?

- [x] I have searched the existing issues

### Install method

Manual

### Operating system

Linux

### GPU vendor

AMD (ROCm)

### GPU model

AMD Radeon 8060S (gfx1151 / RDNA 3.5 / Strix Halo) — integrated APU in GMKtec EVO-X2

### GPU VRAM

48 GB (UMA carve-out from 128 GB LPDDR5X system RAM)

### Version number

6.12.0

### Browser

N/A — reproduced entirely via REST API

### System Information

OS: Ubuntu Noble 24.04 (kernel 6.17.0-20-generic)
ROCm: 7.2.1
PyTorch: 2.7.1+rocm7.2.1.git1dab218d
Docker base image: rocm/pytorch:rocm7.2.1_ubuntu24.04_py3.12_pytorch_release_2.7.1
HSA_OVERRIDE_GFX_VERSION: 11.5.1
HSA_ENABLE_SDMA: 0
PYTORCH_HIP_ALLOC_CONF: garbage_collection_threshold:0.6
amdgpu.lockup_timeout: 600000 (kernel boot param)

invokeai.yaml:
  schema_version: 4.0.2
  device: cuda
  precision: float16
  enable_partial_loading: true
  device_working_mem_gb: 20
  force_tiled_decode: true

### What happened

The FLUX VAE decode step (`flux_vae_decode` invocation) causes an immediate GPU Hang on AMD Radeon 8060S (gfx1151) running ROCm 7.2.1. The crash follows a consistent pattern:

1. **First FLUX generation** after starting InvokeAI: **succeeds** (denoising + VAE decode complete normally)
2. **Second FLUX generation** in the same InvokeAI process: denoising completes, but the GPU immediately hangs when `flux_vae_decode` begins

The crash produces:
```
HW Exception by GPU node-1 (Agent handle: 0x...) reason :GPU Hang
HW Exception by GPU node-1 (Agent handle: 0x...) reason :GPU Hang
```

This occurs immediately after the VAE model is loaded onto the device:
```
[MODEL CACHE] Loaded model 'cfc36333-...:vae' (AutoEncoder) onto cuda device in 0.02s
HW Exception by GPU node-1 reason :GPU Hang
```

**Critical finding: ComfyUI does NOT have this issue.** We ran 3+ consecutive FLUX generations through ComfyUI on the exact same hardware, same PyTorch build, same GGUF model, same VAE file — all completed successfully with zero crashes and zero restarts. This confirms the issue is specific to InvokeAI's model cache / session management, not ROCm or the GPU hardware.

Additionally, the **Z-Image pipeline** in InvokeAI (`z_image_latents_to_image.py`) uses the same VAE model file and same `FluxAutoEncoder` class but does **not** crash on subsequent runs. Only the FLUX pipeline (`flux_vae_decode.py`) is affected.

### Investigation summary

We tested extensively to isolate the cause:

| Attempted fix | Result |
|--------------|--------|
| Increase `amdgpu.lockup_timeout` (10s → 120s → 600s) | No effect — crash is instant, not a timeout |
| Convert VAE weights FP32 → FP16 | No effect on FLUX path (did help Z-Image separately) |
| Patch `flux_vae_decode.py` to add `TorchDevice.empty_cache()` + `torch.inference_mode()` (matching `z_image_latents_to_image.py`) | No effect |
| Call `POST /api/v2/models/empty_model_cache` between runs | No effect |
| Different resolutions (512x512 through 2000x1120) | All crash on 2nd run |
| Different FLUX models (GGUF Q5_K_S, BnB NF4) | All crash on 2nd run |
| Restart InvokeAI container between runs | Partially effective — usually works but not guaranteed |
| **Use ComfyUI instead** | **Fully effective — zero crashes across all tests** |

### Code comparison

`flux_vae_decode.py` (crashes on 2nd run):
```python
with vae_info.model_on_device(working_mem_bytes=estimated_working_memory) as (_, vae):
    assert isinstance(vae, AutoEncoder)
    vae_dtype = next(iter(vae.parameters())).dtype
    latents = latents.to(device=TorchDevice.choose_torch_device(), dtype=vae_dtype)
    img = vae.decode(latents)  # <-- GPU Hang on 2nd+ run
```

`z_image_latents_to_image.py` (never crashes, same VAE class):
```python
with seamless_context, vae_info.model_on_device(working_mem_bytes=estimated_working_memory) as (_, vae):
    vae_dtype = next(iter(vae.parameters())).dtype
    latents = latents.to(device=TorchDevice.choose_torch_device(), dtype=vae_dtype)
    TorchDevice.empty_cache()
    with torch.inference_mode():
        if isinstance(vae, FluxAutoEncoder):
            img = vae.decode(latents)  # <-- works on repeated runs
```

Note: We patched `flux_vae_decode.py` to match the Z-Image pattern (adding `empty_cache()` + `inference_mode()`), but this did **not** fix the issue, suggesting the root cause is deeper — likely in how the model cache manages GPU state between FLUX pipeline runs.

### What you expected to happen

Multiple consecutive FLUX generations should complete without GPU hangs, as they do in ComfyUI on the same hardware.

### How to reproduce the problem

## Steps to reproduce

1. Run InvokeAI 6.12.0 in Docker with ROCm 7.2.1 on AMD gfx1151 (RDNA 3.5) GPU
2. Set `HSA_OVERRIDE_GFX_VERSION=11.5.1`, `HSA_ENABLE_SDMA=0`
3. Generate any image using a FLUX model (e.g., `flux1-dev-Q5_K_S.gguf`) at any resolution
4. First generation succeeds
5. Without restarting InvokeAI, generate a second FLUX image
6. Second generation completes denoising but crashes with `HW Exception: GPU Hang` when `flux_vae_decode` runs

Note: This is 100% reproducible on our hardware. If testing on a different AMD GPU, the behavior may differ since gfx1151 is a new ISA target with limited ROCm testing.

### Additional context


- The GPU is an integrated APU (Radeon 8060S in GMKtec EVO-X2) using unified memory (48 GB UMA carve-out from 128 GB LPDDR5X)
- gfx1151 is only supported in ROCm 7.2+ via the `gfx11-generic` ISA target
- The crash is non-deterministic on "first after restart" — sometimes the first run also crashes, suggesting GPU state from a prior InvokeAI session may persist across container restarts
- After repeated GPU hangs, the GPU's MES (Micro Engine Scheduler) enters an unrecoverable state requiring a full server reboot
- ComfyUI uses the same `rocm/pytorch:rocm7.2.1` base image and same model files, confirming the issue is not in ROCm, PyTorch, or the model weights
- We suspect InvokeAI's persistent model cache leaves HIP/ROCm runtime state (streams, memory mappings, or kernel dispatch state) that corrupts on gfx1151 between FLUX VAE decode calls

### Discord username

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: FLUX VAE decode causes GPU Hang on AMD gfx1151 (RDNA 3.5) — second+ generation always crashes #9053

Is there an existing issue for this problem?

Install method

Operating system

GPU vendor

GPU model

GPU VRAM

Version number

Browser

System Information

What happened

Investigation summary

Code comparison

What you expected to happen

How to reproduce the problem

Steps to reproduce

Additional context

Discord username

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Attempted fix	Result
Increase `amdgpu.lockup_timeout` (10s → 120s → 600s)	No effect — crash is instant, not a timeout
Convert VAE weights FP32 → FP16	No effect on FLUX path (did help Z-Image separately)
Patch `flux_vae_decode.py` to add `TorchDevice.empty_cache()` + `torch.inference_mode()` (matching `z_image_latents_to_image.py`)	No effect
Call `POST /api/v2/models/empty_model_cache` between runs	No effect
Different resolutions (512x512 through 2000x1120)	All crash on 2nd run
Different FLUX models (GGUF Q5_K_S, BnB NF4)	All crash on 2nd run
Restart InvokeAI container between runs	Partially effective — usually works but not guaranteed
Use ComfyUI instead	Fully effective — zero crashes across all tests

[bug]: FLUX VAE decode causes GPU Hang on AMD gfx1151 (RDNA 3.5) — second+ generation always crashes #9053

Description

Is there an existing issue for this problem?

Install method

Operating system

GPU vendor

GPU model

GPU VRAM

Version number

Browser

System Information

What happened

Investigation summary

Code comparison

What you expected to happen

How to reproduce the problem

Steps to reproduce

Additional context

Discord username

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions