Stable Diffusion with a AMD Instinct MI25

Thomas Büttner

April 4, 2023

Recently while browsing some forums i stumbled across a forum post about how usable the AMD Instinct MI25 is for Stable Diffusion thanks to its 16GB of HBM2 Memory and affordable price of at the time ~90 USD. So needless to say i wanted one of these cards as this seemed like a fun Weekend project. Thanks to German ebays ludicrously insane prices (899€ for used MI25 cards… 🤦) i imported 2 cards from the US via ebay, 2 cards just in case i managed to break one *foreshadowing intensifies*.

And a couple Weeks later…

The Pain begins

Since Stable Diffusion normally leverages CUDA which we don’t have on AMD, we have HIP which should work as a drop in replacement if pytorch has been build with the correct ROCm Architectures.

As i later found out the Vega Architecture of my Vega64 and the MI25 is deprecated in favor of CDNA… so better stick to a working version of ROCm.

First try on Fedora Workstation

So after installing some of the ROCm suite from the official Fedora repos (dnf install -y rocm*).

Running rocm-smi looks pretty ok:

1========================ROCm System Management Interface========================
2================================================================================
3GPU  Temp   AvgPwr  SCLK    MCLK    Fan     Perf  PwrCap  VRAM%  GPU%
40    34.0c  16.0W   852Mhz  945Mhz  14.51%  auto  264.0W   26%   4%
51    28.0c  4.0W    300Mhz  945Mhz  9.41%   auto  110.0W   17%   0%
6================================================================================
7==============================End of ROCm SMI Log ==============================

My primary GPU Vega 64 and the new Instinct Accelerator have booth been detected, so far so good. But after cloning the webui repo and starting it, it only complains about not being able to find a suitable CUDA device…

Ok ok i understand that’s what you get for not following the guide from the forum and simply trying to wing it.

Lets skip all of my VM pain…

So after many many hours and even days of fiddling with Virtual Machines and soooo much pain with PCIe Passthrough, PCIe ACS, PCIe-Atomics, VFIO, XEN and Guest awfulness. Im going to spare you all that pain suffering and Ubuntu. In that long an painful process of troubleshooting i tried many different stuff and also different VBIOS the Original 110 Watt, the MI25 220 Watt one and the WX 9100 VBIOS… with which i managed to kill one MI25!

Exploded MI25 Silicon

My working Setup

What ultimately worked was installing ROCm from the Fedora repos (not AMDs one) and actually using a version of Python that has the PyTorch ROCm module precompiled (Python 3.9 at that time).

For that i used conda environments:

1conda create --name sd python=3.9
2conda activate sd
3pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2
4pip3 install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/rocm5.4.2
5pip3 install -r requirements_versions.txt --extra-index-url https://download.pytorch.org/whl/rocm5.4.2

Checking again if the GPU is detected by the ROCm stack:

 1ROCk module is loaded
 2=====================
 3HSA System Attributes
 4=====================
 5Runtime Version:         1.1
 6System Timestamp Freq.:  1000.000000MHz
 7Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
 8Machine Model:           LARGE
 9System Endianness:       LITTLE
10
11==========
12HSA Agents
13==========
14#### many unrelated lines later
15*******
16Agent 4
17*******
18  Name:                    gfx900
19  Uuid:                    GPU-0213f25e34d631a4
20  Marketing Name:          AMD Radeon Graphics
21  Vendor Name:             AMD
22  Feature:                 KERNEL_DISPATCH
23  Profile:                 BASE_PROFILE
24  Float Round Mode:        NEAR
25  Max Queue Number:        128(0x80)
26  Queue Min Size:          64(0x40)
27  Queue Max Size:          131072(0x20000)
28  Queue Type:              MULTI
29  Node:                    3
30  Device Type:             GPU
31  Cache Info:
32    L1:                      16(0x10) KB
33    L2:                      4096(0x1000) KB
34  Chip ID:                 26720(0x6860)
35  ASIC Revision:           1(0x1)
36  Cacheline Size:          64(0x40)
37  Max Clock Freq. (MHz):   1500
38  BDFID:                   1792
39  Internal Node ID:        3
40  Compute Unit:            64
41  SIMDs per CU:            4
42  Shader Engines:          4
43  Shader Arrs. per Eng.:   1
44  WatchPts on Addr. Ranges:4
45  Features:                KERNEL_DISPATCH
46  Fast F16 Operation:      TRUE
47  Wavefront Size:          64(0x40)
48  Workgroup Max Size:      1024(0x400)
49  Workgroup Max Size per Dimension:
50    x                        1024(0x400)
51    y                        1024(0x400)
52    z                        1024(0x400)
53  Max Waves Per CU:        40(0x28)
54  Max Work-item Per CU:    2560(0xa00)
55  Grid Max Size:           4294967295(0xffffffff)
56  Grid Max Size per Dimension:
57    x                        4294967295(0xffffffff)
58    y                        4294967295(0xffffffff)
59    z                        4294967295(0xffffffff)
60  Max fbarriers/Workgrp:   32
61  Pool Info:
62    Pool 1
63      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
64      Size:                    16760832(0xffc000) KB
65      Allocatable:             TRUE
66      Alloc Granule:           4KB
67      Alloc Alignment:         4KB
68      Accessible by all:       FALSE
69    Pool 2
70      Segment:                 GROUP
71      Size:                    64(0x40) KB
72      Allocatable:             FALSE
73      Alloc Granule:           0KB
74      Alloc Alignment:         0KB
75      Accessible by all:       FALSE
76  ISA Info:
77    ISA 1
78      Name:                    amdgcn-amd-amdhsa--gfx900:xnack-
79      Machine Models:          HSA_MACHINE_MODEL_LARGE
80      Profiles:                HSA_PROFILE_BASE
81      Default Rounding Mode:   NEAR
82      Default Rounding Mode:   NEAR
83      Fast f16:                TRUE
84      Workgroup Max Size:      1024(0x400)
85      Workgroup Max Size per Dimension:
86        x                        1024(0x400)
87        y                        1024(0x400)
88        z                        1024(0x400)
89      Grid Max Size:           4294967295(0xffffffff)
90      Grid Max Size per Dimension:
91        x                        4294967295(0xffffffff)
92        y                        4294967295(0xffffffff)
93        z                        4294967295(0xffffffff)
94      FBarrier Max Size:       32
95*** Done ***

The GPU devices are Numbered “Agent 1”, “Agent 2” and so on but are not separated by type so the MI25 is in my case “Agent 4” is CUDA device 1 which i pass over to the webui as --device-id 1:

 1env PYTORCH_HIP_ALLOC_CONF='max_split_size_mb:256' \
 2  TORCH_COMMAND='pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2' \
 3    python launch.py \
 4    --device-id 1 \
 5    --enable-console-prompts \
 6    --use-textbox-seed \
 7    --api --listen --port 7860 \
 8    --enable-insecure-extension-access \
 9    --disable-nan-check \
10    --disable-console-progressbars

Results

So now to some fun results from Stable-Diffusion

A cat sitting on a table A flying castle