Stable Diffusion with a AMD Instinct MI25
Thomas Büttner
Recently while browsing some forums i stumbled across a forum post about how usable the AMD Instinct MI25 is for Stable Diffusion thanks to its 16GB of HBM2 Memory and affordable price of at the time ~90 USD. So needless to say i wanted one of these cards as this seemed like a fun Weekend project. Thanks to German ebays ludicrously insane prices (899€ for used MI25 cards… 🤦) i imported 2 cards from the US via ebay, 2 cards just in case i managed to break one *foreshadowing intensifies*.
And a couple Weeks later…
The Pain begins
Since Stable Diffusion normally leverages CUDA which we don’t have on AMD, we have HIP which should work as a drop in replacement if pytorch has been build with the correct ROCm Architectures.
As i later found out the Vega Architecture of my Vega64 and the MI25 is deprecated in favor of CDNA… so better stick to a working version of ROCm.
First try on Fedora Workstation
So after installing some of the ROCm suite from the official Fedora repos (dnf install -y rocm*
).
Running rocm-smi
looks pretty ok:
1========================ROCm System Management Interface========================
2================================================================================
3GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
40 34.0c 16.0W 852Mhz 945Mhz 14.51% auto 264.0W 26% 4%
51 28.0c 4.0W 300Mhz 945Mhz 9.41% auto 110.0W 17% 0%
6================================================================================
7==============================End of ROCm SMI Log ==============================
My primary GPU Vega 64 and the new Instinct Accelerator have booth been detected, so far so good. But after cloning the webui repo and starting it, it only complains about not being able to find a suitable CUDA device…
Ok ok i understand that’s what you get for not following the guide from the forum and simply trying to wing it.
Lets skip all of my VM pain…
So after many many hours and even days of fiddling with Virtual Machines and soooo much pain with PCIe Passthrough, PCIe ACS, PCIe-Atomics, VFIO, XEN and Guest awfulness. Im going to spare you all that pain suffering and Ubuntu. In that long an painful process of troubleshooting i tried many different stuff and also different VBIOS the Original 110 Watt, the MI25 220 Watt one and the WX 9100 VBIOS… with which i managed to kill one MI25!
My working Setup
What ultimately worked was installing ROCm from the Fedora repos (not AMDs one) and actually using a version of Python that has the PyTorch ROCm module precompiled (Python 3.9 at that time).
For that i used conda environments:
1conda create --name sd python=3.9
2conda activate sd
3pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2
4pip3 install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/rocm5.4.2
5pip3 install -r requirements_versions.txt --extra-index-url https://download.pytorch.org/whl/rocm5.4.2
Checking again if the GPU is detected by the ROCm stack:
1ROCk module is loaded
2=====================
3HSA System Attributes
4=====================
5Runtime Version: 1.1
6System Timestamp Freq.: 1000.000000MHz
7Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
8Machine Model: LARGE
9System Endianness: LITTLE
10
11==========
12HSA Agents
13==========
14#### many unrelated lines later
15*******
16Agent 4
17*******
18 Name: gfx900
19 Uuid: GPU-0213f25e34d631a4
20 Marketing Name: AMD Radeon Graphics
21 Vendor Name: AMD
22 Feature: KERNEL_DISPATCH
23 Profile: BASE_PROFILE
24 Float Round Mode: NEAR
25 Max Queue Number: 128(0x80)
26 Queue Min Size: 64(0x40)
27 Queue Max Size: 131072(0x20000)
28 Queue Type: MULTI
29 Node: 3
30 Device Type: GPU
31 Cache Info:
32 L1: 16(0x10) KB
33 L2: 4096(0x1000) KB
34 Chip ID: 26720(0x6860)
35 ASIC Revision: 1(0x1)
36 Cacheline Size: 64(0x40)
37 Max Clock Freq. (MHz): 1500
38 BDFID: 1792
39 Internal Node ID: 3
40 Compute Unit: 64
41 SIMDs per CU: 4
42 Shader Engines: 4
43 Shader Arrs. per Eng.: 1
44 WatchPts on Addr. Ranges:4
45 Features: KERNEL_DISPATCH
46 Fast F16 Operation: TRUE
47 Wavefront Size: 64(0x40)
48 Workgroup Max Size: 1024(0x400)
49 Workgroup Max Size per Dimension:
50 x 1024(0x400)
51 y 1024(0x400)
52 z 1024(0x400)
53 Max Waves Per CU: 40(0x28)
54 Max Work-item Per CU: 2560(0xa00)
55 Grid Max Size: 4294967295(0xffffffff)
56 Grid Max Size per Dimension:
57 x 4294967295(0xffffffff)
58 y 4294967295(0xffffffff)
59 z 4294967295(0xffffffff)
60 Max fbarriers/Workgrp: 32
61 Pool Info:
62 Pool 1
63 Segment: GLOBAL; FLAGS: COARSE GRAINED
64 Size: 16760832(0xffc000) KB
65 Allocatable: TRUE
66 Alloc Granule: 4KB
67 Alloc Alignment: 4KB
68 Accessible by all: FALSE
69 Pool 2
70 Segment: GROUP
71 Size: 64(0x40) KB
72 Allocatable: FALSE
73 Alloc Granule: 0KB
74 Alloc Alignment: 0KB
75 Accessible by all: FALSE
76 ISA Info:
77 ISA 1
78 Name: amdgcn-amd-amdhsa--gfx900:xnack-
79 Machine Models: HSA_MACHINE_MODEL_LARGE
80 Profiles: HSA_PROFILE_BASE
81 Default Rounding Mode: NEAR
82 Default Rounding Mode: NEAR
83 Fast f16: TRUE
84 Workgroup Max Size: 1024(0x400)
85 Workgroup Max Size per Dimension:
86 x 1024(0x400)
87 y 1024(0x400)
88 z 1024(0x400)
89 Grid Max Size: 4294967295(0xffffffff)
90 Grid Max Size per Dimension:
91 x 4294967295(0xffffffff)
92 y 4294967295(0xffffffff)
93 z 4294967295(0xffffffff)
94 FBarrier Max Size: 32
95*** Done ***
The GPU devices are Numbered “Agent 1”, “Agent 2” and so on but are not separated by type so the MI25 is in my case “Agent 4” is CUDA device 1 which i pass over to the webui as --device-id 1
:
1env PYTORCH_HIP_ALLOC_CONF='max_split_size_mb:256' \
2 TORCH_COMMAND='pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2' \
3 python launch.py \
4 --device-id 1 \
5 --enable-console-prompts \
6 --use-textbox-seed \
7 --api --listen --port 7860 \
8 --enable-insecure-extension-access \
9 --disable-nan-check \
10 --disable-console-progressbars
Results
So now to some fun results from Stable-Diffusion