AMD Radeon RX 480 crashing on Linux
Back in 2017, I bought a gaming machine with a Saphire AMD RX 480
graphic card. As far as I can tell, while running Linux, the machine
has always been crashing regularly (I would say once a day or so).
Symptoms were a few graphics glitches, spreading all over the screen,
then my machine would freeze. Keyboard and mouse would not answer,
but I would still be able to connect to the machine over ssh. When
issuing a poweroff
of reboot
command, the ssh session would close,
but the machine would never shutdown or reboot. It would be stuck,
until I would manually reset or poweroff the machine (by a long press
on the power button). The RX 480 would run quite well on Microsoft
Windows 10 (with occasional crashes while playing on demanding games,
though much less frequent). At some point, I moved the RX 480 to
another machine. The behaviour was similar. Linux would crash
roughly once a day, while Windows would be stable.
The crash
When looking at Linux kernel logs on my Ubuntu system, I would typically get a trace similar to the following:
$ journalctl -k -b -1
Sep 17 11:00:06 amn kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x02084401 for process firefox pid 4298 thread firefox:cs0 pid 4619
Sep 17 11:00:06 amn kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x05F00441
Sep 17 11:00:06 amn kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C044001
Sep 17 11:00:06 amn kernel: amdgpu 0000:01:00.0: VM fault (0x01, vmid 6, pasid 32769) at page 99615809, read from 'TC5' (0x54433500) (68)
Sep 17 11:00:16 amn kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
Sep 17 11:00:16 amn kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=24625, emitted seq=24627
Sep 17 11:00:16 amn kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 4298 thread firefox:cs0 pid 4619
Sep 17 11:00:16 amn kernel: amdgpu 0000:01:00.0: GPU reset begin!
Sep 17 11:00:17 amn kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Sep 17 11:00:17 amn kernel: [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Sep 17 11:00:17 amn kernel: cp is busy, skip halt cp
Sep 17 11:00:17 amn kernel: rlc is busy, skip halt rlc
Sep 17 11:00:17 amn kernel: amdgpu 0000:01:00.0: GPU pci config reset
Sep 17 11:00:17 amn kernel: amdgpu 0000:01:00.0: GPU reset succeeded, trying to resume
Sep 17 11:00:17 amn kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Sep 17 11:00:17 amn kernel: [drm] VRAM is lost due to GPU reset!
Sep 17 11:00:18 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:19 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:20 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:21 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:22 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:23 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:24 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:25 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:26 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:27 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:27 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, giving up!!!
Sep 17 11:00:27 amn kernel: [drm:amdgpu_device_ip_set_powergating_state [amdgpu]] *ERROR* set_powergating_state of IP block <uvd_v6_0> failed -1
Sep 17 11:00:28 amn kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd test failed (-110)
Sep 17 11:00:28 amn kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <uvd_v6_0> failed -110
Sep 17 11:00:28 amn kernel: amdgpu 0000:01:00.0: GPU reset(2) failed
Sep 17 11:00:28 amn kernel: amdgpu 0000:01:00.0: GPU reset end with ret = -110
Sep 17 11:00:38 amn kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Sep 17 11:00:48 amn kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
According to my notes from 2020, at some point a kernel update (with its amdgpu driver) to Linux 5.4.0 fixed the repetitive crashes (or at least reduced their frequency significantly). But months later (probably following a more recent kernel update), the crashes came back. As of september 2022, I would estimate the frequency of these crashes to be roughly once a day. Suspending, then resuming the machine seems to increase (slightly) the probablity of these crashes.
I ended up using my Intel Internal Graphic Card while running Linux, and switched to the Radeon RX 480 when playing games on Windows. This setup worked for me for a while, even though swapping video cables when booting to the other OS wasn't the best experience.
The fix
This was until I found this discussion about a computer freezing with similar symptoms. Screen glitching, application becoming laggy, machine frozen. The author even took a picture of its crashed machine from their phone camera. This definitely looks like the crashes I see on my own hardware. Then djamayaofficial's answer hinted that these kind of bugs may be related to power scaling issues, and provided a code snippet to manually disable the driver's power scaling feature, and set it to fix values.
I've since then spent some time trying to understand how to interact
with the amdgpu driver. There's a section dedicated to this topic in
the official Linux kernel documentation. It describes how to use
the amdgpu driver sysfs
interface to update and/or read the GPU
power states. The power_dpm_force_performance_level
file allows
selecting a preset for a set of power related parameters. By default
it is positioned to auto
. In this setting, the driver dynamically
adapts, depending on the GPU usage, the clock frequencies of different
domains: pp_dpm_sclk
, pp_dpm_mclk
and pp_dpm_pcie
. Depending on
the model of graphic cards, additional parameters may be updated. The
low
and high
presets respectively set the clocks to the lowest and
highest possible frequency for each clock domain. With the
power_dpm_force_performance_level
set to low
or high
(I mostly
tested my GPU with the low
setting), the graphic card is much more
stable. I have been running Linux for the past 6 months without any
crash.
Here's what the content of the amdgpu sysfs
files look like:
$ ./amdgpu-dump.sh
/sys/class/drm/card0/device/power_dpm_force_performance_level
low
/sys/class/drm/card0/device/pp_dpm_sclk
0: 300Mhz *
1: 608Mhz
2: 910Mhz
3: 1077Mhz
4: 1145Mhz
5: 1191Mhz
6: 1236Mhz
7: 1266Mhz
/sys/class/drm/card0/device/pp_dpm_mclk
0: 300Mhz *
1: 2000Mhz
/sys/class/drm/card0/device/pp_dpm_pcie
0: 2.5GT/s, x8 *
1: 8.0GT/s, x16
/sys/class/drm/card0/device/gpu_busy_percent
0
/sys/class/drm/card0/device/mem_busy_percent
1
/sys/class/drm/card0/device/pp_power_profile_mode
NUM MODE_NAME SCLK_UP_HYST SCLK_DOWN_HYST SCLK_ACTIVE_LEVEL MCLK_UP_HYST MCLK_DOWN_HYST MCLK_ACTIVE_LEVEL
0 BOOTUP_DEFAULT: - - - - - -
1 3D_FULL_SCREEN *: 0 100 30 0 100 10
2 POWER_SAVING: 10 0 30 - - -
3 VIDEO: - - - 10 16 31
4 VR: 0 11 50 0 100 10
5 COMPUTE: 0 5 30 - - -
6 CUSTOM: - - - - - -
amdgpu-pci-0100
Adapter: PCI adapter
vddgfx: 750.00 mV
fan1: 782 RPM (min = 0 RPM, max = 5200 RPM)
edge: +37.0°C (crit = +94.0°C, hyst = -273.1°C)
power1: 8.20 W (cap = 110.00 W)
(The amdgpu-dump.sh
script is available on my amdgpu-tools github repository)
Boot time setup
Here's my setup to set the GPU's clock frequencies to fix values at startup, and definitively fix my crashes.
A shell script to set a GPU preset:
$ cat /usr/local/bin/amdgpu-set-dpm.sh
#!/usr/bin/bash
set -eu
# Examples of values to be written in power_dpm_force_performance_level: auto, low, high
# See https://www.kernel.org/doc/html/latest/gpu/amdgpu/thermal.html#power-dpm-force-performance-level
DPM_SETUP_FILE=/sys/class/drm/card0/device/power_dpm_force_performance_level
PERF_LEVEL=${1:-low}
echo "Updating amdgpu device power management setting (power_dpm_force_performance_level)"
echo "Initial value of \"${DPM_SETUP_FILE}\" is $(cat ${DPM_SETUP_FILE})"
echo "Setting \"${DPM_SETUP_FILE}\" to ${PERF_LEVEL}"
if [ ! -f ${DPM_SETUP_FILE} ]; then
echo "*ERROR* file \"${DPM_SETUP_FILE}\" is missing. Exiting."
exit 1
fi
echo ${PERF_LEVEL} > ${DPM_SETUP_FILE}
echo "New value of \"${DPM_SETUP_FILE}\" is $(cat ${DPM_SETUP_FILE})"
A systemd service file:
$ cat /etc/systemd/system/amdgpu-set-dpm.service
[Unit]
Description=amdgpu device power management tweaking
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/local/bin/amdgpu-set-dpm.sh low
[Install]
WantedBy=default.target
And a one-time command to enable the service previously defined:
$ sudo systemctl enable amdgpu-set-dpm.service