Mar 25 2023 AMD Radeon RX 480 crashing on Linux

Back in 2017, I bought a gaming machine with a Saphire AMD RX 480 graphic card. As far as I can tell, while running Linux, the machine has always been crashing regularly (I would say once a day or so). Symptoms were a few graphics glitches, spreading all over the screen, then my machine would freeze. Keyboard and mouse would not answer, but I would still be able to connect to the machine over ssh. When issuing a poweroff of reboot command, the ssh session would close, but the machine would never shutdown or reboot. It would be stuck, until I would manually reset or poweroff the machine (by a long press on the power button). The RX 480 would run quite well on Microsoft Windows 10 (with occasional crashes while playing on demanding games, though much less frequent). At some point, I moved the RX 480 to another machine. The behaviour was similar. Linux would crash roughly once a day, while Windows would be stable.

The crash

When looking at Linux kernel logs on my Ubuntu system, I would typically get a trace similar to the following:

$ journalctl -k -b -1
Sep 17 11:00:06 amn kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x02084401 for process firefox pid 4298 thread firefox:cs0 pid 4619
Sep 17 11:00:06 amn kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x05F00441
Sep 17 11:00:06 amn kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C044001
Sep 17 11:00:06 amn kernel: amdgpu 0000:01:00.0: VM fault (0x01, vmid 6, pasid 32769) at page 99615809, read from 'TC5' (0x54433500) (68)
Sep 17 11:00:16 amn kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
Sep 17 11:00:16 amn kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=24625, emitted seq=24627
Sep 17 11:00:16 amn kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 4298 thread firefox:cs0 pid 4619
Sep 17 11:00:16 amn kernel: amdgpu 0000:01:00.0: GPU reset begin!
Sep 17 11:00:17 amn kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Sep 17 11:00:17 amn kernel: [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Sep 17 11:00:17 amn kernel: cp is busy, skip halt cp
Sep 17 11:00:17 amn kernel: rlc is busy, skip halt rlc
Sep 17 11:00:17 amn kernel: amdgpu 0000:01:00.0: GPU pci config reset
Sep 17 11:00:17 amn kernel: amdgpu 0000:01:00.0: GPU reset succeeded, trying to resume
Sep 17 11:00:17 amn kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
Sep 17 11:00:17 amn kernel: [drm] VRAM is lost due to GPU reset!
Sep 17 11:00:18 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:19 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:20 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:21 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:22 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:23 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:24 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:25 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:26 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:27 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Sep 17 11:00:27 amn kernel: [drm:uvd_v6_0_start [amdgpu]] *ERROR* UVD not responding, giving up!!!
Sep 17 11:00:27 amn kernel: [drm:amdgpu_device_ip_set_powergating_state [amdgpu]] *ERROR* set_powergating_state of IP block <uvd_v6_0> failed -1
Sep 17 11:00:28 amn kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd test failed (-110)
Sep 17 11:00:28 amn kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <uvd_v6_0> failed -110
Sep 17 11:00:28 amn kernel: amdgpu 0000:01:00.0: GPU reset(2) failed
Sep 17 11:00:28 amn kernel: amdgpu 0000:01:00.0: GPU reset end with ret = -110
Sep 17 11:00:38 amn kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Sep 17 11:00:48 amn kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

According to my notes from 2020, at some point a kernel update (with its amdgpu driver) to Linux 5.4.0 fixed the repetitive crashes (or at least reduced their frequency significantly). But months later (probably following a more recent kernel update), the crashes came back. As of september 2022, I would estimate the frequency of these crashes to be roughly once a day. Suspending, then resuming the machine seems to increase (slightly) the probablity of these crashes.

I ended up using my Intel Internal Graphic Card while running Linux, and switched to the Radeon RX 480 when playing games on Windows. This setup worked for me for a while, even though swapping video cables when booting to the other OS wasn't the best experience.

The fix

This was until I found this discussion about a computer freezing with similar symptoms. Screen glitching, application becoming laggy, machine frozen. The author even took a picture of its crashed machine from their phone camera. This definitely looks like the crashes I see on my own hardware. Then djamayaofficial's answer hinted that these kind of bugs may be related to power scaling issues, and provided a code snippet to manually disable the driver's power scaling feature, and set it to fix values.

I've since then spent some time trying to understand how to interact with the amdgpu driver. There's a section dedicated to this topic in the official Linux kernel documentation. It describes how to use the amdgpu driver sysfs interface to update and/or read the GPU power states. The power_dpm_force_performance_level file allows selecting a preset for a set of power related parameters. By default it is positioned to auto. In this setting, the driver dynamically adapts, depending on the GPU usage, the clock frequencies of different domains: pp_dpm_sclk, pp_dpm_mclk and pp_dpm_pcie. Depending on the model of graphic cards, additional parameters may be updated. The low and high presets respectively set the clocks to the lowest and highest possible frequency for each clock domain. With the power_dpm_force_performance_level set to low or high (I mostly tested my GPU with the low setting), the graphic card is much more stable. I have been running Linux for the past 6 months without any crash.

Here's what the content of the amdgpu sysfs files look like:

$ ./amdgpu-dump.sh
/sys/class/drm/card0/device/power_dpm_force_performance_level
low

/sys/class/drm/card0/device/pp_dpm_sclk
0: 300Mhz *
1: 608Mhz
2: 910Mhz
3: 1077Mhz
4: 1145Mhz
5: 1191Mhz
6: 1236Mhz
7: 1266Mhz

/sys/class/drm/card0/device/pp_dpm_mclk
0: 300Mhz *
1: 2000Mhz

/sys/class/drm/card0/device/pp_dpm_pcie
0: 2.5GT/s, x8 *
1: 8.0GT/s, x16

/sys/class/drm/card0/device/gpu_busy_percent
0

/sys/class/drm/card0/device/mem_busy_percent
1

/sys/class/drm/card0/device/pp_power_profile_mode
NUM        MODE_NAME     SCLK_UP_HYST   SCLK_DOWN_HYST SCLK_ACTIVE_LEVEL     MCLK_UP_HYST   MCLK_DOWN_HYST MCLK_ACTIVE_LEVEL
  0   BOOTUP_DEFAULT:        -                -                -                -                -                -
  1 3D_FULL_SCREEN *:        0              100               30                0              100               10
  2     POWER_SAVING:       10                0               30                -                -                -
  3            VIDEO:        -                -                -               10               16               31
  4               VR:        0               11               50                0              100               10
  5          COMPUTE:        0                5               30                -                -                -
  6           CUSTOM:        -                -                -                -                -                -

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:      750.00 mV
fan1:         782 RPM  (min =    0 RPM, max = 5200 RPM)
edge:         +37.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:        8.20 W  (cap = 110.00 W)

(The amdgpu-dump.sh script is available on my amdgpu-tools github repository)

Boot time setup

Here's my setup to set the GPU's clock frequencies to fix values at startup, and definitively fix my crashes.

A shell script to set a GPU preset:

$ cat /usr/local/bin/amdgpu-set-dpm.sh
#!/usr/bin/bash
set -eu

# Examples of values to be written in power_dpm_force_performance_level: auto, low, high
# See https://www.kernel.org/doc/html/latest/gpu/amdgpu/thermal.html#power-dpm-force-performance-level

DPM_SETUP_FILE=/sys/class/drm/card0/device/power_dpm_force_performance_level
PERF_LEVEL=${1:-low}

echo "Updating amdgpu device power management setting (power_dpm_force_performance_level)"
echo "Initial value of \"${DPM_SETUP_FILE}\" is $(cat ${DPM_SETUP_FILE})"
echo "Setting \"${DPM_SETUP_FILE}\" to ${PERF_LEVEL}"

if [ ! -f ${DPM_SETUP_FILE} ]; then
    echo "*ERROR* file \"${DPM_SETUP_FILE}\" is missing. Exiting."
    exit 1
fi

echo ${PERF_LEVEL} > ${DPM_SETUP_FILE}
echo "New value of \"${DPM_SETUP_FILE}\" is $(cat ${DPM_SETUP_FILE})"

A systemd service file:

$ cat /etc/systemd/system/amdgpu-set-dpm.service
[Unit]
Description=amdgpu device power management tweaking

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/local/bin/amdgpu-set-dpm.sh low

[Install]
WantedBy=default.target

And a one-time command to enable the service previously defined:

$ sudo systemctl enable amdgpu-set-dpm.service