1

I'm running Arch Linux, kernel 5.17.3 (though this issue has been happening across many versions.) Every couple of days, I get random complete system freezes. The kernel logs vary, but they most commonly look like this:

...
Apr 02 05:04:20 starship kernel: BUG: scheduling while atomic: swapper/0/0/0x7fff0001
Apr 02 05:04:20 starship kernel: Modules linked in: tun uinput btrfs blake2b_generic xor raid6_pq dm_crypt cbc encrypted_keys trusted asn1_encoder tee dm_mod rfcomm snd_seq_dummy snd_hrtimer snd_seq hid_logitech_hidpp xt_CHECKSUM xt_MASQUERADE nft_chain_nat nf_nat bridge stp llc cmac algif_hash algif_skcipher af_alg bnep ip6t_REJECT nf_reject_ipv6 xt_hl mousedev hid_logitech_dj ip6_tables joydev ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_comment xt_multiport nft_limit btusb btrtl btbcm xt_limit btintel xt_addrtype btmtk xt_tcpudp snd_usb_audio bluetooth xt_conntrack nf_conntrack snd_usbmidi_lib nf_defrag_ipv6 snd_rawmidi nf_defrag_ipv4 snd_seq_device usbhid ecdh_generic nft_compat nf_tables libcrc32c nfnetlink i2c_dev i2c_smbus nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) iwlmvm nvidia(POE) mac80211 intel_rapl_msr intel_rapl_common libarc4 edac_mce_amd eeepc_wmi kvm_amd iwlwifi asus_wmi sparse_keymap kvm iwlmei platform_profile irqbypass crct10dif_pclmul crc32_pclmul video wmi_bmof
Apr 02 05:04:20 starship kernel:  mxm_wmi asus_wmi_sensors ghash_clmulni_intel cfg80211 aesni_intel crypto_simd snd_hda_codec_realtek cryptd rfkill snd_hda_codec_generic vfat sp5100_tco fat rapl ledtrig_audio pcspkr snd_hda_codec_hdmi ccp i2c_piix4 k10temp igb mei e1000e tpm_crb dca tpm_tis tpm_tis_core snd_hda_intel tpm snd_intel_dspcfg gpio_amdpt rng_core snd_intel_sdw_acpi gpio_generic pinctrl_amd snd_hda_codec snd_hda_core snd_hwdep wmi mac_hid acpi_cpufreq snd_aloop snd_pcm snd_timer snd soundcore v4l2loopback_dc(OE) videodev mc crypto_user fuse bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 xhci_pci crc32c_intel xhci_pci_renesas
Apr 02 05:04:20 starship kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           OE     5.17.1-arch1-1 #1 0ea933cb6bfe82a8dc16ab834a4bccdd297f98b7
Apr 02 05:04:20 starship kernel: Hardware name: System manufacturer System Product Name/ROG STRIX B450-F GAMING, BIOS 4801 03/02/2022
Apr 02 05:04:20 starship kernel: Call Trace:
Apr 02 05:04:20 starship kernel:  <TASK>
Apr 02 05:04:20 starship kernel:  dump_stack_lvl+0x48/0x5e
Apr 02 05:04:20 starship kernel:  __schedule_bug.cold+0x4c/0x58
Apr 02 05:04:20 starship kernel:  __schedule+0xd55/0x10a0
Apr 02 05:04:20 starship kernel:  ? hrtimer_start_range_ns+0x272/0x350
Apr 02 05:04:20 starship kernel:  schedule_idle+0x26/0x40
Apr 02 05:04:20 starship kernel:  do_idle+0x16d/0x260
Apr 02 05:04:20 starship kernel:  cpu_startup_entry+0x19/0x20
Apr 02 05:04:20 starship kernel:  start_kernel+0x9a2/0x9c9
Apr 02 05:04:20 starship kernel:  secondary_startup_64_no_verify+0xd5/0xdb
Apr 02 05:04:20 starship kernel:  </TASK>
Apr 02 05:04:20 starship kernel: [UFW BLOCK] IN=enp10s0 OUT= MAC=04:d4:c4:55:3e:fc:98:09:cf:93:64:22:08:00 SRC=192.168.4.7 DST=192.168.4.2 LEN=1909 TOS=0x00 PREC=0x00 TTL=64 ID=44904 PROTO=UDP SPT=40665 DPT=1716 LEN=1889
...

This sometimes is near the end of the logs but is also sometimes several (thousand) lines up, before lots of complaints from systemd. Is it likely this is the problem with my crashes? Should I be looking for something else? If this is likely the problem, how should I go about debugging it? I think it's likely some badly written program/driver/kernel module on my machine, but I don't know where to start to figure out which one.

If I'm using the computer when it happens, applications usually freeze first, almost immediately followed by the desktop environment (Cinnamon), but often I can still move my mouse around for ~30 seconds before it fully hangs and I have to hard reset. If I'm not at the computer, it just won't respond to pings, or I'll get back and it'll be "running" but won't wake up from sleep/screensaver/whatever the DE does when it's idle, and I have to hard reset it.

Things I've tried (many of these from hunches/advice that it might be a hardware issue):

  • Updating the BIOS
  • Disabling CPU idle states (after finding this could be a common issue with Ryzen CPUs/chipsets)
  • Downclocking RAM (from 3600MHz advertised speed to 3200MHz which is supposedly the mobo's supported speed)
  • Stress testing the CPU (using mprime) and RAM (using Memtest86 since Memtest86+ wouldn't boot), no errors found

Could this still be a hardware issue? Or, where should I start debugging/looking for software issues?

I can provide more info if it helps. Also, if there's a better place to ask this question, please let me know that too. Thanks!

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.