0

This appears in my Dell XR12. I am using an Intel ACC100 (PDF download) acceleration card . But I dont understand the error. What is happening here? Any help is appreciated!

[Thu Sep  7 08:43:27 2023] loop10: detected capacity change from 0 to 8
[Thu Sep  7 08:44:23 2023] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[Thu Sep  7 08:44:23 2023] {1}[Hardware Error]: event severity: recoverable
[Thu Sep  7 08:44:23 2023] {1}[Hardware Error]:  Error 0, type: fatal
[Thu Sep  7 08:44:23 2023] {1}[Hardware Error]:   section_type: PCIe error
[Thu Sep  7 08:44:23 2023] {1}[Hardware Error]:   port_type: 4, root port
[Thu Sep  7 08:44:23 2023] {1}[Hardware Error]:   version: 3.0
[Thu Sep  7 08:44:23 2023] {1}[Hardware Error]:   command: 0x0547, status: 0x4010
[Thu Sep  7 08:44:23 2023] {1}[Hardware Error]:   device_id: 0000:50:02.0
[Thu Sep  7 08:44:23 2023] {1}[Hardware Error]:   slot: 2
[Thu Sep  7 08:44:23 2023] {1}[Hardware Error]:   secondary_bus: 0x51
[Thu Sep  7 08:44:23 2023] {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x347a
[Thu Sep  7 08:44:23 2023] {1}[Hardware Error]:   class_code: 060400
[Thu Sep  7 08:44:23 2023] {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0003
[Thu Sep  7 08:44:23 2023] {1}[Hardware Error]:   aer_uncor_status: 0x00000020, aer_uncor_mask: 0x01310000
[Thu Sep  7 08:44:23 2023] {1}[Hardware Error]:   aer_uncor_severity: 0x044ef030
[Thu Sep  7 08:44:23 2023] {1}[Hardware Error]:   TLP Header: ffffffff ffffffff ffffffff ffffffff
[Thu Sep  7 08:44:23 2023] pcieport 0000:50:02.0: AER: aer_status: 0x00000020, aer_mask: 0x01310000
[Thu Sep  7 08:44:23 2023] pcieport 0000:50:02.0:    [ 5] SDES                   (First)
[Thu Sep  7 08:44:23 2023] pcieport 0000:50:02.0: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
[Thu Sep  7 08:44:23 2023] pcieport 0000:50:02.0: AER: aer_uncor_severity: 0x044ef030
[Thu Sep  7 08:44:24 2023] pcieport 0000:50:02.0: AER: Root Port link has been reset (0)
[Thu Sep  7 08:44:24 2023] pcieport 0000:50:02.0: AER: device recovery successful
[Thu Sep  7 08:44:24 2023] vfio-pci 0000:51:00.0: vfio_ecap_init: hiding ecap 0x19@0x248
[Thu Sep  7 08:44:37 2023] pci 0000:52:00.0: [8086:0d5d] type 00 class 0x120001
[Thu Sep  7 08:44:37 2023] pci 0000:52:00.0: Adding to iommu group 96
[Thu Sep  7 08:44:37 2023] vfio-pci 0000:51:00.0: Captured SR-IOV VF 0000:52:00.0 driver_override
[Thu Sep  7 08:44:37 2023] pci 0000:52:00.1: [8086:0d5d] type 00 class 0x120001
[Thu Sep  7 08:44:37 2023] pci 0000:52:00.1: Adding to iommu group 97
[Thu Sep  7 08:44:37 2023] vfio-pci 0000:51:00.0: Captured SR-IOV VF 0000:52:00.1 driver_override
[Thu Sep  7 08:44:50 2023] vfio-pci 0000:52:00.0: enabling device (0000 -> 0002)
6
  • Does this answer your question? Understanding "Hardware error from APEI Generic Hardware Error Source" error message Commented Sep 7, 2023 at 10:18
  • and this one: unix.stackexchange.com/questions/150451/… Commented Sep 7, 2023 at 10:20
  • Please run latest Memtest86 version from a USB to see if it is RAM issue. Commented Sep 7, 2023 at 10:26
  • I have ran memtest and there are no issues with the memory! I assume its related to the PCIe slot and the ACC100 accelerator. But I am not sure how to debug this and get more information. Thanks for you help so far! Commented Sep 7, 2023 at 10:34
  • Since this is a server we can safely assume it contains ECC RAM, but to test this out you would need likely a Professional version which is able to test ECC RAM. Normal version - the free one - is not capable of this. Cheers and good luck. Commented Sep 7, 2023 at 10:36

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.