4

/var/log/syslog contains:

Jul 31 13:45:01 ray-desktop CRON[5667]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 31 13:45:50 ray-desktop org.gnome.Shell.desktop[1689]: [2036:2054:0731/134550.778035:ERROR:socket_stream.cc(219)] Closing stream with result -2
Jul 31 13:47:51 ray-desktop rasdaemon[695]:            <...>-35    [-41071872]     0.001327: mce_record:           2019-07-31 12:27:04 -0400 bank=8, status= 8c2001000001110b, corrected filtering (some unreported errors in same region) Generic CACHE Level-3 Generic Error, mci=Corrected_error Threshold based error status: green, mca=corrected filtering (some unreported errors in same region) Generic CACHE Level-3 Generic Error Large number of corrected cache errors. System operating, but might leadto uncorrected errors soon, cpu_type= Intel generic architectural MCA, cpu= 0, socketid= 0, misc= 31c0, addr= 2cee80000075b7d, mcgstatus=0, mcgcap= c09, apicid= 0
Jul 31 13:47:51 ray-desktop kernel: [18114.699831] mce: [Hardware Error]: Machine check events logged
Jul 31 13:47:51 ray-desktop rasdaemon[695]: cpu 00:rasdaemon: mce_record store: 0x556ec46df398
Jul 31 13:47:51 ray-desktop rasdaemon[695]: rasdaemon: register inserted at db
Jul 31 13:48:22 ray-desktop kernel: [18145.544922] perf: interrupt took too long (5187 > 5062), lowering kernel.perf_event_max_sample_rate to 38500

immediately followed by reboot logs at 13:55:53.

I understand that "mce" logging has been replaced by "rasdaemon", both of which are mentioned in the above.

$ find /sys/kernel/debug/tracing  -type f  \! -empty

finds nothing.

There are over 22,000 files in that directory, all empty, and all created at the time of the reboot.

Is this where rasdaemon keeps its information, and if so, what use is it if it all gets zeroed by a reboot?

2 Answers 2

7

Everything below /sys is typically a virtual file system of the kernel, in particular, /sys/kernel/debug/tracing is tracefs. This has nothing to do with rasdaemon.

If rasdaemon is started with parameter -r/--record, it stores events in an Sqlite3 database, which on my system is at /var/lib/rasdaemon/ras-mc_event.db. This database can be examined with ras-mc-ctl --errors.

0
  1. The logs of rasdaemon are reported via syslog/journald.

The rasdaemon program is a daemon which monitors the platform Reliablity, Availability and Serviceability (RAS) reports from the Linux kernel trace events. These trace events are logged in /sys/kernel/debug/tracing, reporting them via syslog/journald.

https://github.com/mchehab/rasdaemon/blob/master/man/rasdaemon.1.in

You can obtain log by journalctl.

#journalctl | tail -n 100
Jul 12 20:27:24 localhost.localdomain rasdaemon[39806]: <idle>-0     [-85410864]     0.000960: mc_event:             2023-07-12 20:24:45 +0800 1 Corrected error: single-symbol chipkill ECC on unknown memory (mc: 0 address: 0x400abb3a400 grain: 0 APEI location: node:0 card:5 module:0 rank:1 bank_group:0 bank_address:3 device:0 row:174 column:1280 chip_id:0 status(0x0000000000000400): Storage error in DRAM memory)

Jul 12 20:27:24 localhost.localdomain rasdaemon[39806]: cpu 19:rasdaemon: mc_event store: 0xaaaab9491ff8
Jul 12 20:27:24 localhost.localdomain rasdaemon[39806]: rasdaemon: register inserted at db
  1. The ras events are tracepoints emitted by kernel, you can monitor them via debugfs by yourself.
# ls /sys/kernel/debug/tracing/events/ras/mc_event/
enable  filter  format  hist  id  trigger

#cat /sys/kernel/debug/tracing/events/ras/mc_event/id
1188

# cd /sys/kernel/debug/tracing/events/ras/mc_event/
#echo 1 > enable 

# cd /sys/kernel/debug/tracing/
# cat trace_pipe
          <idle>-0       [074] dnh.  7251.551618: mc_event: 1 Corrected error: Single-symbol ChipKill ECC on unknown memory (mc:0 location:-1:-1:-1 address:0x40098e20900 grain:1 syndrome:0x00000000 APEI location: node:0 card:4 module:0 rank:1 bank_group:2 bank_address:0 row:99 col:64 chipID: 0 status(0x0000000000000400): Storage error in DRAM memory)

# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 1/1   #P:128
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |
          <idle>-0       [075] d.h.  7323.829675: mc_event: 1 Corrected error: Single-symbol ChipKill ECC on unknown memory (mc:0 location:-1:-1:-1 address:0x40098e20900 grain:1 syndrome:0x00000000 APEI location: node:0 card:4 module:0 rank:1 bank_group:2 bank_address:0 row:99 col:64 chipID: 0 status(0x0000000000000400): Storage error in DRAM memory)
  1. The tracepoints are monitored by rasdeamon, and finally persistently stored in an Sqlite3 database if started with parameter -r/--record.
#systemctl status rasdaemon.service
● rasdaemon.service - RAS daemon to log the RAS events
   Loaded: loaded (/usr/lib/systemd/system/rasdaemon.service; disabled; vendor preset: disabled)
   Active: active (running) since Wed 2023-07-12 15:40:42 CST; 3s ago
  Process: 40597 ExecStartPost=/usr/sbin/rasdaemon --enable (code=exited, status=0/SUCCESS)
 Main PID: 40596 (rasdaemon)
    Tasks: 1
   Memory: 440.0K
   CGroup: /system.slice/rasdaemon.service
           └─40596 /usr/sbin/rasdaemon -f -r

#ras-mc-ctl --errors
Memory controller events:
1 2023-07-12 15:42:21 +0800 1 Info error(s): memory read error at CPU_SrcID#0_MC#0_Chan#0_DIMM#0 location: 0:0:0:-1, xxxx
No Extlog errors.
PCIe AER events:
1 2023-07-12 17:00:56 +0800 Corrected error: Data Link Protocol
MCE events:
1 2023-07-12 15:42:21 +0800 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg xxxx

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.