I want to set up a high availability system, and to that end, was experimenting with several recovery methods.
Assuming I have received a kernel update through automated means, which would mean a reboot, I would create a snapshot of the running system, create a bootable entry in GRUB2, update my kernel, and force a boot with the updated root.
If boot did not go through as expected, I'd want a timer which would, after a set time, abandon the boot attempt and boot normally, which would be through the pre-update snapshot.
The scenario looked rile for a WatchDog timer (WDT). I noticed that there was a Watchdog in my BIOS ACPI settings, with some time values from 3s to 30 min. I set it to 2 min, and in my boot attempt, I interrupted the boot process so it would not go ahead, sure enough, after 2 min, it rebooted. However, even if the system did boot correctly, it still kept rebooting every 2 min.
I read, here, here and here that there needs to be a daemon which should tickle / refresh this watchdog. I noticed that there is a /dev/watchdog in my system, however, I have no idea if this is the watchdog which interfaces with the watchdog I enabled in BIOS. I further read that this can be tickled by systemd. Setting the RuntimeWatchdogSec= option in /etc/systemd/system.conf should send a refresher to the watchdog timer and prevent it from rebooting. Accordingly I set this to 20s, but still my device keeps on booting whenever the set timer frequency in BIOS watchdog elapses.
(For some reason I have /dev/watchdog and /dev/watchdog0, both 0 bytes, not sure if that is normal...)
Do I need to enable something else? Is my understanding correct that the /dev/watchdog interfaces with the BIOS watchdog timer which I had enabled, and that it can be tickled by /etc/systemd/system.conf option? Initially I assumed it was a software watchdog not caring about the hardware watchdog I turned on in BIOS, but seems it should work with that.
The board I am using is a very generic board, and OS is CentOs 8.
EDIT:
Doing a lsmod | grep wdt gives me the following :
iTCO_wdt 16384 1
iTCO_vendor_support 16384 1 iTCO_wdt
mei_wdt 16384 0
mei 110592 3 mei_wdt,mei_me
Since systemd was not able to work, I downloaded the watchdog daemon provided by CentOs, and set values such as watchdog-device to /dev/watchdog, and tried setting some stuff, but that didn't work either. The system just keeps restarting.
I did a systemctl status watchdog.service, it gives me a status saying daemon service is running and is active, alongwith :
alive=/dev/watchdog heartbeat=[none] to=root no_act=no force=no
hardware watchdog identity : iTCO_wdt
cannot set scheduler (errno=1 = 'Operation not permitted') //ERROR
The error I checked might have something to do with systemd, but I checked and /etc/system.conf is completely commented.
lsmod | grep wdt, what do you get?lsmodoutput, and some other information.