1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798 |
- Last reviewed: 05/20/2016
- HPE iLO NMI Watchdog Driver
- NMI sourcing for iLO based ProLiant Servers
- Documentation and Driver by
- Thomas Mingarelli
- The HPE iLO NMI Watchdog driver is a kernel module that provides basic
- watchdog functionality and the added benefit of NMI sourcing. Both the
- watchdog functionality and the NMI sourcing capability need to be enabled
- by the user. Remember that the two modes are not dependent on one another.
- A user can have the NMI sourcing without the watchdog timer and vice-versa.
- All references to iLO in this document imply it also works on iLO2 and all
- subsequent generations.
- Watchdog functionality is enabled like any other common watchdog driver. That
- is, an application needs to be started that kicks off the watchdog timer. A
- basic application exists in the Documentation/watchdog/src directory called
- watchdog-test.c. Simply compile the C file and kick it off. If the system
- gets into a bad state and hangs, the HPE ProLiant iLO timer register will
- not be updated in a timely fashion and a hardware system reset (also known as
- an Automatic Server Recovery (ASR)) event will occur.
- The hpwdt driver also has three (3) module parameters. They are the following:
- soft_margin - allows the user to set the watchdog timer value.
- Default value is 30 seconds.
- allow_kdump - allows the user to save off a kernel dump image after an NMI.
- Default value is 1/ON
- nowayout - basic watchdog parameter that does not allow the timer to
- be restarted or an impending ASR to be escaped.
- Default value is set when compiling the kernel. If it is set
- to "Y", then there is no way of disabling the watchdog once
- it has been started.
- NOTE: More information about watchdog drivers in general, including the ioctl
- interface to /dev/watchdog can be found in
- Documentation/watchdog/watchdog-api.txt and Documentation/IPMI.txt.
- The NMI sourcing capability is disabled by default due to the inability to
- distinguish between "NMI Watchdog Ticks" and "HW generated NMI events" in the
- Linux kernel. What this means is that the hpwdt nmi handler code is called
- each time the NMI signal fires off. This could amount to several thousands of
- NMIs in a matter of seconds. If a user sees the Linux kernel's "dazed and
- confused" message in the logs or if the system gets into a hung state, then
- the hpwdt driver can be reloaded.
- 1. If the kernel has not been booted with nmi_watchdog turned off then
- edit and place the nmi_watchdog=0 at the end of the currently booting
- kernel line. Depending on your Linux distribution and platform setup:
- For non-UEFI systems
- /boot/grub/grub.conf or
- /boot/grub/menu.lst
- For UEFI systems
- /boot/efi/EFI/distroname/grub.conf or
- /boot/efi/efi/distroname/elilo.conf
- 2. reboot the sever
- 3. Once the system comes up perform a modprobe -r hpwdt
- 4. modprobe /lib/modules/`uname -r`/kernel/drivers/watchdog/hpwdt.ko
- Now, the hpwdt can successfully receive and source the NMI and provide a log
- message that details the reason for the NMI (as determined by the HPE BIOS).
- Below is a list of NMIs the HPE BIOS understands along with the associated
- code (reason):
- No source found 00h
- Uncorrectable Memory Error 01h
- ASR NMI 1Bh
- PCI Parity Error 20h
- NMI Button Press 27h
- SB_BUS_NMI 28h
- ILO Doorbell NMI 29h
- ILO IOP NMI 2Ah
- ILO Watchdog NMI 2Bh
- Proc Throt NMI 2Ch
- Front Side Bus NMI 2Dh
- PCI Express Error 2Fh
- DMA controller NMI 30h
- Hypertransport/CSI Error 31h
- -- Tom Mingarelli
|