Awesome
nerves_heart
This is a replacement for Erlang's heart
port process specifically for
Nerves-based devices. It is installed by default on all Nerves devices.
Heart monitors the Erlang runtime using
heart beat messages. If the Erlang runtime becomes unresponsive, heart
reboots
the system. This implementation of heart
is fully compatible with the default
implementation and provides the following changes:
- Supports a hardware watchdog timer (WDT) so that if both the Erlang runtime
and
heart
become unresponsive, the hardware watchdog can reboot the system - Simplifies the code base. Code not needed for Nerves devices has been removed to make the program easier to audit.
- Provides diagnostic information. See Nerves.Runtime.Heart for an easier interface to get this.
- Directly calls sync(2) and reboot(2) when Erlang is unresponsive. The reboot call is not configurable nor is it necessary to invoke another program.
- Trigger watchdog-protected reboots and shutdowns
- Support for an application-level initialization handshake. This lets your application guard against conditions that cause Erlang and Nerves Heart to think that everything is ok when it's not. See discussion below.
- Support for snoozing timeouts and assuring an uninterrupted amount of time at start. This makes heart friendlier to remote debug where it's nice when the watchdog doesn't get in your way of debugging. These timeouts are limited to avoid accidentally putting the device in a state where it can't recover.
- Includes many regression tests make it easier to modify and maintain.
Timeouts and semantics
The following timeouts are important to nerves_heart
:
- Erlang VM heart beat timeout - the time interval for Erlang heart beat messages
- Watchdog Timer (WDT) timeout - the time interval that before the WDT resets the device
- WDT pet timeout - the time interval that
nerves_heart
pets the WDT
The WDT pet timeout is always shorter than the WDT timeout. The Erlang VM heart
beat timer may be longer or shorter than the WDT timeout. Heart beat messages
from Erlang reset the heart beat timer and pet the WDT. In the case that Erlang
sends heart beat messages frequent enough to satisfy the heart beat timeout, but
at a shorter interval than the WDT pet timeout, nerves_heart
will
automatically pet the WDT.
In the event that Erlang is unresponsive and if nerves_heart
is still able to
run, it will reboot the system by calling sync(2)
and then reboot(2)
. If
even nerves_heart
is unresponsive, the WDT will reset the system. This will
also happen if sync(2)
or reboot(2)
hang nerves_heart
.
If you stop the Erlang VM gracefully (such as by calling :init.stop/0
), Erlang
will tell nerves_heart
to exit without rebooting. This, however, will stop
petting the WDT which will also lead to a system reset if the Linux kernel has
CONFIG_WATCHDOG_NOWAYOUT=y
in its configuration. All official Nerves systems
specify this option.
WDT pet decisions made by nerves_heart
:
- Pet the WDT on start. This is important since it's unknown
how much time has passed since the WDT has been started and when
nerves_heart
is started. This prevents a reboot due to a slow boot time. - Pet the WDT on Erlang heart beat messages AND pet timeout events. This makes the Erlang VM heart beat timeout the critical one in deciding when reboots happen.
- Pet the WDT once on graceful shutdown and requested crash dumps. The intention is to give other shutdown code or crash dump preparation code the full WDT timeout interval.
Using
If nerves_heart
is part of your Nerves system, then all that you need to do is
enable it in your rel/vm.args
file by adding the line:
-heart
If nerves_heart
is not part of your system, you'll need to compile it and copy
it over the Erlang-provided heart
program in /usr/lib/erlang/erts-x.y/bin
.
Once you have heart
running, if you have a hardware watchdog on your board,
you'll see a message printed to the console on your device with the pet
interval:
heart: Kernel watchdog activated (interval 5s)
The interval that Erlang needs to communicate with heart
can be different from
the interval that heart
pets the hardware watchdog. The default for Erlang is
60 seconds. If this is too long, update your rel/vm.args
like follows:
-heart
-env HEART_BEAT_TIMEOUT 30
The heart beat timeout has to be greater than 10 seconds per the Erlang documentation.
If you need to change the watchdog path, you can do this through an environment variable.
-env HEART_WATCHDOG_PATH /dev/watchdog1
The following table shows the environment variables that affect Nerves Heart:
Variable | Description |
---|---|
ERL_CRASH_DUMP_SECONDS | Timeout in seconds to wait for Erlang to exit |
HEART_BEAT_TIMEOUT | Used by Erlang to start heart . Erlang promises to pet heart before this timeout. |
HEART_INIT_TIMEOUT | If set, require an init handshake message before the timeout |
HEART_KERNEL_TIMEOUT | Set the kernel watchdog driver's timeout. Requires that the kernel watchdog driver supports WDIOF_SETTIMEOUT |
HEART_KILL_SIGNAL | Set to "SIGABRT" to send SIGABRT rather than SIGKILL |
HEART_INIT_GRACE_TIME | Grace period for Erlang at the start. E.g., if set to 120, then heart will pet the hardware watchdog for the first two minutes even if Erlang isn't responsive. |
HEART_NO_KILL | If "TRUE", don't try to kill Erlang before exiting |
HEART_VERBOSE | "0" turns off logging, "1" is error logs only, "2" is everything |
HEART_WATCHDOG_PATH | Path to hardware watchdog. Defaults to "/dev/watchdog0" |
Linux kernel configuration
All official Nerves systems have Linux configured of Nerves Heart.
Nerves Heart expects the kernel watchdog device driver to be enabled with the no way out option:
CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_NOWAYOUT=y
Linux provides several hardware watchdog driver so select the option that matches your device. You may need to add a section to your device tree to configure the driver. For example, if using an external watchdog and petting it via a GPIO, you will need to specify which GPIO in the device tree.
Application-level initialization handshake
Erlang's heart
module has a very useful feature of letting you specify a
callback to extend the heart beat protection beyond the VM. See
:heart.set_callback/2.
To use it, you specify a callback function that returns :ok
if your
application seems is ok. If it's not ok, then Erlang won't send a heart beat
message to the heart process and the device will eventually reboot.
The problem with this mechanism is that if something causes the code that registers the callback to not be run, the device could continue to run without knowing that its in a bad state. As an aside, a non-Nerves way of handling this would be to exit the VM with an error message. On devices, running in a degraded state can be useful to keep important services going and allowing time for remote troubleshooting.
Here's the flow for using the application-level initialization handshake:
- Set the
HEART_INIT_TIMEOUT
environment variable to the number of seconds to wait for the initializing handshake. - Call
:heart.set_callback/2
in your program - Call
Nerves.Runtime.Heart.init_complete/0
or:heart.setcmd('init_done')
. If you call the latter, be sure to do it in a different process than the your heart callback uses or you'll get a hang.
To set HEART_INIT_TIMEOUT
, edit the rel/vm.args.eex
and update the heart
section to look like this:
## Enable heartbeat monitoring of the Erlang runtime system
-heart
-env HEART_BEAT_TIMEOUT 30
## Require an initialization handshake within 15 minutes
-env HEART_INIT_TIMEOUT 900
With this configuration, Nerves Heart will operate as normal with petting the
hardware watchdog and expecting Erlang heart beat messages up. If init_done
is
passed, Nerves Heart will continue working as normal. However, if init_done
is
not sent at the 15 minute mark, Nerves Heart will pet the hardware WDT one last
time and exit. This will cause the Erlang VM to exit and if the VM decides to
not be happy about this, the hardware WDT will reset.
Runtime diagnostic information
Both nerves_heart
and the Linux watchdog subsystem can return useful
information via :heart.get_cmd/0
. According to the Erlang/OTP documentation,
this function really should return the temporary reboot command, but it's
commented as unsupported in the Erlang version of heart
. Given that there
isn't a better function call for returning this information, this one is used.
Here's an example run:
iex> :heart.get_cmd
{:ok,
'program_name=nerves_heart\nprogram_version=2.0.0\nheartbeat_timeout=60\nheartbeat_time_left=53\nwdt_pet_time_left=103\nwdt_identity=OMAP Watchdog\nwdt_firmware_version=0\nwdt_options=settimeout,magicclose,keepaliveping,\nwdt_time_left=116\nwdt_pre_timeout=0\nwdt_timeout=120\nwdt_last_boot=power_on\n'}
The format is "key=value\n". The keys are either from nerves_heart
or from
Linux watchdog ioctl
functions.
If you're using Nerves, Nerves.Runtime.Heart
has a friendlier interface:
iex> Nerves.Runtime.Heart.status!
%{
program_name: "nerves_heart",
program_version: %Version{major: 2, minor: 0, patch: 0},
heartbeat_timeout: 60,
heartbeat_time_left: 53,
init_handshake_happened: true,
init_handshake_timeout: 900,
init_handshake_time_left: 0,
wdt_identity: "OMAP Watchdog",
wdt_firmware_version: 0,
wdt_last_boot: :power_on,
wdt_options: [:settimeout, :magicclose, :keepaliveping],
wdt_pre_timeout: 0,
wdt_time_left: 116,
wdt_pet_time_left: 103,
wdt_timeout: 120
}
The following table describes keys and their values:
Key | Description or value |
---|---|
:program_name | "nerves_heart" |
:program_version | Nerves heart's version number |
:heartbeat_timeout | Erlang's heartbeat timeout setting. Note that the hardware watchdog timeout supersedes this since it reboots. |
:heartbeat_time_left | The amount of time left for Erlang to send a heartbeat message before heart times out. |
:init_handshake_happened | true if the initialization handshake happened or isn't enabled |
:init_handshake_timeout | The time to wait for the handshake message before timing out |
:init_handshake_time_left | If waiting for an initialization handshake, this is the number of seconds left. |
:wdt_identity | The hardware watchdog that's being used |
:wdt_firmware_version | An integer that represents the hardware watchdog's firmware revision |
:wdt_last_boot | What caused the most recent boot. Whether this is reliable depends on the watchdog. |
:wdt_options | Hardware watchdog options as reported by Linux |
:wdt_pre_timeout | How many seconds before the watchdog expires that Linux will receive a pre-timeout notification |
:wdt_time_left | How many seconds are left before the hardware watchdog triggers a reboot (depends on the kernel driver) |
:wdt_timeout | The hardware watchdog timeout. This is only changeable in the Linux configuration |
:wdt_pet_time_left | The time left before Nerves heart will pet the hardware WDT should everything remain ok |
Reboot and power off
The :heart.set_cmd/1
function can be used to trigger reboots and shutdowns.
To understand why, it's good to review the alternatives. The other ways of
rebooting with Nerves are either by exiting the Erlang VM via :init.stop/0
or
by running the reboot(8)
or poweroff(8)
shell commands. The latter two work
by sending a signal to PID 1 (erlinit
) that kills all OS processes and does
the reboot and power off. These work well in practice, but have had failures
blamed on them that can be further reduced by using Nerves Heart. For example,
one failure was found where OS processes couldn't be started and therefore
reboot(8)
wasn't being called. This ended up causing a hang that wasn't
resolved for a long time. Since using Nerves Heart to reboot does not create OS
processes, it would not have this issue and if communication problems between Nerves
Heart and Erlang arise, that will be detected by the heart beat timers. There's
lots to say about this topic. Suffice it to say that the heart process has
advantages in simplicity and control of the hardware WDT to make this process
more reliable.
To use, run either:
iex> :heart.set_cmd("guarded_reboot")
or
iex> :heart.set_cmd("guarded_poweroff")
IMPORTANT: nerves_runtime
v0.13.2 and later makes these calls for you when
running Nerves.Runtime.reboot/0
and Nerves.Runtime.poweroff/0
if it detects
that this feature is supported.
Nerves Heart will notify PID 1 of the intention to reboot or poweroff. However,
the WDT will be pet for the last time. You should then call either
:init.stop/0
(graceful) or :erlang.halt/0
(ungraceful) to exit the Erlang
VM. Both erlinit
and the WDT will prevent shutdown from not completing.
It's also possible to do a totally ungraceful reboot or shutdown if you need a
very immediate response. The "guarded_immediate_reboot"
and
"guarded_immediate_poweroff"
commands do this.
Testing
It's reassuring to know that heart
does what it's supposed to do since it
should only rarely take action. If you'd like to verify that a hardware watchdog
works, you can instruct heart
to stop petting the hardware watchdog by
running:
iex> :heart.set_cmd("disable_hw")
If you'd like to verify how the Erlang VM handles the heart
process detecting
an unresponsive VM, you can instruct heart
to stop as it would if it timed out
on the VM by running:
iex> :heart.set_cmd("disable_vm")
Snoozing
If you're debugging a watchdog or Erlang heart issue, it can be really helpful
to disable reboots temporarily. This is called snoozing and it works by telling
heart
to pet the hardware watchdog and keep the Erlang VM running even if it
should reboot. This is effective for the next 15 minutes every time you run it.
iex> :heart.set_cmd("snooze")
If :heart.set_cmd/1
doesn't return, that means that the Erlang heart process
is stuck. This is probably due to the health check callback (registered by
:heart.set_callback/2
) taking a long time. To get past this, send the USR1
signal to Nerves Heart like this:
iex> :os.cmd('killall -USR1 heart')
The reason snoozing forever isn't supported is to avoid keeping a device on forever in a bad state. For example, it would be unfortunate to start a debug session on a remote device and accidentally mess it up in a way where you lose connectivity until the next reboot.
Debugging
Nerves Heart writes to the kernel's logger to aid debug if something unexpected
happens. Only errors are logged by default. If you'd like informational
messages as well, set the HEART_VERBOSE
environment variable in your
vm.args.eex
:
-env HEART_VERBOSE 2
If you want Nerves Heart to be completely stealth in its prints for some
reason, set HEART_VERBOSE
to 0
:
-env HEART_VERBOSE 0
Heart set_cmd summary
The following commands can be sent to Nerves Heart via :heart.set_cmd
:
Command | Description |
---|---|
"disable" | Same as "disable_hw" |
"disable_hw" | Stop petting the hardware watchdog |
"disable_vm" | Return a timeout to Erlang |
"init_handshake" | Stop the initialization timeout timer |
"guarded_immediate_poweroff" | Stop petting the watchdog and immediately power off |
"guarded_immediate_reboot" | Stop petting the watchdog and immediately reboot |
"guarded_halt" | Stop petting the watchdog and start a graceful halt |
"guarded_poweroff" | Stop petting the watchdog and start a graceful power off |
"guarded_reboot" | Stop petting the watchdog and start a graceful reboot |
"snooze" | Don't stop petting the watchdog for the next 15 minutes |
License
This production code in this project is Erlang/OTP's heart.c
with custom
modifications for use with Nerves. This file is licensed under Apache-2.0.
Without additional modification, no other license is used.
This project does contain source with other licenses including GPL-2.0-only. Files containing these licenses are clearly marked as required by the REUSE recommendations.
Exceptions to Apache-2.0 licensing are:
- Configuration and data files are licensed under CC0-1.0
- Documentation is CC-BY-4.0
- Linux kernel headers for MacOS development are GPL-2.0-only with the Linux syscall note.