Hardware Watchdog as Well as Software Watch Scripts
Provided by: watchdog_5.14-3_amd64
NAME
watchdog - a software watchdog daemon
SYNOPSIS
watchdog [-F|--foreground] [-f|--forcefulness] [-c filename|--config-file filename] [-v|--verbose] [-southward|--sync] [-b|--softboot] [-q|--no-action]
DESCRIPTION
The Linux kernel can reset the arrangement if serious bug are detected. This can be implemented via special watchdog hardware, or via a slightly less reliable software-only watchdog inside the kernel. Either way, at that place needs to be a daemon that tells the kernel the organization is working fine. If the daemon stops doing that, the organisation is reset. watchdog is such a daemon. Information technology opens /dev/watchdog, and keeps writing to it often plenty to keep the kernel from resetting, at least one time per infinitesimal. Each write delays the reboot fourth dimension some other minute. After a minute of inactivity the watchdog hardware will cause the reset. In the instance of the software watchdog the ability to reboot will depend on the land of the machines and interrupts. The watchdog daemon can be stopped without causing a reboot if the device /dev/watchdog is closed correctly, unless your kernel is compiled with the CONFIG_WATCHDOG_NOWAYOUT option enabled.
TESTS
The watchdog daemon does several tests to check the system status: · Is the process table full? · Is in that location enough costless memory? · Is there plenty allocatable retention? · Are some files accessible? · Have some files changed inside a given interval? · Is the average work load as well high? · Has a file tabular array overflow occurred? · Is a procedure still running? The process is specified by a pid file. · Do some IP addresses respond to ping? · Do network interfaces receive traffic? · Is the temperature likewise high? (Temperature data not ever available.) · Execute a user divers command to do arbitrary tests. · Execute one or more examination/repair commands institute in /etc/watchdog.d. These commands are called with the statement test or repair. If whatsoever of these checks fail watchdog volition crusade a shutdown. Should any of these tests except the user defined binary terminal longer than one minute the machine will exist rebooted, too.
OPTIONS
Available command line options are the following: -v, --verbose Set verbose mode. Only implemented if compiled with SYSLOG characteristic. This mode will log each several infos in LOG_DAEMON with priority LOG_INFO. This is useful if you lot want to come across exactly what happened until the watchdog rebooted the system. Currently it logs the temperature (if available), the load average, the change date of the files it checks and how oftentimes information technology went to sleep. -southward, --sync Try to synchronize the filesystem every time the process is awake. Note that the system is rebooted if for any reason the synchronizing lasts longer than a infinitesimal. -b, --softboot Soft-boot the arrangement if an mistake occurs during the principal loop, east.thousand. if a given file is not attainable via the stat(2) phone call. Note that this does not apply to the opening of /dev/watchdog and /proc/loadavg, which are opened before the main loop starts. -F, --foreground Run in foreground style, useful for running under systemd (for example). -f, --strength Strength the usage of the interval given or the maximal load boilerplate given in the config file. Without this option these values are sanity checked. -c config-file, --config-file config-file Use config-file as the configuration file instead of the default /etc/watchdog.conf. -q, --no-action Do not reboot or halt the auto. This is for testing purposes. All checks are executed and the results are logged as usual, but no action is taken. As well your hardware card or the kernel software watchdog driver is not enabled. Temperature checking is also disabled since this triggers the hardware watchdog on some cards.
FUNCTION
After watchdog starts, it puts itself into the background and and so tries all checks specified in its configuration file in plough. Between each ii tests it will write to the kernel device to foreclose a reset. Subsequently finishing all tests watchdog goes to slumber for some fourth dimension. The kernel drivers expects a write to the watchdog device every infinitesimal. Otherwise the organisation volition be reset. watchdog will sleep for a configure interval that defaults to 1 2d to brand sure it triggers the device early plenty. Under high system load watchdog might exist swapped out of retentivity and may fail to make it back in in fourth dimension. Under these circumstances the Linux kernel will reset the auto. To brand sure you won't get unnecessary reboots make sure you have the variable realtime ready to yeah in the configuration file watchdog.conf. This adds real time back up to watchdog: it will lock itself into retentiveness and there should be no problem even nether the highest of loads. On system running out of memory the kernel volition endeavor to free plenty retention by killing process. The watchdog daemon itself is exempted from this so-called out-of-memory killer. As well you can specify a maximal allowed load average. One time this load boilerplate is reached the system is rebooted. You lot may specify maximal load averages for 1 minute, 5 minutes or 15 minutes. The default values is to disable this test. Be conscientious not to ready this parameter too low. To set a value less then the predefined minimal value of ii, you take to employ the -f option. You lot can also specify a minimal corporeality of virtual memory y'all desire to have bachelor equally free. Equally shortly every bit more virtual retention is used action is taken by watchdog. Annotation, however, that watchdog does not distinguish between different types of memory usage. Information technology just checks for free virtual memory. If you have a watchdog carte du jour with temperature sensor you can specify the maximal allowed temperature. Once this temperature is reached the organization is halted. The default value is 120. There is no unit conversion so make sure you lot apply the same unit as your hardware. watchdog will issue warnings once the temperature increases 90%, 95% and 98% of this temperature. When using file style watchdog will try to stat(two) the given files. Errors returned by stat will not crusade a reboot. For a reboot the stat call has to last at least one infinitesimal. This may happen if the file is located on an NFS mounted filesystem. If your system relies on an NFS mounted filesystem you might endeavour this pick. However, in such a instance the sync pick may not work if the NFS server is not answering. watchdog can read the pid from a pid file and encounter whether the process still exists. If non, activeness is taken by watchdog. So yous can for instance restart the server from your repair-binary. watchdog will try periodically to fork itself to see whether the process table is total. This process will leave a zombie process until watchdog wakes up once more and catches it; this is harmless, don't worry nigh it. In ping mode watchdog tries to ping the given IP addresses. These addresses exercise not take to be a single machine. It is possible to ping to a circulate address instead to encounter if at to the lowest degree one machine in a subnet is still living. Do not utilize this broadcast ping unless your MIS person a) knows about information technology and b) has given yous explicit permission to use it! watchdog will send out three ping packages and wait upwardly to <interval> seconds for the reply with <interval> existence the time it goes to sleep between two times triggering the watchdog device. Thus a unreachable network volition non cause a hard reset just a soft reboot. You tin can also test passively for an unreachable network by but monitoring a given interface for traffic. If no traffic arrives the network is considered unreachable causing a soft reboot or action from the repair binary. watchdog can run an external command for user-defined tests. A render lawmaking non equal 0 means an fault occurred and watchdog should react. If the external control is killed past an uncaught indicate this is considered an error by watchdog too. The command may accept longer than the time slice defined for the kernel device without a problem. Yet, error messages are generated into the syslog facility. If you have enabled softboot on error the machine volition exist rebooted if the binary doesn't exit in half the time watchdog sleeps between two tries triggering the kernel device. If you lot specify a repair binary it will be started instead of shutting down the system. If this binary is non able to fix the problem watchdog will still cause a reboot subsequently. If the machine is halted an email is sent to notify a human being that the machine is going downward. Starting with version 4.iv watchdog will as well notify the man in charge if the machine is rebooted.
SOFT REBOOT
A soft reboot (i.e. controlled shutdown and reboot) is initiated for every fault that is establish. Since there might exist no more than processes bachelor, watchdog does it all by himself. That means: ane. Kill all processes with SIGTERM. 2. Later a short pause impale all remaining processes with SIGKILL. 3. Record a shutdown entry in wtmp. 4. Save the random seed from /dev/urandom. If the device is non-existant or in that location is no filename for saving this step is skipped. 5. Plow off accounting. half-dozen. Plough off quota and bandy. seven. Unmount all partitions except the root partition. 8. Remount the root partition read-only. 9. Shut down all network interfaces. 10. Finally reboot.
Cheque BINARY
If the return code of the check binary is non zero watchdog will assume an error and reboot the system. Exist careful with this if you are using the real-fourth dimension backdrop of watchdog since watchdog will wait for the return of this binary before proceeding. An exit lawmaking smaller than 245 is interpreted every bit an system mistake lawmaking (see errno.h for details). Values of 245 or larger than are special to watchdog: 255 (based on -ane every bit unsigned viii-bit number) Reboot the system. This is non exactly an error message but a command to watchdog. If the return code is this the watchdog will not try to run a shutdown script instead. 254 Reset the system. This is not exactly an error bulletin merely a command to watchdog. If the render lawmaking is this the watchdog will simply reject to write the kernel device again. 253 Maximum load average exceeded. 252 The temperature within is too high. 251 /proc/loadavg contains no (or not enough) data. 250 The given file was not inverse in the given interval. 249 /proc/meminfo contains invalid information. 248 Child process was killed by a signal. 247 Child process did not return in time. 246 Gratuitous for personal watchdog-specific use (was -10 as an unsigned viii-fleck number). 245 Reserved for an unknown result, for example a dull background test that is still running so neither a success nor an error.
REPAIR BINARY
The repair binary is started with ane parameter: the error number that acquired watchdog to initiate the kicking process. Later trying to repair the organisation the binary should get out with 0 if the system was successfully repaired and thus there is no demand to kicking anymore. A return value not equal 0 tells watchdog to reboot. The return code of the repair binary should be the error number of the fault causing watchdog to reboot. Exist careful with this if you are using the existent-time properties since watchdog will await for the return of this binary earlier proceeding.
TEST DIRECTORY
Executables placed in the test directory are discovered past watchdog on startup and are automatically executed. They are divisional time-wise by the test-timeout directive in watchdog.conf. These executables are called with either "test" as the beginning argument (if a exam is being performed) or "repair" equally the first statement (if a repair for a previously-failed "test" functioning on is being performed). The as with exam binaries and repair binaries, expected exit codes for a successful test or repair operation is e'er nada. If an executable's test performance fails, the aforementioned executable is automatically called with the "repair" argument likewise as the return code of the previously-failed test operation. For example, if the following execution returns 42: /etc/watchdog.d/my-test test The watchdog daemon volition attempt to repair the trouble by calling: /etc/watchdog.d/my-examination repair 42 This enables administrators and awarding developers to make intelligent test/repair commands. If the "repair" functioning is non required (or is non probable to succeed), it is of import that the author of the control render a non-zero value so the machine will nonetheless reboot as expected. Note that the watchdog daemon may interpret and deed upon any of the reserved return codes noted in the Check Binary department prior to calling a given command in "repair" way.
BUGS
None known then far.
AUTHORS
The original code is an example written by Alan Cox <alan@lxorguk.ukuu.org.united kingdom of great britain and northern ireland>, the author of the kernel commuter. All additions were written past Michael Meskes <meskes@debian.org>. Johnie Ingram <johnie@netgod.net> had the idea of testing the load average. He too took over the Debian specific work. Dave Cinege <dcinege@psychosis.com> brought upward some hardware watchdog issues and helped testing this stuff.
FILES
/dev/watchdog The watchdog device. /var/run/watchdog.pid The pid file of the running watchdog.
SEE ALSO
watchdog.conf(5)
0 Response to "Hardware Watchdog as Well as Software Watch Scripts"
Post a Comment