Hardware Watchdog as Well as Software Watch Scripts

Provided by: watchdog_5.14-3_amd64 bug

        

NAME

          watchdog - a software watchdog daemon        

SYNOPSIS

          watchdog          [-F|--foreground]    [-f|--forcefulness]    [-c          filename|--config-file          filename]        [-v|--verbose] [-southward|--sync] [-b|--softboot] [-q|--no-action]        

DESCRIPTION

          The Linux kernel can reset the arrangement if serious  bug  are  detected.   This  can  be        implemented  via  special watchdog hardware, or via a slightly less reliable software-only        watchdog inside the kernel. Either way, at that place needs to be a daemon that tells  the  kernel        the organization is working fine. If the daemon stops doing that, the organisation is reset.          watchdog          is such a daemon. Information technology opens          /dev/watchdog, and keeps writing to it often plenty to        keep the kernel from resetting, at least one time per infinitesimal. Each  write  delays  the  reboot        fourth dimension  some other  minute.  After  a minute of inactivity the watchdog hardware will cause the        reset. In the instance of the software watchdog the ability to reboot will depend on the land        of the machines and interrupts.         The watchdog daemon can be stopped without causing a reboot if the device          /dev/watchdog          is        closed correctly, unless your kernel is compiled with the          CONFIG_WATCHDOG_NOWAYOUT          option        enabled.        

TESTS

          The watchdog daemon does several tests to check the system status:         ·  Is the process table full?         ·  Is in that location enough costless memory?         ·  Is there plenty allocatable retention?         ·  Are some files accessible?         ·  Have some files changed inside a given interval?         ·  Is the average work load as well high?         ·  Has a file tabular array overflow occurred?         ·  Is a procedure still running? The process is specified by a pid file.         ·  Do some IP addresses respond to ping?         ·  Do network interfaces receive traffic?         ·  Is the temperature likewise high? (Temperature data not ever available.)         ·  Execute a user divers command to do arbitrary tests.         ·  Execute  one or more examination/repair commands institute in /etc/watchdog.d.  These commands are           called with the statement          test          or          repair.         If whatsoever of these checks fail watchdog volition crusade a shutdown.  Should  any  of  these  tests        except  the  user defined binary terminal longer than one minute the machine will exist rebooted,        too.        

OPTIONS

          Available command line options are the following:          -v,          --verbose          Set verbose mode. Only implemented if compiled with          SYSLOG          characteristic. This mode  will               log each several infos in          LOG_DAEMON          with priority          LOG_INFO.          This is useful if you lot               want to come across exactly what happened until the watchdog rebooted the system. Currently               it  logs  the  temperature (if available), the load average, the change date of the               files it checks and how oftentimes information technology went to sleep.          -southward,          --sync          Try to synchronize the filesystem every time the process is awake.  Note  that  the               system is rebooted if for any reason the synchronizing lasts longer than a infinitesimal.          -b,          --softboot          Soft-boot  the arrangement if an mistake occurs during the principal loop, east.thousand. if a given file               is not attainable via the          stat(2) phone call. Note  that  this  does  not  apply  to  the               opening  of          /dev/watchdog          and          /proc/loadavg, which are opened before the main loop               starts.          -F,          --foreground          Run in foreground style, useful for running under systemd (for example).          -f,          --strength          Strength the usage of the interval given or the maximal  load  boilerplate  given  in  the               config file. Without this option these values are sanity checked.          -c          config-file,          --config-file          config-file          Use          config-file          as   the   configuration   file   instead   of   the   default          /etc/watchdog.conf.          -q,          --no-action          Do not reboot or halt the auto. This is for testing  purposes.  All  checks  are               executed  and  the  results are logged as usual, but no action is taken.  As well your               hardware card or the kernel software watchdog driver is  not  enabled.  Temperature               checking is also disabled since this triggers the hardware watchdog on some cards.        

FUNCTION

          After          watchdog          starts,  it  puts  itself  into  the background and and so tries all checks        specified in its configuration file in plough. Between each ii tests it will write  to  the        kernel  device  to  foreclose  a reset. Subsequently finishing all tests watchdog goes to slumber for        some fourth dimension. The kernel drivers expects  a  write  to  the  watchdog  device  every  infinitesimal.        Otherwise  the  organisation  volition  be reset.          watchdog          will sleep for a configure interval that        defaults to 1 2d to brand sure it triggers the device early plenty.         Under high system load          watchdog          might exist swapped out of retentivity and may  fail  to  make  it        back  in  in  fourth dimension.  Under these circumstances the Linux kernel will reset the auto. To        brand sure you won't get unnecessary reboots make sure you have the variable          realtime          ready        to          yeah          in the configuration file          watchdog.conf.  This adds real time back up to          watchdog:        it will lock itself into retentiveness and there should  be no problem even nether the highest  of        loads.         On  system  running  out  of  memory  the kernel volition endeavor to free plenty retention by killing        process. The          watchdog          daemon itself is exempted from this so-called out-of-memory killer.         As well you can specify a maximal allowed load average. One time this load boilerplate is reached the        system  is  rebooted.  You lot may specify maximal load averages for 1 minute, 5 minutes or 15        minutes. The default values is to disable this test. Be conscientious not to ready this  parameter        too  low.  To set a value less then the predefined minimal value of ii, you take to employ the          -f          option.         You lot can also specify a minimal corporeality of virtual memory y'all  desire  to  have  bachelor  equally        free.  Equally shortly every bit more virtual retention is used action is taken by          watchdog.  Annotation, however,        that watchdog does not distinguish between different types of memory usage. Information technology just checks        for free virtual memory.         If  you  have  a watchdog carte du jour with temperature sensor you can specify the maximal allowed        temperature. Once this temperature is reached the organization is halted. The default  value  is        120.  There  is  no  unit  conversion so make sure you lot apply the same unit as your hardware.          watchdog          will issue warnings once the temperature increases  90%,  95%  and  98%  of  this        temperature.         When using file style          watchdog          will try to          stat(two) the given files. Errors returned by stat        will          not          crusade a reboot. For a reboot the stat call has to last at least one infinitesimal.  This        may  happen  if the file is located on an NFS mounted filesystem. If your system relies on        an NFS mounted filesystem you might endeavour this pick.  However, in such  a  instance  the          sync          pick may not work if the NFS server is not answering.          watchdog          can  read  the  pid from a pid file and encounter whether the process still exists. If        non, activeness is taken by          watchdog.  So yous can for instance restart the  server  from  your          repair-binary.          watchdog          will  try  periodically to fork itself to see whether the process table is total.        This process will leave a zombie process until watchdog wakes up  once more  and  catches  it;        this is harmless, don't worry nigh it.         In ping mode          watchdog          tries to ping the given IP addresses. These addresses exercise not take to        be a single machine. It is possible to ping to a circulate address instead to  encounter  if  at        to the lowest degree one machine in a subnet is still living.          Do          not          utilize          this          broadcast          ping          unless          your          MIS          person          a)          knows          about          information technology          and          b)          has          given          yous          explicit          permission          to          use          it!          watchdog          will send out three ping packages and wait upwardly to <interval> seconds for the reply        with  <interval> existence the time it goes to sleep between two times triggering the watchdog        device. Thus a unreachable network volition non cause a hard reset just a soft reboot.         You tin can also test passively  for  an  unreachable  network  by  but  monitoring  a  given        interface for traffic. If no traffic arrives the network is considered unreachable causing        a soft reboot or action from the repair binary.          watchdog          can run an external command for user-defined tests. A render  lawmaking  non  equal  0        means an fault occurred and watchdog should react. If the external control is killed past an        uncaught indicate this is considered an error by watchdog too.  The command may accept  longer        than  the  time  slice  defined  for  the  kernel device without a problem. Yet, error        messages are generated into the syslog facility. If you have enabled softboot on error the        machine  volition  exist  rebooted  if  the  binary doesn't exit in half the time          watchdog          sleeps        between two tries triggering the kernel device.         If you lot specify a repair binary it will be started instead of shutting down the system.  If        this binary is non able to fix the problem          watchdog          will still cause a reboot subsequently.         If  the  machine  is  halted  an email is sent to notify a human being that the machine is going        downward. Starting with version 4.iv          watchdog          will as well notify  the  man  in  charge  if  the        machine is rebooted.        

SOFT REBOOT

          A  soft  reboot (i.e. controlled shutdown and reboot) is initiated for every fault that is        establish. Since there might exist no more than processes bachelor, watchdog does it all by  himself.        That means:         ane.  Kill all processes with SIGTERM.         2.  Later a short pause impale all remaining processes with SIGKILL.         3.  Record a shutdown entry in wtmp.         4.  Save  the random seed from          /dev/urandom.  If the device is non-existant or in that location is no            filename for saving this step is skipped.         5.  Plow off accounting.         half-dozen.  Plough off quota and bandy.         seven.  Unmount all partitions except the root partition.         8.  Remount the root partition read-only.         9.  Shut down all network interfaces.         10. Finally reboot.        

Cheque BINARY

          If the return code of the check binary is non zero          watchdog          will  assume  an  error  and        reboot  the  system.  Exist  careful  with  this if you are using the real-fourth dimension backdrop of        watchdog since          watchdog          will wait for the return of this binary before proceeding. An exit        lawmaking  smaller  than  245 is interpreted every bit an system mistake lawmaking (see          errno.h          for details).        Values of 245 or larger than are special to          watchdog:         255 (based on -ane every bit unsigned viii-bit number)               Reboot the system. This is non exactly an error message but a command to          watchdog.               If  the  return  code  is  this  the          watchdog          will not try to run a shutdown script               instead.         254    Reset the system. This is not exactly an error bulletin merely a command  to          watchdog.               If  the  render  lawmaking  is  this the          watchdog          will          simply          reject          to          write          the          kernel               device again.         253    Maximum load average exceeded.         252    The temperature within is too high.         251          /proc/loadavg          contains no (or not enough) data.         250    The given file was not inverse in the given interval.         249          /proc/meminfo          contains invalid information.         248    Child process was killed by a signal.         247    Child process did not return in time.         246    Gratuitous for personal watchdog-specific use (was -10 as an unsigned viii-fleck number).         245    Reserved for an unknown result, for example a dull background test  that  is  still               running so neither a success nor an error.        

REPAIR BINARY

          The  repair binary is started with ane parameter: the error number that acquired          watchdog          to        initiate the kicking process. Later trying to repair the organisation the binary should get out with 0        if  the  system  was  successfully  repaired  and thus there is no demand to kicking anymore. A        return value not equal 0 tells          watchdog          to reboot. The return code of  the  repair  binary        should  be  the error number of the fault causing          watchdog          to reboot. Exist careful with this        if you are using the existent-time properties since          watchdog          will await for the return of  this        binary earlier proceeding.        

TEST DIRECTORY

          Executables  placed  in  the  test directory are discovered past watchdog on startup and are        automatically executed.  They are divisional  time-wise  by  the  test-timeout  directive  in        watchdog.conf.         These  executables are called with either "test" as the beginning argument (if a exam is being        performed) or "repair" equally the first statement (if a repair for a  previously-failed  "test"        functioning on is being performed).         The  as  with exam binaries and repair binaries, expected exit codes for a successful test        or repair operation is e'er nada.         If an executable's test performance fails, the aforementioned executable is automatically called  with        the "repair" argument likewise as the return code of the previously-failed test operation.         For example, if the following execution returns 42:             /etc/watchdog.d/my-test test         The watchdog daemon volition attempt to repair the trouble by calling:             /etc/watchdog.d/my-examination repair 42         This  enables  administrators  and  awarding developers to make intelligent test/repair        commands.  If the "repair" functioning is non required (or is non probable to succeed), it  is        of import that the author of the control render a non-zero value so the machine will nonetheless        reboot as expected.         Note that the watchdog daemon may interpret and deed upon any of the reserved return  codes        noted in the Check Binary department prior to calling a given command in "repair" way.        

BUGS

          None known then far.        

AUTHORS

          The original code is an example written by Alan Cox <alan@lxorguk.ukuu.org.united kingdom of great britain and northern ireland>, the author        of the kernel commuter. All additions were written past  Michael  Meskes  <meskes@debian.org>.        Johnie  Ingram  <johnie@netgod.net> had the idea of testing the load average. He too took        over the Debian  specific  work.  Dave  Cinege  <dcinege@psychosis.com>  brought  upward  some        hardware watchdog issues and helped testing this stuff.        

FILES

          /dev/watchdog          The watchdog device.          /var/run/watchdog.pid          The pid file of the running          watchdog.        

SEE ALSO

          watchdog.conf(5)        

0 Response to "Hardware Watchdog as Well as Software Watch Scripts"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel