Nagios

From HerzbubeWiki
Jump to: navigation, search

Debian packages

The following Debian packages need to be installed

nagios3
nagios3-doc


References


Glossary

Object 
The nagios configuration is made up of objects of various types (e.g. hosts, host groups, contacts, services, etc.)
Service 
A service is something that should be monitored by nagios
Host 
A host is a machine that provides services that nagios should monitor
Host group 
An aggregation of multiple hosts (e.g. "pingable servers", "SSH servers", etc.); mainly used for larger networks
Contact 
nagios notifies contacts, using various means (e.g. e-mail, SMS, pager), when it detects a problem
Contact group 
An aggregation of multiple contacts (e.g. "admins"); mainly used for larger organisations
Time period 
Time periods are definitions used to tell nagios when it is allowed to check hosts, services and notify contacts (e.g. 24x7, work hours, etc.)
Command 
A command defines how to perform a certain check on a target host or service (e.g. ), or how to notify a target contact


DebConf configuration

Questions + answers:

  • apache servers to configure for nagios3 = apache2
  • enable support for nagios 1.x links = no
  • web admin password = <secret>

Notes:

  • User name (nagiosadmin) and password of the web admin are placed into
/etc/nagios3/htpasswd.users
  • The web admin user name is also referenced in
/etc/nagios3/cgi.cfg
  • See chapter "LDAP" further down for information on how to use LDAP as the user/password database.
  • dpkg automatically generates a symlink /etc/apache2/conf.d/nagios3.conf that points to /etc/nagios3/apache.conf


Apache configuration

Nagios requires that PHP is turned on in its webroot, something like this:

<Directory /usr/share/nagios3/htdocs/>
  <IfModule mod_php5.c>
    php_admin_flag engine on
  </IfModule>
</Directory>


Web administration

Overview

During installation of the Debian package, the Apache web server was already configured for web access, using basic HTTP authentication. The following sections provide details about

  • more configuration aspects of the web interface
  • how to switch from basic HTTP authentication to LDAP authentication
  • how to enable full featured control of the nagios daemon from the web interface through the so-called "external commands" feature


/etc/nagios3/cgi.cfg

This file configures all aspects of the web interface of nagios. For instance

  • whether or not authentication is required (default = yes)
  • what users are allowed to do
  • refresh rate
  • which sound files should be played (if any) in the web browser if problems are detected
  • etc.


By default, the web admin user nagiosadmin created by DebConf is referenced and has all possible permissions. This and all other default settings are OK for my purposes, so I didn't have to change anything in the beginning. Later on, when LDAP authentication is enabled, it might be necessary to change the name of the admin user.


LDAP

By default web interface authentication works through mod_authz_user, i.e. users/passwords are stored in a file:

/etc/nagios/htpasswd.users

If users/passwords should be taken from LDAP, the following stuff needs to be changed after DebConf configuration has finished:

  • edit /etc/apache2/conf.d/nagios3.conf
    • replace this block
AuthName "Nagios Access"
AuthType Basic
AuthUserFile /etc/nagios3/htpasswd.users
require valid-user
    • by the new block
AuthName "Nagios Access"
AuthType Basic
# LDAP connection information is inherited
AuthBasicProvider ldap
Require ldap-user admin
  • remove /etc/nagios3/htpasswd.users because it is no longer needed
  • make sure that the user(s) used for authentication are properly referenced in /etc/nagios3/cgi.cfg (in my case I changed "nagiosadmin" to the generic "admin" account)


Enabling external commands

"External commands" is a scheme to control the nagios daemon's operation without shutting it down. An external client can write control instructions ("external commands") to a pipe, which the daemon periodically checks to see if it must perform some actions. The approach is commonly used to send commands from the web interface.

What basically needs to be done is:

  • enable the "external commands" feature (in /etc/nagios3/nagios.cfg)
  • give the web server write access to the command pipe (/var/lib/nagios3/rw/nagios.cmd)
  • make sure that the permission change is permanent, i.e. make sure that the change is not reverted by the next update of the Debian package

/usr/share/doc/nagios3-common/README.Debian explains how best to do this:

  • set "check_external_commands = 1" in /etc/nagios3/nagios.cfg
  • check if /var/lib/nagios3 and /var/lib/nagios3/rw have the correct permissions
root@pelargir:~# l /var/lib/nagios3/
total 28
drwxr-x--x  4 nagios nagios    4096 Jun 29 20:52 .
drwxr-xr-x 54 root   root      4096 Jun 29 20:47 ..
-rw-r--r--  1 nagios nagios   11099 Jun 29 20:52 retention.dat
drwx--s---  2 nagios www-data  4096 Jun  4 20:25 rw
drwxr-x---  3 nagios nagios    4096 Jun 29 20:47 spool
  • if the two files mentioned do not have not the listed permissions, run these commands
/etc/init.d/nagios3 stop
dpkg-statoverride --update --add nagios www-data 2710 /var/lib/nagios3/rw
dpkg-statoverride --update --add nagios nagios 751 /var/lib/nagios3
/etc/init.d/nagios3 start
  • the permission changes are made with dpkg-statoverride so that they are permanent and survive package updates


Monitoring configuration

Overview

The main configuration file used by the nagios daemon is

/etc/nagios3/nagios.cfg

nagios.cfg includes a couple of other files, but most importantly it defines a directory where you can drop files to extend the nagios configuration:

/etc/nagios3/conf.d

Note: Files in this directory can have any name.


To test whether the current configuration is correct:

nagios3 -v /etc/nagios3/nagios.cfg


Nagios plugins

The Debian package nagios-plugins provides a lot of useful commands that can be used to check hosts and services. The command definitions live in

/etc/nagios-plugins/config

which is included by default by nagios.cfg. The actual plugin binaries are located in

/usr/lib/nagios/plugins


Check scripts

The plugins provided by the Debian package nagios-plugins may not be sufficient in all cases. Sometimes you may want to create your own tool to check something in a special way. Such a tool can be any kind of script or binary, as long as it satisfies the following interface:

  • exit status 0 = everything ok
  • exit status 1 = warning
  • exit status 2 = critical
  • output on stdout is used by nagios as status information


An example check script that I wrote:

pelargir:~# cat /usr/local/lib/nagios_check_boinc.sh 
#!/bin/bash
MYNAME=$(basename $0)
PROCESS_NAME="[b]oinc"
PROCESS_INFORMATION="$(ps aux | grep $PROCESS_NAME | grep -v $MYNAME)"
if test -z "$PROCESS_INFORMATION"; then
   echo "boinc: not running"
   exit 2
else
   echo "boinc: $PROCESS_INFORMATION"
   exit 0
fi


/etc/nagios3/nagios.cfg

The default settings in this file are largely OK. I only changed the following values:

  • check_external_commands = 1 (see the section about "external commands" further up in this document)
  • date_format = euro

Note: /usr/share/doc/nagios3-common/README.Debian suggests not to edit /etc/nagios/nagios.cfg, but instead to place changes into /etc/nagios3/conf.d/nagios.cfg. This does not work, though.


Hosts

The generic host that can be used as the base for host definitions is located in

/etc/nagios3/conf.d/generic-host_nagios2.cfg

It's main characteristics are

  • allow notifications if a problem is detected with the host
  • notify 24x7
  • contact the group "admins"
  • use the "check-host-alive" command (defined in /etc/nagios-plugins/config/ping.cfg)


Predefined hosts are:

  • localhost, in /etc/nagios3/conf.d/localhost_nagios2.cfg
  • default gateway, in /etc/nagios3/conf.d/host-gateway_nagios3.cfg; the IP address of the default gateway is configured automatically, probably by DebConf


I added the following custom hosts to /etc/nagios3/conf.d/pelargir-hosts.cfg:

pelargir:/etc/nagios3/conf.d# cat pelargir-hosts.cfg 
# Any host on the internet that can be used to check whether the internet is reachable
define host {
        host_name   inethost
        alias       Internet host
        address     www.google.ch
        use         generic-host
        # let nagios distinguish between "host unreachable" and "host down"
        parents     gateway
        # Google uses IPv6 since June 6 2012 (World IPv6 Launch), apparently
        # we now need to explicitly check things using IPv4
        check_command   check-host-alive_4
        }

# Host pelargir, network interface facing the Internet
define host {
        host_name   pelargir-inet
        alias       pelargir-inet
        address     192.168.178.20
        use         generic-host
        }

# Host pelargir, network interface facing the intranet LAN network
define host {
        host_name   pelargir-lan
        alias       pelargir-lan
        address     192.168.1.11
        use         generic-host
        }

# Host pelargir, network interface facing the intranet WiFi network
define host {
        host_name   pelargir-wifi
        alias       pelargir-wifi
        address     192.168.2.6
        use         generic-host
        }

# Host landroval
define host {
        host_name   landroval
        alias       landroval
        address     192.168.2.2
        use         generic-host
        }

# Host www.herzbube.ch
define host {
        host_name   www.herzbube.ch
        alias       www.herzbube.ch
        address     212.101.18.224
        use         generic-host
        }

# Host www.moser-naef.ch
define host {
        host_name   www.moser-naef.ch
        alias       www.moser-naef.ch
        address     212.101.18.224
        use         generic-host
        }

# Host smtp.herzbube.ch
define host {
        host_name   smtp.herzbube.ch
        alias       smtp.herzbube.ch
        address     212.101.18.224
        use         generic-host
        }

# Host mail.herzbube.ch
define host {
        host_name   mail.herzbube.ch
        alias       mail.herzbube.ch
        address     212.101.18.224
        use         generic-host
        }


Host groups

Predefined hosts in /etc/nagios3/conf.d/hostgroups_nagios2.cfg are:

  • all servers (members = *)
  • debian servers (members = localhost)
  • HTTP servers (members = localhost)
  • SSH servers (members = localhost)
  • ping servers (members = gateway)


I added the following custom host groups to /etc/nagios3/conf.d/pelargir-hostgroups.cfg:

pelargir:/etc/nagios3/conf.d# cat pelargir-hostgroups.cfg 
# List of hosts located on the Internet
define hostgroup {
        hostgroup_name  internet-hosts
        alias           Internet hosts
        members         inethost
        }

# List of hosts located on the Intranet
define hostgroup {
        hostgroup_name  intranet-hosts
        alias           Intranet hosts
        members         pelargir-inet, pelargir-lan, pelargir-wifi, gateway, landroval
        }

# List of SMTP servers
define hostgroup {
        hostgroup_name  smtp-servers
        alias           SMTP servers
        members         localhost, smtp.herzbube.ch
        }

# List of IMAP servers
define hostgroup {
        hostgroup_name  imap-servers
        alias           IMAP servers
        members         localhost, mail.herzbube.ch
        }

# List of DHCP servers
define hostgroup {
        hostgroup_name  dhcp-servers
        alias           DHCP servers
        members         gateway, pelargir-lan, pelargir-wifi
        }

# List of LDAP servers
define hostgroup {
        hostgroup_name  ldap-servers
        alias           LDAP servers
        members         localhost
        }

# List of MySQL servers
define hostgroup {
        hostgroup_name  mysql-servers
        alias           MySQL servers
        members         localhost
        }

# List of PostgreSQL servers
define hostgroup {
        hostgroup_name  postgresql-servers
        alias           PostgreSQL servers
        members         localhost
        }

# List of Samba servers
define hostgroup {
        hostgroup_name  samba-servers
        alias           Samba servers
        members         localhost
        }


Contacts & contact groups

Predefined contacts & and contact groups in /etc/nagios3/conf.d/contacts_nagios2.cfg are:

  • a single contact named "root" whose email address is root@localhost
  • a single contact group named "admins" (which has the contact "root" as its single member)


These predefined contact & contact group is sufficient for my purposes, I therefore do not define any custom contacts or groups.


Commands

I added the following custom commands to /etc/nagios3/conf.d/pelargir-commands.cfg:

pelargir:/etc/nagios3/conf.d# cat pelargir-commands.cfg 
# Connect to a host on a given port and check that the SSL certificate provided is valid
# for a given minimum number of days; if the check fails this will result in a WARNING only
# (not CRITICAL)
define command {
        command_name    check_cert
        command_line    /usr/lib/nagios/plugins/check_http -H '$HOSTNAME$' -p '$ARG1$' -C '$ARG2$'
        }

# check_dhcp must be run with root privileges. README.Debian suggests to
# achieve this by setting the setuid bit on check_dhcp, by issuing this
# command:
#   dpkg-statoverride --update --add root nagios 4750 /usr/lib/nagios/plugins/check_dhcp
# However, I prefer sudo to the setuid flag, although this involves an additional bit
# of configuration in the sudoers file.
define command{
        command_name    sudo_check_dhcp_interface
        command_line    sudo -u root /usr/lib/nagios/plugins/check_dhcp -s '$HOSTADDRESS$' -i '$ARG1$'
        }


Services

The generic service that can be used as the base for service definitions is located in

/etc/nagios3/conf.d/generic-service_nagios2.cfg

It's main characteristics are

  • allow notifications if a problem is detected with the service
  • notify only when the problem occurs, or goes away (notification_interval = 0; if this were set to a non-zero value, notifications would be sent periodically)
  • notify 24x7
  • notify the contact group "admins"
  • allow checking 24x7
  • check every 5 minutes


Predefined services are:

  • for localhost, in /etc/nagios3/conf.d/localhost_nagios2.cfg
    • disk space (warning/critical if free space on any file system falls below 20%/10%)
    • number of currently logged in users (warning/critical if more than 20/50 users)
    • number of processes (warning/critical if more than 250/400 processes)
    • machine load (warning/critical if more than (5.0, 4.0, 3.0) / (10.0, 6.0, 4.0))
  • for specific services, in /etc/nagios3/conf.d/services_nagios2.cfg
    • HTTP servers
    • SSH servers
    • ping'able servers


I added the following custom services to /etc/nagios3/conf.d/pelargir-services.cfg:

pelargir:/etc/nagios3/conf.d# cat pelargir-services.cfg 
# check that an Internet connection is available
define service {
        hostgroup_name                  internet-hosts
        service_description             Internet connection
        # Google uses IPv6 since June 6 2012 (World IPv6 Launch), apparently
        # we now need to explicitly check things using IPv4
        check_command                   check_ping_4!100.0,20%!500.0,60%
        use                             generic-service
	notification_interval           0 ; set > 0 if you want to be renotified
}

# check that SMTP services are running
define service {
        hostgroup_name                  smtp-servers
        service_description             SMTP
 	check_command                   check_smtp
        use                             generic-service
	notification_interval           0 ; set > 0 if you want to be renotified
}

# check that IMAP services are running
define service {
        hostgroup_name                  imap-servers
        service_description             IMAP
        check_command                   check_imap
        use                             generic-service
	notification_interval           0 ; set > 0 if you want to be renotified
}

# check that FTP services are running
define service {
        hostgroup_name                  ftp-servers
        service_description             FTP
        check_command                   check_ftp
        use                             generic-service
	notification_interval           0 ; set > 0 if you want to be renotified
}

# check that DHCP services are running
#
# Note: check_dhcp must be run with root privileges
define service {
        host_name                       gateway
        service_description             FritzBox DHCP
        check_command                   sudo_check_dhcp_interface!eth1
        use                             generic-service
        notification_interval           0 ; set > 0 if you want to be renotified
}

define service {
        host_name                       pelargir-lan
        service_description             Intranet LAN DHCP
        check_command                   sudo_check_dhcp_interface!eth0
        use                             generic-service
        notification_interval           0 ; set > 0 if you want to be renotified
}

define service {
        host_name                       pelargir-wifi
        service_description             Intranet WiFi DHCP
        check_command                   sudo_check_dhcp_interface!eth2
        use                             generic-service
        notification_interval           0 ; set > 0 if you want to be renotified
}

# check that LDAP services are running
define service {
        hostgroup_name                  ldap-servers
        service_description             LDAP
        # cannot use check_ldap because our LDAP service does not allow anonymous binding
	# and I don't want to expose passwords here
        check_command                   check_tcp!389
        use                             generic-service
	notification_interval           0 ; set > 0 if you want to be renotified
}

# check that MySQL services are running
#
# Note: The "nagios" database user has no password. We specify "no password"
# by adding a trailing "!" to the command. This means that the second argument
# of the command, which is the password, is an empty string.
define service {
        hostgroup_name                  mysql-servers
        service_description             MySQL
        check_command                   check_mysql_cmdlinecred!nagios!
        use                             generic-service
	notification_interval           0 ; set > 0 if you want to be renotified
}

# check that PostgreSQL services are running
define service {
        hostgroup_name                  postgresql-servers
        service_description             PostgreSQL
        check_command                   check_pgsql
        use                             generic-service
        notification_interval           0 ; set > 0 if you want to be renotified
}

# check that Samba services are running
define service {
        hostgroup_name                  samba-servers
        service_description             Samba
        check_command                   check_ssh
        check_command                   check_tcp!445
        use                             generic-service
	notification_interval           0 ; set > 0 if you want to be renotified
}

# check validity of the SSL certificate provided on port 443
define service {
        host_name                       www.herzbube.ch
        service_description             SSL certificate validity on port 443 www.herzbube.ch
        check_command                   check_cert!443!28
        use                             generic-service
        notification_interval           0 ; set > 0 if you want to be renotified
}
define service {
        host_name                       www.moser-naef.ch
        service_description             SSL certificate validity on port 443 www.moser-naef.ch
        check_command                   check_cert!443!28
        use                             generic-service
        notification_interval           0 ; set > 0 if you want to be renotified
}


Service configuration

Overview

Some services need to be configured in a special way so that they can be monitored by Nagios. A typical example is that the service needs to allow the nagios system user minimal access to it's resources.


sudo

Some of the check commands that Nagios needs to run can only be run as root. I prefer to use sudo for this purpose (instead of setting the setuid flag on the command binaries). The following snippet is the sudo configuration I use:

# cat /etc/sudoers.d/pelargir-nagios.conf
# Nagios must never be lectured when it runs commands because this
# might interfere with how it evaluates the command's output
Defaults:  nagios !lecture

# Nagios never needs to authenticate when it runs commands because this
# is, obviously, beyond the capabilities of a daemon. This could also
# be solved by specifying the NOPASSWD: option for each command.
Defaults:  nagios !authenticate

# ALL should be a "non" terminal
# nagios = the line applies only when sudo is invoked by the user "nagios"
# (root) = only allows "sudo -u root"; no groups may be specified
#          (sudo -g), and no users other than "root" may be specified
nagios pelargir = (root) /usr/lib/nagios/plugins/check_dhcp


DHCP

My DHCP server currently hands out addresses to any machine regardless of its MAC address, so there is no need to add any Nagios-specific entries to the configuration of my DHCP server.

Pseudo entry used by Nagios monitoring. The entry has a fake MAC address that must be supplied by the Nagios check command.


MySQL

Add a new user "nagios" with no password that has no privileges except the general "Usage" privilege.


PostgreSQL

Add a new role:

createuser --no-superuser --no-createdb --no-createrole nagios

Add the following line to /etc/postgresql/8.4/main/pg_hba.conf. This allows the nagios role to connect to the template1 database without a password. Note: Don't forget to restart the PostgreSQL daemon after adding the line.

host    template1   nagios      127.0.0.1/32          trust

From now on the Nagios command check_pgsql defined in

/etc/nagios-plugins/config/pgsql.cfg

(which invokes /usr/lib/nagios/plugins/check_pgsql) can be used to monitor basic presence of the PostgreSQL service.


Discussion of the configuration line in pg_hba.conf:

  • Because we have used "trust" as the authentication method, any user is allowed to connect as "nagios"!
  • Normally we would like to use the "ident" authentication method because this restricts connection attempts to specific system users
  • This works very nicely as long as the connection attempt is made via Unix-domain sockets, i.e. if the line in pg_hba.conf uses the "local" keyword
  • However, as soon as the connection attempt is made via TCP/IP (i.e. the line in pg_hba.conf uses the "host" keyword), PostgreSQL will try to verify the identity of the connecting user by contacting an ident daemon on the system where the connection attempt came from
  • Because the Nagios command check_pgsql does connect via TCP/IP, I would be forced to run an ident daemon
  • Having the choice between opening access to the template1 database and installing an entire new service on my server, I clearly choose the former option!