Nagios

From HerzbubeWiki

Jump to: navigation, search

Contents

Debian packages

The following Debian packages need to be installed

nagios3
nagios3-doc


References


Glossary

Object 
The nagios configuration is made up of objects of various types (e.g. hosts, host groups, contacts, services, etc.)
Service 
A service is something that should be monitored by nagios
Host 
A host is a machine that provides services that nagios should monitor
Host group 
An aggregation of multiple hosts (e.g. "pingable servers", "SSH servers", etc.); mainly used for larger networks
Contact 
nagios notifies contacts, using various means (e.g. e-mail, SMS, pager), when it detects a problem
Contact group 
An aggregation of multiple contacts (e.g. "admins"); mainly used for larger organisations
Time period 
Time periods are definitions used to tell nagios when it is allowed to check hosts, services and notify contacts (e.g. 24x7, work hours, etc.)
Command 
A command defines how to perform a certain check on a target host or service (e.g. ), or how to notify a target contact


DebConf configuration

Questions + answers:

  • apache servers to configure for nagios3 = apache2
  • enable support for nagios 1.x links = no
  • web admin password = <secret>

Notes:

  • User name (nagiosadmin) and password of the web admin are placed into
/etc/nagios3/htpasswd.users
  • The web admin user name is also referenced in
/etc/nagios3/cgi.cfg
  • See chapter "LDAP" further down for information on how to use LDAP as the user/password database.
  • dpkg automatically generates a symlink /etc/apache2/conf.d/nagios3.conf that points to /etc/nagios3/apache.conf


Apache configuration

Nagios requires that PHP is turned on in its webroot, something like this:

<Directory /usr/share/nagios3/htdocs/>
  <IfModule mod_php5.c>
    php_admin_flag engine on
  </IfModule>
</Directory>


Web administration

Overview

During installation of the Debian package, the Apache web server was already configured for web access, using basic HTTP authentication. The following sections provide details about

  • more configuration aspects of the web interface
  • how to switch from basic HTTP authentication to LDAP authentication
  • how to enable full featured control of the nagios daemon from the web interface through the so-called "external commands" feature


/etc/nagios3/cgi.cfg

This file configures all aspects of the web interface of nagios. For instance

  • whether or not authentication is required (default = yes)
  • what users are allowed to do
  • refresh rate
  • which sound files should be played (if any) in the web browser if problems are detected
  • etc.


By default, the web admin user nagiosadmin created by DebConf is referenced and has all possible permissions. This and all other default settings are OK for my purposes, so I didn't have to change anything in the beginning. Later on, when LDAP authentication is enabled, it might be necessary to change the name of the admin user.


LDAP

By default web interface authentication works through mod_authz_user, i.e. users/passwords are stored in a file:

/etc/nagios/htpasswd.users

If users/passwords should be taken from LDAP, the following stuff needs to be changed after DebConf configuration has finished:

  • edit /etc/apache2/conf.d/nagios3.conf
    • replace this block
AuthName "Nagios Access"
AuthType Basic
AuthUserFile /etc/nagios3/htpasswd.users
require valid-user
    • by the new block
Include pelargir-ldap.conf
AuthName "Nagios Access"
AuthType Basic
AuthBasicProvider ldap
Require valid-user
AuthzLDAPAuthoritative off
  • remove /etc/nagios3/htpasswd.users because it is no longer needed
  • make sure that the user(s) used for authentication are properly referenced in /etc/nagios3/cgi.cfg (in my case I changed "nagiosadmin" to the generic "admin" account)


Enabling external commands

"External commands" is a scheme to control the nagios daemon's operation without shutting it down. An external client can write control instructions ("external commands") to a pipe, which the daemon periodically checks to see if it must perform some actions. The approach is commonly used to send commands from the web interface.

What basically needs to be done is:

  • enable the "external commands" feature (in /etc/nagios3/nagios.cfg)
  • give the web server write access to the command pipe (/var/lib/nagios3/rw/nagios.cmd)
  • make sure that the permission change is permanent, i.e. make sure that the change is not reverted by the next update of the Debian package

/usr/share/doc/nagios3-common/README.Debian explains how best to do this:

  • set "check_external_commands = 1" in /etc/nagios3/nagios.cfg
  • check if /var/lib/nagios3 and /var/lib/nagios3/rw have the correct permissions
root@pelargir:~# l /var/lib/nagios3/
total 28
drwxr-x--x  4 nagios nagios    4096 Jun 29 20:52 .
drwxr-xr-x 54 root   root      4096 Jun 29 20:47 ..
-rw-r--r--  1 nagios nagios   11099 Jun 29 20:52 retention.dat
drwx--s---  2 nagios www-data  4096 Jun  4 20:25 rw
drwxr-x---  3 nagios nagios    4096 Jun 29 20:47 spool
  • if the two files mentioned do not have not the listed permissions, run these commands
/etc/init.d/nagios3 stop
dpkg-statoverride --update --add nagios www-data 2710 /var/lib/nagios3/rw
dpkg-statoverride --update --add nagios nagios 751 /var/lib/nagios3
/etc/init.d/nagios3 start
  • the permission changes are made with dpkg-statoverride so that they are permanent and survive package updates


Monitoring configuration

Overview

The main configuration file used by the nagios daemon is

/etc/nagios3/nagios.cfg

nagios.cfg includes a couple of other files, but most importantly it defines a directory where you can drop files to extend the nagios configuration:

/etc/nagios3/conf.d

Note: Files in this directory can have any name.


To test whether the current configuration is correct:

nagios3 -v /etc/nagios3/nagios.cfg


Nagios plugins

The Debian package nagios-plugins provides a lot of useful commands that can be used to check hosts and services. The command definitions live in

/etc/nagios-plugins/config

which is included by default by nagios.cfg. The actual plugin binaries are located in

/usr/lib/nagios/plugins


Check scripts

The plugins provided by the Debian package nagios-plugins may not be sufficient in all cases. Sometimes you may want to create your own tool to check something in a special way. Such a tool can be any kind of script or binary, as long as it satisfies the following interface:

  • exit status 0 = everything ok
  • exit status 1 = warning
  • exit status 2 = critical
  • output on stdout is used by nagios as status information


An example check script that I wrote:

pelargir:~# cat /usr/local/lib/nagios_check_boinc.sh 
#!/bin/bash
MYNAME=$(basename $0)
PROCESS_NAME="[b]oinc"
PROCESS_INFORMATION="$(ps aux | grep $PROCESS_NAME | grep -v $MYNAME)"
if test -z "$PROCESS_INFORMATION"; then
   echo "boinc: not running"
   exit 2
else
   echo "boinc: $PROCESS_INFORMATION"
   exit 0
fi


/etc/nagios3/nagios.cfg

The default settings in this file are largely OK. I only changed the following values:

  • check_external_commands = 1 (see the section about "external commands" further up in this document)
  • date_format = euro

Note: /usr/share/doc/nagios3-common/README.Debian suggests not to edit /etc/nagios/nagios.cfg, but instead to place changes into /etc/nagios3/conf.d/nagios.cfg. This does not work, though.


Hosts

The generic host that can be used as the base for host definitions is located in

/etc/nagios3/conf.d/generic-host_nagios2.cfg

It's main characteristics are

  • allow notifications if a problem is detected with the host
  • notify 24x7
  • contact the group "admins"
  • use the "check-host-alive" command (defined in /etc/nagios-plugins/config/ping.cfg)


Predefined hosts are:

  • localhost, in /etc/nagios3/conf.d/localhost_nagios2.cfg
  • default gateway, in /etc/nagios3/conf.d/host-gateway_nagios3.cfg; the IP address of the default gateway is configured automatically, probably by DebConf


I added the following custom hosts to /etc/nagios3/conf.d/pelargir-hosts.cfg:

pelargir:/etc/nagios3/conf.d# cat pelargir-hosts.cfg 
# Any host on the internet that can be used to check whether the internet is reachable
define host {
        host_name   inethost
        alias       Internet host
        address     www.google.ch
        use         generic-host
        # let nagios distinguish between "host unreachable" and "host down"
        parents     gateway
        }

# Host pelargir
define host {
        host_name   pelargir
        alias       pelargir
        address     192.168.0.2
        use         generic-host
        }

# Host www.herzbube.ch
define host {
        host_name   www.herzbube.ch
        alias       www.herzbube.ch
        address     212.101.18.224
        use         generic-host
        }

# Host www.moser-naef.ch
define host {
        host_name   www.moser-naef.ch
        alias       www.moser-naef.ch
        address     212.101.18.224
        use         generic-host
        }

# Host smtp.herzbube.ch
define host {
        host_name   smtp.herzbube.ch
        alias       smtp.herzbube.ch
        address     212.101.18.224
        use         generic-host
        }

# Host mail.herzbube.ch
define host {
        host_name   mail.herzbube.ch
        alias       mail.herzbube.ch
        address     212.101.18.224
        use         generic-host
        }


Host groups

Predefined hosts in /etc/nagios3/conf.d/hostgroups_nagios2.cfg are:

  • all servers (members = *)
  • debian servers (members = localhost)
  • HTTP servers (members = localhost)
  • SSH servers (members = localhost)
  • ping servers (members = gateway)


I added the following custom host groups to /etc/nagios3/conf.d/pelargir-hostgroups.cfg:

pelargir:/etc/nagios3/conf.d# cat pelargir-hostgroups.cfg 
# List of hosts located on the Internet
define hostgroup {
        hostgroup_name  internet-hosts
        alias           Internet hosts
        members         inethost
        }

# List of hosts located on the Intranet
define hostgroup {
        hostgroup_name  intranet-hosts
        alias           Intranet hosts
        members         pelargir
        }

# List of SMTP servers
define hostgroup {
        hostgroup_name  smtp-servers
        alias           SMTP servers
        members         localhost, smtp.herzbube.ch
        }

# List of FTP servers
define hostgroup {
        hostgroup_name  ftp-servers
        alias           FTP servers
        members         localhost
        }

# List of IMAP servers
define hostgroup {
        hostgroup_name  imap-servers
        alias           IMAP servers
        members         localhost, mail.herzbube.ch
        }

# List of DHCP servers
define hostgroup {
        hostgroup_name  dhcp-servers
        alias           DHCP servers
        members         localhost, gateway
        }

# List of LDAP servers
define hostgroup {
        hostgroup_name  ldap-servers
        alias           LDAP servers
        members         localhost
        }

# List of MySQL servers
define hostgroup {
        hostgroup_name  mysql-servers
        alias           MySQL servers
        members         localhost
        }

# List of PostgreSQL servers
define hostgroup {
        hostgroup_name  postgresql-servers
        alias           PostgreSQL servers
        members         localhost
        }

# List of Samba servers
define hostgroup {
        hostgroup_name  samba-servers
        alias           Samba servers
        members         localhost
        }

# List of Boinc servers
define hostgroup {
        hostgroup_name  boinc-servers
        alias           BOINC servers
        members         localhost
        }


Contacts & contact groups

Predefined contacts & and contact groups in /etc/nagios3/conf.d/contacts_nagios2.cfg are:

  • a single contact named "root" whose email address is root@localhost
  • a single contact group named "admins" (which has the contact "root" as its single member)


These predefined contact & contact group is sufficient for my purposes, I therefore do not define any custom contacts or groups.


Commands

I added the following custom commands to /etc/nagios3/conf.d/pelargir-commands.cfg:

pelargir:/etc/nagios3/conf.d# cat pelargir-commands.cfg 
define command {
        command_name    check_setiathome
        command_line    /usr/local/lib/nagios_check_setiathome.sh
        }

define command {
        command_name    check_boinc
        command_line    /usr/local/lib/nagios_check_boinc.sh
        }

# connect to a host on a given port and check that the SSL certificate provided is valid
# for a given minimum number of days; if the check fails this will result in a WARNING only
# (not CRITICAL)
define command {
        command_name    check_cert
        command_line    /usr/lib/nagios/plugins/check_http -H '$HOSTNAME$' -p '$ARG1$' -C '$ARG2$'
        }


Services

The generic service that can be used as the base for service definitions is located in

/etc/nagios3/conf.d/generic-service_nagios2.cfg

It's main characteristics are

  • allow notifications if a problem is detected with the service
  • notify only when the problem occurs, or goes away (notification_interval = 0; if this were set to a non-zero value, notifications would be sent periodically)
  • notify 24x7
  • notify the contact group "admins"
  • allow checking 24x7
  • check every 5 minutes


Predefined services are:

  • for localhost, in /etc/nagios3/conf.d/localhost_nagios2.cfg
    • disk space (warning/critical if free space on any file system falls below 20%/10%)
    • number of currently logged in users (warning/critical if more than 20/50 users)
    • number of processes (warning/critical if more than 250/400 processes)
    • machine load (warning/critical if more than (5.0, 4.0, 3.0) / (10.0, 6.0, 4.0))
  • for specific services, in /etc/nagios3/conf.d/services_nagios2.cfg
    • HTTP servers
    • SSH servers
    • ping'able servers


I added the following custom services to /etc/nagios3/conf.d/pelargir-services.cfg:

pelargir:/etc/nagios3/conf.d# cat pelargir-services.cfg 
# check that an Internet connection is available
define service {
        hostgroup_name                  internet-hosts
        service_description             Internet connection
        check_command                   check_ping!100.0,20%!500.0,60%
        use                             generic-service
	notification_interval           0 ; set > 0 if you want to be renotified
}

# check that SMTP services are running
define service {
        hostgroup_name                  smtp-servers
        service_description             SMTP
 	check_command                   check_smtp
        use                             generic-service
	notification_interval           0 ; set > 0 if you want to be renotified
}

# check that IMAP services are running
define service {
        hostgroup_name                  imap-servers
        service_description             IMAP
        check_command                   check_imap
        use                             generic-service
	notification_interval           0 ; set > 0 if you want to be renotified
}

# check that FTP services are running
define service {
        hostgroup_name                  ftp-servers
        service_description             FTP
        check_command                   check_ftp
        use                             generic-service
	notification_interval           0 ; set > 0 if you want to be renotified
}

# check that DHCP services are running
#
# Note: check_dhcp must be run with root privileges. As suggested by README.Debian,
# this can be achieved like this:
#   dpkg-statoverride --update --add root nagios 4750 /usr/lib/nagios/plugins/check_dhcp
#
# Note: DHCP checking is currently disabled because it doesn't work out of the box,
# a quick Google session did not provide a solution, and I don't have the time to
# investigate any further
#
#define service {
#        hostgroup_name                  dhcp-servers
#        service_description             DHCP
#        check_command                   check_dhcp
#        use                             generic-service
#	notification_interval           0 ; set > 0 if you want to be renotified
#}

# check that LDAP services are running
define service {
        hostgroup_name                  ldap-servers
        service_description             LDAP
        # cannot use check_ldap because our LDAP service does not allow anonymous binding
	# and I don't want to expose passwords here
        check_command                   check_tcp!389
        use                             generic-service
	notification_interval           0 ; set > 0 if you want to be renotified
}

# check that MySQL services are running
define service {
        hostgroup_name                  mysql-servers
        service_description             MySQL
        check_command                   check_mysql
        use                             generic-service
	notification_interval           0 ; set > 0 if you want to be renotified
}

# check that PostgreSQL services are running
define service {
        hostgroup_name                  postgresql-servers
        service_description             PostgreSQL
        check_command                   check_pgsql
        use                             generic-service
        notification_interval           0 ; set > 0 if you want to be renotified
}

# check that Samba services are running
define service {
        hostgroup_name                  samba-servers
        service_description             Samba
        check_command                   check_ssh
        check_command                   check_tcp!445
        use                             generic-service
	notification_interval           0 ; set > 0 if you want to be renotified
}

# check that BOINC services are running
define service {
        hostgroup_name                  boinc-servers
        service_description             BOINC
        check_command                   check_boinc
        use                             generic-service
	notification_interval           0 ; set > 0 if you want to be renotified
}

# check validity of the SSL certificate provided on port 443
define service {
        host_name                       www.herzbube.ch
        service_description             SSL certificate validity on port 443 www.herzbube.ch
        check_command                   check_cert!443!28
        use                             generic-service
        notification_interval           0 ; set > 0 if you want to be renotified
}
define service {
        host_name                       www.moser-naef.ch
        service_description             SSL certificate validity on port 443 www.moser-naef.ch
        check_command                   check_cert!443!28
        use                             generic-service
        notification_interval           0 ; set > 0 if you want to be renotified
}


Service configuration

Overview

Some services need to be configured in a special way so that they can be monitored by Nagios. A typical example is that the service needs to allow the nagios system user minimal access to it's resources.


PostgreSQL

Add a new role:

createuser --no-superuser --no-createdb --no-createrole nagios

Add the following line to /etc/postgresql/8.4/main/pg_hba.conf. This allows the nagios role to connect to the template1 database without a password. Note: Don't forget to restart the PostgreSQL daemon after adding the line.

host    template1   nagios      127.0.0.1/32          trust

From now on the Nagios command check_pgsql defined in

/etc/nagios-plugins/config/pgsql.cfg

(which invokes /usr/lib/nagios/plugins/check_pgsql) can be used to monitor basic presence of the PostgreSQL service.


Discussion of the configuration line in pg_hba.conf:

  • Because we have used "trust" as the authentication method, any user is allowed to connect as "nagios"!
  • Normally we would like to use the "ident" authentication method because this restricts connection attempts to specific system users
  • This works very nicely as long as the connection attempt is made via Unix-domain sockets, i.e. if the line in pg_hba.conf uses the "local" keyword
  • However, as soon as the connection attempt is made via TCP/IP (i.e. the line in pg_hba.conf uses the "host" keyword), PostgreSQL will try to verify the identity of the connecting user by contacting an ident daemon on the system where the connection attempt came from
  • Because the Nagios command check_pgsql does connect via TCP/IP, I would be forced to run an ident daemon
  • Having the choice between opening access to the template1 database and installing an entire new service on my server, I clearly choose the former option!
Personal tools
francesca