Nagios
Debian packages
The following Debian packages need to be installed
nagios3 nagios3-doc
References
- http://nagios.sourceforge.net/docs/3_0/toc.html
- /usr/share/doc/nagios3-doc (http://pelargir.herzbube.ch/doc/nagios3-doc)
Glossary
- Object
- The nagios configuration is made up of objects of various types (e.g. hosts, host groups, contacts, services, etc.)
- Service
- A service is something that should be monitored by nagios
- Host
- A host is a machine that provides services that nagios should monitor
- Host group
- An aggregation of multiple hosts (e.g. "pingable servers", "SSH servers", etc.); mainly used for larger networks
- Contact
- nagios notifies contacts, using various means (e.g. e-mail, SMS, pager), when it detects a problem
- Contact group
- An aggregation of multiple contacts (e.g. "admins"); mainly used for larger organisations
- Time period
- Time periods are definitions used to tell nagios when it is allowed to check hosts, services and notify contacts (e.g. 24x7, work hours, etc.)
- Command
- A command defines how to perform a certain check on a target host or service (e.g. ), or how to notify a target contact
DebConf configuration
Questions + answers:
- apache servers to configure for nagios3 = apache2
- enable support for nagios 1.x links = no
- web admin password = <secret>
Notes:
- User name (nagiosadmin) and password of the web admin are placed into
/etc/nagios3/htpasswd.users
- The web admin user name is also referenced in
/etc/nagios3/cgi.cfg
- See chapter "LDAP" further down for information on how to use LDAP as the user/password database.
- dpkg automatically generates a symlink /etc/apache2/conf.d/nagios3.conf that points to /etc/nagios3/apache.conf
Apache configuration
Nagios requires that PHP is turned on in its webroot, something like this:
<Directory /usr/share/nagios3/htdocs/> <IfModule mod_php5.c> php_admin_flag engine on </IfModule> </Directory>
Web administration
Overview
During installation of the Debian package, the Apache web server was already configured for web access, using basic HTTP authentication. The following sections provide details about
- more configuration aspects of the web interface
- how to switch from basic HTTP authentication to LDAP authentication
- how to enable full featured control of the nagios daemon from the web interface through the so-called "external commands" feature
/etc/nagios3/cgi.cfg
This file configures all aspects of the web interface of nagios. For instance
- whether or not authentication is required (default = yes)
- what users are allowed to do
- refresh rate
- which sound files should be played (if any) in the web browser if problems are detected
- etc.
By default, the web admin user nagiosadmin created by DebConf is referenced and has all possible permissions. This and all other default settings are OK for my purposes, so I didn't have to change anything in the beginning. Later on, when LDAP authentication is enabled, it might be necessary to change the name of the admin user.
LDAP
By default web interface authentication works through mod_authz_user, i.e. users/passwords are stored in a file:
/etc/nagios/htpasswd.users
If users/passwords should be taken from LDAP, the following stuff needs to be changed after DebConf configuration has finished:
- edit /etc/apache2/conf.d/nagios3.conf
- replace this block
AuthName "Nagios Access" AuthType Basic AuthUserFile /etc/nagios3/htpasswd.users require valid-user
- by the new block
AuthName "Nagios Access" AuthType Basic # LDAP connection information is inherited AuthBasicProvider ldap Require ldap-user admin
- remove /etc/nagios3/htpasswd.users because it is no longer needed
- make sure that the user(s) used for authentication are properly referenced in /etc/nagios3/cgi.cfg (in my case I changed "nagiosadmin" to the generic "admin" account)
Enabling external commands
"External commands" is a scheme to control the nagios daemon's operation without shutting it down. An external client can write control instructions ("external commands") to a pipe, which the daemon periodically checks to see if it must perform some actions. The approach is commonly used to send commands from the web interface.
What basically needs to be done is:
- enable the "external commands" feature (in /etc/nagios3/nagios.cfg)
- give the web server write access to the command pipe (/var/lib/nagios3/rw/nagios.cmd)
- make sure that the permission change is permanent, i.e. make sure that the change is not reverted by the next update of the Debian package
/usr/share/doc/nagios3-common/README.Debian explains how best to do this:
- set "check_external_commands = 1" in /etc/nagios3/nagios.cfg
- check if
/var/lib/nagios3
and/var/lib/nagios3/rw
have the correct permissions
root@pelargir:~# l /var/lib/nagios3/ total 28 drwxr-x--x 4 nagios nagios 4096 Jun 29 20:52 . drwxr-xr-x 54 root root 4096 Jun 29 20:47 .. -rw-r--r-- 1 nagios nagios 11099 Jun 29 20:52 retention.dat drwx--s--- 2 nagios www-data 4096 Jun 4 20:25 rw drwxr-x--- 3 nagios nagios 4096 Jun 29 20:47 spool
- if the two files mentioned do not have not the listed permissions, run these commands
/etc/init.d/nagios3 stop dpkg-statoverride --update --add nagios www-data 2710 /var/lib/nagios3/rw dpkg-statoverride --update --add nagios nagios 751 /var/lib/nagios3 /etc/init.d/nagios3 start
- the permission changes are made with dpkg-statoverride so that they are permanent and survive package updates
Monitoring configuration
Overview
The main configuration file used by the nagios daemon is
/etc/nagios3/nagios.cfg
nagios.cfg includes a couple of other files, but most importantly it defines a directory where you can drop files to extend the nagios configuration:
/etc/nagios3/conf.d
Note: Files in this directory can have any name.
To test whether the current configuration is correct:
nagios3 -v /etc/nagios3/nagios.cfg
Nagios plugins
The Debian package nagios-plugins provides a lot of useful commands that can be used to check hosts and services. The command definitions live in
/etc/nagios-plugins/config
which is included by default by nagios.cfg. The actual plugin binaries are located in
/usr/lib/nagios/plugins
Check scripts
The plugins provided by the Debian package nagios-plugins may not be sufficient in all cases. Sometimes you may want to create your own tool to check something in a special way. Such a tool can be any kind of script or binary, as long as it satisfies the following interface:
- exit status 0 = everything ok
- exit status 1 = warning
- exit status 2 = critical
- output on stdout is used by nagios as status information
An example check script that I wrote:
pelargir:~# cat /usr/local/lib/nagios_check_boinc.sh #!/bin/bash MYNAME=$(basename $0) PROCESS_NAME="[b]oinc" PROCESS_INFORMATION="$(ps aux | grep $PROCESS_NAME | grep -v $MYNAME)" if test -z "$PROCESS_INFORMATION"; then echo "boinc: not running" exit 2 else echo "boinc: $PROCESS_INFORMATION" exit 0 fi
/etc/nagios3/nagios.cfg
The default settings in this file are largely OK. I only changed the following values:
- check_external_commands = 1 (see the section about "external commands" further up in this document)
- date_format = euro
Note: /usr/share/doc/nagios3-common/README.Debian suggests not to edit /etc/nagios/nagios.cfg, but instead to place changes into /etc/nagios3/conf.d/nagios.cfg. This does not work, though.
Hosts
The generic host that can be used as the base for host definitions is located in
/etc/nagios3/conf.d/generic-host_nagios2.cfg
It's main characteristics are
- allow notifications if a problem is detected with the host
- notify 24x7
- contact the group "admins"
- use the "check-host-alive" command (defined in /etc/nagios-plugins/config/ping.cfg)
Predefined hosts are:
- localhost, in /etc/nagios3/conf.d/localhost_nagios2.cfg
- default gateway, in /etc/nagios3/conf.d/host-gateway_nagios3.cfg; the IP address of the default gateway is configured automatically, probably by DebConf
I added the following custom hosts to /etc/nagios3/conf.d/pelargir-hosts.cfg:
pelargir:/etc/nagios3/conf.d# cat pelargir-hosts.cfg # Any host on the internet that can be used to check whether the internet is reachable define host { host_name inethost alias Internet host address www.google.ch use generic-host # let nagios distinguish between "host unreachable" and "host down" parents gateway # Google uses IPv6 since June 6 2012 (World IPv6 Launch), apparently # we now need to explicitly check things using IPv4 check_command check-host-alive_4 } # Host pelargir, network interface facing the Internet define host { host_name pelargir-inet alias pelargir-inet address 192.168.178.20 use generic-host } # Host pelargir, network interface facing the intranet LAN network define host { host_name pelargir-lan alias pelargir-lan address 192.168.1.11 use generic-host } # Host pelargir, network interface facing the intranet WiFi network define host { host_name pelargir-wifi alias pelargir-wifi address 192.168.2.6 use generic-host } # Host landroval define host { host_name landroval alias landroval address 192.168.2.2 use generic-host } # Host www.herzbube.ch define host { host_name www.herzbube.ch alias www.herzbube.ch address 212.101.18.224 use generic-host } # Host www.moser-naef.ch define host { host_name www.moser-naef.ch alias www.moser-naef.ch address 212.101.18.224 use generic-host } # Host smtp.herzbube.ch define host { host_name smtp.herzbube.ch alias smtp.herzbube.ch address 212.101.18.224 use generic-host } # Host mail.herzbube.ch define host { host_name mail.herzbube.ch alias mail.herzbube.ch address 212.101.18.224 use generic-host }
Host groups
Predefined hosts in /etc/nagios3/conf.d/hostgroups_nagios2.cfg are:
- all servers (members = *)
- debian servers (members = localhost)
- HTTP servers (members = localhost)
- SSH servers (members = localhost)
- ping servers (members = gateway)
I added the following custom host groups to /etc/nagios3/conf.d/pelargir-hostgroups.cfg:
pelargir:/etc/nagios3/conf.d# cat pelargir-hostgroups.cfg # List of hosts located on the Internet define hostgroup { hostgroup_name internet-hosts alias Internet hosts members inethost } # List of hosts located on the Intranet define hostgroup { hostgroup_name intranet-hosts alias Intranet hosts members pelargir-inet, pelargir-lan, pelargir-wifi, gateway, landroval } # List of SMTP servers define hostgroup { hostgroup_name smtp-servers alias SMTP servers members localhost, smtp.herzbube.ch } # List of IMAP servers define hostgroup { hostgroup_name imap-servers alias IMAP servers members localhost, mail.herzbube.ch } # List of DHCP servers define hostgroup { hostgroup_name dhcp-servers alias DHCP servers members gateway, pelargir-lan, pelargir-wifi } # List of LDAP servers define hostgroup { hostgroup_name ldap-servers alias LDAP servers members localhost } # List of MySQL servers define hostgroup { hostgroup_name mysql-servers alias MySQL servers members localhost } # List of PostgreSQL servers define hostgroup { hostgroup_name postgresql-servers alias PostgreSQL servers members localhost } # List of Samba servers define hostgroup { hostgroup_name samba-servers alias Samba servers members localhost }
Contacts & contact groups
Predefined contacts & and contact groups in /etc/nagios3/conf.d/contacts_nagios2.cfg are:
- a single contact named "root" whose email address is root@localhost
- a single contact group named "admins" (which has the contact "root" as its single member)
These predefined contact & contact group is sufficient for my purposes, I therefore do not define any custom contacts or groups.
Commands
I added the following custom commands to /etc/nagios3/conf.d/pelargir-commands.cfg:
pelargir:/etc/nagios3/conf.d# cat pelargir-commands.cfg # Connect to a host on a given port and check that the SSL certificate provided is valid # for a given minimum number of days; if the check fails this will result in a WARNING only # (not CRITICAL) define command { command_name check_cert command_line /usr/lib/nagios/plugins/check_http -H '$HOSTNAME$' -p '$ARG1$' -C '$ARG2$' } # check_dhcp must be run with root privileges. README.Debian suggests to # achieve this by setting the setuid bit on check_dhcp, by issuing this # command: # dpkg-statoverride --update --add root nagios 4750 /usr/lib/nagios/plugins/check_dhcp # However, I prefer sudo to the setuid flag, although this involves an additional bit # of configuration in the sudoers file. define command{ command_name sudo_check_dhcp_interface command_line sudo -u root /usr/lib/nagios/plugins/check_dhcp -s '$HOSTADDRESS$' -i '$ARG1$' }
Services
The generic service that can be used as the base for service definitions is located in
/etc/nagios3/conf.d/generic-service_nagios2.cfg
It's main characteristics are
- allow notifications if a problem is detected with the service
- notify only when the problem occurs, or goes away (notification_interval = 0; if this were set to a non-zero value, notifications would be sent periodically)
- notify 24x7
- notify the contact group "admins"
- allow checking 24x7
- check every 5 minutes
Predefined services are:
- for localhost, in /etc/nagios3/conf.d/localhost_nagios2.cfg
- disk space (warning/critical if free space on any file system falls below 20%/10%)
- number of currently logged in users (warning/critical if more than 20/50 users)
- number of processes (warning/critical if more than 250/400 processes)
- machine load (warning/critical if more than (5.0, 4.0, 3.0) / (10.0, 6.0, 4.0))
- for specific services, in /etc/nagios3/conf.d/services_nagios2.cfg
- HTTP servers
- SSH servers
- ping'able servers
I added the following custom services to /etc/nagios3/conf.d/pelargir-services.cfg:
pelargir:/etc/nagios3/conf.d# cat pelargir-services.cfg # check that an Internet connection is available define service { hostgroup_name internet-hosts service_description Internet connection # Google uses IPv6 since June 6 2012 (World IPv6 Launch), apparently # we now need to explicitly check things using IPv4 check_command check_ping_4!100.0,20%!500.0,60% use generic-service notification_interval 0 ; set > 0 if you want to be renotified } # check that SMTP services are running define service { hostgroup_name smtp-servers service_description SMTP check_command check_smtp use generic-service notification_interval 0 ; set > 0 if you want to be renotified } # check that IMAP services are running define service { hostgroup_name imap-servers service_description IMAP check_command check_imap use generic-service notification_interval 0 ; set > 0 if you want to be renotified } # check that FTP services are running define service { hostgroup_name ftp-servers service_description FTP check_command check_ftp use generic-service notification_interval 0 ; set > 0 if you want to be renotified } # check that DHCP services are running # # Note: check_dhcp must be run with root privileges define service { host_name gateway service_description FritzBox DHCP check_command sudo_check_dhcp_interface!eth1 use generic-service notification_interval 0 ; set > 0 if you want to be renotified } define service { host_name pelargir-lan service_description Intranet LAN DHCP check_command sudo_check_dhcp_interface!eth0 use generic-service notification_interval 0 ; set > 0 if you want to be renotified } define service { host_name pelargir-wifi service_description Intranet WiFi DHCP check_command sudo_check_dhcp_interface!eth2 use generic-service notification_interval 0 ; set > 0 if you want to be renotified } # check that LDAP services are running define service { hostgroup_name ldap-servers service_description LDAP # cannot use check_ldap because our LDAP service does not allow anonymous binding # and I don't want to expose passwords here check_command check_tcp!389 use generic-service notification_interval 0 ; set > 0 if you want to be renotified } # check that MySQL services are running # # Note: The "nagios" database user has no password. We specify "no password" # by adding a trailing "!" to the command. This means that the second argument # of the command, which is the password, is an empty string. define service { hostgroup_name mysql-servers service_description MySQL check_command check_mysql_cmdlinecred!nagios! use generic-service notification_interval 0 ; set > 0 if you want to be renotified } # check that PostgreSQL services are running define service { hostgroup_name postgresql-servers service_description PostgreSQL check_command check_pgsql use generic-service notification_interval 0 ; set > 0 if you want to be renotified } # check that Samba services are running define service { hostgroup_name samba-servers service_description Samba check_command check_ssh check_command check_tcp!445 use generic-service notification_interval 0 ; set > 0 if you want to be renotified } # check validity of the SSL certificate provided on port 443 define service { host_name www.herzbube.ch service_description SSL certificate validity on port 443 www.herzbube.ch check_command check_cert!443!28 use generic-service notification_interval 0 ; set > 0 if you want to be renotified } define service { host_name www.moser-naef.ch service_description SSL certificate validity on port 443 www.moser-naef.ch check_command check_cert!443!28 use generic-service notification_interval 0 ; set > 0 if you want to be renotified }
Service configuration
Overview
Some services need to be configured in a special way so that they can be monitored by Nagios. A typical example is that the service needs to allow the nagios
system user minimal access to it's resources.
sudo
Some of the check commands that Nagios needs to run can only be run as root. I prefer to use sudo
for this purpose (instead of setting the setuid flag on the command binaries). The following snippet is the sudo
configuration I use:
# cat /etc/sudoers.d/pelargir-nagios.conf # Nagios must never be lectured when it runs commands because this # might interfere with how it evaluates the command's output Defaults: nagios !lecture # Nagios never needs to authenticate when it runs commands because this # is, obviously, beyond the capabilities of a daemon. This could also # be solved by specifying the NOPASSWD: option for each command. Defaults: nagios !authenticate # ALL should be a "non" terminal # nagios = the line applies only when sudo is invoked by the user "nagios" # (root) = only allows "sudo -u root"; no groups may be specified # (sudo -g), and no users other than "root" may be specified nagios pelargir = (root) /usr/lib/nagios/plugins/check_dhcp
DHCP
My DHCP server currently hands out addresses to any machine regardless of its MAC address, so there is no need to add any Nagios-specific entries to the configuration of my DHCP server.
Pseudo entry used by Nagios monitoring. The entry has a fake MAC address that must be supplied by the Nagios check command.
MySQL
Add a new user "nagios" with no password that has no privileges except the general "Usage" privilege.
PostgreSQL
Add a new role:
createuser --no-superuser --no-createdb --no-createrole nagios
Add the following line to /etc/postgresql/8.4/main/pg_hba.conf
. This allows the nagios
role to connect to the template1
database without a password. Note: Don't forget to restart the PostgreSQL daemon after adding the line.
host template1 nagios 127.0.0.1/32 trust
From now on the Nagios command check_pgsql
defined in
/etc/nagios-plugins/config/pgsql.cfg
(which invokes /usr/lib/nagios/plugins/check_pgsql
) can be used to monitor basic presence of the PostgreSQL service.
Discussion of the configuration line in pg_hba.conf
:
- Because we have used "trust" as the authentication method, any user is allowed to connect as "nagios"!
- Normally we would like to use the "ident" authentication method because this restricts connection attempts to specific system users
- This works very nicely as long as the connection attempt is made via Unix-domain sockets, i.e. if the line in
pg_hba.conf
uses the "local" keyword - However, as soon as the connection attempt is made via TCP/IP (i.e. the line in
pg_hba.conf
uses the "host" keyword), PostgreSQL will try to verify the identity of the connecting user by contacting anident
daemon on the system where the connection attempt came from - Because the Nagios command
check_pgsql
does connect via TCP/IP, I would be forced to run anident
daemon - Having the choice between opening access to the
template1
database and installing an entire new service on my server, I clearly choose the former option!