Simple Cluster Monitoring with Munin

Tags: howtos, linux

Published on
« Previous post: Societal Donatism — Next post: I Have No Group, and I Must Scream »

If you have more than one server to administrate, chances are you will benefit from having at least a rudimentary form of monitoring in place. Having access to usage statistics, thermal sensors, and the like will make your life immensely easier in the long run. When Max and I started working on the compute cluster of our lab in earnest, we wanted a lightweight monitoring solution that would be easy to set up, leaving us room to grow. After evaluating existing solutions, I decided to go with Munin, a lightweight server–client set of monitoring plugins. These notes provide some details about how to set it up and easily gain access to warnings.

Why Munin?

Munin comes with pre-packaged binaries for all major Linux distributions, making it a breeze to install. Plugins can be written in shell script or Perl. While my skills in the latter language are quite rusty, I still feel sufficiently confident in whipping something up.

Setting Up the Primary Munin Server

Munin requires one server to be the dedicated primary node. This is the node to which all other nodes will report, and it is also the node that will serve the statistics, send out warnings, and so on. Setting up this server requires you to install the main munin package and configuring /etc/munin/munin.conf. An example configuration file could look roughly like this:

# `munin` primary configuration file

# (Exactly one) directory to include all files from.
includedir /etc/munin/munin-conf.d

contact.slack.always_send warning critical
contact.slack.command /usr/local/bin/slack_notify_munin "Munin notification for ${var:host}: ${var:graph_title}: warnings: ${loop<,>:wfields ${var:label} = ${var:value}} / criticals: ${loop<,>:cfields  ${var:label} = ${var:value}}"

[alpha.example.com]
  address 10.0.0.1

[beta.example.com]
  address 10.0.0.2

Note that I ignored the default values shipped in the configuration file in order to highlight that there are really not a lot of options you have to adjust. In our example, we will have two servers alpha and beta to monitor. There are many more examples of how to build complex hierarchies here but this is not something you have to decide on right now.

Sending Warnings

The other interesting thing is the slack_notify_munin script, which makes it possible to send out warnings to a dedicated Slack channel; we will discuss how to define limits for such warnings. Munin makes setting up additional warning recipients quite simple; you merely have to provide a name of the recipient as well as some command to send out the notification itself. The alert will consist of a single line of text in our case, with all warnings/errors being expanded correctly. See the official documentation for more information about this syntax—I was previously not familiar with it but it seems to be well known within the Perl community.

An alert will look like this:

Munin notification for alpha.example.com: Memory usage: warnings: swap = 12345.00 / criticals:

If more than one item/field raises a warning, the loop will contain more values. Feel free to adjust the wording of the alert; my personal preference is to keep alerts as terse as possible so that they can be parsed in one glance. Here is the slack_notify_munin script in all its glory:

#!/usr/bin/env bash
#
# Uses Munin's notification mechanism to send an notification about the
# state of the cluster. A notification will be sent for each plugin for
# which an error or warning is detected.
#
# Parameters:
#   $1: message
#
# The message in $1 is directly passed to a Slack webhook, causing it to
# appear in the configured channel.

if [ $# -ne 1 ]; then
  exit -1
fi

HOOK=https://hooks.slack.com/workflows/[YOUR OWN ID HERE]
BODY="{\"message\": \"$@\"}"

curl --header "Content-Type: application/json" \
     --request POST                            \
     --data "$BODY"                            \
     $HOOK

It requires you to set up a simple Slack Workflow with a ‘Webhook’ entry. See the Slack documentation on workflows for more details. You can also use any other type of script here or even an e-mail notification; it just needs to be executable for the munin user.

Generating HTML

The simplest way to make Munin monitoring information available is to serve the HTML files generated by Munin. This results in a static monitoring display that updates itself every 5 minutes. More complicated setups are certainly possible, but this is perfectly sufficient for our needs at present.

If you use nginx as your main browser, serving the HTML files boils down to a few lines in /etc/nginx/sites-available/default:

location /munin {
    alias /var/cache/munin/www/;
    expires modified +310s;
}

Presto—your monitoring results are now being served under /munin. Notice that there are no access controls here: again, this is sufficient for our server, which is only accessible from an internal network, but you might want to add additional authentication here. There are also ways to enable more interactions with the plots, but they require fiddling with the CGI setting and I do not want to add any more executables than I absolutely have to.

Setting Up Munin Nodes

Additional nodes such as our servers alpha and beta can be configured by installing the munin-node package. This will already set up a bunch of plugins, which you may want to disable later on. To grant access to the primary node, assuming its IP address is 10.0.0.128, you merely need to add a single line to munin-node.conf:

allow 10.0.0.128

After this, the munin-node service will be queried every five minutes and report results to the primary node.

Enabling and Disabling Plugins

The default Munin configuration can be quite chatty. To disable plugins that you are not interested in, merely unlink them from /etc/munin/plugins. Here is a list of plugins that I do not consider to be super important for most configurations:

  • df_inode
  • entropy
  • forks
  • fw_packets
  • interrupts
  • irqstats
  • open_inodes
  • netstat
  • postfix_mailqueue
  • postfix_mailvolume

Your mileage may vary, of course, but I would always err on the side of caution: it is easy to drown in a sea of data. Make sure that you always have some actual insights from these data.

Setting Limits

Limits for individual plugins can be added to /etc/munin/plugin-conf.d/limits.conf on each secondary node. The format depends on the individual plugin that you want to configure, so be sure to check their respective documentations. Here is an example that sets percentage-based limits for the memory of a node:

[memory]
env.swap_warning 75%
env.swap_critical 90%
env.apps_warning 75%
env.apps_critical 90%

Testing Everything

Now having set up everything, it is always a good idea to check notifications. Much to the chagrin of my fellow admins, I used the following command multiple times to force notifications:

/usr/share/munin/munin-limits --force

You only ever have to execute this on the primary node, using the munin user. It will trigger all notifications at the same time. If you want to test whether HTML generation and all other updates are configured correctly, just run

munin-cron

as the munin user. By default, this is what will run every 5 minutes anyway, but while you are still setting up everything, manual updates might be desired.

I hope this guide points you in the right direction; may your servers run without encountering any errors. Until next time!