Simple Cluster Monitoring with Munin
Tags: howtos, linux
If you have more than one server to administrate, chances are you will benefit from having at least a rudimentary form of monitoring in place. Having access to usage statistics, thermal sensors, and the like will make your life immensely easier in the long run. When Max and I started working on the compute cluster of our lab in earnest, we wanted a lightweight monitoring solution that would be easy to set up, leaving us room to grow. After evaluating existing solutions, I decided to go with Munin, a lightweight server–client set of monitoring plugins. These notes provide some details about how to set it up and easily gain access to warnings.
Why Munin?
Munin comes with pre-packaged binaries for all major Linux distributions, making it a breeze to install. Plugins can be written in shell script or Perl. While my skills in the latter language are quite rusty, I still feel sufficiently confident in whipping something up.
Setting Up the Primary Munin Server
Munin requires one server to be the dedicated primary node. This is the
node to which all other nodes will report, and it is also the node that
will serve the statistics, send out warnings, and so on. Setting up this
server requires you to install the main munin
package and configuring
/etc/munin/munin.conf
. An example configuration file could look
roughly like this:
# `munin` primary configuration file
# (Exactly one) directory to include all files from.
includedir /etc/munin/munin-conf.d
contact.slack.always_send warning critical
contact.slack.command /usr/local/bin/slack_notify_munin "Munin notification for ${var:host}: ${var:graph_title}: warnings: ${loop<,>:wfields ${var:label} = ${var:value}} / criticals: ${loop<,>:cfields ${var:label} = ${var:value}}"
[alpha.example.com]
address 10.0.0.1
[beta.example.com]
address 10.0.0.2
Note that I ignored the default values shipped in the configuration file
in order to highlight that there are really not a lot of options you
have to adjust. In our example, we will have two servers alpha
and
beta
to monitor. There are many more examples of how to build complex
hierarchies here but this is not something you have to decide on right
now.
Sending Warnings
The other interesting thing is the slack_notify_munin
script, which
makes it possible to send out warnings to a dedicated Slack channel; we
will discuss how to define limits for such warnings.
Munin makes setting up additional warning recipients quite simple; you
merely have to provide a name of the recipient as well as some command
to send out the notification itself. The alert will consist of a single
line of text in our case, with all warnings/errors being expanded
correctly. See the official documentation
for more information about this syntax—I was previously not familiar
with it but it seems to be well known within the Perl community.
An alert will look like this:
Munin notification for alpha.example.com: Memory usage: warnings: swap = 12345.00 / criticals:
If more than one item/field raises a warning, the loop will contain more
values. Feel free to adjust the wording of the alert; my personal
preference is to keep alerts as terse as possible so that they can be
parsed in one glance. Here is the slack_notify_munin
script in all its
glory:
#!/usr/bin/env bash
#
# Uses Munin's notification mechanism to send an notification about the
# state of the cluster. A notification will be sent for each plugin for
# which an error or warning is detected.
#
# Parameters:
# $1: message
#
# The message in $1 is directly passed to a Slack webhook, causing it to
# appear in the configured channel.
if [ $# -ne 1 ]; then
exit -1
fi
HOOK=https://hooks.slack.com/workflows/[YOUR OWN ID HERE]
BODY="{\"message\": \"$@\"}"
curl --header "Content-Type: application/json" \
--request POST \
--data "$BODY" \
$HOOK
It requires you to set up a simple Slack Workflow with a ‘Webhook’
entry. See the Slack documentation on workflows
for more details. You can also use any other type of script here or even
an e-mail notification; it just needs to be executable for the munin
user.
Generating HTML
The simplest way to make Munin monitoring information available is to serve the HTML files generated by Munin. This results in a static monitoring display that updates itself every 5 minutes. More complicated setups are certainly possible, but this is perfectly sufficient for our needs at present.
If you use nginx
as your main browser, serving the HTML files boils
down to a few lines in /etc/nginx/sites-available/default
:
location /munin {
alias /var/cache/munin/www/;
expires modified +310s;
}
Presto—your monitoring results are now being served under /munin
.
Notice that there are no access controls here: again, this is
sufficient for our server, which is only accessible from an internal
network, but you might want to add additional authentication here. There
are also ways to enable more interactions with the plots, but they
require fiddling with the CGI setting and I do not want to add any more
executables than I absolutely have to.
Setting Up Munin Nodes
Additional nodes such as our servers alpha
and beta
can be
configured by installing the munin-node
package. This will already set
up a bunch of plugins, which you may want to disable later on. To grant
access to the primary node, assuming its IP address is 10.0.0.128
, you
merely need to add a single line to munin-node.conf
:
allow 10.0.0.128
After this, the munin-node
service will be queried every five minutes
and report results to the primary node.
Enabling and Disabling Plugins
The default Munin configuration can be quite chatty. To disable plugins
that you are not interested in, merely unlink them from
/etc/munin/plugins
. Here is a list of plugins that I do not consider
to be super important for most configurations:
df_inode
entropy
forks
fw_packets
interrupts
irqstats
open_inodes
netstat
postfix_mailqueue
postfix_mailvolume
Your mileage may vary, of course, but I would always err on the side of caution: it is easy to drown in a sea of data. Make sure that you always have some actual insights from these data.
Setting Limits
Limits for individual plugins can be added to /etc/munin/plugin-conf.d/limits.conf
on each secondary node.
The format depends on the individual plugin that you want to configure,
so be sure to check their respective documentations. Here is an example
that sets percentage-based limits for the memory of a node:
[memory]
env.swap_warning 75%
env.swap_critical 90%
env.apps_warning 75%
env.apps_critical 90%
Testing Everything
Now having set up everything, it is always a good idea to check notifications. Much to the chagrin of my fellow admins, I used the following command multiple times to force notifications:
/usr/share/munin/munin-limits --force
You only ever have to execute this on the primary node, using the
munin
user. It will trigger all notifications at the same time.
If you want to test whether HTML generation and all other updates are
configured correctly, just run
munin-cron
as the munin
user. By default, this is what will run every 5
minutes
anyway, but while you are still setting up everything, manual updates
might be desired.
I hope this guide points you in the right direction; may your servers run without encountering any errors. Until next time!