I Have No Group, and I Must Scream

Tags: software, linux, research

Published on
« Previous post: Simple Cluster Monitoring with Munin — Next post: ‘What is a Manifold?’, Redux: Some … »

One of the central tenets of software development, science, and life in general is ‘Be humble.’ In software development, this means that you should always assume that your code is wrong. It’s almost never an issue with the compiler, the frameworks, or something else. Most of the time, it is you. This is one of those rarer cases where it is not you.

Prelude

Like many research groups, we like to use compute clusters in order to execute complex computations in parallel. Our university provides, next to some additional resources, a cluster that uses IBM Spectrum LSF. Employing a variety of specialised commands, this framework permits you to interact with a cluster. For instance, you can submit a new job that is to be executed on some compute node with a certain set of resource requirements. Personal preferences notwithstanding, the framework is actually pretty great—after a brief tutorial session, anyone can turn their Python scripts or other commands into jobs executed on the cluster. As a simple example, here is how to submit the command ls:

bsub ls

That’s how easy it is in many cases. More advanced calls can be made, of course, if additional resources such as memory or CPU cores are required. Depending on the configuration of the cluster itself, it is also possible to request special nodes for, say, GPU-based computations. But the main idea is that bsub and LSF should enable you to run ‘fire-and-forget’ jobs that will be executed at some point by some node. The exact choice of node does not matter—or so I thought!

Failure

As we were working towards a deadline, I noticed intermittent job failures but I did not think too much of it. Sometimes, compute nodes go down because of hardware issues, so there is always the probability that not all jobs can be executed properly. But a few days before some important results were due, more than 90% of all jobs started to fail. The error message was preceded by one of my own warnings (I redacted the filename for clarity):

UserWarning: File $FILE not found. This will cause an error.

A quick check showed that the file was there; I was also able to read it from one of the login nodes, so I ruled out any issues with file permissions for now. Adding some additional tracing into the code, I found that it indeed always failed when it tried to open a certain file. The file in question was not even very large—just a few kilobytes at best—but loading it failed in almost all circumstances. This was getting spooky. Did I have a Heisenbug on my hands, given that I was unable to reproduce it under controlled conditions?

The Light!

After much wailing, weeping, and gnashing of teeth, I decided to just read more about the function that created the error. Partially, this was to distract me from impending doom, but I was also curious as to what was going on here. I was using os.path.exists() to query the existence of the file, so I thought that consulting its manual might be a smart idea. I found the following instructions:

Return True if path refers to an existing path or an open file descriptor. Returns False for broken symbolic links. On some platforms, this function may return False if permission is not granted to execute os.stat() on the requested file, even if the path physically exists.

This seemed to imply that the function might have returned False in order to indicate that the file cannot be accessed. I had ruled out access issues before, but now they were suddenly back on the menu. Could it be that permissions changed intermittently? Was I being fooled into thinking that the permissions on the login node were the same as on the compute node? I certainly did not change permissions in any way, but as a last-ditch effort I decided to write a script that checks the permissions by evaluating the groups my user is a part of. The script looked something like this:

#!/usr/bin/env bash

for INDEX in $(seq 0 99); do
  bsub -W 00:05 -o "debug_%J.out" -R "rusage[mem=128]" groups
done

Notice how I tried to submit a lot of jobs in order to gain a clearer picture of the status of the compute nodes. I was hoping that the jobs would be nicely distributed among the cluster, and I anxiously waited for them to finish running. With the last job finally being completed, I collected my results:

tail -qn 1 debug_*.out | sort  | uniq -c

A good output would have been something like this:

100 users foo bar baz 

But instead, I received an absolute hotchpotch of different groups; in some cases, my user was part of almost no groups at all. The cluster, much like my will a few minutes before, was broken. These news, grim as they might be for the administrative staff, cheered me up quite considerably because for once, everything was right in the universe again—there was an explanation! Much like the characters in Harlan Ellison’s story whose title I butchered for this post, I had been duped and misled by the computer. I have no group, and I must scream.

Epilogue

Unlike Ellison’s story, mine had a good ending: the debug script was appreciated by the support staff, and the issue was rectified within a few days—until it cropped up again, causing considerably less confusion (the debug script was already in place), and was fixed again. This time for real.

My personal lesson from all of this: I need to study my functions more carefully. While I could argue that os.path.exists should have a different behaviour—the function is called exists, after all—it was nevertheless an assumption of my part when I thought that this would always work as expected. Know thyself and know thy functions, even if they appear to be pretty innocuous at first glance.

Hope you will always be in the right group—until next time!