I Have No Group, and I Must Scream
One of the central tenets of software development, science, and life in general is ‘Be humble.’ In software development, this means that you should always assume that your code is wrong. It’s almost never an issue with the compiler, the frameworks, or something else. Most of the time, it is you. This is one of those rarer cases where it is not you.
Like many research groups, we like to use compute clusters in order to
execute complex computations in parallel. Our university provides, next
to some additional resources, a cluster that uses
IBM Spectrum LSF.
Employing a variety of specialised commands, this framework permits you to
interact with a cluster. For instance, you can submit a new job that is
to be executed on some compute node with a certain set of resource
requirements. Personal preferences notwithstanding, the framework is
actually pretty great—after a brief tutorial session, anyone can turn
their Python scripts or other commands into jobs executed on the
cluster. As a simple example, here is how to submit the command
That’s how easy it is in many cases. More advanced calls can be made, of
course, if additional resources such as memory or CPU cores are required.
Depending on the configuration of the cluster itself, it is also
possible to request special nodes for, say, GPU-based computations. But
the main idea is that
bsub and LSF should enable you to run
‘fire-and-forget’ jobs that will be executed at some point by some
node. The exact choice of node does not matter—or so I thought!
As we were working towards a deadline, I noticed intermittent job failures but I did not think too much of it. Sometimes, compute nodes go down because of hardware issues, so there is always the probability that not all jobs can be executed properly. But a few days before some important results were due, more than 90% of all jobs started to fail. The error message was preceded by one of my own warnings (I redacted the filename for clarity):
UserWarning: File $FILE not found. This will cause an error.
A quick check showed that the file was there; I was also able to read it from one of the login nodes, so I ruled out any issues with file permissions for now. Adding some additional tracing into the code, I found that it indeed always failed when it tried to open a certain file. The file in question was not even very large—just a few kilobytes at best—but loading it failed in almost all circumstances. This was getting spooky. Did I have a Heisenbug on my hands, given that I was unable to reproduce it under controlled conditions?
After much wailing, weeping, and gnashing of teeth, I decided to just
read more about the function that created the error. Partially, this was
to distract me from impending doom, but I was also curious as to what
was going on here. I was using
os.path.exists() to query the existence of
the file, so I thought that consulting its manual might be a smart idea.
I found the following instructions:
Trueif path refers to an existing path or an open file descriptor. Returns
Falsefor broken symbolic links. On some platforms, this function may return
Falseif permission is not granted to execute
os.stat()on the requested file, even if the path physically exists.
This seemed to imply that the function might have returned
order to indicate that the file cannot be accessed. I had ruled out
access issues before, but now they were suddenly back on the menu. Could
it be that permissions changed intermittently? Was I being fooled into
thinking that the permissions on the login node were the same as on the
compute node? I certainly did not change permissions in any way, but as
a last-ditch effort I decided to write a script that checks the
permissions by evaluating the groups my user is a part of. The script
looked something like this:
#!/usr/bin/env bash for INDEX in $(seq 0 99); do bsub -W 00:05 -o "debug_%J.out" -R "rusage[mem=128]" groups done
Notice how I tried to submit a lot of jobs in order to gain a clearer picture of the status of the compute nodes. I was hoping that the jobs would be nicely distributed among the cluster, and I anxiously waited for them to finish running. With the last job finally being completed, I collected my results:
tail -qn 1 debug_*.out | sort | uniq -c
A good output would have been something like this:
100 users foo bar baz
But instead, I received an absolute hotchpotch of different groups; in some cases, my user was part of almost no groups at all. The cluster, much like my will a few minutes before, was broken. These news, grim as they might be for the administrative staff, cheered me up quite considerably because for once, everything was right in the universe again—there was an explanation! Much like the characters in Harlan Ellison’s story whose title I butchered for this post, I had been duped and misled by the computer. I have no group, and I must scream.
Unlike Ellison’s story, mine had a good ending: the debug script was appreciated by the support staff, and the issue was rectified within a few days—until it cropped up again, causing considerably less confusion (the debug script was already in place), and was fixed again. This time for real.
My personal lesson from all of this: I need to study my functions more
carefully. While I could argue that
os.path.exists should have
a different behaviour—the function is called
exists, after all—it
was nevertheless an assumption of my part when I thought that this would
always work as expected. Know thyself and know thy functions, even if
they appear to be pretty innocuous at first glance.
Hope you will always be in the right group—until next time!