Random Figure

Liangcheng Yu

I am interested in systems and networking. This is my homepage and blog.
Email: leoyu@seas.upenn.edu

Cheat Sheet of Using Clusters

17 Jan 2019 » tips

I summarized the notes below of using ETH Zurich clusters (Euler, Leonhard) based on the collected online tutorials for a quick reference.

Accessing the compute node

Typically users could only see and interact with the login node as the gateway to the cluster black box. However, upon the job run, one could use bjob_connect jobID command to monitor relevant information related to the compute node.

Checking the user profile

busers command could print out the user specifications, e.g., maximal jobs allowed to run. Below is the print out of my expired student account.

[liayu@lo-login-02 ~]$ busers -w
liayu                  -   2400      0      0      0      0      0      0      -      0  25000      0

Job submission

Every job has to be submitted and handled by the batch system (Platform Load Sharing Facility/LSF) with respect to its scheduling policy. bsub [optional flags] < tmp.sh will submit a job specified by the shell script. One could specify the resource profile for the job, detailed bsub flags related to resource requests could be found here.

bsub -n 4 -W 120:00 -R "rusage[mem=10000]" "python3 x.py -y z

Interacting with a GitHub repo

Both HTTP or SSH could be leveraged, however, HTTP requires typing in your username and password when cloning or making pull/push requests (but one could use git config credential.helper 'cache --timeout=3600' to cache the password and ID for the avoidance). With SSH, one could generate a RSA key pair via ssh-keygen -t rsa locally and provide an optional passphrase for the two-factor authentication (what you know + what you have). Then add the public key stored in ~/.ssh/id_rsa.pub to the GitHub Settings/SSH and GPG keys. One could then simply type in the passphrase to interact with the GitHub repos conveniently, e.g., git clone git@github.com:{UserName}/{RepoName}.git.

Handling signals

It is frustrating if the long-running jobs are killed without any logs saved when they exceed the time limits originally requested. LSF tries to terminate the job gradually by sending increasingly “unfriendly” signals: USR2 when close to expiry => INT, QUIT, TERM and KILL after a grace period => brutal termination operation that can not be caught or ignored. Hence, it is better to always handle the USR2 signal with a logging operation. One could extend the grace period via -ta USR2 -wt [hh:]mm bsub arguments.

Job chaining

Job chaining is useful to split a long-running job into smaller units that fit into the time limits. For example, I am only allowed to schedule a job <24h at Leonhard cluster. bsub -w (wait) could specify the dependency over the previous job. E.g., the command below (submitted at once) states that job2 will be executed only if job1 is done successfully, otherwise it will be cleared out automatically. ended(job1) is the antonym of done(job1), which states that job2 will be always executed.

bsub -J job1 command1
bsub -J job2 -w "done(job1)" command2

The same program with a large number of dependent iterations could be split as well, however, it requires the programmer to decouple the pipeline of execution and input/output so that the new instance of execution could base itself on the previous output. One could chain the execution as well (giving the same name for the jobs is optional):

bsub -J long_job command
bsub -J long_job -w "done(long_job)" command
bsub -J long_job -w "done(long_job)" command

If one kills an intermediate job, it will exert a domino effect.

Status monitoring

bjobs -p will list all the pending jobs and bjobs -p JOBID will list the detailed reason. bbjobs JOBID will give more detailed information on requested resources and so forth. bpeek JOBID or bpeek -J JOBNAME will monitor the output real-time. One could conveniently monitor the total number of jobs pending and running via bjobs | wc -l. Below is the finite state machine figure for the job status from the IBM tutorial.


Killing jobs

bkill JOBID or bkill -J jobname will kill a specific job. bkill 0 will kill all pending and running jobs. One could “kill” the job by sending a signal, e.g., if there is a USR2 signal handler in the program, one could run bkill -s USR2 JOBID to trigger the callback (e.g., logging) of the USR2 handler.