7. Running Jobs on Expanse
In this Section:
Parallel Models
Parallel Models: Distributed Memory
Programs that run asynchronously, pass messages for communication and coordination between resources.
Examples include: SOA-based systems, massively multiplayer online games, peer-to-peer apps.
Different types of implementations for the message passing mechanism: HTTP, RPC-like connectors, message queues
HPC historically uses the Message Passing Interface (MPI)
Methods for Running Jobs on Expanse
Batch Jobs
Batch Jobs are used to submit batch scripts to Slurm from the login nodes. You need to specify
Partition (queue)
Time limit for the run (maximum of 48 hours)
Number of nodes, tasks per node; Memory requirements (if any)
Job name, output file location; Email info, configuration
When you run in the batch mode, you submit jobs to be run on the compute nodes using the sbatch command.
Remember that computationally intensive jobs should be run only on the compute nodes and not the login nodes.
Expanse places limits on the number of jobs queued and running on a per group (allocation) and partition basis.
Please note that submitting a large number of jobs (especially very short ones) can impact the overall scheduler response for all users.
Interactive Jobs
Interactive Jobs: Use the srun command to obtain nodes for ‘real-time, live’ command line access to a compute node: CPU:
srun --partition=debug --qos=debug-normal --pty --account=abc123 --nodes=1 --ntasks-per-node=128 --mem=248 -t 00:30:00 --wait=0 --export=ALL /bin/bash
GPU:
srun --pty --account=abc123 --nodes=1 --ntasks-per-node=1 --cpus-per-task=10 -p gpu-debug --gpus=1 -t 00:10:00 /bin/bash
Command Line Jobs
The login nodes are meant for compilation, file editing, simple data analysis, and other tasks that use minimal compute resources. Do not run parallel or large jobs on the login nodes - even for simple tests. Even if you could run a simple test on the command line on the login node, full tests should not be run on the login node because the performance will be adversely impacted by all the other tasks and login activities of the other users who are logged onto the same node. For example, at the moment that this note was written, a gzip process was consuming 98% of the CPU time on the Comet HPC system:
[user@comet-ln3 OPENMP]$ top ... PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 19937 XXXXX 20 0 4304 680 300 R 98.2 0.0 0:19.45 gzip
Commands that you type into the terminal and run on the sytem are considered jobs and they consume resources. Computationally intensive jobs should be run only on the compute nodes and not the login nodes.
Slurm Resource Manager
Sections:
Expanse uses the Simple Linux Utility for Resource Management (Slurm) resource manager. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters https://Slurm.schedmd.com/documentation.html. Follow this link to the Slurm Quick Start Guide
logging on to Expanse |
Slurm Architecture |
|---|---|
|
|
User logs onto Expanse, and submits a batch script to the Slurm Controller daemon |
Slurm parses the batch script for correct syntax and then queues up the job until the requested resources are available |
Slurm is the “Glue” for parallel computer to schedule and execute jobs
Role: Allocate resources within a cluster
Nodes (unique IP address)
Interconnect/switches
Generic resources (e.g. GPUs)
Launch and otherwise manage jobs
Slurm Partitions
About Partitions
Partition Name |
Max Walltime |
Max Nodes/ Job |
Max Running Jobs |
Max Running + Queued Jobs |
Charge Factor |
Comments |
|---|---|---|---|---|---|---|
compute |
48 hrs |
32 |
32 |
64 |
1 |
Used for exclusive access to regular compute nodes |
shared |
48 hrs |
1 |
4096 |
4096 |
1 |
Single-node jobs using fewer than 128 cores |
gpu |
48 hrs |
4 |
4 |
8 (32 Tres GPU) |
1 |
Used for exclusive access to the GPU nodes |
gpu-shared |
48 hrs |
1 |
24 |
24 (24 Tres GPU) |
1 |
Single-node job using fewer than 4 GPUs |
large-shared |
48 hrs |
1 |
1 |
4 |
1 |
Single-node jobs using large memory up to 2 TB (minimum memory required 256G) |
debug |
30 min |
2 |
1 |
2 |
1 |
Priority access to compute nodes set aside for testing of jobs with short walltime and limited resources |
gpu-debug |
30 min |
2 |
1 |
2 |
1 |
** Priority access to gpu nodes set aside for testing of jobs with short walltime and limited resources |
preempt |
7 days |
32 |
128 |
0.8 |
Discounted jobs to run on free nodes that can be pre-empted by jobs submited to any other queue (NO REFUNDS) |
|
gpu-preempt |
7 days |
1 |
24 (24 Tres GPU) |
0.8 |
Discounted jobs to run on unallocated nodes that can be pre-empted by jobs submitted to higher priority queues (NO REFUNDS |
Slurm Job State Codes
Slurm has a large number of states. The table below lists some of the most common states you will see. For full details, See the section on Slurm JOB STATE CODES, at the Slurm Squeue data page.
STATE |
LABEL |
DESCRIPTION |
|---|---|---|
CA |
CANCELLED |
Job was explicitly cancelled by the user or system administrator. |
C |
COMPLETED |
job is Complete/Clearing |
F |
FAILED |
Job terminated with non-zero exit code or other failure condition. |
PD |
PENDING |
Job is awaiting resource allocation. |
R |
RUNNING |
Job currently has an allocation. |
ST |
STOPPED |
Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job. |
TO |
TIMEOUT |
Job terminated upon reaching its time limit. |
Common Slurm Commands
Here are a few key Slurm commands. For more information, run the man Slurm or see this page:
COMMAND |
DESCRIPTION |
|---|---|
scancel |
Used to signal or cancel jobs, job arrays or job steps. |
sbatch |
Submit a batch script to Slurm. |
sinfo |
View information about Slurm nodes and partitions. |
squeue |
Used to view job and job step information for jobs managed by Slurm. |
srun |
Run a parallel job on cluster managed by Slurm. |
Submit jobs using the sbatch command:
$ sbatch mycode-slurm.sb
Submitted batch job 8718049. Check job status using the squeue command.
$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8718049 compute mycode user PD 0:00 1 (Priority)
Once the job is running:
$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8718049 debug mycode user R 0:02 1 expanse-14-01
Cancel a running job:
$ scancel 8718049
Slurm Batch Script Example
Below is an example of a batch script that prints our your environment on the compute node:
[user@login01 ENV_INFO]$ cat env-slurm.sb
!/bin/bash
SBATCH --job-name="envinfo"
SBATCH --output="envinfo.%j.%N.out"
SBATCH --partition=compute
SBATCH --nodes=1
SBATCH --ntasks-per-node=1
SBATCH --export=ALL
SBATCH --account=abc123
SBATCH -t 00:01:00
# Environment
module purge
module load slurm
module load cpu
# perform some basic unix commands
echo "----------------------------------"
echo "hostname= " `hostname`
echo "date= " `date`
echo "whoami= " `whoami`
echo "pwd= " `pwd`
echo "module list= " `module list`
echo "----------------------------------"
echo "env= " `env`
echo "----------------------------------"
Portion of the output generated by this script:
[user@login01 ENV_INFO]$ cat envinfo.108867.exp-6-56.out
----------------------------------
hostname= exp-6-56
date= Wed Oct 7 23:45:43 PDT 2020
whoami= user
pwd= /home/user/DEMO/ENV_INFO
Currently Loaded Modules:
1) slurm/expanse/20.02.3 2) cpu/1.0
----------------------------------
env= SLURM_MEM_PER_CPU=1024 LD_LIBRARY_PATH=/cm/shared/apps/Slurm/current/lib64/Slurm:/cm/shared/apps/Slurm/current/lib64 LS_COLORS=rs=0
[SNIP]
MODULESHOME=/usr/share/lmod/lmod LMOD_SETTARG_FULL_SUPPORT=no HISTSIZE=5000 LMOD_PKG=/usr/share/lmod/lmod LMOD_CMD=/usr/share/lmod/lmod/libexec/lmod Slurm_LOCALID=0 LESSOPEN=||/usr/bin/lesspipe.sh %s LMOD_FULL_SETTARG_SUPPORT=no LMOD_DIR=/usr/share/lmod/lmod/libexec BASH_FUNC_module%%=() { eval $($LMOD_CMD bash "$@") && eval $(${LMOD_SETTARG_CMD:-:} -s sh) } BASH_FUNC_ml%%=() { eval $($LMOD_DIR/ml_cmd "$@") } _=/usr/bin/env
----------------------------------

