HPC Cluster Job Scheduler: Difference between revisions
(3 intermediate revisions by the same user not shown) | |||
Line 5: | Line 5: | ||
# Change to the <code>elsa-tutorial</code> directory. <pre>cd elsa-tutorial</pre> | # Change to the <code>elsa-tutorial</code> directory. <pre>cd elsa-tutorial</pre> | ||
# List the names of the files that were copied to the current directory. <pre>ls</pre> | # List the names of the files that were copied to the current directory. <pre>ls</pre> | ||
# Edit one of the submission scripts to modify the email address in it. This email address will receive messages when the job starts and ends or if there was some kind of failure. Use the simple text editor '''nano''' to edit the file. Press <code>CTRL | # Edit one of the submission scripts to modify the email address in it. This email address will receive messages when the job starts and ends or if there was some kind of failure. Use the simple text editor '''nano''' to edit the file. Press <code>CTRL x</code> to exit/quit '''nano'''. <pre>nano submit-mpi.sh</pre>You could alternately use the edit feature in Open OnDemand to make the change. | ||
# Since the tutorial doesn't require any input file, you can simply submit this job to the cluster. <pre>sbatch submit-mpi.sh</pre> | # Since the tutorial doesn't require any input file, you can simply submit this job to the cluster. <pre>sbatch submit-mpi.sh</pre> | ||
# Monitor the status of your running job (which should only take about 20-25 seconds to run). The system will replace <code>$USER</code> in the command below with your username. You can also directly specify your username instead of <code>$USER</code>. <pre>squeue --user=$USER</pre> | # Monitor the status of your running job (which should only take about 20-25 seconds to run). The system will replace <code>$USER</code> in the command below with your username. You can also directly specify your username instead of <code>$USER</code>. <pre>squeue --user=$USER</pre> | ||
Line 11: | Line 11: | ||
# You can view the job output file by running the command (replace ##### with the actual job ID) <pre>cat job.####.out</pre> | # You can view the job output file by running the command (replace ##### with the actual job ID) <pre>cat job.####.out</pre> | ||
<br> | <br> | ||
The video below demonstrates a sample run of the tutorial steps described above. '''NOTE: | The video below demonstrates a sample run of the tutorial steps described above.<br>'''NOTE: This video uses an older method for running the tutorial, but it still works.''' | ||
<br> | <br> | ||
<embedvideo service="youtube" urlargs="rel=0">https://www.youtube.com/watch?v=yIM7NfCqUEg</embedvideo> | <embedvideo service="youtube" urlargs="rel=0">https://www.youtube.com/watch?v=yIM7NfCqUEg</embedvideo> | ||
Line 204: | Line 204: | ||
</pre> | </pre> | ||
While this line it the same as the other examples, you may also require include a line to include CUDA library support. CUDA is the library that includes the code for GPU-enabled programs. In our example, the '''elsa-tutorial''' module automatically loads the CUDA module. If it didn't, you would need to add a line like <code>module add | While this line it the same as the other examples, you may also require include a line to include CUDA library support. CUDA is the library that includes the code for GPU-enabled programs. In our example, the '''elsa-tutorial''' module automatically loads the CUDA module. If it didn't, you would need to add a line like <code>module add cuda</code> in addition to the one listed. | ||
<pre> | <pre> | ||
module add elsa-tutorial | module add elsa-tutorial | ||
Line 246: | Line 246: | ||
The job allocation can not share nodes with other running jobs. | The job allocation can not share nodes with other running jobs. | ||
This option should be used judiciously and sparingly. If for example, your job requires only 2 CPU cores and is scheduled on a node with 32 cores, no other job will be able to make use of the remaining 30 cores (not even your own job). Where this may make sense is when your job is competing for | This option should be used judiciously and sparingly. If for example, your job requires only 2 CPU cores and is scheduled on a node with 32 cores, no other job will be able to make use of the remaining 30 cores (not even your own job). Where this may make sense is when your job is competing for network bandwidth or storage access with others running on the same node. Using this option will guarantee that the entire node is exclusive to your job. | ||
Example: | Example: |
Latest revision as of 13:03, 29 May 2024
Submitting Your First HPC Job
- Login to the HPC cluster using one of the methods described in on Accessing the Cluster via SSH on the Getting Started page.
- Add the ELSA tutorial module
module add elsa-tutorial
- Run the setup script to make the
elsa-tutorial
directory and copy the example files to your account.elsa-tutorial-setup.sh
- Change to the
elsa-tutorial
directory.cd elsa-tutorial
- List the names of the files that were copied to the current directory.
ls
- Edit one of the submission scripts to modify the email address in it. This email address will receive messages when the job starts and ends or if there was some kind of failure. Use the simple text editor nano to edit the file. Press
CTRL x
to exit/quit nano.nano submit-mpi.sh
You could alternately use the edit feature in Open OnDemand to make the change. - Since the tutorial doesn't require any input file, you can simply submit this job to the cluster.
sbatch submit-mpi.sh
- Monitor the status of your running job (which should only take about 20-25 seconds to run). The system will replace
$USER
in the command below with your username. You can also directly specify your username instead of$USER
.squeue --user=$USER
- When your job ends, look for the additional file that was added to your directory.
ls
This file will be in the form of job.#####.out where the ##### matches the number in the JOBID column of the squeue command output. This creates unique output files which prevents subsequent job runs from overwriting previous outputs. - You can view the job output file by running the command (replace ##### with the actual job ID)
cat job.####.out
The video below demonstrates a sample run of the tutorial steps described above.
NOTE: This video uses an older method for running the tutorial, but it still works.
Anatomy of a SLURM Sbatch Submit Script
Sample MPI (Parallel) Sbatch Submission Script
We'll use this sample SLURM sbatch submission script below in our dissection.
#!/bin/bash #SBATCH --chdir=./ # Set the working directory #SBATCH --mail-user=nobody@tcnj.edu # Who to send emails to #SBATCH --mail-type=ALL # Send emails on start, end and failure #SBATCH --job-name=m_pi_dart # Name to show in the job queue #SBATCH --output=job.%j.out # Name of stdout output file (%j expands to jobId) #SBATCH --ntasks=10 # Total number of mpi tasks requested #SBATCH --nodes=2 # Total number of nodes requested #SBATCH --partition=short # Partition (a.k.a. queue) to use #SBATCH --time=00:10:00 # Max run time (days-hh:mm:ss) ... adjust as necessary module add elsa-tutorial # Disable selection of Infiniband networking export OMPI_MCA_btl=^openib # Run MPI program echo "Starting on "`date` mpirun mdart 50000 10000 # ^---- should be 500,000/ntasks to match serial version echo "Finished on "`date`
The first line of the script must start with #!
followed by the interpreter that the script will ultimately be fed to. In this case, and most commonly, it will be the /bin/bash
shell.
#!/bin/bash
Normall the shell interpreter ignores #
and anything that comes after it to the end of the line, but lines starting with #SBATCH
will be interpreted by the sbatch command before being fed into the interpreter as specified above. Note: the line must START (no leading spaces) with exactly #SBATCH
for it to be recognized by sbatch.
The --chdir
option specifies the working directory. This will be the working directory where the job starts from (i.e., your job will cd to this directory before beginning).
The ./
represents the current directory which will be whatever directory you executed the sbatch command from. You could specify an absolute directory such as /home/hpc/ssivy/tutorial
, but that would make the submission script less portable than what is used in this example.
#SBATCH --chdir=./ # Set the working directory
The next two lines specify where job status messages should be emailed. If these lines are absent from your submissions script, no emails are sent. You should substitute your email address for nobody@tcnj.edu
in the example. The mail-type of ALL tells the SLURM system to send emails when the job starts, ends or if there was a failure.
#SBATCH --mail-user=nobody@tcnj.edu # Who to send emails to #SBATCH --mail-type=ALL # Send emails on start, end and failure
The --job-name
is simply the name you want to be visible in job listings, etc. such as output from the squeue command. It typically should not contain spaces (use - or _ instead of a space). If you insist on using spaces, the entire name must be enclosed in double quotes, e.g. #SBATCH --job-name="m pi dart"
.
#SBATCH --job-name=m_pi_dart # Name to show in the job queue
The --output
option specifies the file where any output that would normally go to the screen/terminal will be redirected. In the examlple, the %j
will be replace with the job ID of the job. Each cluster job gets a unique ID. This format will allow multiple runs to create unique output files. It should be noted that this does not affect any files that are created from within your program. You need to figure that out using by reading its documentation.
#SBATCH --output=job.%j.out # Name of stdout output file (%j expands to jobId)
The --ntasks
specifies how many simultaneous tasks can be run by your program. This requires a program that understands MPI or a similar parallel processing method. This is basically the number of CPU processing cores you would like to allocate for your job. In this example, we are allocating 10 cores.
#SBATCH --ntasks=10 # Total number of mpi tasks requested
A node is a discrete server in the HPC cluster. The --nodes
option specifies how many server to allocate. The system will divide try to allocate an even number of tasks (from above) on each node. So, for this example, each node will be assigned 5 tasks if resources allow. If you program is not MPI or using some other parallel API (e.g. a serial program), it is a waste to request more than 1 node.
#SBATCH --nodes=2 # Total number of nodes requested
In SLURM, a partition is what we call a group of nodes providing a similar function. Other schedulers may refer to partitions as queues. The --partition
option specifies what SLURM parition or queue to assign the job to. Each partition has various settings (e.g. max job time) assigned to them. You can review the partition settings in the ELSA Job Partitions/Queues section below. If the --partition
option is absent from your submission script, the SLURM default partition (short in the case of ELSA) will be used.
#SBATCH --partition=short # Partition (a.k.a. queue) to use
The time option is where you specify the maximum amount of processing time that your job can use. This cannot exceed the maximum time allowed by the partition. If your job does not completed before the time specified is reached, it is killed/canceled. It is important to specify a time as it helps the scheduler properly organize the job queue. It is possible that your job can get moved up in the job queue because it can be squeezed in-between two larger jobs without affecting their start times. If no time option is specified the partition maximum time limit is used. The format of the time value is #days-#hours:#minutes:#seconds, e.g. 5-10:24:32 represents 5 days, 10 hours, 24 minutes and 32 seconds. You can also just use #hours:#minutes:#seconds if your job will run for less than a day.
#SBATCH --time=00:10:00 # Max run time (days-hh:mm:ss) ... adjust as necessary
That marks the end of the sbatch options. The rest of the submission script are commands that will be run by the interpreter specified on the first line (/bin/bash in our example).
Typically, your submission script will have one or more module add
lines to setup the environment for your application. Without these lines, you may get errors like command not found or messages about a missing libraries or other settings. The module
command is part of the Lmod system used by the ELSA HPC cluster. You can find appropriate module add
lines for various applications on the software page.
module add elsa-tutorial
The next lines are optional, but helps suppress any warning messages that you may see in our job output file about not being able to access the Infiniband network interface. Since not all nodes have Infiniband, you may or may not get this warning message depending on which nodes the SLURM schedule assigns to your job and if your program include the Infiniband libraries compiled in.
WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job.
These are not errors and your job will still run fine, but some find these messages annoying.
Of course, you don't want these lines if you do want to use Infiniband since it basically tells the cluster not to use Infiniband even if it is available. See the constraints page for ways to specify how to guarantee your job has access to Infiniband.
# Disable selection of Infiniband networking export OMPI_MCA_btl=^openib
These lines are optional, but the echo
line is nice to have since it outputs the date/time of when your job started. It is up to you if you'd like this diagnostic info in your output.
# Run MPI program echo "Starting on "`date`
All the magic happens here. This is where you specify your program(s) that will do the work. If your program is MPI-enabled you need to uses the mpirun
launcher program. This will start multiple instances your your program as specified by the --ntasks
and --nodes
options above. It will also pass to the program various settings so it can understand the environment it is working in. If you are running a serial program, you would not specify the mpirun
launcher. For example, if you were running the serial version of the dartboard program, you would just specify sdart 500000 10000
.
mpirun mdart 50000 10000
Again these lines are optional and outputs the date/time of when your job finished.
# ^---- should be 500,000/ntasks to match serial version echo "Finished on "`date`
Sample Serial Sbatch Submission Script
#!/bin/bash #SBATCH --chdir=./ # Set the working directory #SBATCH --mail-user=nobody@tcnj.edu # Who to send emails to #SBATCH --mail-type=ALL # Send emails on start, end and failure #SBATCH --job-name=s_pi_dart # Name to show in the job queue #SBATCH --output=job.%j.out # Name of stdout output file (%j expands to jobId) #SBATCH --ntasks=1 # Total number of mpi tasks requested #SBATCH --nodes=1 # Total number of nodes requested #SBATCH --partition=short # Partition (a.k.a. queue) to use #SBATCH --time=00:10:00 # Max run time (days-hh:mm:ss) ... adjust as necessary module add elsa-tutorial # Run serial program echo "Starting on "`date` sdart 500000 10000 echo "Finished on "`date`
Since most of the submission script will remains the same as the one above, let's focus on the differences.
With serial programs, they typically don't make use of multiple processing cores so it doesn't make sense to assign multiple tasks or multiple nodes. Some serial program can do multi-threading so you can try increasing the --ntasks
option to see if it makes a performance difference.
#SBATCH --ntasks=1 # Total number of mpi tasks requested #SBATCH --nodes=1 # Total number of nodes requested
As mentioned in the dissection above, when running serial programs, you do not specify the mpirun
(sometimes also called mpiexec). Just specify the command like you would at a normal Linux command line.
sdart 500000 10000
Sample GPU Sbatch Submission Script
#!/bin/bash #SBATCH --chdir=./ # Set the working directory #SBATCH --mail-user=nobody@tcnj.edu # Who to send emails to #SBATCH --mail-type=ALL # Send emails on start, end and failure #SBATCH --job-name=g_pi_dart # Name to show in the job queue #SBATCH --output=job.%j.out # Name of stdout output file (%j expands to jobId) #SBATCH --ntasks=1 # Total number of mpi tasks requested #SBATCH --nodes=1 # Total number of nodes requested #SBATCH --partition=gpu # Partition (a.k.a. queue) to use #SBATCH --gres=gpu:1 # Select GPU resource (# after : indicates how many) #SBATCH --time=00:10:00 # Max run time (days-hh:mm:ss) ... adjust as necessary module add elsa-tutorial # Run GPU program echo "Starting on "`date` gdart 500000 10000 echo "Finished on "`date`
Since most of the submission script will remains the same as the one above, let's focus on the differences.
With GPU programs, a good rule of thumb is to match the --ntasks
with the number of GPUs specified in the --gres
option (see more about that below). However, this is dependent on how much CPU parallelized the program supports. Some GPU programs are compiled with MPI support. In these cases, increasing the --ntasks
and --nodes
may be warranted.
#SBATCH --ntasks=1 # Total number of mpi tasks requested
Of course, to have access to the nodes that contain GPUs, you need to specify a SLURM partition/queue that contains these types of nodes. Refer to ELSA Job Parition/Queues below for your options.
#SBATCH --partition=gpu # Partition (a.k.a. queue) to use
This line is required to allocate GPU(s) for your application. GRES stands for general resource. In this example we want a gpu resource and the number after the : specifies how many (1 in our example).
#SBATCH --gres=gpu:1 # Select GPU resource (# after : indicates how many)
While this line it the same as the other examples, you may also require include a line to include CUDA library support. CUDA is the library that includes the code for GPU-enabled programs. In our example, the elsa-tutorial module automatically loads the CUDA module. If it didn't, you would need to add a line like module add cuda
in addition to the one listed.
module add elsa-tutorial
No special launcher is needed to run GPU-based applications. In this example, our GPU dartboard program is not compiled with MPI support. If it was, we would use mpirun gdart 500000 10000
to run it.
gdart 500000 10000
Advanced Submit Script Options
Constraints
The SLURM constraint option allows for further control over which nodes your job can be scheduled on in a particular parition/queue. You may require a specific processor family or network interconnect. The features that can be used with the sbatch constraint option are defined by the system administrator and thus vary among HPC sites.
One should be careful when combining multiple constraints. It is possible to specify a combination that cannot be satisfied (e.g. specifying a node with a skylake and a broadwell family of processor).
Available ELSA HPC constraints.
Example 1 (single constraint):
#SBATCH --constraint=skylake
Example 2 (anding multiple constraints):
#SBATCH --constraint="skylake&ib"
Example 3 (oring multiple constraints):
#SBATCH --constraint="skylake|broadwell"
Example 3 (complex constraints):
#SBATCH --constraint="(skylake|broadwell)&ib"
Node Exclusivity
The job allocation can not share nodes with other running jobs.
This option should be used judiciously and sparingly. If for example, your job requires only 2 CPU cores and is scheduled on a node with 32 cores, no other job will be able to make use of the remaining 30 cores (not even your own job). Where this may make sense is when your job is competing for network bandwidth or storage access with others running on the same node. Using this option will guarantee that the entire node is exclusive to your job.
Example:
#SBATCH --exclusive
Job Arrays
Example 1:
#SBATCH --output=job.%A_%a.out #SBATCH --array=1-100
Example 2 (step size):
#SBATCH --output=job.%A_%a.out #SBATCH --array=1-100:20
Example 3 (limit simultaneous task):
#SBATCH --output=job.%A_%a.out #SBATCH --array=1-100%5
Example Submit Scripts
Content to be created.
ELSA Job Partitions/Queues
Parition/Queue Name | Max Time Limit | Resource Type |
---|---|---|
short | 6 hours | CPU |
normal | 24 hours | CPU |
long | 7 days | CPU |
nolimit* | none | CPU |
amd | 30 days | CPU |
shortgpu | 6 hours | GPU |
gpu | 7 days | GPU |
* - Use of the nolimit partition is restricted to approved cluster users. Faculty may request access for themselves and students by emailing ssivy@tcnj.edu.