HPC Cluster Job Scheduler: Difference between revisions

From HPC Docs
Jump to navigation Jump to search
Line 46: Line 46:
</pre>
</pre>


The lines that start with a <code>#</code> are normally seen as comment to the shell interpreter, but line starting with <code>#SBATCH </code> will be interpreted by the '''sbatch''' command before being fed into the interpreter as specified above.
Normall the shell interpreter ignores <code>#</code> and anything that comes after it to the end of the line, but lines starting with <code>#SBATCH </code> will be interpreted by the '''sbatch''' command before being fed into the interpreter as specified above.  '''Note:''' the line must START (no leading spaces) with exactly <code>#SBATCH </code> for it to be recognized by '''sbatch'''.


The <code>--workdir</code> option specifies the working directory.  This will be the working directory where the job starts from (i.e., your job will '''cd''' to this directory before beginning).  The <code>./</code> represents the current directory which will be whatever directory you run the '''sbatch''' command from.  You could specify an absolute directory such as <code>/home/hpc/ssivy/tutorial</code>, but that would make the submission script less portable than what is used in this example.
The <code>--workdir</code> option specifies the working directory.  This will be the working directory where the job starts from (i.e., your job will '''cd''' to this directory before beginning).   
 
The <code>./</code> represents the current directory which will be whatever directory you executed the '''sbatch''' command from.  You could specify an absolute directory such as <code>/home/hpc/ssivy/tutorial</code>, but that would make the submission script less portable than what is used in this example.
<pre>
<pre>
#SBATCH --workdir=./                    # Set the working directory
#SBATCH --workdir=./                    # Set the working directory
</pre>
</pre>


The next two lines specify where job status messages should be emailed.  If these lines are absent in your submissions script, no emails are sent.  You should substitute your email address for <code>nobody@tcnj.edu</code> in the example.  The '''mail-type''' of '''ALL''' tells the SLURM system to send emails when the job starts, ends or if there was a failure.
The next two lines specify where job status messages should be emailed.  If these lines are absent from your submissions script, no emails are sent.  You should substitute your email address for <code>nobody@tcnj.edu</code> in the example.  The '''mail-type''' of '''ALL''' tells the SLURM system to send emails when the job starts, ends or if there was a failure.
<pre>
<pre>
#SBATCH --mail-user=nobody@tcnj.edu      # Who to send emails to
#SBATCH --mail-user=nobody@tcnj.edu      # Who to send emails to
Line 59: Line 61:
</pre>
</pre>


The <code>--job-name</code> is simply the name you want to be visible in job listings, etc. such as output from the '''squeue''' command.  It typically should not contain spaces (use - or _ instead of a space).  If you insist on using spaces, the entire name must be enclosed in double quotes, e.g. <code>#SBATCH --job-name="m pi dart"</code>.
<pre>
<pre>
#SBATCH --job-name=m_pi_dart            # Name to show in the job queue
#SBATCH --job-name=m_pi_dart            # Name to show in the job queue
</pre>
</pre>


The <code>--output</code> option specifies the file where any output that would normally go to the screen/terminal will be redirected.  In the examlple, the <code>%j</code> will be replace with the '''job ID''' of the job.  Each cluster job gets a unique ID.  This format will allow multiple runs to create unique output files.  It should be noted that this does not affect any files that are created from within your program.  You need to figure that out using by reading its documentation.
<pre>
<pre>
#SBATCH --output=job.%j.out              # Name of stdout output file (%j expands to jobId)
#SBATCH --output=job.%j.out              # Name of stdout output file (%j expands to jobId)
</pre>
</pre>


The <code>--ntasks</code> specifies how many simultaneous tasks can be run by your program.  This requires a program that understands MPI or a similar parallel processing method.  This is basically the number of CPU processing cores you would like to allocate for your job.  In this example, we are allocating 10 cores. 
<pre>
<pre>
#SBATCH --ntasks=10                      # Total number of mpi tasks requested
#SBATCH --ntasks=10                      # Total number of mpi tasks requested
</pre>
</pre>


A node is a discrete server in the HPC cluster.  The <code>--nodes</code> option specifies how many server to allocate.  The system will divide try to allocate an even number of tasks (from above) on each node. So, for this example, each node will be assigned 5 tasks if resources allow.  If you program is not MPI or using some other parallel API (e.g. a serial program), it is a waste to request more than 1 node.
<pre>
<pre>
#SBATCH --nodes=2                        # Total number of nodes requested
#SBATCH --nodes=2                        # Total number of nodes requested
</pre>
</pre>


In SLURM, a '''partition''' is what we call a group of nodes providing a similar function.  Other schedulers may refer to '''partitions''' as '''queues'''.  The <code>--partition</code> option specifies what SLURM parition or queue to assign the job to.  Each partition has various settings (e.g. max job time) assigned to them.  You can review the partition settings in the xxxx section below.  If the <code>--partition</code> option is absent from your submission script, the SLURM default partition ('''short''' in the case of ELSA) will be used.
<pre>
<pre>
#SBATCH --partition=short # Partition (a.k.a. queue) to use
#SBATCH --partition=short   # Partition (a.k.a. queue) to use
</pre>
</pre>


That marks the end of the '''sbatch''' options.  The rest of the submission script are commands that will be run by the interpreter specified on the first line ('''/bin/bash''' in our example).
<pre>
<pre>
module add elsa-tutorial
module add elsa-tutorial

Revision as of 18:41, 3 June 2019

This content is under construction. Check back often for updates.

Submitting Your First HPC Job

  1. Login to the HPC cluster using one of the methods described in on Accessing the Cluster via SSH on the Getting Started page.
  2. Make a directory using the command
    mkdir tutorial
    and then change into that directory using
    cd tutorial
  3. Next, make a copy of the submit script examples using
    cp /opt/tcnjhpc/esla-tutorial/examples/submit-* . 
    (make sure to include the . which represents the current directory as the target of the copy command).
  4. List the names of the files that were copied to the current directory.
    ls
  5. Edit one of the submission scripts to modify the email address in it. This email address will receive messages when the job starts and ends or if there was some kind of failure. Use the simple text editor nano to edit the file. Press CTRL q to quit nano.
    nano submit-mpi.sh
    You could alternately use the edit feature in Open OnDemand to make the change.
  6. Since the tutorial doesn't require any input file, you can simply submit this job to the cluster.
    sbatch submit-mpi.sh
  7. Monitor the status of your running job (which should only take about 20-25 seconds to run). The system will replace $USER in the command below with your username. You can also directly specify your username instead of $USER.
    squeue --user=$USER
  8. When your job ends, look for the additional file that was added to your directory.
    ls
    This file will be in the form of job.#####.out where the ##### matches the number in the JOBID column of the squeue command output. This creates unique output files which prevents subsequent job runs from overwriting previous outputs.
  9. You can view the job output file by running the command (replace ##### with the actual job ID)
    cat job.####.out


The video below demonstrates a sample run of the tutorial steps described above.

Anatomy of a SLURM Sbatch Submit Script

We'll use this sample SLURM sbatch submission script below in our dissection.

#!/bin/bash

#SBATCH --workdir=./                     # Set the working directory
#SBATCH --mail-user=nobody@tcnj.edu      # Who to send emails to
#SBATCH --mail-type=ALL                  # Send emails on start, end and failure
#SBATCH --job-name=m_pi_dart             # Name to show in the job queue
#SBATCH --output=job.%j.out              # Name of stdout output file (%j expands to jobId)
#SBATCH --ntasks=10                      # Total number of mpi tasks requested
#SBATCH --nodes=2                        # Total number of nodes requested
#SBATCH --partition=short		 # Partition (a.k.a. queue) to use

module add elsa-tutorial

# Disable selection of Infiniband networking
export OMPI_MCA_btl=^openib

# Run MPI program
echo "Starting on "`date`
mpirun mdart 50000 10000
#              ^---- should be 500,000/ntasks to match serial version
echo "Finished on "`date`

The first line of the script must start with #! followed by the interpreter that the script will ultimately be fed to. In this case, and most commonly, it will be the /bin/bash shell.

#!/bin/bash

Normall the shell interpreter ignores # and anything that comes after it to the end of the line, but lines starting with #SBATCH will be interpreted by the sbatch command before being fed into the interpreter as specified above. Note: the line must START (no leading spaces) with exactly #SBATCH for it to be recognized by sbatch.

The --workdir option specifies the working directory. This will be the working directory where the job starts from (i.e., your job will cd to this directory before beginning).

The ./ represents the current directory which will be whatever directory you executed the sbatch command from. You could specify an absolute directory such as /home/hpc/ssivy/tutorial, but that would make the submission script less portable than what is used in this example.

#SBATCH --workdir=./                     # Set the working directory

The next two lines specify where job status messages should be emailed. If these lines are absent from your submissions script, no emails are sent. You should substitute your email address for nobody@tcnj.edu in the example. The mail-type of ALL tells the SLURM system to send emails when the job starts, ends or if there was a failure.

#SBATCH --mail-user=nobody@tcnj.edu      # Who to send emails to
#SBATCH --mail-type=ALL                  # Send emails on start, end and failure

The --job-name is simply the name you want to be visible in job listings, etc. such as output from the squeue command. It typically should not contain spaces (use - or _ instead of a space). If you insist on using spaces, the entire name must be enclosed in double quotes, e.g. #SBATCH --job-name="m pi dart".

#SBATCH --job-name=m_pi_dart             # Name to show in the job queue

The --output option specifies the file where any output that would normally go to the screen/terminal will be redirected. In the examlple, the %j will be replace with the job ID of the job. Each cluster job gets a unique ID. This format will allow multiple runs to create unique output files. It should be noted that this does not affect any files that are created from within your program. You need to figure that out using by reading its documentation.

#SBATCH --output=job.%j.out              # Name of stdout output file (%j expands to jobId)

The --ntasks specifies how many simultaneous tasks can be run by your program. This requires a program that understands MPI or a similar parallel processing method. This is basically the number of CPU processing cores you would like to allocate for your job. In this example, we are allocating 10 cores.

#SBATCH --ntasks=10                      # Total number of mpi tasks requested

A node is a discrete server in the HPC cluster. The --nodes option specifies how many server to allocate. The system will divide try to allocate an even number of tasks (from above) on each node. So, for this example, each node will be assigned 5 tasks if resources allow. If you program is not MPI or using some other parallel API (e.g. a serial program), it is a waste to request more than 1 node.

#SBATCH --nodes=2                        # Total number of nodes requested

In SLURM, a partition is what we call a group of nodes providing a similar function. Other schedulers may refer to partitions as queues. The --partition option specifies what SLURM parition or queue to assign the job to. Each partition has various settings (e.g. max job time) assigned to them. You can review the partition settings in the xxxx section below. If the --partition option is absent from your submission script, the SLURM default partition (short in the case of ELSA) will be used.

#SBATCH --partition=short		  # Partition (a.k.a. queue) to use

That marks the end of the sbatch options. The rest of the submission script are commands that will be run by the interpreter specified on the first line (/bin/bash in our example).

module add elsa-tutorial
# Disable selection of Infiniband networking
export OMPI_MCA_btl=^openib
# Run MPI program
echo "Starting on "`date`
mpirun mdart 50000 10000
#              ^---- should be 500,000/ntasks to match serial version

<pre>
echo "Finished on "`date`

Advanced Submit Script Options

Constraints

The SLURM constraint option allows for further control over which nodes your job can be scheduled on in a particular parition/queue. You may require a specific processor family or network interconnect. The features that can be used with the sbatch constraint option are defined by the system administrator and thus vary among HPC sites.

One should be careful when combining multiple constraints. It is possible to specify a combination that cannot be satisfied (e.g. specifying a node with a skylake and a broadwell family of processor).

Available ELSA HPC constraints.

Example 1 (single constraint):

#SBATCH --constraint=skylake

Example 2 (anding multiple constraints):

#SBATCH --constraint="skylake&ib"

Example 3 (oring multiple constraints):

#SBATCH --constraint="skylake|broadwell"

Example 3 (complex constraints):

#SBATCH --constraint="(skylake|broadwell)&ib"

Node Exclusivity

The job allocation can not share nodes with other running jobs.

This option should be used judiciously and sparingly. If for example, your job requires only 2 CPU cores and is scheduled on a node with 32 cores, no other job will be able to make use of the remaining 30 cores (not even your own job). Where this may make sense is when your job is competing for memory (RAM) with others running on the same node. The system is not yet configured to enforce memory limitations like it does for CPU cores. Using this option will guarantee that the entire node is exclusive to your job.

Example:

#SBATCH --exclusive

Job Arrays

Example 1:

#SBATCH --output=job.%A_%a.out
#SBATCH --array=1-100

Example 2 (step size):

#SBATCH --output=job.%A_%a.out
#SBATCH --array=1-100:20

Example 3 (limit simultaneous task):

#SBATCH --output=job.%A_%a.out
#SBATCH --array=1-100%5

Example Submit Scripts

Content to be created.

ELSA Job Partitions/Queues

Parition/Queue Name Max Time Limit Resource Type
short 6 hours CPU
normal 24 hours CPU
long 7 days CPU
nolimit* none CPU
shortgpu 6 hours GPU
gpu 7 days GPU

* - Use of the nolimit partition is restricted to approved cluster users. Faculty may request access for themselves and students by emailing ssivy@tcnj.edu.