HPC Cluster Job Scheduler: Difference between revisions

From HPC Docs
Jump to navigation Jump to search
Line 27: Line 27:


== Advanced Submit Script Options ==
== Advanced Submit Script Options ==
''Content to be created.''


=== Constraints ===
=== Constraints ===
Available [[HPC_SLURM_Features|constraints]].
The SLURM ''constraint'' option allows for further control over which nodes your job can be scheduled on in a particular parition/queue. You may require a specific processor family or network interconnect. The ''features'' that can be used with the sbatch constraint option are defined by the system administrator and thus vary among HPC sites.


Example:
One should be careful when combining multiple constraints. It is possible to specify a combination that cannot be satisfied (e.g. specifying a node with a '''skylake''' and a '''broadwell''' family of processor).
 
Available [[HPC_SLURM_Features|ELSA HPC constraints]].
 
Example 1 (single constraint):
<pre>
<pre>
#SBATCH --constraint=skylake
#SBATCH --constraint=skylake
</pre>
Example 2 (''and''ing multiple constraints):
<pre>
#SBATCH --constraint="skylake&ib"
</pre>
Example 3 (''or''ing multiple constraints):
<pre>
#SBATCH --constraint="skylake|broadwell"
</pre>
Example 3 (complex constraints):
<pre>
#SBATCH --constraint="(skylake|broadwell)&ib"
</pre>
</pre>



Revision as of 18:46, 2 June 2019

This content is under construction. Check back often for updates.

Submitting Your First HPC Job

Content to be created.

Anatomy of a SLURM Sbatch Submit Script

Content to be updated.

!/bin/bash

#SBATCH --workdir=./                     # Set the working directory
#SBATCH --mail-user=nobody@tcnj.edu      # Who to send emails to
#SBATCH --mail-type=ALL                  # Send emails on start, end and failure
#SBATCH --job-name=pi_dart               # Name to show in the job queue
#SBATCH --output=job.%j.out              # Name of stdout output file (%j expands to jobId)
#SBATCH --ntasks=4                       # Total number of mpi tasks requested
#SBATCH --nodes=1                        # Total number of nodes requested
#SBATCH --partition=test  		 # Partition (a.k.a. queue) to use

# Disable selecting Infiniband
export OMPI_MCA_btl=self,tcp

# Run MPI program
echo "Starting on "`date`
mpirun pi_dartboard
echo "Finished on "`date`

Advanced Submit Script Options

Constraints

The SLURM constraint option allows for further control over which nodes your job can be scheduled on in a particular parition/queue. You may require a specific processor family or network interconnect. The features that can be used with the sbatch constraint option are defined by the system administrator and thus vary among HPC sites.

One should be careful when combining multiple constraints. It is possible to specify a combination that cannot be satisfied (e.g. specifying a node with a skylake and a broadwell family of processor).

Available ELSA HPC constraints.

Example 1 (single constraint):

#SBATCH --constraint=skylake

Example 2 (anding multiple constraints):

#SBATCH --constraint="skylake&ib"

Example 3 (oring multiple constraints):

#SBATCH --constraint="skylake|broadwell"

Example 3 (complex constraints):

#SBATCH --constraint="(skylake|broadwell)&ib"

Node Exclusivity

The job allocation can not share nodes with other running jobs.

This option should be used judiciously and sparingly. If for example, your job requires only 2 CPU cores and is scheduled on a node with 32 cores, no other job will be able to make use of the remaining 30 cores (not even your own job). Where this may make sense is when your job is competing for memory (RAM) with others running on the same node. The system is not yet configured to enforce memory limitations like it does for CPU cores. Using this option will guarantee that the entire node is exclusive to your job.

Example:

#SBATCH --exclusive

Job Arrays

Example 1:

#SBATCH --output=job.%A_%a.out
#SBATCH --array=1-100

Example 2 (step size):

#SBATCH --output=job.%A_%a.out
#SBATCH --array=1-100:20

Example 3 (limit simultaneous task):

#SBATCH --output=job.%A_%a.out
#SBATCH --array=1-100%5

Example Submit Scripts

Content to be created.

ELSA Job Partitions/Queues

Parition/Queue Name Max Time Limit Resource Type
short 6 hours CPU
normal 24 hours CPU
long 7 days CPU
nolimit* none CPU
shortgpu 6 hours GPU
gpu 7 days GPU

* - Use of the nolimit partition is restricted to approved cluster users. Faculty may request access for themselves and students by emailing ssivy@tcnj.edu.