Slurm Tips
This blog records some tips about using Slurm.
1. How to set cpus-per-task
and OMP_NUM_THREADS
?
The cpus-per-task
option in SLURM (–cpus-per-task or -c) specifies how many CPU cores each task in your job should be allocated. Here’s how to understand it:
- A “task” in SLURM typically corresponds to one process
- cpus-per-task tells SLURM how many CPU cores that single process needs
- This is primarily used for multithreaded applications (like OpenMP programs)
In SLURM, cpus-per-task typically refers to logical cores (hyperthreads), not physical cores. Here’s the breakdown:
- SLURM counts logical cores by default
- On a hyperthreaded system, each physical core appears as 2 logical cores
- So –cpus-per-task=8 usually means 8 logical cores (which could be 4 physical cores with hyperthreading)
For hyperthreaded systems, the optimal settings depend on your workload type and performance goals. Here are the common approaches:
Approach 1: Use Physical Cores Only (Recommended for CPU-intensive work)
sbatch --cpus-per-task=8
export OMP_NUM_THREADS=4 # Half of cpus-per-task
This is because --cpus-per-task=8
will reserve 8 logical cores, which could be 4 physical cores. CPU-intensive programs often perform better using one thread per physical core rather than fighting for resources between hyperthreads.
Approach 2: Use All Logical Cores (Good for I/O or memory-bound work)
sbatch --cpus-per-task=8
export OMP_NUM_THREADS=8 # Matches cpus-per-task
This is because memory-bound or I/O-intensive work can benefit from hyperthreading since threads may be waiting anyway.
2. How to set OMP_PROC_BIND
and OMP_PLACES
?
These two are OpenMP environment variables that control thread affinity - how your parallel threads are assigned to specific CPU cores or processing units.
OMP_PROC_BIND
controls whether and how threads are bound to processors:
- false (default): Threads can migrate between processors freely
- true: Threads are bound to processors, but the specific binding pattern depends on the implementation
- master/primary: All threads are bound to the same processor as the primary thread
- close: Threads are bound to processors close to the primary thread
- spread: Threads are bound and are distributed as evenly as possible across available processors
OMP_PLACES defines the set of processors available for thread placement:
- threads: Each place corresponds to a single hardware thread (logical thread)
- cores: Each place corresponds to a single core (all threads on that core)
- sockets: Each place corresponds to a single processor socket
- {0,1,2,3}: Explicit list of processor IDs
- {0:4}: Range notation (processors 0 through 3)
- {0:4:2}: Range with stride (processors 0 and 2)
If you care about performance, try the following combined configurations
# Maybe the best if leveraging compiler/runtime knowledge of your hardware
OMP_PROC_BIND=true
OMP_PLACES=threads
# Should be generally good since each thread owns a full core, less contention
OMP_PROC_BIND=spread
OMP_PLACES=cores
# Should be good if using hyperthreading when beneficial
OMP_PROC_BIND=spread
OMP_PLACES=threads