This blog records some tips about using Slurm.

1. How to set cpus-per-task and OMP_NUM_THREADS ?

The cpus-per-task option in SLURM (–cpus-per-task or -c) specifies how many CPU cores each task in your job should be allocated. Here’s how to understand it:

  • A “task” in SLURM typically corresponds to one process
  • cpus-per-task tells SLURM how many CPU cores that single process needs
  • This is primarily used for multithreaded applications (like OpenMP programs)

In SLURM, cpus-per-task typically refers to logical cores (hyperthreads), not physical cores. Here’s the breakdown:

  • SLURM counts logical cores by default
  • On a hyperthreaded system, each physical core appears as 2 logical cores
  • So –cpus-per-task=8 usually means 8 logical cores (which could be 4 physical cores with hyperthreading)

For hyperthreaded systems, the optimal settings depend on your workload type and performance goals. Here are the common approaches:

Approach 1: Use Physical Cores Only (Recommended for CPU-intensive work)

sbatch --cpus-per-task=8
export OMP_NUM_THREADS=4  # Half of cpus-per-task

This is because --cpus-per-task=8 will reserve 8 logical cores, which could be 4 physical cores. CPU-intensive programs often perform better using one thread per physical core rather than fighting for resources between hyperthreads.

Approach 2: Use All Logical Cores (Good for I/O or memory-bound work)

sbatch --cpus-per-task=8
export OMP_NUM_THREADS=8  # Matches cpus-per-task

This is because memory-bound or I/O-intensive work can benefit from hyperthreading since threads may be waiting anyway.

2. How to set OMP_PROC_BIND and OMP_PLACES ?

These two are OpenMP environment variables that control thread affinity - how your parallel threads are assigned to specific CPU cores or processing units.

OMP_PROC_BIND controls whether and how threads are bound to processors:

  • false (default): Threads can migrate between processors freely
  • true: Threads are bound to processors, but the specific binding pattern depends on the implementation
  • master/primary: All threads are bound to the same processor as the primary thread
  • close: Threads are bound to processors close to the primary thread
  • spread: Threads are bound and are distributed as evenly as possible across available processors

OMP_PLACES defines the set of processors available for thread placement:

  • threads: Each place corresponds to a single hardware thread (logical thread)
  • cores: Each place corresponds to a single core (all threads on that core)
  • sockets: Each place corresponds to a single processor socket
  • {0,1,2,3}: Explicit list of processor IDs
  • {0:4}: Range notation (processors 0 through 3)
  • {0:4:2}: Range with stride (processors 0 and 2)

If you care about performance, try the following combined configurations

# Maybe the best if leveraging compiler/runtime knowledge of your hardware
OMP_PROC_BIND=true
OMP_PLACES=threads

# Should be generally good since each thread owns a full core, less contention
OMP_PROC_BIND=spread
OMP_PLACES=cores

# Should be good if using hyperthreading when beneficial
OMP_PROC_BIND=spread
OMP_PLACES=threads