Scalable Training with SLURM#
This guide covers running ProtoMotions training jobs on SLURM-managed HPC clusters.
What is SLURM?#
SLURM (Simple Linux Utility for Resource Management) is a widely-used job scheduler for high-performance computing clusters. It manages compute resources, queues jobs, and handles multi-node distributed workloads. Most academic and enterprise GPU clusters use SLURM for job scheduling.
Overview#
ProtoMotions provides train_slurm.py, a launcher script that:
Syncs your code to the cluster (via rsync over SSH)
Generates a SLURM batch script with the correct job parameters
Submits the job to the cluster queue
Handles auto-resume via SLURM job arrays (jobs continue after timeouts)
The script is a template designed to be customized for your specific cluster setup.
Configuring for Your Cluster#
Before using SLURM training, edit the configuration section at the top of
protomotions/train_slurm.py:
# =============================================================================
# CLUSTER CONFIGURATION - EDIT THIS SECTION FOR YOUR CLUSTER
# =============================================================================
# Login node hostname (e.g., "login.mycluster.edu")
CLUSTER_LOGIN_NODE = "YOUR_CLUSTER_LOGIN_NODE"
# Base directory for experiments on the cluster filesystem
CLUSTER_BASE_DIR = "/path/to/your/experiments/directory"
# Container images (Singularity .sif or Enroot .sqsh format)
CONTAINER_IMAGES = {
"isaacgym": "/path/to/containers/isaacgym.sqsh",
"isaaclab": "/path/to/containers/isaaclab.sqsh",
"newton": "/path/to/containers/newton.sqsh",
}
# Default SLURM account (your allocation/project)
DEFAULT_SLURM_ACCOUNT = "your_account"
# Default SLURM partitions
DEFAULT_SLURM_PARTITION = "gpu"
# Filesystem mounts for container
CONTAINER_MOUNTS = "/scratch:/scratch:rw"
Key settings to configure:
CLUSTER_LOGIN_NODE: SSH hostname for your cluster’s login nodeCLUSTER_BASE_DIR: Directory where experiment code will be syncedCONTAINER_IMAGES: Paths to your container images (Singularity/Enroot)DEFAULT_SLURM_ACCOUNT: Your SLURM allocation or project nameCONTAINER_MOUNTS: Filesystem paths to mount inside the container
Container Setup#
You’ll need containerized environments with ProtoMotions dependencies. Convert your
Docker images to Singularity (.sif) or Enroot (.sqsh) format as required
by your cluster.
Auto-Resume with Job Arrays#
Long training runs often exceed cluster time limits (e.g., 4-hour walltime). ProtoMotions handles this automatically using two mechanisms:
1. SLURM Job Arrays
The launcher submits jobs as arrays (--array=0-5%1), meaning up to 5 sequential
jobs will run. When a job times out, the next array task starts and resumes from
the last checkpoint.
2. AutoResume Callback
When --use-slurm is enabled, training registers the AutoResumeCallbackSrun
callback. This callback:
Tracks elapsed training time
Saves a checkpoint before the SLURM time limit (default: after 3.5 hours)
Gracefully stops training so the next array job can resume
# From protomotions/agents/callbacks/slurm_autoresume_srun.py
class AutoResumeCallbackSrun(Callback):
def __init__(self, autoresume_after=12600): # 3.5 hours in seconds
self.autoresume_after = autoresume_after
def _check_autoresume(self, agent):
if time.time() - self.start_time >= self.autoresume_after:
agent.save() # Save checkpoint
agent._should_stop = True # Signal graceful stop
The default autoresume_after=12600 (3.5 hours) works well with 4-hour job limits,
providing buffer time for checkpoint saving.
Understanding Scaling Parameters#
The --num-envs and --batch-size parameters are specified per GPU. With
multi-GPU and multi-node training, the effective totals scale accordingly:
Total GPUs = ngpu × nodes
Effective num-envs = num-envs × Total GPUs
Effective batch-size = batch-size × Total GPUs
Example:
With --ngpu=4 --nodes=2 --num-envs=4096 --batch-size=16384:
Total GPUs: 4 × 2 = 8 GPUs
Effective environments: 4,096 × 8 = 32,768 parallel environments
Effective batch size: 16,384 × 8 = 131,072 samples per update
This scaling is automatic—you specify per-GPU values and the distributed training handles aggregation across all processes.
Running a Training Job#
Once configured, launch training from your local machine:
python protomotions/train_slurm.py \
--robot-name=g1 \
--simulator=isaaclab \
--num-envs=4096 \
--batch-size=16384 \
--motion-file=/cluster/path/to/motions.pt \
--experiment-path=examples/experiments/mimic/mlp_bm.py \
--experiment-name=g1_motion_tracker \
--user=myusername \
--ngpu=4 \
--nodes=1 \
--slurm-time=4:00:00 \
--use-wandb
Key arguments:
Argument |
Description |
|---|---|
|
Robot to train (e.g., |
|
Physics backend ( |
|
Parallel environments (scale with GPU memory) |
|
PPO batch size (typically 2-4x num-envs) |
|
Path to motion data on the cluster |
|
Experiment config file (relative to repo root) |
|
Unique name for this experiment |
|
Your cluster username |
|
GPUs per node |
|
Number of compute nodes |
|
Job time limit (HH:MM:SS) |
|
Number of auto-resume attempts (default: 5) |
|
Enable Weights & Biases logging |
Multi-Node Training#
For large-scale training across multiple nodes:
python protomotions/train_slurm.py \
--robot-name=smpl \
--simulator=isaacgym \
--num-envs=8192 \
--batch-size=16384 \
--motion-file=/cluster/path/to/amass_train.pt \
--experiment-path=examples/experiments/mimic/mlp.py \
--experiment-name=smpl_motion_tracker_4node \
--user=myusername \
--ngpu=8 \
--nodes=4 \
--slurm-time=4:00:00 \
--use-wandb
ProtoMotions uses PyTorch Fabric for distributed training. Each node runs
--ngpu processes, and gradients are synchronized across all nodes.
Monitoring Jobs#
After submission, the script prints monitoring commands:
# Monitor live output
ssh myusername@cluster 'tail -f /path/to/exp/slurm_output.log'
# Check job status
ssh myusername@cluster 'squeue -u myusername'
# Cancel a job
ssh myusername@cluster 'scancel <job_id>'
Next Steps#
Configuration System - Configuration system details
Experiments - Creating custom experiments
Domain Randomization & Sim2Sim - Domain randomization for robust policies