HPC and SLURM Guide
This guide covers running PolyzyMD simulations on HPC clusters using SLURM.
Overview
Long MD simulations often exceed HPC time limits (typically 24-48 hours). PolyzyMD solves this with daisy-chaining: breaking simulations into segments that run as dependent SLURM jobs.
User Workflow
The typical workflow for running a PolyzyMD simulation on an HPC cluster is:
Create a simulation directory with your configuration and input files
Write a
config.yamlfile with your simulation parametersGenerate job scripts using
polyzymd submit --dry-runReview the generated scripts
Submit the jobs for real
Step-by-Step Example
# 1. Create your simulation directory
mkdir -p my_simulation/structures
cd my_simulation
# 2. Copy your input files
cp /path/to/enzyme.pdb structures/
cp /path/to/substrate.sdf structures/
# If using pre-built polymers:
cp -r /path/to/polymer_sdfs ./ATRP_EGPMA_SBMA_5-mer/
# 3. Create your config.yaml (see Configuration Guide)
# The config should reference paths relative to this directory:
# enzyme.pdb_path: "structures/enzyme.pdb"
# substrate.sdf_path: "structures/substrate.sdf"
# polymers.sdf_directory: "ATRP_EGPMA_SBMA_5-mer"
# 4. Test with a dry run first
polyzymd submit -c config.yaml --preset testing --dry-run
# 5. Review the generated scripts
cat job_scripts/initial_seg0_rep1.sh
# 6. Submit for real (quick test first)
polyzymd submit -c config.yaml --preset testing --time-limit 0:05:00 --replicates 1
# 7. Once testing passes, submit production jobs
polyzymd submit -c config.yaml --preset aa100 --replicates 1-5 --email your@email.edu
Directory Structure
PolyzyMD supports separating:
Projects directory: Long-term storage for scripts, logs, configs
Scratch directory: High-performance storage for trajectories
/projects/$USER/polyzymd/ # Long-term storage
├── my_simulation/ # Your simulation directory
│ ├── config.yaml # Main configuration
│ ├── structures/ # Input structure files
│ │ ├── enzyme.pdb
│ │ └── substrate.sdf
│ ├── ATRP_EGPMA_SBMA_5-mer/ # Pre-built polymer SDFs (optional)
│ │ ├── EGPMA-SBMA_AAAAA_5-mer_charged.sdf
│ │ └── ...
│ ├── job_scripts/ # Generated SLURM scripts
│ │ ├── initial_seg0_rep1.sh
│ │ ├── continue_seg1_rep1.sh
│ │ └── ...
│ └── slurm_logs/ # Job output files
│ └── s0_r1_300K_LipA.out
/scratch/alpine/$USER/polyzymd_sims/ # High-performance storage
├── LipA_Substrate_EGPMA-SBMA_10pct_300K_run1/
│ ├── system.pdb
│ ├── equilibration/
│ │ └── trajectory.dcd
│ ├── production_seg0/
│ │ ├── trajectory.dcd
│ │ ├── checkpoint.chk
│ │ └── state_data.csv
│ └── production_seg1/
│ └── ...
└── LipA_Substrate_EGPMA-SBMA_10pct_300K_run2/
Configuring Directories
In YAML Configuration
Environment variables ($USER, $HOME, etc.) and ~ are automatically expanded:
output:
projects_directory: "/projects/$USER/polyzymd/my_simulation"
scratch_directory: "/scratch/alpine/$USER/polyzymd_sims"
job_scripts_subdir: "job_scripts"
slurm_logs_subdir: "slurm_logs"
You can also use ~ for home directory:
output:
projects_directory: "~/polyzymd/my_simulation"
Via CLI Override
polyzymd submit -c config.yaml \
--projects-dir /projects/$USER/polyzymd \
--scratch-dir /scratch/alpine/$USER/simulations \
--replicates 1-5
SLURM Presets
PolyzyMD includes presets for common HPC configurations:
Preset |
Partition |
GPUs |
Time Limit |
Memory |
Description |
|---|---|---|---|---|---|
|
aa100 |
1x A100 |
24h |
3GB |
NVIDIA A100 (recommended) |
|
al40 |
1x L40 |
24h |
3GB |
NVIDIA L40 |
|
blanca-shirts |
1x |
24h |
3GB |
Shirts lab partition |
|
atesting_a100 |
1x |
6min |
3GB |
Quick tests |
Using Presets
# Use A100 GPUs
polyzymd submit -c config.yaml --preset aa100
# Use testing partition for quick tests
polyzymd submit -c config.yaml --preset testing
Overriding Time Limit
You can override the preset’s time limit using --time-limit:
# Use testing preset with a 2-minute time limit
polyzymd submit -c config.yaml --preset testing --time-limit 0:02:00
# Use A100 with a 12-hour limit instead of 24h
polyzymd submit -c config.yaml --preset aa100 --time-limit 12:00:00
Time format options:
MM:SS- minutes and seconds (e.g.,2:00for 2 minutes)HH:MM:SS- hours, minutes, seconds (e.g.,0:02:00)D-HH:MM:SS- days, hours, minutes, seconds (e.g.,1-00:00:00for 1 day)
This is especially useful for:
Quick testing with short time limits
Adjusting for segment duration requirements
Working within specific QOS constraints
Custom SLURM Settings
For custom configurations, edit the generated scripts in job_scripts/ before submitting.
Daisy-Chain Workflow
How It Works
Initial job: Builds system, runs equilibration, runs first production segment
Continuation jobs: Load checkpoint, run next segment
Dependencies: Each job depends on the previous one completing successfully
Job 1 (initial) Job 2 (continue) Job 3 (continue)
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Build │ │ Load chkpt │ │ Load chkpt │
│ Equilibrate │ --> │ Run seg 1 │ --> │ Run seg 2 │
│ Run seg 0 │ │ Save chkpt │ │ Save chkpt │
└─────────────┘ └─────────────┘ └─────────────┘
Configuring Segments
simulation_phases:
production:
duration: 100.0 # 100 ns total
segments: 10 # 10 segments of 10 ns each
Tip
Segment duration = total duration / segments
Choose segment duration to fit within your cluster’s time limit with margin:
24h limit → ~8-10 ns segments (2h GPU time + overhead)
48h limit → ~20 ns segments
Submitting Jobs
Dry Run (Recommended First)
Generate scripts without submitting:
polyzymd submit -c config.yaml \
--replicates 1-3 \
--preset aa100 \
--dry-run
Inspect generated scripts:
cat job_scripts/initial_seg0_rep1.sh
Submit for Real
polyzymd submit -c config.yaml \
--replicates 1-3 \
--preset aa100 \
--email your.email@university.edu
Replicate Specification
# Single replicate
--replicates 1
# Range
--replicates 1-5
# Specific replicates
--replicates 1,3,5,7
# Combined
--replicates 1-3,5,7-10
Monitoring Jobs
Check Job Status
# All your jobs
squeue -u $USER
# Specific job
scontrol show job <job_id>
# Watch jobs update
watch -n 30 'squeue -u $USER'
View Job Output
# Real-time output
tail -f slurm_logs/s0_r1_300K_LipA*.out
# Check for errors
grep -i error slurm_logs/*.out
grep -i fail slurm_logs/*.out
Check Simulation Progress
# List trajectory files
ls -la /scratch/$USER/polyzymd_sims/*/production_*/trajectory.dcd
# Check trajectory sizes
du -h /scratch/$USER/polyzymd_sims/*/production_*/*.dcd
Handling Failures
Job Failed Mid-Segment
If a job fails, the dependent jobs won’t start. To restart:
Check the error:
cat slurm_logs/s2_r1_300K_*.outFix the issue (if possible)
Manually continue:
# Edit and resubmit the continuation script sbatch job_scripts/continue_seg2_rep1.sh
Start Fresh
To restart a simulation from scratch:
# Remove old output
rm -rf /scratch/$USER/polyzymd_sims/LipA_*_run1/
# Resubmit
polyzymd submit -c config.yaml --replicates 1 --preset aa100
Generated Script Structure
Initial Script
The initial job script (segment 0) builds the system, runs equilibration, and runs the first production segment:
#!/bin/bash
#SBATCH --partition=aa100
#SBATCH --job-name=i_s0_r1_300K_LipA
#SBATCH --output=slurm_logs/s0_r1_300K_LipA.%A_%a.out
#SBATCH --qos=normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=3G
#SBATCH --time=23:59:59
#SBATCH --gres=gpu:1
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=your@email.edu
#SBATCH --account=ucb625_asc1
# Exit immediately if any command fails
set -e
module purge
module load miniforge
mamba activate polymerist-env
# Projects directory (scripts, configs, logs)
PROJECTS_DIR="/projects/$USER/polyzymd/my_simulation"
# Scratch directory (simulation output)
SCRATCH_DIR="/scratch/alpine/$USER/polyzymd_sims/LipA_300K_run1"
# Ensure scratch directory exists
mkdir -p "$SCRATCH_DIR"
# Change to projects directory where config lives
cd "$PROJECTS_DIR"
echo "Starting initial simulation segment 0"
echo "Projects dir: $PROJECTS_DIR"
echo "Scratch dir: $SCRATCH_DIR"
echo "Config: config.yaml"
echo "Replicate: 1"
echo "Timestamp: $(date)"
# Run the initial simulation using polyzymd CLI
polyzymd run -c "config.yaml" \
--replicate 1 \
--scratch-dir "$SCRATCH_DIR" \
--segment-time 10.0 \
--segment-frames 250
echo "Segment 0 completed successfully at $(date)"
Continuation Script
Continuation scripts load the checkpoint from the previous segment and continue the simulation:
#!/bin/bash
#SBATCH --partition=aa100
#SBATCH --job-name=c_s1_r1_300K_LipA
#SBATCH --output=slurm_logs/s1_r1_300K_LipA.%A_%a.out
#SBATCH --qos=normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=3G
#SBATCH --time=23:59:59
#SBATCH --gres=gpu:1
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=your@email.edu
#SBATCH --account=ucb625_asc1
# Exit immediately if any command fails
set -e
module purge
module load miniforge
mamba activate polymerist-env
# Projects directory (scripts, configs, logs)
PROJECTS_DIR="/projects/$USER/polyzymd/my_simulation"
# Scratch directory (simulation output - where previous segment data lives)
SCRATCH_DIR="/scratch/alpine/$USER/polyzymd_sims/LipA_300K_run1"
# Change to projects directory
cd "$PROJECTS_DIR"
echo "Starting continuation segment 1"
echo "Projects dir: $PROJECTS_DIR"
echo "Scratch dir: $SCRATCH_DIR"
echo "Timestamp: $(date)"
# Continue simulation from previous segment using polyzymd CLI
polyzymd continue \
-w "$SCRATCH_DIR" \
-s 1 \
-t 10.0 \
-n 250
echo "Segment 1 completed successfully at $(date)"
Best Practices
1. Always Test First
# Generate scripts without submitting (dry run)
polyzymd submit -c config.yaml --preset testing --dry-run
# Quick test with 2-minute time limit
polyzymd submit -c config.yaml \
--preset testing \
--time-limit 0:02:00 \
--replicates 1
# Or a slightly longer test
polyzymd submit -c config.yaml \
--preset testing \
--time-limit 0:05:00 \
--replicates 1
2. Monitor Early Segments
Watch the first segment complete to catch issues early:
tail -f slurm_logs/*_s0_*.out
3. Back Up Important Data
Scratch is often purged. Copy completed simulations to projects:
# After simulation completes
cp -r /scratch/$USER/polyzymd_sims/LipA_300K_run1 \
/projects/$USER/completed_simulations/
4. Use Email Notifications
polyzymd submit -c config.yaml --email you@university.edu
You’ll receive emails when jobs start, end, or fail.
5. Segment Duration Guidelines
Cluster Time Limit |
Recommended Segment Duration |
|---|---|
1 hour (testing) |
0.5 - 1 ns |
24 hours |
8 - 12 ns |
48 hours |
20 - 30 ns |
7 days |
50 - 100 ns |
CLI Reference
polyzymd submit
Submit daisy-chain simulation jobs to SLURM.
polyzymd submit -c CONFIG [OPTIONS]
Required:
-c, --config PATH- Path to YAML configuration file
Options:
-r, --replicates RANGE- Replicate range (e.g., “1-5”, “1,3,5”). Default: “1”--preset PRESET- SLURM preset: aa100, al40, blanca-shirts, testing. Default: aa100--scratch-dir PATH- Override scratch directory for simulation output--projects-dir PATH- Override projects directory for scripts/logs--output-dir PATH- Directory for job scripts. Default: {projects_dir}/job_scripts--email EMAIL- Email for job notifications--time-limit TIME- Override SLURM time limit (HH:MM:SS)--memory SIZE- Override SLURM memory allocation (e.g., “4G”, “8G”). Default: 3G--openff-logs- Enable verbose OpenFF logs in generated job scripts (for debugging)--dry-run- Generate scripts without submitting
polyzymd run
Run a complete simulation (build + equilibration + first production segment).
polyzymd run -c CONFIG [OPTIONS]
Required:
-c, --config PATH- Path to YAML configuration file
Options:
-r, --replicate INT- Replicate number. Default: 1--scratch-dir PATH- Scratch directory for simulation output--projects-dir PATH- Projects directory for scripts/logs--segment-time FLOAT- Override production time per segment (ns)--segment-frames INT- Override frames per segment--skip-build- Skip system building (use existing)
polyzymd continue
Continue a simulation from a previous segment checkpoint.
polyzymd continue -w WORKING_DIR -s SEGMENT -t TIME [OPTIONS]
Required:
-w, --working-dir PATH- Working directory with previous segment-s, --segment INT- Segment index to run (1-based)-t, --segment-time FLOAT- Duration of this segment (ns)
Options:
-n, --num-samples INT- Number of frames to save. Default: 250
Troubleshooting
“Job pending forever”
squeue -u $USER
# Check REASON column
Common reasons:
Resources- Waiting for GPUsPriority- Queue is busyDependency- Waiting for previous job
“Module not found in job”
If you see errors like “Lmod has detected the following error: The following module(s) are unknown”, the cluster may have different module configurations on different partitions.
Check available modules:
module spider miniforge
module spider anaconda
You may need to edit the generated scripts to use a different module:
# Instead of:
module load miniforge
mamba activate polymerist-env
# Try:
module load anaconda
conda activate polymerist-env
# Or source conda directly:
source /curc/sw/anaconda3/latest/etc/profile.d/conda.sh
conda activate polymerist-env
“Out of memory”
There are two types of out-of-memory errors:
GPU Memory (CUDA OOM):
CUDA out of memory
Reduce system size:
Decrease
box.paddingUse fewer polymers
Use smaller production
samples(fewer frames saved)
System Memory (SLURM OOM):
slurmstepd: error: Detected 1 oom_kill event in StepId=...
The job exceeded its RAM allocation. This often occurs during energy minimization when loading large systems onto the GPU.
Solution: Increase memory with the --memory flag:
# Default is 3G, increase for larger systems
polyzymd submit -c config.yaml --memory 4G
# For very large systems
polyzymd submit -c config.yaml --memory 8G
“GPU not detected”
Check:
nvidia-smi # In job script
Make sure #SBATCH --gres=gpu:1 is present.
“config.yaml not found”
Make sure you’re running polyzymd submit from the directory containing your config.yaml, or use an absolute path:
polyzymd submit -c /full/path/to/config.yaml --preset aa100