HPC and SLURM Guide

This guide covers running PolyzyMD simulations on HPC clusters using SLURM.

Overview

Long MD simulations often exceed HPC time limits (typically 24-48 hours). PolyzyMD solves this with daisy-chaining: breaking simulations into segments that run as dependent SLURM jobs.

User Workflow

The typical workflow for running a PolyzyMD simulation on an HPC cluster is:

  1. Create a simulation directory with your configuration and input files

  2. Write a config.yaml file with your simulation parameters

  3. Generate job scripts using polyzymd submit --dry-run

  4. Review the generated scripts

  5. Submit the jobs for real

Step-by-Step Example

# 1. Create your simulation directory
mkdir -p my_simulation/structures
cd my_simulation

# 2. Copy your input files
cp /path/to/enzyme.pdb structures/
cp /path/to/substrate.sdf structures/
# If using pre-built polymers:
cp -r /path/to/polymer_sdfs ./ATRP_EGPMA_SBMA_5-mer/

# 3. Create your config.yaml (see Configuration Guide)
# The config should reference paths relative to this directory:
#   enzyme.pdb_path: "structures/enzyme.pdb"
#   substrate.sdf_path: "structures/substrate.sdf"
#   polymers.sdf_directory: "ATRP_EGPMA_SBMA_5-mer"

# 4. Test with a dry run first
polyzymd submit -c config.yaml --preset testing --dry-run

# 5. Review the generated scripts
cat job_scripts/initial_seg0_rep1.sh

# 6. Submit for real (quick test first)
polyzymd submit -c config.yaml --preset testing --time-limit 0:05:00 --replicates 1

# 7. Once testing passes, submit production jobs
polyzymd submit -c config.yaml --preset aa100 --replicates 1-5 --email your@email.edu

Directory Structure

PolyzyMD supports separating:

  • Projects directory: Long-term storage for scripts, logs, configs

  • Scratch directory: High-performance storage for trajectories

/projects/$USER/polyzymd/           # Long-term storage
├── my_simulation/                  # Your simulation directory
│   ├── config.yaml                 # Main configuration
│   ├── structures/                 # Input structure files
│   │   ├── enzyme.pdb
│   │   └── substrate.sdf
│   ├── ATRP_EGPMA_SBMA_5-mer/     # Pre-built polymer SDFs (optional)
│   │   ├── EGPMA-SBMA_AAAAA_5-mer_charged.sdf
│   │   └── ...
│   ├── job_scripts/                # Generated SLURM scripts
│   │   ├── initial_seg0_rep1.sh
│   │   ├── continue_seg1_rep1.sh
│   │   └── ...
│   └── slurm_logs/                 # Job output files
│       └── s0_r1_300K_LipA.out

/scratch/alpine/$USER/polyzymd_sims/  # High-performance storage
├── LipA_Substrate_EGPMA-SBMA_10pct_300K_run1/
│   ├── system.pdb
│   ├── equilibration/
│   │   └── trajectory.dcd
│   ├── production_seg0/
│   │   ├── trajectory.dcd
│   │   ├── checkpoint.chk
│   │   └── state_data.csv
│   └── production_seg1/
│       └── ...
└── LipA_Substrate_EGPMA-SBMA_10pct_300K_run2/

Configuring Directories

In YAML Configuration

Environment variables ($USER, $HOME, etc.) and ~ are automatically expanded:

output:
  projects_directory: "/projects/$USER/polyzymd/my_simulation"
  scratch_directory: "/scratch/alpine/$USER/polyzymd_sims"
  job_scripts_subdir: "job_scripts"
  slurm_logs_subdir: "slurm_logs"

You can also use ~ for home directory:

output:
  projects_directory: "~/polyzymd/my_simulation"

Via CLI Override

polyzymd submit -c config.yaml \
    --projects-dir /projects/$USER/polyzymd \
    --scratch-dir /scratch/alpine/$USER/simulations \
    --replicates 1-5

SLURM Presets

PolyzyMD includes presets for common HPC configurations:

Preset

Partition

GPUs

Time Limit

Memory

Description

aa100

aa100

1x A100

24h

3GB

NVIDIA A100 (recommended)

al40

al40

1x L40

24h

3GB

NVIDIA L40

blanca-shirts

blanca-shirts

1x

24h

3GB

Shirts lab partition

testing

atesting_a100

1x

6min

3GB

Quick tests

Using Presets

# Use A100 GPUs
polyzymd submit -c config.yaml --preset aa100

# Use testing partition for quick tests
polyzymd submit -c config.yaml --preset testing

Overriding Time Limit

You can override the preset’s time limit using --time-limit:

# Use testing preset with a 2-minute time limit
polyzymd submit -c config.yaml --preset testing --time-limit 0:02:00

# Use A100 with a 12-hour limit instead of 24h
polyzymd submit -c config.yaml --preset aa100 --time-limit 12:00:00

Time format options:

  • MM:SS - minutes and seconds (e.g., 2:00 for 2 minutes)

  • HH:MM:SS - hours, minutes, seconds (e.g., 0:02:00)

  • D-HH:MM:SS - days, hours, minutes, seconds (e.g., 1-00:00:00 for 1 day)

This is especially useful for:

  • Quick testing with short time limits

  • Adjusting for segment duration requirements

  • Working within specific QOS constraints

Custom SLURM Settings

For custom configurations, edit the generated scripts in job_scripts/ before submitting.


Daisy-Chain Workflow

How It Works

  1. Initial job: Builds system, runs equilibration, runs first production segment

  2. Continuation jobs: Load checkpoint, run next segment

  3. Dependencies: Each job depends on the previous one completing successfully

Job 1 (initial)     Job 2 (continue)    Job 3 (continue)
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Build       │     │ Load chkpt  │     │ Load chkpt  │
│ Equilibrate │ --> │ Run seg 1   │ --> │ Run seg 2   │
│ Run seg 0   │     │ Save chkpt  │     │ Save chkpt  │
└─────────────┘     └─────────────┘     └─────────────┘

Configuring Segments

simulation_phases:
  production:
    duration: 100.0    # 100 ns total
  segments: 10         # 10 segments of 10 ns each

Tip

Segment duration = total duration / segments

Choose segment duration to fit within your cluster’s time limit with margin:

  • 24h limit → ~8-10 ns segments (2h GPU time + overhead)

  • 48h limit → ~20 ns segments


Submitting Jobs

Submit for Real

polyzymd submit -c config.yaml \
    --replicates 1-3 \
    --preset aa100 \
    --email your.email@university.edu

Replicate Specification

# Single replicate
--replicates 1

# Range
--replicates 1-5

# Specific replicates
--replicates 1,3,5,7

# Combined
--replicates 1-3,5,7-10

Monitoring Jobs

Check Job Status

# All your jobs
squeue -u $USER

# Specific job
scontrol show job <job_id>

# Watch jobs update
watch -n 30 'squeue -u $USER'

View Job Output

# Real-time output
tail -f slurm_logs/s0_r1_300K_LipA*.out

# Check for errors
grep -i error slurm_logs/*.out
grep -i fail slurm_logs/*.out

Check Simulation Progress

# List trajectory files
ls -la /scratch/$USER/polyzymd_sims/*/production_*/trajectory.dcd

# Check trajectory sizes
du -h /scratch/$USER/polyzymd_sims/*/production_*/*.dcd

Handling Failures

Job Failed Mid-Segment

If a job fails, the dependent jobs won’t start. To restart:

  1. Check the error:

    cat slurm_logs/s2_r1_300K_*.out
    
  2. Fix the issue (if possible)

  3. Manually continue:

    # Edit and resubmit the continuation script
    sbatch job_scripts/continue_seg2_rep1.sh
    

Start Fresh

To restart a simulation from scratch:

# Remove old output
rm -rf /scratch/$USER/polyzymd_sims/LipA_*_run1/

# Resubmit
polyzymd submit -c config.yaml --replicates 1 --preset aa100

Generated Script Structure

Initial Script

The initial job script (segment 0) builds the system, runs equilibration, and runs the first production segment:

#!/bin/bash
#SBATCH --partition=aa100
#SBATCH --job-name=i_s0_r1_300K_LipA
#SBATCH --output=slurm_logs/s0_r1_300K_LipA.%A_%a.out
#SBATCH --qos=normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=3G
#SBATCH --time=23:59:59
#SBATCH --gres=gpu:1
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=your@email.edu
#SBATCH --account=ucb625_asc1

# Exit immediately if any command fails
set -e

module purge
module load miniforge
mamba activate polymerist-env

# Projects directory (scripts, configs, logs)
PROJECTS_DIR="/projects/$USER/polyzymd/my_simulation"

# Scratch directory (simulation output)
SCRATCH_DIR="/scratch/alpine/$USER/polyzymd_sims/LipA_300K_run1"

# Ensure scratch directory exists
mkdir -p "$SCRATCH_DIR"

# Change to projects directory where config lives
cd "$PROJECTS_DIR"

echo "Starting initial simulation segment 0"
echo "Projects dir: $PROJECTS_DIR"
echo "Scratch dir: $SCRATCH_DIR"
echo "Config: config.yaml"
echo "Replicate: 1"
echo "Timestamp: $(date)"

# Run the initial simulation using polyzymd CLI
polyzymd run -c "config.yaml" \
    --replicate 1 \
    --scratch-dir "$SCRATCH_DIR" \
    --segment-time 10.0 \
    --segment-frames 250

echo "Segment 0 completed successfully at $(date)"

Continuation Script

Continuation scripts load the checkpoint from the previous segment and continue the simulation:

#!/bin/bash
#SBATCH --partition=aa100
#SBATCH --job-name=c_s1_r1_300K_LipA
#SBATCH --output=slurm_logs/s1_r1_300K_LipA.%A_%a.out
#SBATCH --qos=normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=3G
#SBATCH --time=23:59:59
#SBATCH --gres=gpu:1
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=your@email.edu
#SBATCH --account=ucb625_asc1

# Exit immediately if any command fails
set -e

module purge
module load miniforge
mamba activate polymerist-env

# Projects directory (scripts, configs, logs)
PROJECTS_DIR="/projects/$USER/polyzymd/my_simulation"

# Scratch directory (simulation output - where previous segment data lives)
SCRATCH_DIR="/scratch/alpine/$USER/polyzymd_sims/LipA_300K_run1"

# Change to projects directory
cd "$PROJECTS_DIR"

echo "Starting continuation segment 1"
echo "Projects dir: $PROJECTS_DIR"
echo "Scratch dir: $SCRATCH_DIR"
echo "Timestamp: $(date)"

# Continue simulation from previous segment using polyzymd CLI
polyzymd continue \
    -w "$SCRATCH_DIR" \
    -s 1 \
    -t 10.0 \
    -n 250

echo "Segment 1 completed successfully at $(date)"

Best Practices

1. Always Test First

# Generate scripts without submitting (dry run)
polyzymd submit -c config.yaml --preset testing --dry-run

# Quick test with 2-minute time limit
polyzymd submit -c config.yaml \
    --preset testing \
    --time-limit 0:02:00 \
    --replicates 1

# Or a slightly longer test
polyzymd submit -c config.yaml \
    --preset testing \
    --time-limit 0:05:00 \
    --replicates 1

2. Monitor Early Segments

Watch the first segment complete to catch issues early:

tail -f slurm_logs/*_s0_*.out

3. Back Up Important Data

Scratch is often purged. Copy completed simulations to projects:

# After simulation completes
cp -r /scratch/$USER/polyzymd_sims/LipA_300K_run1 \
      /projects/$USER/completed_simulations/

4. Use Email Notifications

polyzymd submit -c config.yaml --email you@university.edu

You’ll receive emails when jobs start, end, or fail.

5. Segment Duration Guidelines

Cluster Time Limit

Recommended Segment Duration

1 hour (testing)

0.5 - 1 ns

24 hours

8 - 12 ns

48 hours

20 - 30 ns

7 days

50 - 100 ns


CLI Reference

polyzymd submit

Submit daisy-chain simulation jobs to SLURM.

polyzymd submit -c CONFIG [OPTIONS]

Required:

  • -c, --config PATH - Path to YAML configuration file

Options:

  • -r, --replicates RANGE - Replicate range (e.g., “1-5”, “1,3,5”). Default: “1”

  • --preset PRESET - SLURM preset: aa100, al40, blanca-shirts, testing. Default: aa100

  • --scratch-dir PATH - Override scratch directory for simulation output

  • --projects-dir PATH - Override projects directory for scripts/logs

  • --output-dir PATH - Directory for job scripts. Default: {projects_dir}/job_scripts

  • --email EMAIL - Email for job notifications

  • --time-limit TIME - Override SLURM time limit (HH:MM:SS)

  • --memory SIZE - Override SLURM memory allocation (e.g., “4G”, “8G”). Default: 3G

  • --openff-logs - Enable verbose OpenFF logs in generated job scripts (for debugging)

  • --dry-run - Generate scripts without submitting

polyzymd run

Run a complete simulation (build + equilibration + first production segment).

polyzymd run -c CONFIG [OPTIONS]

Required:

  • -c, --config PATH - Path to YAML configuration file

Options:

  • -r, --replicate INT - Replicate number. Default: 1

  • --scratch-dir PATH - Scratch directory for simulation output

  • --projects-dir PATH - Projects directory for scripts/logs

  • --segment-time FLOAT - Override production time per segment (ns)

  • --segment-frames INT - Override frames per segment

  • --skip-build - Skip system building (use existing)

polyzymd continue

Continue a simulation from a previous segment checkpoint.

polyzymd continue -w WORKING_DIR -s SEGMENT -t TIME [OPTIONS]

Required:

  • -w, --working-dir PATH - Working directory with previous segment

  • -s, --segment INT - Segment index to run (1-based)

  • -t, --segment-time FLOAT - Duration of this segment (ns)

Options:

  • -n, --num-samples INT - Number of frames to save. Default: 250


Troubleshooting

“Job pending forever”

squeue -u $USER
# Check REASON column

Common reasons:

  • Resources - Waiting for GPUs

  • Priority - Queue is busy

  • Dependency - Waiting for previous job

“Module not found in job”

If you see errors like “Lmod has detected the following error: The following module(s) are unknown”, the cluster may have different module configurations on different partitions.

Check available modules:

module spider miniforge
module spider anaconda

You may need to edit the generated scripts to use a different module:

# Instead of:
module load miniforge
mamba activate polymerist-env

# Try:
module load anaconda
conda activate polymerist-env

# Or source conda directly:
source /curc/sw/anaconda3/latest/etc/profile.d/conda.sh
conda activate polymerist-env

“Out of memory”

There are two types of out-of-memory errors:

GPU Memory (CUDA OOM):

CUDA out of memory

Reduce system size:

  • Decrease box.padding

  • Use fewer polymers

  • Use smaller production samples (fewer frames saved)

System Memory (SLURM OOM):

slurmstepd: error: Detected 1 oom_kill event in StepId=...

The job exceeded its RAM allocation. This often occurs during energy minimization when loading large systems onto the GPU.

Solution: Increase memory with the --memory flag:

# Default is 3G, increase for larger systems
polyzymd submit -c config.yaml --memory 4G

# For very large systems
polyzymd submit -c config.yaml --memory 8G

“GPU not detected”

Check:

nvidia-smi  # In job script

Make sure #SBATCH --gres=gpu:1 is present.

“config.yaml not found”

Make sure you’re running polyzymd submit from the directory containing your config.yaml, or use an absolute path:

polyzymd submit -c /full/path/to/config.yaml --preset aa100