HPC and SLURM Guide

This guide covers running PolyzyMD simulations on HPC clusters using SLURM.

Overview

Long MD simulations often exceed HPC time limits (typically 24-48 hours). PolyzyMD solves this with self-resubmitting jobs: each SLURM job runs a single simulation segment, checks whether work remains, and resubmits itself if not finished. This approach is simpler and more robust than dependency chains — every job is identical, and the simulation resumes correctly after wall-time limits, preemptions, or node failures.

User Workflow

The typical workflow for running a PolyzyMD simulation on an HPC cluster is:

Create a simulation directory with your configuration and input files
Write a config.yaml file with your simulation parameters
Generate job scripts using polyzymd submit --dry-run
Review the generated scripts
Submit the jobs for real

Step-by-Step Example

# 1. Create your simulation directory
mkdir -p my_simulation/structures
cd my_simulation

# 2. Copy your input files
cp /path/to/enzyme.pdb structures/
cp /path/to/substrate.sdf structures/
# If using pre-built polymers:
cp -r /path/to/polymer_sdfs ./ATRP_EGPMA_SBMA_5-mer/

# 3. Create your config.yaml (see Configuration Guide)
# The config should reference paths relative to this directory:
#   enzyme.pdb_path: "structures/enzyme.pdb"
#   substrate.sdf_path: "structures/substrate.sdf"
#   polymers.sdf_directory: "ATRP_EGPMA_SBMA_5-mer"

# 4. Test with a dry run first
polyzymd submit -c config.yaml --preset testing --dry-run

# 5. Review the generated scripts
cat job_scripts/r1_300K_LipA.sh

# 6. Submit for real (quick test first)
polyzymd submit -c config.yaml --preset testing --time-limit 0:05:00 --replicates 1

# 7. Once testing passes, submit production jobs
polyzymd submit -c config.yaml --preset aa100 --replicates 1-5 --email your@email.edu

Directory Structure

PolyzyMD supports separating:

Projects directory: Long-term storage for scripts, logs, configs
Scratch directory: High-performance storage for trajectories

/projects/$USER/polyzymd/           # Long-term storage
├── my_simulation/                  # Your simulation directory
│   ├── config.yaml                 # Main configuration
│   ├── structures/                 # Input structure files
│   │   ├── enzyme.pdb
│   │   └── substrate.sdf
│   ├── ATRP_EGPMA_SBMA_5-mer/     # Pre-built polymer SDFs (optional)
│   │   ├── EGPMA-SBMA_AAAAA_5-mer_charged.sdf
│   │   └── ...
│   ├── job_scripts/                # Generated SLURM scripts
│   │   ├── r1_300K_LipA.sh        # One script per replicate
│   │   ├── r2_300K_LipA.sh
│   │   └── ...
│   └── slurm_logs/                 # Job output files
│       └── r1_300K_LipA.12345.out

/scratch/alpine/$USER/polyzymd_sims/  # High-performance storage
├── LipA_Substrate_EGPMA-SBMA_10pct_300K_run1/
│   ├── system.pdb
│   ├── progress.json               # Progress tracking file
│   ├── equilibration/
│   │   └── trajectory.dcd
│   ├── production_0/
│   │   ├── trajectory.dcd
│   │   ├── checkpoint.chk
│   │   └── state_data.csv
│   └── production_1/
│       └── ...
└── LipA_Substrate_EGPMA-SBMA_10pct_300K_run2/

Configuring Directories

In YAML Configuration

Environment variables ($USER, $HOME, etc.) and ~ are automatically expanded:

output:
  projects_directory: "/projects/$USER/polyzymd/my_simulation"
  scratch_directory: "/scratch/alpine/$USER/polyzymd_sims"
  job_scripts_subdir: "job_scripts"
  slurm_logs_subdir: "slurm_logs"

You can also use ~ for home directory:

output:
  projects_directory: "~/polyzymd/my_simulation"

Via CLI Override

polyzymd submit -c config.yaml \
    --projects-dir /projects/$USER/polyzymd \
    --scratch-dir /scratch/alpine/$USER/simulations \
    --replicates 1-5

SLURM Presets

PolyzyMD includes presets for common HPC configurations:

Preset	Partition	GPUs	Time Limit	Memory	Description
`aa100`	aa100	1x A100	24h	3GB	CU Boulder Alpine — NVIDIA A100
`al40`	al40	1x L40	24h	3GB	CU Boulder Alpine — NVIDIA L40
`blanca-shirts`	blanca-shirts	1x	24h	3GB	CU Boulder Blanca — Shirts lab partition
`testing`	atesting_a100	1x	6min	3GB	CU Boulder Alpine — quick tests
`bridges2`	GPU-shared	1x V100-32	24h	(per-GPU)	PSC Bridges2 — NVIDIA V100 32GB

Using Presets

# Use A100 GPUs
polyzymd submit -c config.yaml --preset aa100

# Use testing partition for quick tests
polyzymd submit -c config.yaml --preset testing

Overriding Time Limit

You can override the preset’s time limit using --time-limit:

# Use testing preset with a 2-minute time limit
polyzymd submit -c config.yaml --preset testing --time-limit 0:02:00

# Use A100 with a 12-hour limit instead of 24h
polyzymd submit -c config.yaml --preset aa100 --time-limit 12:00:00

Time format options:

MM:SS - minutes and seconds (e.g., 2:00 for 2 minutes)
HH:MM:SS - hours, minutes, seconds (e.g., 0:02:00)
D-HH:MM:SS - days, hours, minutes, seconds (e.g., 1-00:00:00 for 1 day)

This is especially useful for:

Quick testing with short time limits
Adjusting for segment duration requirements
Working within specific QOS constraints

Custom SLURM Settings

For custom configurations, edit the generated scripts in job_scripts/ before submitting.

Bridges2 (PSC)

Bridges2 is the Pittsburgh Supercomputing Center (PSC) GPU cluster. It uses slightly different SLURM conventions than CU Boulder Alpine, and polyzymd handles these differences automatically via the bridges2 preset.

Key Differences from Alpine

Feature	Alpine (`aa100`)	Bridges2 (`bridges2`)
GPU directive	`--gres=gpu:N`	`--gpus=<type>:N`
Nodes/tasks	`--nodes=1` + `--ntasks=1`	`-N 1` (single line)
QoS	`--qos=normal`	(omitted — not used)
Memory	`--mem=3G`	(omitted — per-GPU allocation)
Account	ucb-group (in preset)	(omitted — inferred from login)
Env activation	`pixi shell-hook -e cuda-12-4`	`pixi shell-hook -e cuda-12-6`
Default time limit	24h	24h

Account

Bridges2 infers the billing allocation from your login session, so no --account directive is emitted by default. If you have multiple allocations and need to charge a specific one, pass --account:

polyzymd submit -c config.yaml \
    --preset bridges2 \
    --account chm250017p \
    --replicates 1-3 \
    --email collaborator@pitt.edu

Note

Unlike Alpine presets, Bridges2 scripts omit the #SBATCH --account= line entirely when no account is specified. The --account CLI flag is optional for Bridges2 (it is required on Alpine where the preset always sets a group account).

GPU Type Selection

Bridges2 has multiple GPU types available. Use --gpu-type to select:

Flag value	GPU	VRAM
`v100-32` (default)	NVIDIA V100	32 GB
`v100-16`	NVIDIA V100	16 GB
`l40s-48`	NVIDIA L40S	48 GB
`h100-80`	NVIDIA H100	80 GB

# Default (V100 32GB) — good balance of availability and memory
polyzymd submit -c config.yaml \
    --preset bridges2 \
    --account abc123_gpu

# High-memory GPU for large systems
polyzymd submit -c config.yaml \
    --preset bridges2 \
    --account abc123_gpu \
    --gpu-type h100-80

Full Bridges2 Workflow

# 1. Dry run — inspect scripts before submitting
polyzymd submit -c config.yaml \
    --preset bridges2 \
    --account abc123_gpu \
    --replicates 1-3 \
    --dry-run

# 2. Inspect the generated SBATCH directives
head -20 job_scripts/r1_300K_LipA.sh
# You should see:
#   #SBATCH --partition=GPU-shared
#   #SBATCH -N 1                    ← single-line nodes directive
#   #SBATCH --gpus=v100-32:1        ← type-specific GPU directive
#   (no --qos line)
#   (no --mem line)
#   (no --account line — inferred from login)

# 3. Submit for real
polyzymd submit -c config.yaml \
    --preset bridges2 \
    --account abc123_gpu \
    --replicates 1-3 \
    --email collaborator@pitt.edu

Bridges2 Directory Structure

On Bridges2, use Ocean storage for long-term data and local scratch for active simulations:

/ocean/projects/abc123_gpu/$USER/polyzymd/   # Long-term storage
├── my_simulation/
│   ├── config.yaml
│   ├── structures/
│   │   ├── enzyme.pdb
│   │   └── substrate.sdf
│   ├── job_scripts/
│   │   ├── r1_300K_LipA.sh
│   │   └── ...
│   └── slurm_logs/

/local/scratch/$USER/polyzymd_sims/          # High-performance local scratch
├── LipA_Substrate_300K_run1/
│   ├── system.pdb
│   ├── progress.json
│   ├── equilibration/
│   └── production_0/

Set these paths in your config.yaml:

output:
  projects_directory: "/ocean/projects/abc123_gpu/$USER/polyzymd/my_simulation"
  scratch_directory: "/local/scratch/$USER/polyzymd_sims"

Or override on the CLI:

polyzymd submit -c config.yaml \
    --preset bridges2 \
    --account abc123_gpu \
    --projects-dir "/ocean/projects/abc123_gpu/$USER/polyzymd/my_simulation" \
    --scratch-dir "/local/scratch/$USER/polyzymd_sims"

Self-Resubmitting Job Model

How It Works

PolyzyMD generates one identical SLURM script per replicate. Each job:

Calls polyzymd run-segment to run the next segment of work
After the segment finishes (or is interrupted), calls polyzymd check-progress to see if the simulation is complete
If work remains, resubmits itself via sbatch "$THIS_SCRIPT"
If the simulation is complete, exits cleanly

  ┌───────────────────────────┐
  │  Job submits itself       │
  │                           │◄──────────────────┐
  │  1. run-segment           │                   │
  │     (build/eq/prod OR     │                   │
  │      continue from last)  │                   │
  │                           │                   │
  │  2. check-progress        │     resubmit      │
  │     └─ complete? exit 0   │                   │
  │     └─ work remains? ─────┼───────────────────┘
  │     └─ error? exit $RC    │
  └───────────────────────────┘

This model is simpler and more fault-tolerant than dependency chains:

No afterany dependencies — each job is independent
Automatic recovery — if a job is interrupted by wall-time or preemption, it saves a checkpoint, resubmits, and the next invocation picks up where it left off
Idempotent — each job scans the filesystem to determine what work remains, so the same script can be resubmitted manually at any time

Progress Tracking

PolyzyMD tracks simulation progress in a progress.json file in the working directory. This file records:

Which segments have completed
How many steps each segment ran
Whether any segments were interrupted
The total steps requested vs. completed

On startup, run-segment validates the progress file against the filesystem (checking for production_N/ directories and their contents) to ensure consistency.

Configuring Production Duration

Specify the total simulation time in your config. PolyzyMD determines segment boundaries automatically based on wall-time:

simulation_phases:
  production:
    duration: 100.0    # 100 ns total production time

Tip

You don’t need to configure segments manually. Each job runs as much production time as it can before the wall-time limit, checkpoints, and resubmits. The segment duration is determined at runtime by the --segment-time and --segment-frames options passed to run (which submit computes automatically from your config).

Smart Restart & Fault Tolerance

When you run 60 replicates, each resubmitting multiple times, robustness is critical. PolyzyMD’s smart restart system handles interruptions automatically. The generated scripts already include everything described below — you do not need to configure anything. This section explains what happens under the hood so you can debug issues or adapt the approach to other workflows.

What Happens Automatically

Every generated SLURM script includes three pieces of fault-tolerance infrastructure:

Signal handling — Python-side handlers catch SIGUSR1 (wall-time warning) and SIGTERM (preemption), save an interrupted checkpoint, and exit with code 99.
Signal forwarding — Bash trap + background + wait pattern forwards signals from the SLURM batch shell to the Python child process.
Progress tracking — After each segment (whether completed or interrupted), the progress file is updated so the next invocation knows exactly where to resume.

The Three Scenarios

Scenario	Signal	What Happens	Outcome
Wall-time warning	`SIGUSR1` (5 min before limit)	Interrupted state saved, progress updated, exit 99	Job resubmits and resumes
Preemption	`SIGTERM` (120 s grace on Blanca)	Interrupted state saved, progress updated, exit 99	Job resubmits and resumes
Hard crash	None (OOM, segfault, node failure)	No state saved	Job resubmits; `run-segment` detects incomplete segment and handles it

Note

The wall-time signal is configured via #SBATCH --signal=B:USR1@300, which tells SLURM to send SIGUSR1 to the batch shell 300 seconds (5 minutes) before the time limit expires. This gives the simulation enough time to save a full OpenMM state (~10-30 seconds on GPU).

Interrupted State Files

When an interrupt is detected, the signal handler writes three files into the current segment’s directory (e.g. production_3/):

File	Purpose
`interrupted_state.xml`	Portable OpenMM state (positions, velocities, forces)
`interrupted_system.xml`	Serialized OpenMM System (force field parameters)
`INTERRUPTED`	Marker file with step-count metadata for recovery

The INTERRUPTED marker contains the information needed for recovery:

segment_index=3
steps_completed=1250000
total_steps=2500000
remaining_steps=1250000

Wall-Time Restart Checkpoints

In addition to signal-triggered saves, the simulation loop periodically writes portable restart checkpoints at a configurable wall-time interval (default: 60 seconds):

File	Purpose
`restart_state.xml`	Portable OpenMM state (overwritten each checkpoint)
`restart_system.xml`	Serialized OpenMM System (for self-contained recovery)

These files serve as a safety net: if the process is killed between signal delivery and the signal handler completing (e.g., the SLURM grace period expires), the restart checkpoint from the last 60-second interval is still on disk. Recovery prefers portable state XML files over binary .chk checkpoints, which is important on heterogeneous clusters where jobs may restart on different GPU hardware.

You can tune the checkpoint interval in your YAML config:

simulation_phases:
  production:
    duration: 100.0   # ns
    samples: 2500
    checkpoint_interval: 60.0  # seconds (default)

Set to 0 to disable wall-time checkpoints (not recommended on preemptible queues).

Adaptive Sub-Chunking

OpenMM’s simulation.step(N) blocks Python for the entire call. With large report intervals (e.g., 200,000 steps), each call can take ~2 minutes on a slow system, leaving no opportunity to check for SIGTERM. PolyzyMD solves this with adaptive sub-chunking: after the first checkpoint interval, it measures actual simulation speed and divides the loop into smaller chunks (~15 seconds each), ensuring the signal flag is checked ~4 times per checkpoint interval. This is transparent — reporters still fire at the original interval, and sub-chunk overhead is negligible (<0.001%).

Signal Forwarding: Why trap + background + wait?

SLURM sends signals to the batch shell process, not to child processes. Bash ignores SIGUSR1 by default, so without explicit forwarding, the Python simulation never sees the signal. The generated scripts use a standard pattern to solve this:

# Background the Python process
polyzymd run-segment -c "$CONFIG_PATH" -r "$REPLICATE" --scratch-dir "$WORKING_DIR" &
CHILD_PID=$!

# Trap signals and forward them to the child
trap 'forward_signal USR1' USR1
trap 'forward_signal TERM' TERM

# Wait in a loop (wait is interrupted by trapped signals)
wait "$CHILD_PID"
RC=$?
while kill -0 "$CHILD_PID" 2>/dev/null; do
    wait "$CHILD_PID"
    RC=$?
done

Warning

Do not remove the trap, backgrounding (&), or wait loop from the generated scripts. Without them, signals will not reach the Python process and graceful shutdown will not work.

Manually Triggering an Interrupt

You can test graceful shutdown or manually stop a running simulation by sending SIGUSR1 via scancel:

# Send USR1 to a specific job
scancel --signal=USR1 <job_id>

# The job will save interrupted state, update progress, and exit with code 99
# The resubmission logic then resubmits the job to continue

This is useful when you realize a simulation has a problem and want to stop it cleanly without losing progress.

Cancelling a Job Permanently

Because scancel sends SIGTERM, and our scripts treat SIGTERM the same as SIGUSR1 (save state, exit 99, resubmit), a plain scancel <job_id> will not permanently stop a simulation — the job will save its state and resubmit itself.

To truly cancel a simulation so it does not restart, use --signal=KILL:

# Permanently stop a job (no state saved, no resubmission)
scancel --signal=KILL <job_id>

# Equivalent shorthand
scancel -s KILL <job_id>

SIGKILL cannot be caught or trapped by bash or Python, so the process dies immediately and the resubmission logic never runs. The last completed segment’s checkpoint is still intact — only in-progress work since the last checkpoint is lost.

Command	Saves state?	Resubmits?	Use case
`scancel <job_id>`	Yes	Yes	Don’t use this to permanently stop a simulation
`scancel --signal=USR1 <job_id>`	Yes	Yes	Graceful stop (saves progress, continues later)
`scancel --signal=KILL <job_id>`	No	No	Permanently cancel a simulation

Tip

If you need to cancel all replicates of a simulation, cancel them all at once so that none resubmit before you can cancel the others:

scancel --signal=KILL <job_id_1> <job_id_2> <job_id_3>

Or cancel all your jobs:

scancel --signal=KILL -u $USER

Manual Recovery

If a simulation is stalled (e.g., the SLURM job exited without resubmitting), use the recover command to inspect status and optionally resume:

# Check status only
polyzymd recover -c config.yaml -r 1

# Submit a recovery job
polyzymd recover -c config.yaml -r 1 --submit --preset blanca-shirts

# Dry-run (show what would be submitted)
polyzymd recover -c config.yaml -r 1 --submit --dry-run

See the CLI Reference section below for full option details.

Submitting Jobs

Dry Run (Recommended First)

Generate scripts without submitting:

polyzymd submit -c config.yaml \
    --replicates 1-3 \
    --preset aa100 \
    --dry-run

Inspect generated scripts:

cat job_scripts/r1_300K_LipA.sh

Submit for Real

polyzymd submit -c config.yaml \
    --replicates 1-3 \
    --preset aa100 \
    --email your.email@university.edu

Replicate Specification

# Single replicate
--replicates 1

# Range
--replicates 1-5

# Specific replicates
--replicates 1,3,5,7

# Combined
--replicates 1-3,5,7-10

Monitoring Jobs

Check Job Status

# All your jobs
squeue -u $USER

# Specific job
scontrol show job <job_id>

# Watch jobs update
watch -n 30 'squeue -u $USER'

View Job Output

# Real-time output
tail -f slurm_logs/r1_300K_LipA*.out

# Check for errors
grep -i error slurm_logs/*.out
grep -i fail slurm_logs/*.out

Check Simulation Progress

Use the check-progress command to query the progress file:

# Check a specific replicate
polyzymd check-progress -c config.yaml -r 1

# Example output:
# Progress: 12500000/50000000 steps (25.0%), 5 segment(s)
# Status: in_progress — 75.000 ns remaining

Or inspect files directly:

# List trajectory files
ls -la /scratch/$USER/polyzymd_sims/*/production_*/trajectory.dcd

# Check trajectory sizes
du -h /scratch/$USER/polyzymd_sims/*/production_*/*.dcd

Handling Failures

Most failures are handled automatically by the smart restart system. This section covers the cases where manual intervention is needed.

Wall-Time or Preemption Interrupts

No action needed. The smart restart system saves an interrupted checkpoint, updates the progress file, and the job resubmits itself. Check the SLURM log to confirm:

# Look for the graceful shutdown message
grep -i "interrupted\|graceful\|resubmit" slurm_logs/r1_300K_*.out

Hard Crash (OOM, Segfault, Node Failure)

If a job crashes without saving state, the resubmission logic still runs (since the crash only kills the child Python process, not the bash wrapper). The resubmitted job’s run-segment will detect the incomplete segment and handle it appropriately.

To diagnose issues:

Check the error:
```
cat slurm_logs/r1_300K_*.out
```
Fix the issue (e.g. increase memory with --memory 8G)

Resume:

# Option 1: Resubmit the existing script
sbatch job_scripts/r1_300K_LipA.sh

# Option 2: Use recover to inspect and resume (with more memory if needed)
polyzymd recover -c config.yaml -r 1 --submit --preset aa100 --memory 8G

Checking Progress

For a visual overview of all replicates at once:

polyzymd status -c config.yaml

Use check-progress to check a single replicate (used by SLURM scripts):

polyzymd check-progress -c config.yaml -r 1

Or for a detailed per-segment view with recovery options:

polyzymd recover -c config.yaml -r 1

Start Fresh

To restart a simulation from scratch:

# Remove old output
rm -rf /scratch/$USER/polyzymd_sims/LipA_*_run1/

# Resubmit
polyzymd submit -c config.yaml --replicates 1 --preset aa100

Generated Script Structure

The submit command generates one script per replicate. Each script is a self-resubmitting job that handles the entire simulation lifecycle: building, equilibration, production segments, interruptions, and resubmission.

#!/bin/bash
#SBATCH --partition=aa100
#SBATCH --job-name=r1_300K_LipA
#SBATCH --output=slurm_logs/r1_300K_LipA.%A_%a.out
#SBATCH --qos=normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=3G
#SBATCH --time=23:59:59
#SBATCH --gres=gpu:1
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=your@email.edu
#SBATCH --account=ucb625_asc1
#SBATCH --signal=B:USR1@300
#SBATCH --no-requeue

# =============================================================================
# PolyzyMD Self-Resubmitting Simulation Job
# Generated by polyzymd — do not edit manually
# =============================================================================

# Activate pixi environment
# The manifest path was resolved at submission time from `which polyzymd`.
eval "$(pixi shell-hook -e cuda-12-4 --manifest-path /projects/$USER/polyzymd/pixi.toml)"

set -e

export INTERCHANGE_EXPERIMENTAL=1

# Resolve this script's path for self-resubmission
# ($SLURM_JOB_SCRIPT is only available in SLURM >= 22.05)
THIS_SCRIPT="${SLURM_JOB_SCRIPT:-$(realpath "$0")}"

CONFIG_PATH="/projects/$USER/polyzymd/my_simulation/config.yaml"
REPLICATE=1
WORKING_DIR="/scratch/alpine/$USER/polyzymd_sims/LipA_300K_run1"

mkdir -p "$WORKING_DIR"

echo "=================================================="
echo "PolyzyMD self-resubmitting job"
echo "Config:    $CONFIG_PATH"
echo "Replicate: $REPLICATE"
echo "Work dir:  $WORKING_DIR"
echo "Pixi env:  cuda-12-4"
echo "Job ID:    ${SLURM_JOB_ID:-local}"
echo "Timestamp: $(date)"
echo "=================================================="

# Signal forwarding (see Smart Restart docs)
CHILD_PID=""
forward_signal() {
    if [ -n "$CHILD_PID" ] && kill -0 "$CHILD_PID" 2>/dev/null; then
        echo "Forwarding $1 to Python process (PID $CHILD_PID)"
        kill -"$1" "$CHILD_PID"
    fi
}
trap 'forward_signal USR1' USR1
trap 'forward_signal TERM' TERM

# Run the next segment (backgrounded for signal forwarding)
polyzymd run-segment \
    -c "$CONFIG_PATH" \
    -r "$REPLICATE" \
    --scratch-dir "$WORKING_DIR" &
CHILD_PID=$!

# Wait for the child process
set +e
wait "$CHILD_PID" 2>/dev/null
RC=$?
while kill -0 "$CHILD_PID" 2>/dev/null; do
    wait "$CHILD_PID" 2>/dev/null
    RC=$?
done
set -e

echo "run-segment exited with code $RC at $(date)"

# --- Resubmission logic ---

if [ $RC -ne 0 ] && [ $RC -ne 99 ]; then
    echo "FATAL: run-segment failed (exit code $RC) — NOT resubmitting"
    exit $RC
fi

# Check whether more work remains
set +e
polyzymd check-progress -c "$CONFIG_PATH" -r "$REPLICATE" --scratch-dir "$WORKING_DIR"
PROGRESS_RC=$?
set -e

if [ $PROGRESS_RC -eq 0 ]; then
    echo "Simulation complete — no resubmission needed."
    exit 0
fi

# Work remains — resubmit this same script
echo "Work remains — resubmitting job..."
sbatch "$THIS_SCRIPT"
SUBMIT_RC=$?

if [ $SUBMIT_RC -eq 0 ]; then
    echo "Resubmitted successfully."
else
    echo "WARNING: sbatch resubmission failed (exit code $SUBMIT_RC)"
    echo "You can manually resume with:"
    echo "  sbatch $THIS_SCRIPT"
    exit 1
fi

exit 0

Key Features of the Generated Script

Feature	How It Works
Signal forwarding	`trap` + background `&` + `wait` loop ensures SIGUSR1/SIGTERM reach the Python process
Unified entry point	`polyzymd run-segment` handles both initial (build + eq + seg 0) and continuation segments
Progress checking	`polyzymd check-progress` returns exit code 0 (complete) or 1 (work remains)
Self-resubmission	`sbatch "$THIS_SCRIPT"` resubmits the exact same script (path resolved at script start via `$SLURM_JOB_SCRIPT` or `realpath "$0"`)
Error handling	Non-zero, non-99 exit codes abort without resubmitting

Best Practices

1. Always Test First

# Generate scripts without submitting (dry run)
polyzymd submit -c config.yaml --preset testing --dry-run

# Quick test with 2-minute time limit
polyzymd submit -c config.yaml \
    --preset testing \
    --time-limit 0:02:00 \
    --replicates 1

# Or a slightly longer test
polyzymd submit -c config.yaml \
    --preset testing \
    --time-limit 0:05:00 \
    --replicates 1

2. Monitor Early Segments

Watch the first segment complete to catch issues early:

tail -f slurm_logs/r1_300K_*.out

3. Back Up Important Data

Scratch is often purged. Copy completed simulations to projects:

# After simulation completes
cp -r /scratch/$USER/polyzymd_sims/LipA_300K_run1 \
      /projects/$USER/completed_simulations/

4. Use Email Notifications

polyzymd submit -c config.yaml --email you@university.edu

You’ll receive emails when jobs start, end, or fail.

5. Segment Duration Guidelines

Cluster Time Limit	Approximate Production per Segment
1 hour (testing)	0.5 - 1 ns
24 hours	8 - 12 ns
48 hours	20 - 30 ns
7 days	50 - 100 ns

CLI Reference

`polyzymd submit`

Submit self-resubmitting simulation jobs to SLURM.

polyzymd submit -c CONFIG [OPTIONS]

Required:

-c, --config PATH - Path to YAML configuration file

Options:

-r, --replicates RANGE - Replicate range (e.g., “1-5”, “1,3,5”). Default: “1”
--preset PRESET - SLURM preset: aa100, al40, blanca-shirts, testing, bridges2. Default: aa100
--account ACCOUNT - HPC allocation account ID (required for Bridges2)
--gpu-type TYPE - GPU type for Bridges2: v100-16, v100-32, l40s-48, h100-80. Default: v100-32
--scratch-dir PATH - Override scratch directory for simulation output
--projects-dir PATH - Override projects directory for scripts/logs
--output-dir PATH - Directory for job scripts. Default: {projects_dir}/job_scripts
--email EMAIL - Email for job notifications
--time-limit TIME - Override SLURM time limit (HH:MM:SS)
--memory SIZE - Override SLURM memory allocation (e.g., “4G”). Bridges2 omits –mem by default (per-GPU allocation)
--openff-logs - Enable verbose OpenFF logs in generated job scripts (for debugging)
--dry-run - Generate scripts without submitting

`polyzymd run-segment`

Unified entry point for SLURM jobs. Determines what work remains and runs the next segment.

polyzymd run-segment -c CONFIG [OPTIONS]

Required:

-c, --config PATH - Path to YAML configuration file

Options:

-r, --replicate INT - Replicate number. Default: 1
--scratch-dir PATH - Override scratch directory for simulation output
--skip-build - Skip system building (use existing) for initial segment

Behavior:

If no segments exist: builds system, equilibrates, runs segment 0
If segments exist but simulation is incomplete: continues from last completed segment
If simulation is complete: exits 0 immediately

Exit codes:

0 - Segment completed successfully
1 - Error
99 - Graceful interruption (wall-time signal)

`polyzymd check-progress`

Check whether a simulation is complete. Used by SLURM resubmission logic.

polyzymd check-progress -c CONFIG [OPTIONS]

Required:

-c, --config PATH - Path to YAML configuration file

Options:

-r, --replicate INT - Replicate number. Default: 1
--scratch-dir PATH - Override scratch directory

Exit codes:

0 - Simulation complete (do NOT resubmit)
1 - Work remains (resubmit)

`polyzymd recover`

Resume a stalled or interrupted simulation. Scans the working directory, loads progress state, and reports how much work remains. With --submit, generates and submits a self-resubmitting SLURM job that will automatically continue from the last completed segment.

polyzymd recover -c CONFIG [OPTIONS]

Required:

-c, --config PATH - Path to YAML configuration file

Options:

-r, --replicate INT - Replicate number. Default: 1
--scratch-dir PATH - Override scratch directory
--preset PRESET - SLURM preset for recovery job. Default: aa100
--submit / --no-submit - Submit a recovery job (default: status only)
--dry-run - Show what would be submitted without submitting
--memory SIZE - Override SLURM memory allocation (e.g. ‘4G’, ‘8G’). Default: 3G

Examples:

# Check status only
polyzymd recover -c config.yaml -r 1

# Submit a recovery job
polyzymd recover -c config.yaml -r 1 --submit --preset blanca-shirts

# Dry-run (show what would be submitted)
polyzymd recover -c config.yaml -r 1 --submit --dry-run

Example output (status only):

Working directory: /scratch/user/sim/LipA_300K_run1
Progress: 12500000/50000000 steps (25.0%)
Status: in_progress
Segments: 5
  segment 0: completed (100%)
  segment 1: completed (100%)
  segment 2: completed (100%)
  segment 3: completed (100%)
  segment 4: interrupted (50%)

Remaining: 75.000 ns (37500000 steps)

To resume, run:
  polyzymd recover -c config.yaml -r 1 --submit --preset aa100

Troubleshooting

“Job pending forever”

squeue -u $USER
# Check REASON column

Common reasons:

Resources - Waiting for GPUs
Priority - Queue is busy

“pixi: command not found” in job

If you see errors like “pixi: command not found” in your job output, pixi is not available in the non-interactive SLURM shell.

Ensure pixi is installed and on PATH for non-interactive shells:

# Check pixi is installed
which pixi

# If not installed, install it:
curl -fsSL https://pixi.sh/install.sh | sh
source ~/.bashrc

If pixi is installed but only available in interactive shells (e.g., it was added to ~/.bashrc inside an if [ -z "$PS1" ] guard), move the PATH addition outside the guard so that SLURM jobs can find it.

“Out of memory”

There are two types of out-of-memory errors:

GPU Memory (CUDA OOM):

CUDA out of memory

Reduce system size:

Decrease box.padding
Use fewer polymers
Use smaller production samples (fewer frames saved)

System Memory (SLURM OOM):

slurmstepd: error: Detected 1 oom_kill event in StepId=...

The job exceeded its RAM allocation. This often occurs during energy minimization when loading large systems onto the GPU.

Solution: Increase memory with the --memory flag:

# Default is 3G, increase for larger systems
polyzymd submit -c config.yaml --memory 4G

# For very large systems
polyzymd submit -c config.yaml --memory 8G

# Also works with recover --submit
polyzymd recover -c config.yaml -r 1 --submit --memory 8G

“GPU not detected”

Check:

nvidia-smi  # In job script

Make sure the GPU directive is present in the generated script:

Alpine presets: #SBATCH --gres=gpu:1
Bridges2 preset: #SBATCH --gpus=v100-32:1 (or your selected GPU type)

“config.yaml not found”

The self-resubmitting script stores an absolute path to the config file. Make sure the config file has not been moved or deleted since job submission:

# Check the path in the generated script
grep CONFIG_PATH job_scripts/r1_300K_LipA.sh

If the config was moved, either move it back or regenerate and resubmit:

polyzymd submit -c /new/path/to/config.yaml --preset aa100 --replicates 1

HPC and SLURM Guide

Overview

User Workflow

Step-by-Step Example

Directory Structure

Configuring Directories

In YAML Configuration

Via CLI Override

SLURM Presets

Using Presets

Overriding Time Limit

Custom SLURM Settings

Bridges2 (PSC)

Key Differences from Alpine

Account

GPU Type Selection

Full Bridges2 Workflow

Bridges2 Directory Structure

Self-Resubmitting Job Model

How It Works

Progress Tracking

Configuring Production Duration

Smart Restart & Fault Tolerance

What Happens Automatically

The Three Scenarios

Interrupted State Files

Wall-Time Restart Checkpoints

Adaptive Sub-Chunking

Signal Forwarding: Why trap + background + wait?

Manually Triggering an Interrupt

Cancelling a Job Permanently

Manual Recovery

Submitting Jobs

Dry Run (Recommended First)

Submit for Real

Replicate Specification

Monitoring Jobs

Check Job Status

View Job Output

Check Simulation Progress

Handling Failures

Wall-Time or Preemption Interrupts

Hard Crash (OOM, Segfault, Node Failure)

Checking Progress

Start Fresh

Generated Script Structure

Key Features of the Generated Script

Best Practices

1. Always Test First

2. Monitor Early Segments

3. Back Up Important Data

4. Use Email Notifications

5. Segment Duration Guidelines

CLI Reference

polyzymd submit

polyzymd run-segment

polyzymd check-progress

polyzymd recover

Troubleshooting

“Job pending forever”

“pixi: command not found” in job

“Out of memory”

“GPU not detected”

“config.yaml not found”

`polyzymd submit`

`polyzymd run-segment`

`polyzymd check-progress`

`polyzymd recover`