HPC and SLURM Guide
This guide covers running PolyzyMD simulations on HPC clusters using SLURM.
Overview
Long MD simulations often exceed HPC time limits (typically 24-48 hours). PolyzyMD solves this with self-resubmitting jobs: each SLURM job runs a single simulation segment, checks whether work remains, and resubmits itself if not finished. This approach is simpler and more robust than dependency chains — every job is identical, and the simulation resumes correctly after wall-time limits, preemptions, or node failures.
User Workflow
The typical workflow for running a PolyzyMD simulation on an HPC cluster is:
Create a simulation directory with your configuration and input files
Write a
config.yamlfile with your simulation parametersGenerate job scripts using
polyzymd submit --dry-runReview the generated scripts
Submit the jobs for real
Step-by-Step Example
# 1. Create your simulation directory
mkdir -p my_simulation/structures
cd my_simulation
# 2. Copy your input files
cp /path/to/enzyme.pdb structures/
cp /path/to/substrate.sdf structures/
# If using pre-built polymers:
cp -r /path/to/polymer_sdfs ./ATRP_EGPMA_SBMA_5-mer/
# 3. Create your config.yaml (see Configuration Guide)
# The config should reference paths relative to this directory:
# enzyme.pdb_path: "structures/enzyme.pdb"
# substrate.sdf_path: "structures/substrate.sdf"
# polymers.sdf_directory: "ATRP_EGPMA_SBMA_5-mer"
# 4. Test with a dry run first
polyzymd submit -c config.yaml --preset testing --dry-run
# 5. Review the generated scripts
cat job_scripts/r1_300K_LipA.sh
# 6. Submit for real (quick test first)
polyzymd submit -c config.yaml --preset testing --time-limit 0:05:00 --replicates 1
# 7. Once testing passes, submit production jobs
polyzymd submit -c config.yaml --preset aa100 --replicates 1-5 --email your@email.edu
Directory Structure
PolyzyMD supports separating:
Projects directory: Long-term storage for scripts, logs, configs
Scratch directory: High-performance storage for trajectories
/projects/$USER/polyzymd/ # Long-term storage
├── my_simulation/ # Your simulation directory
│ ├── config.yaml # Main configuration
│ ├── structures/ # Input structure files
│ │ ├── enzyme.pdb
│ │ └── substrate.sdf
│ ├── ATRP_EGPMA_SBMA_5-mer/ # Pre-built polymer SDFs (optional)
│ │ ├── EGPMA-SBMA_AAAAA_5-mer_charged.sdf
│ │ └── ...
│ ├── job_scripts/ # Generated SLURM scripts
│ │ ├── r1_300K_LipA.sh # One script per replicate
│ │ ├── r2_300K_LipA.sh
│ │ └── ...
│ └── slurm_logs/ # Job output files
│ └── r1_300K_LipA.12345.out
/scratch/alpine/$USER/polyzymd_sims/ # High-performance storage
├── LipA_Substrate_EGPMA-SBMA_10pct_300K_run1/
│ ├── system.pdb
│ ├── progress.json # Progress tracking file
│ ├── equilibration/
│ │ └── trajectory.dcd
│ ├── production_0/
│ │ ├── trajectory.dcd
│ │ ├── checkpoint.chk
│ │ └── state_data.csv
│ └── production_1/
│ └── ...
└── LipA_Substrate_EGPMA-SBMA_10pct_300K_run2/
Configuring Directories
In YAML Configuration
Environment variables ($USER, $HOME, etc.) and ~ are automatically expanded:
output:
projects_directory: "/projects/$USER/polyzymd/my_simulation"
scratch_directory: "/scratch/alpine/$USER/polyzymd_sims"
job_scripts_subdir: "job_scripts"
slurm_logs_subdir: "slurm_logs"
You can also use ~ for home directory:
output:
projects_directory: "~/polyzymd/my_simulation"
Via CLI Override
polyzymd submit -c config.yaml \
--projects-dir /projects/$USER/polyzymd \
--scratch-dir /scratch/alpine/$USER/simulations \
--replicates 1-5
SLURM Presets
PolyzyMD includes presets for common HPC configurations:
Preset |
Partition |
GPUs |
Time Limit |
Memory |
Description |
|---|---|---|---|---|---|
|
aa100 |
1x A100 |
24h |
3GB |
CU Boulder Alpine — NVIDIA A100 |
|
al40 |
1x L40 |
24h |
3GB |
CU Boulder Alpine — NVIDIA L40 |
|
blanca-shirts |
1x |
24h |
3GB |
CU Boulder Blanca — Shirts lab partition |
|
atesting_a100 |
1x |
6min |
3GB |
CU Boulder Alpine — quick tests |
|
GPU-shared |
1x V100-32 |
24h |
(per-GPU) |
PSC Bridges2 — NVIDIA V100 32GB |
Using Presets
# Use A100 GPUs
polyzymd submit -c config.yaml --preset aa100
# Use testing partition for quick tests
polyzymd submit -c config.yaml --preset testing
Overriding Time Limit
You can override the preset’s time limit using --time-limit:
# Use testing preset with a 2-minute time limit
polyzymd submit -c config.yaml --preset testing --time-limit 0:02:00
# Use A100 with a 12-hour limit instead of 24h
polyzymd submit -c config.yaml --preset aa100 --time-limit 12:00:00
Time format options:
MM:SS- minutes and seconds (e.g.,2:00for 2 minutes)HH:MM:SS- hours, minutes, seconds (e.g.,0:02:00)D-HH:MM:SS- days, hours, minutes, seconds (e.g.,1-00:00:00for 1 day)
This is especially useful for:
Quick testing with short time limits
Adjusting for segment duration requirements
Working within specific QOS constraints
Custom SLURM Settings
For custom configurations, edit the generated scripts in job_scripts/ before submitting.
Bridges2 (PSC)
Bridges2 is the Pittsburgh Supercomputing Center (PSC) GPU cluster. It uses slightly different SLURM conventions than CU Boulder Alpine, and polyzymd handles these differences automatically via the bridges2 preset.
Key Differences from Alpine
Feature |
Alpine ( |
Bridges2 ( |
|---|---|---|
GPU directive |
|
|
Nodes/tasks |
|
|
QoS |
|
(omitted — not used) |
Memory |
|
(omitted — per-GPU allocation) |
Account |
ucb-group (in preset) |
(omitted — inferred from login) |
Env activation |
|
|
Default time limit |
24h |
24h |
Account
Bridges2 infers the billing allocation from your login session, so no --account directive is emitted by default. If you have multiple allocations and need to charge a specific one, pass --account:
polyzymd submit -c config.yaml \
--preset bridges2 \
--account chm250017p \
--replicates 1-3 \
--email collaborator@pitt.edu
Note
Unlike Alpine presets, Bridges2 scripts omit the #SBATCH --account= line
entirely when no account is specified. The --account CLI flag is optional
for Bridges2 (it is required on Alpine where the preset always sets a
group account).
GPU Type Selection
Bridges2 has multiple GPU types available. Use --gpu-type to select:
Flag value |
GPU |
VRAM |
|---|---|---|
|
NVIDIA V100 |
32 GB |
|
NVIDIA V100 |
16 GB |
|
NVIDIA L40S |
48 GB |
|
NVIDIA H100 |
80 GB |
# Default (V100 32GB) — good balance of availability and memory
polyzymd submit -c config.yaml \
--preset bridges2 \
--account abc123_gpu
# High-memory GPU for large systems
polyzymd submit -c config.yaml \
--preset bridges2 \
--account abc123_gpu \
--gpu-type h100-80
Full Bridges2 Workflow
# 1. Dry run — inspect scripts before submitting
polyzymd submit -c config.yaml \
--preset bridges2 \
--account abc123_gpu \
--replicates 1-3 \
--dry-run
# 2. Inspect the generated SBATCH directives
head -20 job_scripts/r1_300K_LipA.sh
# You should see:
# #SBATCH --partition=GPU-shared
# #SBATCH -N 1 ← single-line nodes directive
# #SBATCH --gpus=v100-32:1 ← type-specific GPU directive
# (no --qos line)
# (no --mem line)
# (no --account line — inferred from login)
# 3. Submit for real
polyzymd submit -c config.yaml \
--preset bridges2 \
--account abc123_gpu \
--replicates 1-3 \
--email collaborator@pitt.edu
Bridges2 Directory Structure
On Bridges2, use Ocean storage for long-term data and local scratch for active simulations:
/ocean/projects/abc123_gpu/$USER/polyzymd/ # Long-term storage
├── my_simulation/
│ ├── config.yaml
│ ├── structures/
│ │ ├── enzyme.pdb
│ │ └── substrate.sdf
│ ├── job_scripts/
│ │ ├── r1_300K_LipA.sh
│ │ └── ...
│ └── slurm_logs/
/local/scratch/$USER/polyzymd_sims/ # High-performance local scratch
├── LipA_Substrate_300K_run1/
│ ├── system.pdb
│ ├── progress.json
│ ├── equilibration/
│ └── production_0/
Set these paths in your config.yaml:
output:
projects_directory: "/ocean/projects/abc123_gpu/$USER/polyzymd/my_simulation"
scratch_directory: "/local/scratch/$USER/polyzymd_sims"
Or override on the CLI:
polyzymd submit -c config.yaml \
--preset bridges2 \
--account abc123_gpu \
--projects-dir "/ocean/projects/abc123_gpu/$USER/polyzymd/my_simulation" \
--scratch-dir "/local/scratch/$USER/polyzymd_sims"
Self-Resubmitting Job Model
How It Works
PolyzyMD generates one identical SLURM script per replicate. Each job:
Calls
polyzymd run-segmentto run the next segment of workAfter the segment finishes (or is interrupted), calls
polyzymd check-progressto see if the simulation is completeIf work remains, resubmits itself via
sbatch "$THIS_SCRIPT"If the simulation is complete, exits cleanly
┌───────────────────────────┐
│ Job submits itself │
│ │◄──────────────────┐
│ 1. run-segment │ │
│ (build/eq/prod OR │ │
│ continue from last) │ │
│ │ │
│ 2. check-progress │ resubmit │
│ └─ complete? exit 0 │ │
│ └─ work remains? ─────┼───────────────────┘
│ └─ error? exit $RC │
└───────────────────────────┘
This model is simpler and more fault-tolerant than dependency chains:
No
afteranydependencies — each job is independentAutomatic recovery — if a job is interrupted by wall-time or preemption, it saves a checkpoint, resubmits, and the next invocation picks up where it left off
Idempotent — each job scans the filesystem to determine what work remains, so the same script can be resubmitted manually at any time
Progress Tracking
PolyzyMD tracks simulation progress in a progress.json file in the working directory. This file records:
Which segments have completed
How many steps each segment ran
Whether any segments were interrupted
The total steps requested vs. completed
On startup, run-segment validates the progress file against the filesystem (checking for production_N/ directories and their contents) to ensure consistency.
Configuring Production Duration
Specify the total simulation time in your config. PolyzyMD determines segment boundaries automatically based on wall-time:
simulation_phases:
production:
duration: 100.0 # 100 ns total production time
Tip
You don’t need to configure segments manually. Each job runs as much
production time as it can before the wall-time limit, checkpoints, and
resubmits. The segment duration is determined at runtime by the
--segment-time and --segment-frames options passed to run
(which submit computes automatically from your config).
Smart Restart & Fault Tolerance
When you run 60 replicates, each resubmitting multiple times, robustness is critical. PolyzyMD’s smart restart system handles interruptions automatically. The generated scripts already include everything described below — you do not need to configure anything. This section explains what happens under the hood so you can debug issues or adapt the approach to other workflows.
What Happens Automatically
Every generated SLURM script includes three pieces of fault-tolerance infrastructure:
Signal handling — Python-side handlers catch SIGUSR1 (wall-time warning) and SIGTERM (preemption), save an interrupted checkpoint, and exit with code 99.
Signal forwarding — Bash trap + background + wait pattern forwards signals from the SLURM batch shell to the Python child process.
Progress tracking — After each segment (whether completed or interrupted), the progress file is updated so the next invocation knows exactly where to resume.
The Three Scenarios
Scenario |
Signal |
What Happens |
Outcome |
|---|---|---|---|
Wall-time warning |
|
Interrupted state saved, progress updated, exit 99 |
Job resubmits and resumes |
Preemption |
|
Interrupted state saved, progress updated, exit 99 |
Job resubmits and resumes |
Hard crash |
None (OOM, segfault, node failure) |
No state saved |
Job resubmits; |
Note
The wall-time signal is configured via #SBATCH --signal=B:USR1@300, which
tells SLURM to send SIGUSR1 to the batch shell 300 seconds (5 minutes)
before the time limit expires. This gives the simulation enough time to
save a full OpenMM state (~10-30 seconds on GPU).
Interrupted State Files
When an interrupt is detected, the signal handler writes three files into
the current segment’s directory (e.g. production_3/):
File |
Purpose |
|---|---|
|
Portable OpenMM state (positions, velocities, forces) |
|
Serialized OpenMM System (force field parameters) |
|
Marker file with step-count metadata for recovery |
The INTERRUPTED marker contains the information needed for recovery:
segment_index=3
steps_completed=1250000
total_steps=2500000
remaining_steps=1250000
Wall-Time Restart Checkpoints
In addition to signal-triggered saves, the simulation loop periodically writes portable restart checkpoints at a configurable wall-time interval (default: 60 seconds):
File |
Purpose |
|---|---|
|
Portable OpenMM state (overwritten each checkpoint) |
|
Serialized OpenMM System (for self-contained recovery) |
These files serve as a safety net: if the process is killed between signal
delivery and the signal handler completing (e.g., the SLURM grace period
expires), the restart checkpoint from the last 60-second interval is still
on disk. Recovery prefers portable state XML files over binary .chk
checkpoints, which is important on heterogeneous clusters where jobs may
restart on different GPU hardware.
You can tune the checkpoint interval in your YAML config:
simulation_phases:
production:
duration: 100.0 # ns
samples: 2500
checkpoint_interval: 60.0 # seconds (default)
Set to 0 to disable wall-time checkpoints (not recommended on preemptible
queues).
Adaptive Sub-Chunking
OpenMM’s simulation.step(N) blocks Python for the entire call. With
large report intervals (e.g., 200,000 steps), each call can take ~2 minutes
on a slow system, leaving no opportunity to check for SIGTERM. PolyzyMD
solves this with adaptive sub-chunking: after the first checkpoint
interval, it measures actual simulation speed and divides the loop into
smaller chunks (~15 seconds each), ensuring the signal flag is checked ~4
times per checkpoint interval. This is transparent — reporters still fire
at the original interval, and sub-chunk overhead is negligible (<0.001%).
Signal Forwarding: Why trap + background + wait?
SLURM sends signals to the batch shell process, not to child processes.
Bash ignores SIGUSR1 by default, so without explicit forwarding, the
Python simulation never sees the signal. The generated scripts use a
standard pattern to solve this:
# Background the Python process
polyzymd run-segment -c "$CONFIG_PATH" -r "$REPLICATE" --scratch-dir "$WORKING_DIR" &
CHILD_PID=$!
# Trap signals and forward them to the child
trap 'forward_signal USR1' USR1
trap 'forward_signal TERM' TERM
# Wait in a loop (wait is interrupted by trapped signals)
wait "$CHILD_PID"
RC=$?
while kill -0 "$CHILD_PID" 2>/dev/null; do
wait "$CHILD_PID"
RC=$?
done
Warning
Do not remove the trap, backgrounding (&), or wait loop from the
generated scripts. Without them, signals will not reach the Python process
and graceful shutdown will not work.
Manually Triggering an Interrupt
You can test graceful shutdown or manually stop a running simulation by
sending SIGUSR1 via scancel:
# Send USR1 to a specific job
scancel --signal=USR1 <job_id>
# The job will save interrupted state, update progress, and exit with code 99
# The resubmission logic then resubmits the job to continue
This is useful when you realize a simulation has a problem and want to stop it cleanly without losing progress.
Cancelling a Job Permanently
Because scancel sends SIGTERM, and our scripts treat SIGTERM the same as
SIGUSR1 (save state, exit 99, resubmit), a plain scancel <job_id> will
not permanently stop a simulation — the job will save its state and
resubmit itself.
To truly cancel a simulation so it does not restart, use --signal=KILL:
# Permanently stop a job (no state saved, no resubmission)
scancel --signal=KILL <job_id>
# Equivalent shorthand
scancel -s KILL <job_id>
SIGKILL cannot be caught or trapped by bash or Python, so the process
dies immediately and the resubmission logic never runs. The last
completed segment’s checkpoint is still intact — only in-progress work
since the last checkpoint is lost.
Command |
Saves state? |
Resubmits? |
Use case |
|---|---|---|---|
|
Yes |
Yes |
Don’t use this to permanently stop a simulation |
|
Yes |
Yes |
Graceful stop (saves progress, continues later) |
|
No |
No |
Permanently cancel a simulation |
Tip
If you need to cancel all replicates of a simulation, cancel them all at once so that none resubmit before you can cancel the others:
scancel --signal=KILL <job_id_1> <job_id_2> <job_id_3>
Or cancel all your jobs:
scancel --signal=KILL -u $USER
Manual Recovery
If a simulation is stalled (e.g., the SLURM job exited without resubmitting),
use the recover command to inspect status and optionally resume:
# Check status only
polyzymd recover -c config.yaml -r 1
# Submit a recovery job
polyzymd recover -c config.yaml -r 1 --submit --preset blanca-shirts
# Dry-run (show what would be submitted)
polyzymd recover -c config.yaml -r 1 --submit --dry-run
See the CLI Reference section below for full option details.
Submitting Jobs
Dry Run (Recommended First)
Generate scripts without submitting:
polyzymd submit -c config.yaml \
--replicates 1-3 \
--preset aa100 \
--dry-run
Inspect generated scripts:
cat job_scripts/r1_300K_LipA.sh
Submit for Real
polyzymd submit -c config.yaml \
--replicates 1-3 \
--preset aa100 \
--email your.email@university.edu
Replicate Specification
# Single replicate
--replicates 1
# Range
--replicates 1-5
# Specific replicates
--replicates 1,3,5,7
# Combined
--replicates 1-3,5,7-10
Monitoring Jobs
Check Job Status
# All your jobs
squeue -u $USER
# Specific job
scontrol show job <job_id>
# Watch jobs update
watch -n 30 'squeue -u $USER'
View Job Output
# Real-time output
tail -f slurm_logs/r1_300K_LipA*.out
# Check for errors
grep -i error slurm_logs/*.out
grep -i fail slurm_logs/*.out
Check Simulation Progress
Use the check-progress command to query the progress file:
# Check a specific replicate
polyzymd check-progress -c config.yaml -r 1
# Example output:
# Progress: 12500000/50000000 steps (25.0%), 5 segment(s)
# Status: in_progress — 75.000 ns remaining
Or inspect files directly:
# List trajectory files
ls -la /scratch/$USER/polyzymd_sims/*/production_*/trajectory.dcd
# Check trajectory sizes
du -h /scratch/$USER/polyzymd_sims/*/production_*/*.dcd
Handling Failures
Most failures are handled automatically by the smart restart system. This section covers the cases where manual intervention is needed.
Wall-Time or Preemption Interrupts
No action needed. The smart restart system saves an interrupted checkpoint, updates the progress file, and the job resubmits itself. Check the SLURM log to confirm:
# Look for the graceful shutdown message
grep -i "interrupted\|graceful\|resubmit" slurm_logs/r1_300K_*.out
Hard Crash (OOM, Segfault, Node Failure)
If a job crashes without saving state, the resubmission logic still runs
(since the crash only kills the child Python process, not the bash wrapper).
The resubmitted job’s run-segment will detect the incomplete segment and
handle it appropriately.
To diagnose issues:
Check the error:
cat slurm_logs/r1_300K_*.outFix the issue (e.g. increase memory with
--memory 8G)Resume:
# Option 1: Resubmit the existing script sbatch job_scripts/r1_300K_LipA.sh # Option 2: Use recover to inspect and resume (with more memory if needed) polyzymd recover -c config.yaml -r 1 --submit --preset aa100 --memory 8G
Checking Progress
For a visual overview of all replicates at once:
polyzymd status -c config.yaml
Use check-progress to check a single replicate (used by SLURM scripts):
polyzymd check-progress -c config.yaml -r 1
Or for a detailed per-segment view with recovery options:
polyzymd recover -c config.yaml -r 1
Start Fresh
To restart a simulation from scratch:
# Remove old output
rm -rf /scratch/$USER/polyzymd_sims/LipA_*_run1/
# Resubmit
polyzymd submit -c config.yaml --replicates 1 --preset aa100
Generated Script Structure
The submit command generates one script per replicate. Each script is a self-resubmitting job that handles the entire simulation lifecycle: building, equilibration, production segments, interruptions, and resubmission.
#!/bin/bash
#SBATCH --partition=aa100
#SBATCH --job-name=r1_300K_LipA
#SBATCH --output=slurm_logs/r1_300K_LipA.%A_%a.out
#SBATCH --qos=normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=3G
#SBATCH --time=23:59:59
#SBATCH --gres=gpu:1
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=your@email.edu
#SBATCH --account=ucb625_asc1
#SBATCH --signal=B:USR1@300
#SBATCH --no-requeue
# =============================================================================
# PolyzyMD Self-Resubmitting Simulation Job
# Generated by polyzymd — do not edit manually
# =============================================================================
# Activate pixi environment
# The manifest path was resolved at submission time from `which polyzymd`.
eval "$(pixi shell-hook -e cuda-12-4 --manifest-path /projects/$USER/polyzymd/pixi.toml)"
set -e
export INTERCHANGE_EXPERIMENTAL=1
# Resolve this script's path for self-resubmission
# ($SLURM_JOB_SCRIPT is only available in SLURM >= 22.05)
THIS_SCRIPT="${SLURM_JOB_SCRIPT:-$(realpath "$0")}"
CONFIG_PATH="/projects/$USER/polyzymd/my_simulation/config.yaml"
REPLICATE=1
WORKING_DIR="/scratch/alpine/$USER/polyzymd_sims/LipA_300K_run1"
mkdir -p "$WORKING_DIR"
echo "=================================================="
echo "PolyzyMD self-resubmitting job"
echo "Config: $CONFIG_PATH"
echo "Replicate: $REPLICATE"
echo "Work dir: $WORKING_DIR"
echo "Pixi env: cuda-12-4"
echo "Job ID: ${SLURM_JOB_ID:-local}"
echo "Timestamp: $(date)"
echo "=================================================="
# Signal forwarding (see Smart Restart docs)
CHILD_PID=""
forward_signal() {
if [ -n "$CHILD_PID" ] && kill -0 "$CHILD_PID" 2>/dev/null; then
echo "Forwarding $1 to Python process (PID $CHILD_PID)"
kill -"$1" "$CHILD_PID"
fi
}
trap 'forward_signal USR1' USR1
trap 'forward_signal TERM' TERM
# Run the next segment (backgrounded for signal forwarding)
polyzymd run-segment \
-c "$CONFIG_PATH" \
-r "$REPLICATE" \
--scratch-dir "$WORKING_DIR" &
CHILD_PID=$!
# Wait for the child process
set +e
wait "$CHILD_PID" 2>/dev/null
RC=$?
while kill -0 "$CHILD_PID" 2>/dev/null; do
wait "$CHILD_PID" 2>/dev/null
RC=$?
done
set -e
echo "run-segment exited with code $RC at $(date)"
# --- Resubmission logic ---
if [ $RC -ne 0 ] && [ $RC -ne 99 ]; then
echo "FATAL: run-segment failed (exit code $RC) — NOT resubmitting"
exit $RC
fi
# Check whether more work remains
set +e
polyzymd check-progress -c "$CONFIG_PATH" -r "$REPLICATE" --scratch-dir "$WORKING_DIR"
PROGRESS_RC=$?
set -e
if [ $PROGRESS_RC -eq 0 ]; then
echo "Simulation complete — no resubmission needed."
exit 0
fi
# Work remains — resubmit this same script
echo "Work remains — resubmitting job..."
sbatch "$THIS_SCRIPT"
SUBMIT_RC=$?
if [ $SUBMIT_RC -eq 0 ]; then
echo "Resubmitted successfully."
else
echo "WARNING: sbatch resubmission failed (exit code $SUBMIT_RC)"
echo "You can manually resume with:"
echo " sbatch $THIS_SCRIPT"
exit 1
fi
exit 0
Key Features of the Generated Script
Feature |
How It Works |
|---|---|
Signal forwarding |
|
Unified entry point |
|
Progress checking |
|
Self-resubmission |
|
Error handling |
Non-zero, non-99 exit codes abort without resubmitting |
Best Practices
1. Always Test First
# Generate scripts without submitting (dry run)
polyzymd submit -c config.yaml --preset testing --dry-run
# Quick test with 2-minute time limit
polyzymd submit -c config.yaml \
--preset testing \
--time-limit 0:02:00 \
--replicates 1
# Or a slightly longer test
polyzymd submit -c config.yaml \
--preset testing \
--time-limit 0:05:00 \
--replicates 1
2. Monitor Early Segments
Watch the first segment complete to catch issues early:
tail -f slurm_logs/r1_300K_*.out
3. Back Up Important Data
Scratch is often purged. Copy completed simulations to projects:
# After simulation completes
cp -r /scratch/$USER/polyzymd_sims/LipA_300K_run1 \
/projects/$USER/completed_simulations/
4. Use Email Notifications
polyzymd submit -c config.yaml --email you@university.edu
You’ll receive emails when jobs start, end, or fail.
5. Segment Duration Guidelines
Cluster Time Limit |
Approximate Production per Segment |
|---|---|
1 hour (testing) |
0.5 - 1 ns |
24 hours |
8 - 12 ns |
48 hours |
20 - 30 ns |
7 days |
50 - 100 ns |
CLI Reference
polyzymd submit
Submit self-resubmitting simulation jobs to SLURM.
polyzymd submit -c CONFIG [OPTIONS]
Required:
-c, --config PATH- Path to YAML configuration file
Options:
-r, --replicates RANGE- Replicate range (e.g., “1-5”, “1,3,5”). Default: “1”--preset PRESET- SLURM preset: aa100, al40, blanca-shirts, testing, bridges2. Default: aa100--account ACCOUNT- HPC allocation account ID (required for Bridges2)--gpu-type TYPE- GPU type for Bridges2: v100-16, v100-32, l40s-48, h100-80. Default: v100-32--scratch-dir PATH- Override scratch directory for simulation output--projects-dir PATH- Override projects directory for scripts/logs--output-dir PATH- Directory for job scripts. Default: {projects_dir}/job_scripts--email EMAIL- Email for job notifications--time-limit TIME- Override SLURM time limit (HH:MM:SS)--memory SIZE- Override SLURM memory allocation (e.g., “4G”). Bridges2 omits –mem by default (per-GPU allocation)--openff-logs- Enable verbose OpenFF logs in generated job scripts (for debugging)--dry-run- Generate scripts without submitting
polyzymd run-segment
Unified entry point for SLURM jobs. Determines what work remains and runs the next segment.
polyzymd run-segment -c CONFIG [OPTIONS]
Required:
-c, --config PATH- Path to YAML configuration file
Options:
-r, --replicate INT- Replicate number. Default: 1--scratch-dir PATH- Override scratch directory for simulation output--skip-build- Skip system building (use existing) for initial segment
Behavior:
If no segments exist: builds system, equilibrates, runs segment 0
If segments exist but simulation is incomplete: continues from last completed segment
If simulation is complete: exits 0 immediately
Exit codes:
0- Segment completed successfully1- Error99- Graceful interruption (wall-time signal)
polyzymd check-progress
Check whether a simulation is complete. Used by SLURM resubmission logic.
polyzymd check-progress -c CONFIG [OPTIONS]
Required:
-c, --config PATH- Path to YAML configuration file
Options:
-r, --replicate INT- Replicate number. Default: 1--scratch-dir PATH- Override scratch directory
Exit codes:
0- Simulation complete (do NOT resubmit)1- Work remains (resubmit)
polyzymd recover
Resume a stalled or interrupted simulation. Scans the working directory,
loads progress state, and reports how much work remains. With --submit,
generates and submits a self-resubmitting SLURM job that will automatically
continue from the last completed segment.
polyzymd recover -c CONFIG [OPTIONS]
Required:
-c, --config PATH- Path to YAML configuration file
Options:
-r, --replicate INT- Replicate number. Default: 1--scratch-dir PATH- Override scratch directory--preset PRESET- SLURM preset for recovery job. Default: aa100--submit / --no-submit- Submit a recovery job (default: status only)--dry-run- Show what would be submitted without submitting--memory SIZE- Override SLURM memory allocation (e.g. ‘4G’, ‘8G’). Default: 3G
Examples:
# Check status only
polyzymd recover -c config.yaml -r 1
# Submit a recovery job
polyzymd recover -c config.yaml -r 1 --submit --preset blanca-shirts
# Dry-run (show what would be submitted)
polyzymd recover -c config.yaml -r 1 --submit --dry-run
Example output (status only):
Working directory: /scratch/user/sim/LipA_300K_run1
Progress: 12500000/50000000 steps (25.0%)
Status: in_progress
Segments: 5
segment 0: completed (100%)
segment 1: completed (100%)
segment 2: completed (100%)
segment 3: completed (100%)
segment 4: interrupted (50%)
Remaining: 75.000 ns (37500000 steps)
To resume, run:
polyzymd recover -c config.yaml -r 1 --submit --preset aa100
Troubleshooting
“Job pending forever”
squeue -u $USER
# Check REASON column
Common reasons:
Resources- Waiting for GPUsPriority- Queue is busy
“pixi: command not found” in job
If you see errors like “pixi: command not found” in your job output, pixi is not available in the non-interactive SLURM shell.
Ensure pixi is installed and on PATH for non-interactive shells:
# Check pixi is installed
which pixi
# If not installed, install it:
curl -fsSL https://pixi.sh/install.sh | sh
source ~/.bashrc
If pixi is installed but only available in interactive shells (e.g., it
was added to ~/.bashrc inside an if [ -z "$PS1" ] guard), move the
PATH addition outside the guard so that SLURM jobs can find it.
“Out of memory”
There are two types of out-of-memory errors:
GPU Memory (CUDA OOM):
CUDA out of memory
Reduce system size:
Decrease
box.paddingUse fewer polymers
Use smaller production
samples(fewer frames saved)
System Memory (SLURM OOM):
slurmstepd: error: Detected 1 oom_kill event in StepId=...
The job exceeded its RAM allocation. This often occurs during energy minimization when loading large systems onto the GPU.
Solution: Increase memory with the --memory flag:
# Default is 3G, increase for larger systems
polyzymd submit -c config.yaml --memory 4G
# For very large systems
polyzymd submit -c config.yaml --memory 8G
# Also works with recover --submit
polyzymd recover -c config.yaml -r 1 --submit --memory 8G
“GPU not detected”
Check:
nvidia-smi # In job script
Make sure the GPU directive is present in the generated script:
Alpine presets:
#SBATCH --gres=gpu:1Bridges2 preset:
#SBATCH --gpus=v100-32:1(or your selected GPU type)
“config.yaml not found”
The self-resubmitting script stores an absolute path to the config file. Make sure the config file has not been moved or deleted since job submission:
# Check the path in the generated script
grep CONFIG_PATH job_scripts/r1_300K_LipA.sh
If the config was moved, either move it back or regenerate and resubmit:
polyzymd submit -c /new/path/to/config.yaml --preset aa100 --replicates 1