# HPC and SLURM Guide This guide covers running PolyzyMD simulations on HPC clusters using SLURM. ## Overview Long MD simulations often exceed HPC time limits (typically 24-48 hours). PolyzyMD solves this with **self-resubmitting jobs**: each SLURM job runs a single simulation segment, checks whether work remains, and resubmits itself if not finished. This approach is simpler and more robust than dependency chains — every job is identical, and the simulation resumes correctly after wall-time limits, preemptions, or node failures. ## User Workflow The typical workflow for running a PolyzyMD simulation on an HPC cluster is: 1. **Create a simulation directory** with your configuration and input files 2. **Write a `config.yaml`** file with your simulation parameters 3. **Generate job scripts** using `polyzymd submit --dry-run` 4. **Review** the generated scripts 5. **Submit the jobs** for real ### Step-by-Step Example ```bash # 1. Create your simulation directory mkdir -p my_simulation/structures cd my_simulation # 2. Copy your input files cp /path/to/enzyme.pdb structures/ cp /path/to/substrate.sdf structures/ # If using pre-built polymers: cp -r /path/to/polymer_sdfs ./ATRP_EGPMA_SBMA_5-mer/ # 3. Create your config.yaml (see Configuration Guide) # The config should reference paths relative to this directory: # enzyme.pdb_path: "structures/enzyme.pdb" # substrate.sdf_path: "structures/substrate.sdf" # polymers.sdf_directory: "ATRP_EGPMA_SBMA_5-mer" # 4. Test with a dry run first polyzymd submit -c config.yaml --preset testing --dry-run # 5. Review the generated scripts cat job_scripts/r1_300K_LipA.sh # 6. Submit for real (quick test first) polyzymd submit -c config.yaml --preset testing --time-limit 0:05:00 --replicates 1 # 7. Once testing passes, submit production jobs polyzymd submit -c config.yaml --preset aa100 --replicates 1-5 --email your@email.edu ``` --- ## Directory Structure PolyzyMD supports separating: - **Projects directory**: Long-term storage for scripts, logs, configs - **Scratch directory**: High-performance storage for trajectories ``` /projects/$USER/polyzymd/ # Long-term storage ├── my_simulation/ # Your simulation directory │ ├── config.yaml # Main configuration │ ├── structures/ # Input structure files │ │ ├── enzyme.pdb │ │ └── substrate.sdf │ ├── ATRP_EGPMA_SBMA_5-mer/ # Pre-built polymer SDFs (optional) │ │ ├── EGPMA-SBMA_AAAAA_5-mer_charged.sdf │ │ └── ... │ ├── job_scripts/ # Generated SLURM scripts │ │ ├── r1_300K_LipA.sh # One script per replicate │ │ ├── r2_300K_LipA.sh │ │ └── ... │ └── slurm_logs/ # Job output files │ └── r1_300K_LipA.12345.out /scratch/alpine/$USER/polyzymd_sims/ # High-performance storage ├── LipA_Substrate_EGPMA-SBMA_10pct_300K_run1/ │ ├── system.pdb │ ├── progress.json # Progress tracking file │ ├── equilibration/ │ │ └── trajectory.dcd │ ├── production_0/ │ │ ├── trajectory.dcd │ │ ├── checkpoint.chk │ │ └── state_data.csv │ └── production_1/ │ └── ... └── LipA_Substrate_EGPMA-SBMA_10pct_300K_run2/ ``` ## Configuring Directories ### In YAML Configuration Environment variables (`$USER`, `$HOME`, etc.) and `~` are automatically expanded: ```yaml output: projects_directory: "/projects/$USER/polyzymd/my_simulation" scratch_directory: "/scratch/alpine/$USER/polyzymd_sims" job_scripts_subdir: "job_scripts" slurm_logs_subdir: "slurm_logs" ``` You can also use `~` for home directory: ```yaml output: projects_directory: "~/polyzymd/my_simulation" ``` ### Via CLI Override ```bash polyzymd submit -c config.yaml \ --projects-dir /projects/$USER/polyzymd \ --scratch-dir /scratch/alpine/$USER/simulations \ --replicates 1-5 ``` --- ## SLURM Presets PolyzyMD includes presets for common HPC configurations: | Preset | Partition | GPUs | Time Limit | Memory | Description | |--------|-----------|------|------------|--------|-------------| | `aa100` | aa100 | 1x A100 | 24h | 3GB | CU Boulder Alpine — NVIDIA A100 | | `al40` | al40 | 1x L40 | 24h | 3GB | CU Boulder Alpine — NVIDIA L40 | | `blanca-shirts` | blanca-shirts | 1x | 24h | 3GB | CU Boulder Blanca — Shirts lab partition | | `testing` | atesting_a100 | 1x | 6min | 3GB | CU Boulder Alpine — quick tests | | `bridges2` | GPU-shared | 1x V100-32 | 24h | (per-GPU) | PSC Bridges2 — NVIDIA V100 32GB | ### Using Presets ```bash # Use A100 GPUs polyzymd submit -c config.yaml --preset aa100 # Use testing partition for quick tests polyzymd submit -c config.yaml --preset testing ``` ### Overriding Time Limit You can override the preset's time limit using `--time-limit`: ```bash # Use testing preset with a 2-minute time limit polyzymd submit -c config.yaml --preset testing --time-limit 0:02:00 # Use A100 with a 12-hour limit instead of 24h polyzymd submit -c config.yaml --preset aa100 --time-limit 12:00:00 ``` **Time format options:** - `MM:SS` - minutes and seconds (e.g., `2:00` for 2 minutes) - `HH:MM:SS` - hours, minutes, seconds (e.g., `0:02:00`) - `D-HH:MM:SS` - days, hours, minutes, seconds (e.g., `1-00:00:00` for 1 day) This is especially useful for: - Quick testing with short time limits - Adjusting for segment duration requirements - Working within specific QOS constraints ### Custom SLURM Settings For custom configurations, edit the generated scripts in `job_scripts/` before submitting. --- ## Bridges2 (PSC) [Bridges2](https://www.psc.edu/resources/bridges-2/) is the Pittsburgh Supercomputing Center (PSC) GPU cluster. It uses slightly different SLURM conventions than CU Boulder Alpine, and polyzymd handles these differences automatically via the `bridges2` preset. ### Key Differences from Alpine | Feature | Alpine (`aa100`) | Bridges2 (`bridges2`) | |---------|-----------------|----------------------| | GPU directive | `--gres=gpu:N` | `--gpus=:N` | | Nodes/tasks | `--nodes=1` + `--ntasks=1` | `-N 1` (single line) | | QoS | `--qos=normal` | *(omitted — not used)* | | Memory | `--mem=3G` | *(omitted — per-GPU allocation)* | | Account | ucb-group (in preset) | *(omitted — inferred from login)* | | Env activation | `pixi shell-hook -e cuda-12-4` | `pixi shell-hook -e cuda-12-6` | | Default time limit | 24h | 24h | ### Account Bridges2 infers the billing allocation from your login session, so **no `--account` directive is emitted by default**. If you have multiple allocations and need to charge a specific one, pass `--account`: ```bash polyzymd submit -c config.yaml \ --preset bridges2 \ --account chm250017p \ --replicates 1-3 \ --email collaborator@pitt.edu ``` ```{note} Unlike Alpine presets, Bridges2 scripts omit the `#SBATCH --account=` line entirely when no account is specified. The `--account` CLI flag is optional for Bridges2 (it is required on Alpine where the preset always sets a group account). ``` ### GPU Type Selection Bridges2 has multiple GPU types available. Use `--gpu-type` to select: | Flag value | GPU | VRAM | |------------|-----|------| | `v100-32` *(default)* | NVIDIA V100 | 32 GB | | `v100-16` | NVIDIA V100 | 16 GB | | `l40s-48` | NVIDIA L40S | 48 GB | | `h100-80` | NVIDIA H100 | 80 GB | ```bash # Default (V100 32GB) — good balance of availability and memory polyzymd submit -c config.yaml \ --preset bridges2 \ --account abc123_gpu # High-memory GPU for large systems polyzymd submit -c config.yaml \ --preset bridges2 \ --account abc123_gpu \ --gpu-type h100-80 ``` ### Full Bridges2 Workflow ```bash # 1. Dry run — inspect scripts before submitting polyzymd submit -c config.yaml \ --preset bridges2 \ --account abc123_gpu \ --replicates 1-3 \ --dry-run # 2. Inspect the generated SBATCH directives head -20 job_scripts/r1_300K_LipA.sh # You should see: # #SBATCH --partition=GPU-shared # #SBATCH -N 1 ← single-line nodes directive # #SBATCH --gpus=v100-32:1 ← type-specific GPU directive # (no --qos line) # (no --mem line) # (no --account line — inferred from login) # 3. Submit for real polyzymd submit -c config.yaml \ --preset bridges2 \ --account abc123_gpu \ --replicates 1-3 \ --email collaborator@pitt.edu ``` ### Bridges2 Directory Structure On Bridges2, use Ocean storage for long-term data and local scratch for active simulations: ``` /ocean/projects/abc123_gpu/$USER/polyzymd/ # Long-term storage ├── my_simulation/ │ ├── config.yaml │ ├── structures/ │ │ ├── enzyme.pdb │ │ └── substrate.sdf │ ├── job_scripts/ │ │ ├── r1_300K_LipA.sh │ │ └── ... │ └── slurm_logs/ /local/scratch/$USER/polyzymd_sims/ # High-performance local scratch ├── LipA_Substrate_300K_run1/ │ ├── system.pdb │ ├── progress.json │ ├── equilibration/ │ └── production_0/ ``` Set these paths in your `config.yaml`: ```yaml output: projects_directory: "/ocean/projects/abc123_gpu/$USER/polyzymd/my_simulation" scratch_directory: "/local/scratch/$USER/polyzymd_sims" ``` Or override on the CLI: ```bash polyzymd submit -c config.yaml \ --preset bridges2 \ --account abc123_gpu \ --projects-dir "/ocean/projects/abc123_gpu/$USER/polyzymd/my_simulation" \ --scratch-dir "/local/scratch/$USER/polyzymd_sims" ``` --- ## Self-Resubmitting Job Model ### How It Works PolyzyMD generates one identical SLURM script per replicate. Each job: 1. Calls `polyzymd run-segment` to run the next segment of work 2. After the segment finishes (or is interrupted), calls `polyzymd check-progress` to see if the simulation is complete 3. If work remains, resubmits itself via `sbatch "$THIS_SCRIPT"` 4. If the simulation is complete, exits cleanly ``` ┌───────────────────────────┐ │ Job submits itself │ │ │◄──────────────────┐ │ 1. run-segment │ │ │ (build/eq/prod OR │ │ │ continue from last) │ │ │ │ │ │ 2. check-progress │ resubmit │ │ └─ complete? exit 0 │ │ │ └─ work remains? ─────┼───────────────────┘ │ └─ error? exit $RC │ └───────────────────────────┘ ``` This model is simpler and more fault-tolerant than dependency chains: - **No `afterany` dependencies** — each job is independent - **Automatic recovery** — if a job is interrupted by wall-time or preemption, it saves a checkpoint, resubmits, and the next invocation picks up where it left off - **Idempotent** — each job scans the filesystem to determine what work remains, so the same script can be resubmitted manually at any time ### Progress Tracking PolyzyMD tracks simulation progress in a `progress.json` file in the working directory. This file records: - Which segments have completed - How many steps each segment ran - Whether any segments were interrupted - The total steps requested vs. completed On startup, `run-segment` validates the progress file against the filesystem (checking for `production_N/` directories and their contents) to ensure consistency. ### Configuring Production Duration Specify the total simulation time in your config. PolyzyMD determines segment boundaries automatically based on wall-time: ```yaml simulation_phases: production: duration: 100.0 # 100 ns total production time ``` ```{tip} You don't need to configure segments manually. Each job runs as much production time as it can before the wall-time limit, checkpoints, and resubmits. The segment duration is determined at runtime by the ``--segment-time`` and ``--segment-frames`` options passed to ``run`` (which ``submit`` computes automatically from your config). ``` --- (smart-restart)= ## Smart Restart & Fault Tolerance When you run 60 replicates, each resubmitting multiple times, robustness is critical. PolyzyMD's **smart restart** system handles interruptions automatically. The generated scripts already include everything described below — you do not need to configure anything. This section explains what happens under the hood so you can debug issues or adapt the approach to other workflows. ### What Happens Automatically Every generated SLURM script includes three pieces of fault-tolerance infrastructure: 1. **Signal handling** — Python-side handlers catch SIGUSR1 (wall-time warning) and SIGTERM (preemption), save an interrupted checkpoint, and exit with code 99. 2. **Signal forwarding** — Bash trap + background + wait pattern forwards signals from the SLURM batch shell to the Python child process. 3. **Progress tracking** — After each segment (whether completed or interrupted), the progress file is updated so the next invocation knows exactly where to resume. ### The Three Scenarios | Scenario | Signal | What Happens | Outcome | |----------|--------|--------------|---------| | **Wall-time warning** | `SIGUSR1` (5 min before limit) | Interrupted state saved, progress updated, exit 99 | Job resubmits and resumes | | **Preemption** | `SIGTERM` (120 s grace on Blanca) | Interrupted state saved, progress updated, exit 99 | Job resubmits and resumes | | **Hard crash** | None (OOM, segfault, node failure) | No state saved | Job resubmits; `run-segment` detects incomplete segment and handles it | ```{note} The wall-time signal is configured via `#SBATCH --signal=B:USR1@300`, which tells SLURM to send `SIGUSR1` to the batch shell 300 seconds (5 minutes) before the time limit expires. This gives the simulation enough time to save a full OpenMM state (~10-30 seconds on GPU). ``` ### Interrupted State Files When an interrupt is detected, the signal handler writes three files into the current segment's directory (e.g. `production_3/`): | File | Purpose | |------|---------| | `interrupted_state.xml` | Portable OpenMM state (positions, velocities, forces) | | `interrupted_system.xml` | Serialized OpenMM System (force field parameters) | | `INTERRUPTED` | Marker file with step-count metadata for recovery | The `INTERRUPTED` marker contains the information needed for recovery: ``` segment_index=3 steps_completed=1250000 total_steps=2500000 remaining_steps=1250000 ``` ### Wall-Time Restart Checkpoints In addition to signal-triggered saves, the simulation loop periodically writes portable restart checkpoints at a configurable wall-time interval (default: 60 seconds): | File | Purpose | |------|---------| | `restart_state.xml` | Portable OpenMM state (overwritten each checkpoint) | | `restart_system.xml` | Serialized OpenMM System (for self-contained recovery) | These files serve as a safety net: if the process is killed between signal delivery and the signal handler completing (e.g., the SLURM grace period expires), the restart checkpoint from the last 60-second interval is still on disk. Recovery prefers portable state XML files over binary `.chk` checkpoints, which is important on heterogeneous clusters where jobs may restart on different GPU hardware. You can tune the checkpoint interval in your YAML config: ```yaml simulation_phases: production: duration: 100.0 # ns samples: 2500 checkpoint_interval: 60.0 # seconds (default) ``` Set to `0` to disable wall-time checkpoints (not recommended on preemptible queues). ### Adaptive Sub-Chunking OpenMM's `simulation.step(N)` blocks Python for the entire call. With large report intervals (e.g., 200,000 steps), each call can take ~2 minutes on a slow system, leaving no opportunity to check for SIGTERM. PolyzyMD solves this with **adaptive sub-chunking**: after the first checkpoint interval, it measures actual simulation speed and divides the loop into smaller chunks (~15 seconds each), ensuring the signal flag is checked ~4 times per checkpoint interval. This is transparent — reporters still fire at the original interval, and sub-chunk overhead is negligible (<0.001%). ### Signal Forwarding: Why trap + background + wait? SLURM sends signals to the **batch shell process**, not to child processes. Bash ignores `SIGUSR1` by default, so without explicit forwarding, the Python simulation never sees the signal. The generated scripts use a standard pattern to solve this: ```bash # Background the Python process polyzymd run-segment -c "$CONFIG_PATH" -r "$REPLICATE" --scratch-dir "$WORKING_DIR" & CHILD_PID=$! # Trap signals and forward them to the child trap 'forward_signal USR1' USR1 trap 'forward_signal TERM' TERM # Wait in a loop (wait is interrupted by trapped signals) wait "$CHILD_PID" RC=$? while kill -0 "$CHILD_PID" 2>/dev/null; do wait "$CHILD_PID" RC=$? done ``` ```{warning} Do not remove the `trap`, backgrounding (`&`), or `wait` loop from the generated scripts. Without them, signals will not reach the Python process and graceful shutdown will not work. ``` ### Manually Triggering an Interrupt You can test graceful shutdown or manually stop a running simulation by sending `SIGUSR1` via `scancel`: ```bash # Send USR1 to a specific job scancel --signal=USR1 # The job will save interrupted state, update progress, and exit with code 99 # The resubmission logic then resubmits the job to continue ``` This is useful when you realize a simulation has a problem and want to stop it cleanly without losing progress. ### Cancelling a Job Permanently Because `scancel` sends `SIGTERM`, and our scripts treat SIGTERM the same as SIGUSR1 (save state, exit 99, resubmit), a plain `scancel ` will **not** permanently stop a simulation — the job will save its state and resubmit itself. To truly cancel a simulation so it does not restart, use `--signal=KILL`: ```bash # Permanently stop a job (no state saved, no resubmission) scancel --signal=KILL # Equivalent shorthand scancel -s KILL ``` `SIGKILL` cannot be caught or trapped by bash or Python, so the process dies immediately and the resubmission logic never runs. The last *completed* segment's checkpoint is still intact — only in-progress work since the last checkpoint is lost. | Command | Saves state? | Resubmits? | Use case | |---------|-------------|-----------|----------| | `scancel ` | Yes | **Yes** | Don't use this to permanently stop a simulation | | `scancel --signal=USR1 ` | Yes | **Yes** | Graceful stop (saves progress, continues later) | | `scancel --signal=KILL ` | No | **No** | Permanently cancel a simulation | ```{tip} If you need to cancel **all replicates** of a simulation, cancel them all at once so that none resubmit before you can cancel the others: scancel --signal=KILL Or cancel all your jobs: scancel --signal=KILL -u $USER ``` ### Manual Recovery If a simulation is stalled (e.g., the SLURM job exited without resubmitting), use the `recover` command to inspect status and optionally resume: ```bash # Check status only polyzymd recover -c config.yaml -r 1 # Submit a recovery job polyzymd recover -c config.yaml -r 1 --submit --preset blanca-shirts # Dry-run (show what would be submitted) polyzymd recover -c config.yaml -r 1 --submit --dry-run ``` See the {ref}`CLI Reference ` section below for full option details. --- ## Submitting Jobs ### Dry Run (Recommended First) Generate scripts without submitting: ```bash polyzymd submit -c config.yaml \ --replicates 1-3 \ --preset aa100 \ --dry-run ``` Inspect generated scripts: ```bash cat job_scripts/r1_300K_LipA.sh ``` ### Submit for Real ```bash polyzymd submit -c config.yaml \ --replicates 1-3 \ --preset aa100 \ --email your.email@university.edu ``` ### Replicate Specification ```bash # Single replicate --replicates 1 # Range --replicates 1-5 # Specific replicates --replicates 1,3,5,7 # Combined --replicates 1-3,5,7-10 ``` --- ## Monitoring Jobs ### Check Job Status ```bash # All your jobs squeue -u $USER # Specific job scontrol show job # Watch jobs update watch -n 30 'squeue -u $USER' ``` ### View Job Output ```bash # Real-time output tail -f slurm_logs/r1_300K_LipA*.out # Check for errors grep -i error slurm_logs/*.out grep -i fail slurm_logs/*.out ``` ### Check Simulation Progress Use the `check-progress` command to query the progress file: ```bash # Check a specific replicate polyzymd check-progress -c config.yaml -r 1 # Example output: # Progress: 12500000/50000000 steps (25.0%), 5 segment(s) # Status: in_progress — 75.000 ns remaining ``` Or inspect files directly: ```bash # List trajectory files ls -la /scratch/$USER/polyzymd_sims/*/production_*/trajectory.dcd # Check trajectory sizes du -h /scratch/$USER/polyzymd_sims/*/production_*/*.dcd ``` --- ## Handling Failures Most failures are handled automatically by the {ref}`smart restart ` system. This section covers the cases where manual intervention is needed. ### Wall-Time or Preemption Interrupts **No action needed.** The smart restart system saves an interrupted checkpoint, updates the progress file, and the job resubmits itself. Check the SLURM log to confirm: ```bash # Look for the graceful shutdown message grep -i "interrupted\|graceful\|resubmit" slurm_logs/r1_300K_*.out ``` ### Hard Crash (OOM, Segfault, Node Failure) If a job crashes without saving state, the resubmission logic still runs (since the crash only kills the child Python process, not the bash wrapper). The resubmitted job's `run-segment` will detect the incomplete segment and handle it appropriately. To diagnose issues: 1. **Check the error**: ```bash cat slurm_logs/r1_300K_*.out ``` 2. **Fix the issue** (e.g. increase memory with `--memory 8G`) 3. **Resume**: ```bash # Option 1: Resubmit the existing script sbatch job_scripts/r1_300K_LipA.sh # Option 2: Use recover to inspect and resume (with more memory if needed) polyzymd recover -c config.yaml -r 1 --submit --preset aa100 --memory 8G ``` ### Checking Progress For a visual overview of **all replicates** at once: ```bash polyzymd status -c config.yaml ``` Use `check-progress` to check a single replicate (used by SLURM scripts): ```bash polyzymd check-progress -c config.yaml -r 1 ``` Or for a detailed per-segment view with recovery options: ```bash polyzymd recover -c config.yaml -r 1 ``` ### Start Fresh To restart a simulation from scratch: ```bash # Remove old output rm -rf /scratch/$USER/polyzymd_sims/LipA_*_run1/ # Resubmit polyzymd submit -c config.yaml --replicates 1 --preset aa100 ``` --- ## Generated Script Structure The submit command generates **one script per replicate**. Each script is a self-resubmitting job that handles the entire simulation lifecycle: building, equilibration, production segments, interruptions, and resubmission. ```bash #!/bin/bash #SBATCH --partition=aa100 #SBATCH --job-name=r1_300K_LipA #SBATCH --output=slurm_logs/r1_300K_LipA.%A_%a.out #SBATCH --qos=normal #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --mem=3G #SBATCH --time=23:59:59 #SBATCH --gres=gpu:1 #SBATCH --mail-type=FAIL #SBATCH --mail-user=your@email.edu #SBATCH --account=ucb625_asc1 #SBATCH --signal=B:USR1@300 #SBATCH --no-requeue # ============================================================================= # PolyzyMD Self-Resubmitting Simulation Job # Generated by polyzymd — do not edit manually # ============================================================================= # Activate pixi environment # The manifest path was resolved at submission time from `which polyzymd`. eval "$(pixi shell-hook -e cuda-12-4 --manifest-path /projects/$USER/polyzymd/pixi.toml)" set -e export INTERCHANGE_EXPERIMENTAL=1 # Resolve this script's path for self-resubmission # ($SLURM_JOB_SCRIPT is only available in SLURM >= 22.05) THIS_SCRIPT="${SLURM_JOB_SCRIPT:-$(realpath "$0")}" CONFIG_PATH="/projects/$USER/polyzymd/my_simulation/config.yaml" REPLICATE=1 WORKING_DIR="/scratch/alpine/$USER/polyzymd_sims/LipA_300K_run1" mkdir -p "$WORKING_DIR" echo "==================================================" echo "PolyzyMD self-resubmitting job" echo "Config: $CONFIG_PATH" echo "Replicate: $REPLICATE" echo "Work dir: $WORKING_DIR" echo "Pixi env: cuda-12-4" echo "Job ID: ${SLURM_JOB_ID:-local}" echo "Timestamp: $(date)" echo "==================================================" # Signal forwarding (see Smart Restart docs) CHILD_PID="" forward_signal() { if [ -n "$CHILD_PID" ] && kill -0 "$CHILD_PID" 2>/dev/null; then echo "Forwarding $1 to Python process (PID $CHILD_PID)" kill -"$1" "$CHILD_PID" fi } trap 'forward_signal USR1' USR1 trap 'forward_signal TERM' TERM # Run the next segment (backgrounded for signal forwarding) polyzymd run-segment \ -c "$CONFIG_PATH" \ -r "$REPLICATE" \ --scratch-dir "$WORKING_DIR" & CHILD_PID=$! # Wait for the child process set +e wait "$CHILD_PID" 2>/dev/null RC=$? while kill -0 "$CHILD_PID" 2>/dev/null; do wait "$CHILD_PID" 2>/dev/null RC=$? done set -e echo "run-segment exited with code $RC at $(date)" # --- Resubmission logic --- if [ $RC -ne 0 ] && [ $RC -ne 99 ]; then echo "FATAL: run-segment failed (exit code $RC) — NOT resubmitting" exit $RC fi # Check whether more work remains set +e polyzymd check-progress -c "$CONFIG_PATH" -r "$REPLICATE" --scratch-dir "$WORKING_DIR" PROGRESS_RC=$? set -e if [ $PROGRESS_RC -eq 0 ]; then echo "Simulation complete — no resubmission needed." exit 0 fi # Work remains — resubmit this same script echo "Work remains — resubmitting job..." sbatch "$THIS_SCRIPT" SUBMIT_RC=$? if [ $SUBMIT_RC -eq 0 ]; then echo "Resubmitted successfully." else echo "WARNING: sbatch resubmission failed (exit code $SUBMIT_RC)" echo "You can manually resume with:" echo " sbatch $THIS_SCRIPT" exit 1 fi exit 0 ``` ### Key Features of the Generated Script | Feature | How It Works | |---------|-------------| | **Signal forwarding** | `trap` + background `&` + `wait` loop ensures SIGUSR1/SIGTERM reach the Python process | | **Unified entry point** | `polyzymd run-segment` handles both initial (build + eq + seg 0) and continuation segments | | **Progress checking** | `polyzymd check-progress` returns exit code 0 (complete) or 1 (work remains) | | **Self-resubmission** | `sbatch "$THIS_SCRIPT"` resubmits the exact same script (path resolved at script start via `$SLURM_JOB_SCRIPT` or `realpath "$0"`) | | **Error handling** | Non-zero, non-99 exit codes abort without resubmitting | --- ## Best Practices ### 1. Always Test First ```bash # Generate scripts without submitting (dry run) polyzymd submit -c config.yaml --preset testing --dry-run # Quick test with 2-minute time limit polyzymd submit -c config.yaml \ --preset testing \ --time-limit 0:02:00 \ --replicates 1 # Or a slightly longer test polyzymd submit -c config.yaml \ --preset testing \ --time-limit 0:05:00 \ --replicates 1 ``` ### 2. Monitor Early Segments Watch the first segment complete to catch issues early: ```bash tail -f slurm_logs/r1_300K_*.out ``` ### 3. Back Up Important Data Scratch is often purged. Copy completed simulations to projects: ```bash # After simulation completes cp -r /scratch/$USER/polyzymd_sims/LipA_300K_run1 \ /projects/$USER/completed_simulations/ ``` ### 4. Use Email Notifications ```bash polyzymd submit -c config.yaml --email you@university.edu ``` You'll receive emails when jobs start, end, or fail. ### 5. Segment Duration Guidelines | Cluster Time Limit | Approximate Production per Segment | |--------------------|-------------------------------------| | 1 hour (testing) | 0.5 - 1 ns | | 24 hours | 8 - 12 ns | | 48 hours | 20 - 30 ns | | 7 days | 50 - 100 ns | --- ## CLI Reference ### `polyzymd submit` Submit self-resubmitting simulation jobs to SLURM. ```bash polyzymd submit -c CONFIG [OPTIONS] ``` **Required:** - `-c, --config PATH` - Path to YAML configuration file **Options:** - `-r, --replicates RANGE` - Replicate range (e.g., "1-5", "1,3,5"). Default: "1" - `--preset PRESET` - SLURM preset: aa100, al40, blanca-shirts, testing, bridges2. Default: aa100 - `--account ACCOUNT` - HPC allocation account ID (required for Bridges2) - `--gpu-type TYPE` - GPU type for Bridges2: v100-16, v100-32, l40s-48, h100-80. Default: v100-32 - `--scratch-dir PATH` - Override scratch directory for simulation output - `--projects-dir PATH` - Override projects directory for scripts/logs - `--output-dir PATH` - Directory for job scripts. Default: {projects_dir}/job_scripts - `--email EMAIL` - Email for job notifications - `--time-limit TIME` - Override SLURM time limit (HH:MM:SS) - `--memory SIZE` - Override SLURM memory allocation (e.g., "4G"). Bridges2 omits --mem by default (per-GPU allocation) - `--openff-logs` - Enable verbose OpenFF logs in generated job scripts (for debugging) - `--dry-run` - Generate scripts without submitting ### `polyzymd run-segment` Unified entry point for SLURM jobs. Determines what work remains and runs the next segment. ```bash polyzymd run-segment -c CONFIG [OPTIONS] ``` **Required:** - `-c, --config PATH` - Path to YAML configuration file **Options:** - `-r, --replicate INT` - Replicate number. Default: 1 - `--scratch-dir PATH` - Override scratch directory for simulation output - `--skip-build` - Skip system building (use existing) for initial segment **Behavior:** - If no segments exist: builds system, equilibrates, runs segment 0 - If segments exist but simulation is incomplete: continues from last completed segment - If simulation is complete: exits 0 immediately **Exit codes:** - `0` - Segment completed successfully - `1` - Error - `99` - Graceful interruption (wall-time signal) ### `polyzymd check-progress` Check whether a simulation is complete. Used by SLURM resubmission logic. ```bash polyzymd check-progress -c CONFIG [OPTIONS] ``` **Required:** - `-c, --config PATH` - Path to YAML configuration file **Options:** - `-r, --replicate INT` - Replicate number. Default: 1 - `--scratch-dir PATH` - Override scratch directory **Exit codes:** - `0` - Simulation complete (do NOT resubmit) - `1` - Work remains (resubmit) (hpc-recover)= ### `polyzymd recover` Resume a stalled or interrupted simulation. Scans the working directory, loads progress state, and reports how much work remains. With `--submit`, generates and submits a self-resubmitting SLURM job that will automatically continue from the last completed segment. ```bash polyzymd recover -c CONFIG [OPTIONS] ``` **Required:** - `-c, --config PATH` - Path to YAML configuration file **Options:** - `-r, --replicate INT` - Replicate number. Default: 1 - `--scratch-dir PATH` - Override scratch directory - `--preset PRESET` - SLURM preset for recovery job. Default: aa100 - `--submit / --no-submit` - Submit a recovery job (default: status only) - `--dry-run` - Show what would be submitted without submitting - `--memory SIZE` - Override SLURM memory allocation (e.g. '4G', '8G'). Default: 3G **Examples:** ```bash # Check status only polyzymd recover -c config.yaml -r 1 # Submit a recovery job polyzymd recover -c config.yaml -r 1 --submit --preset blanca-shirts # Dry-run (show what would be submitted) polyzymd recover -c config.yaml -r 1 --submit --dry-run ``` **Example output (status only):** ``` Working directory: /scratch/user/sim/LipA_300K_run1 Progress: 12500000/50000000 steps (25.0%) Status: in_progress Segments: 5 segment 0: completed (100%) segment 1: completed (100%) segment 2: completed (100%) segment 3: completed (100%) segment 4: interrupted (50%) Remaining: 75.000 ns (37500000 steps) To resume, run: polyzymd recover -c config.yaml -r 1 --submit --preset aa100 ``` --- ## Troubleshooting ### "Job pending forever" ```bash squeue -u $USER # Check REASON column ``` Common reasons: - `Resources` - Waiting for GPUs - `Priority` - Queue is busy ### "pixi: command not found" in job If you see errors like "pixi: command not found" in your job output, pixi is not available in the non-interactive SLURM shell. Ensure pixi is installed and on PATH for non-interactive shells: ```bash # Check pixi is installed which pixi # If not installed, install it: curl -fsSL https://pixi.sh/install.sh | sh source ~/.bashrc ``` If pixi is installed but only available in interactive shells (e.g., it was added to `~/.bashrc` inside an `if [ -z "$PS1" ]` guard), move the PATH addition outside the guard so that SLURM jobs can find it. ### "Out of memory" There are two types of out-of-memory errors: **GPU Memory (CUDA OOM):** ``` CUDA out of memory ``` Reduce system size: - Decrease `box.padding` - Use fewer polymers - Use smaller production `samples` (fewer frames saved) **System Memory (SLURM OOM):** ``` slurmstepd: error: Detected 1 oom_kill event in StepId=... ``` The job exceeded its RAM allocation. This often occurs during energy minimization when loading large systems onto the GPU. **Solution:** Increase memory with the `--memory` flag: ```bash # Default is 3G, increase for larger systems polyzymd submit -c config.yaml --memory 4G # For very large systems polyzymd submit -c config.yaml --memory 8G # Also works with recover --submit polyzymd recover -c config.yaml -r 1 --submit --memory 8G ``` ### "GPU not detected" Check: ```bash nvidia-smi # In job script ``` Make sure the GPU directive is present in the generated script: - Alpine presets: `#SBATCH --gres=gpu:1` - Bridges2 preset: `#SBATCH --gpus=v100-32:1` (or your selected GPU type) ### "config.yaml not found" The self-resubmitting script stores an absolute path to the config file. Make sure the config file has not been moved or deleted since job submission: ```bash # Check the path in the generated script grep CONFIG_PATH job_scripts/r1_300K_LipA.sh ``` If the config was moved, either move it back or regenerate and resubmit: ```bash polyzymd submit -c /new/path/to/config.yaml --preset aa100 --replicates 1 ```