# HPC and SLURM Guide This guide covers running PolyzyMD simulations on HPC clusters using SLURM. ## Overview Long MD simulations often exceed HPC time limits (typically 24-48 hours). PolyzyMD solves this with **daisy-chaining**: breaking simulations into segments that run as dependent SLURM jobs. ## User Workflow The typical workflow for running a PolyzyMD simulation on an HPC cluster is: 1. **Create a simulation directory** with your configuration and input files 2. **Write a `config.yaml`** file with your simulation parameters 3. **Generate job scripts** using `polyzymd submit --dry-run` 4. **Review** the generated scripts 5. **Submit the jobs** for real ### Step-by-Step Example ```bash # 1. Create your simulation directory mkdir -p my_simulation/structures cd my_simulation # 2. Copy your input files cp /path/to/enzyme.pdb structures/ cp /path/to/substrate.sdf structures/ # If using pre-built polymers: cp -r /path/to/polymer_sdfs ./ATRP_EGPMA_SBMA_5-mer/ # 3. Create your config.yaml (see Configuration Guide) # The config should reference paths relative to this directory: # enzyme.pdb_path: "structures/enzyme.pdb" # substrate.sdf_path: "structures/substrate.sdf" # polymers.sdf_directory: "ATRP_EGPMA_SBMA_5-mer" # 4. Test with a dry run first polyzymd submit -c config.yaml --preset testing --dry-run # 5. Review the generated scripts cat job_scripts/initial_seg0_rep1.sh # 6. Submit for real (quick test first) polyzymd submit -c config.yaml --preset testing --time-limit 0:05:00 --replicates 1 # 7. Once testing passes, submit production jobs polyzymd submit -c config.yaml --preset aa100 --replicates 1-5 --email your@email.edu ``` --- ## Directory Structure PolyzyMD supports separating: - **Projects directory**: Long-term storage for scripts, logs, configs - **Scratch directory**: High-performance storage for trajectories ``` /projects/$USER/polyzymd/ # Long-term storage ├── my_simulation/ # Your simulation directory │ ├── config.yaml # Main configuration │ ├── structures/ # Input structure files │ │ ├── enzyme.pdb │ │ └── substrate.sdf │ ├── ATRP_EGPMA_SBMA_5-mer/ # Pre-built polymer SDFs (optional) │ │ ├── EGPMA-SBMA_AAAAA_5-mer_charged.sdf │ │ └── ... │ ├── job_scripts/ # Generated SLURM scripts │ │ ├── initial_seg0_rep1.sh │ │ ├── continue_seg1_rep1.sh │ │ └── ... │ └── slurm_logs/ # Job output files │ └── s0_r1_300K_LipA.out /scratch/alpine/$USER/polyzymd_sims/ # High-performance storage ├── LipA_Substrate_EGPMA-SBMA_10pct_300K_run1/ │ ├── system.pdb │ ├── equilibration/ │ │ └── trajectory.dcd │ ├── production_seg0/ │ │ ├── trajectory.dcd │ │ ├── checkpoint.chk │ │ └── state_data.csv │ └── production_seg1/ │ └── ... └── LipA_Substrate_EGPMA-SBMA_10pct_300K_run2/ ``` ## Configuring Directories ### In YAML Configuration Environment variables (`$USER`, `$HOME`, etc.) and `~` are automatically expanded: ```yaml output: projects_directory: "/projects/$USER/polyzymd/my_simulation" scratch_directory: "/scratch/alpine/$USER/polyzymd_sims" job_scripts_subdir: "job_scripts" slurm_logs_subdir: "slurm_logs" ``` You can also use `~` for home directory: ```yaml output: projects_directory: "~/polyzymd/my_simulation" ``` ### Via CLI Override ```bash polyzymd submit -c config.yaml \ --projects-dir /projects/$USER/polyzymd \ --scratch-dir /scratch/alpine/$USER/simulations \ --replicates 1-5 ``` --- ## SLURM Presets PolyzyMD includes presets for common HPC configurations: | Preset | Partition | GPUs | Time Limit | Memory | Description | |--------|-----------|------|------------|--------|-------------| | `aa100` | aa100 | 1x A100 | 24h | 3GB | NVIDIA A100 (recommended) | | `al40` | al40 | 1x L40 | 24h | 3GB | NVIDIA L40 | | `blanca-shirts` | blanca-shirts | 1x | 24h | 3GB | Shirts lab partition | | `testing` | atesting_a100 | 1x | 6min | 3GB | Quick tests | ### Using Presets ```bash # Use A100 GPUs polyzymd submit -c config.yaml --preset aa100 # Use testing partition for quick tests polyzymd submit -c config.yaml --preset testing ``` ### Overriding Time Limit You can override the preset's time limit using `--time-limit`: ```bash # Use testing preset with a 2-minute time limit polyzymd submit -c config.yaml --preset testing --time-limit 0:02:00 # Use A100 with a 12-hour limit instead of 24h polyzymd submit -c config.yaml --preset aa100 --time-limit 12:00:00 ``` **Time format options:** - `MM:SS` - minutes and seconds (e.g., `2:00` for 2 minutes) - `HH:MM:SS` - hours, minutes, seconds (e.g., `0:02:00`) - `D-HH:MM:SS` - days, hours, minutes, seconds (e.g., `1-00:00:00` for 1 day) This is especially useful for: - Quick testing with short time limits - Adjusting for segment duration requirements - Working within specific QOS constraints ### Custom SLURM Settings For custom configurations, edit the generated scripts in `job_scripts/` before submitting. --- ## Daisy-Chain Workflow ### How It Works 1. **Initial job**: Builds system, runs equilibration, runs first production segment 2. **Continuation jobs**: Load checkpoint, run next segment 3. **Dependencies**: Each job depends on the previous one completing successfully ``` Job 1 (initial) Job 2 (continue) Job 3 (continue) ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Build │ │ Load chkpt │ │ Load chkpt │ │ Equilibrate │ --> │ Run seg 1 │ --> │ Run seg 2 │ │ Run seg 0 │ │ Save chkpt │ │ Save chkpt │ └─────────────┘ └─────────────┘ └─────────────┘ ``` ### Configuring Segments ```yaml simulation_phases: production: duration: 100.0 # 100 ns total segments: 10 # 10 segments of 10 ns each ``` ```{tip} **Segment duration** = total duration / segments Choose segment duration to fit within your cluster's time limit with margin: - 24h limit → ~8-10 ns segments (2h GPU time + overhead) - 48h limit → ~20 ns segments ``` --- ## Submitting Jobs ### Dry Run (Recommended First) Generate scripts without submitting: ```bash polyzymd submit -c config.yaml \ --replicates 1-3 \ --preset aa100 \ --dry-run ``` Inspect generated scripts: ```bash cat job_scripts/initial_seg0_rep1.sh ``` ### Submit for Real ```bash polyzymd submit -c config.yaml \ --replicates 1-3 \ --preset aa100 \ --email your.email@university.edu ``` ### Replicate Specification ```bash # Single replicate --replicates 1 # Range --replicates 1-5 # Specific replicates --replicates 1,3,5,7 # Combined --replicates 1-3,5,7-10 ``` --- ## Monitoring Jobs ### Check Job Status ```bash # All your jobs squeue -u $USER # Specific job scontrol show job # Watch jobs update watch -n 30 'squeue -u $USER' ``` ### View Job Output ```bash # Real-time output tail -f slurm_logs/s0_r1_300K_LipA*.out # Check for errors grep -i error slurm_logs/*.out grep -i fail slurm_logs/*.out ``` ### Check Simulation Progress ```bash # List trajectory files ls -la /scratch/$USER/polyzymd_sims/*/production_*/trajectory.dcd # Check trajectory sizes du -h /scratch/$USER/polyzymd_sims/*/production_*/*.dcd ``` --- ## Handling Failures ### Job Failed Mid-Segment If a job fails, the dependent jobs won't start. To restart: 1. **Check the error**: ```bash cat slurm_logs/s2_r1_300K_*.out ``` 2. **Fix the issue** (if possible) 3. **Manually continue**: ```bash # Edit and resubmit the continuation script sbatch job_scripts/continue_seg2_rep1.sh ``` ### Start Fresh To restart a simulation from scratch: ```bash # Remove old output rm -rf /scratch/$USER/polyzymd_sims/LipA_*_run1/ # Resubmit polyzymd submit -c config.yaml --replicates 1 --preset aa100 ``` --- ## Generated Script Structure ### Initial Script The initial job script (segment 0) builds the system, runs equilibration, and runs the first production segment: ```bash #!/bin/bash #SBATCH --partition=aa100 #SBATCH --job-name=i_s0_r1_300K_LipA #SBATCH --output=slurm_logs/s0_r1_300K_LipA.%A_%a.out #SBATCH --qos=normal #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --mem=3G #SBATCH --time=23:59:59 #SBATCH --gres=gpu:1 #SBATCH --mail-type=FAIL #SBATCH --mail-user=your@email.edu #SBATCH --account=ucb625_asc1 # Exit immediately if any command fails set -e module purge module load miniforge mamba activate polymerist-env # Projects directory (scripts, configs, logs) PROJECTS_DIR="/projects/$USER/polyzymd/my_simulation" # Scratch directory (simulation output) SCRATCH_DIR="/scratch/alpine/$USER/polyzymd_sims/LipA_300K_run1" # Ensure scratch directory exists mkdir -p "$SCRATCH_DIR" # Change to projects directory where config lives cd "$PROJECTS_DIR" echo "Starting initial simulation segment 0" echo "Projects dir: $PROJECTS_DIR" echo "Scratch dir: $SCRATCH_DIR" echo "Config: config.yaml" echo "Replicate: 1" echo "Timestamp: $(date)" # Run the initial simulation using polyzymd CLI polyzymd run -c "config.yaml" \ --replicate 1 \ --scratch-dir "$SCRATCH_DIR" \ --segment-time 10.0 \ --segment-frames 250 echo "Segment 0 completed successfully at $(date)" ``` ### Continuation Script Continuation scripts load the checkpoint from the previous segment and continue the simulation: ```bash #!/bin/bash #SBATCH --partition=aa100 #SBATCH --job-name=c_s1_r1_300K_LipA #SBATCH --output=slurm_logs/s1_r1_300K_LipA.%A_%a.out #SBATCH --qos=normal #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --mem=3G #SBATCH --time=23:59:59 #SBATCH --gres=gpu:1 #SBATCH --mail-type=FAIL #SBATCH --mail-user=your@email.edu #SBATCH --account=ucb625_asc1 # Exit immediately if any command fails set -e module purge module load miniforge mamba activate polymerist-env # Projects directory (scripts, configs, logs) PROJECTS_DIR="/projects/$USER/polyzymd/my_simulation" # Scratch directory (simulation output - where previous segment data lives) SCRATCH_DIR="/scratch/alpine/$USER/polyzymd_sims/LipA_300K_run1" # Change to projects directory cd "$PROJECTS_DIR" echo "Starting continuation segment 1" echo "Projects dir: $PROJECTS_DIR" echo "Scratch dir: $SCRATCH_DIR" echo "Timestamp: $(date)" # Continue simulation from previous segment using polyzymd CLI polyzymd continue \ -w "$SCRATCH_DIR" \ -s 1 \ -t 10.0 \ -n 250 echo "Segment 1 completed successfully at $(date)" ``` --- ## Best Practices ### 1. Always Test First ```bash # Generate scripts without submitting (dry run) polyzymd submit -c config.yaml --preset testing --dry-run # Quick test with 2-minute time limit polyzymd submit -c config.yaml \ --preset testing \ --time-limit 0:02:00 \ --replicates 1 # Or a slightly longer test polyzymd submit -c config.yaml \ --preset testing \ --time-limit 0:05:00 \ --replicates 1 ``` ### 2. Monitor Early Segments Watch the first segment complete to catch issues early: ```bash tail -f slurm_logs/*_s0_*.out ``` ### 3. Back Up Important Data Scratch is often purged. Copy completed simulations to projects: ```bash # After simulation completes cp -r /scratch/$USER/polyzymd_sims/LipA_300K_run1 \ /projects/$USER/completed_simulations/ ``` ### 4. Use Email Notifications ```bash polyzymd submit -c config.yaml --email you@university.edu ``` You'll receive emails when jobs start, end, or fail. ### 5. Segment Duration Guidelines | Cluster Time Limit | Recommended Segment Duration | |--------------------|------------------------------| | 1 hour (testing) | 0.5 - 1 ns | | 24 hours | 8 - 12 ns | | 48 hours | 20 - 30 ns | | 7 days | 50 - 100 ns | --- ## CLI Reference ### `polyzymd submit` Submit daisy-chain simulation jobs to SLURM. ```bash polyzymd submit -c CONFIG [OPTIONS] ``` **Required:** - `-c, --config PATH` - Path to YAML configuration file **Options:** - `-r, --replicates RANGE` - Replicate range (e.g., "1-5", "1,3,5"). Default: "1" - `--preset PRESET` - SLURM preset: aa100, al40, blanca-shirts, testing. Default: aa100 - `--scratch-dir PATH` - Override scratch directory for simulation output - `--projects-dir PATH` - Override projects directory for scripts/logs - `--output-dir PATH` - Directory for job scripts. Default: {projects_dir}/job_scripts - `--email EMAIL` - Email for job notifications - `--time-limit TIME` - Override SLURM time limit (HH:MM:SS) - `--memory SIZE` - Override SLURM memory allocation (e.g., "4G", "8G"). Default: 3G - `--openff-logs` - Enable verbose OpenFF logs in generated job scripts (for debugging) - `--dry-run` - Generate scripts without submitting ### `polyzymd run` Run a complete simulation (build + equilibration + first production segment). ```bash polyzymd run -c CONFIG [OPTIONS] ``` **Required:** - `-c, --config PATH` - Path to YAML configuration file **Options:** - `-r, --replicate INT` - Replicate number. Default: 1 - `--scratch-dir PATH` - Scratch directory for simulation output - `--projects-dir PATH` - Projects directory for scripts/logs - `--segment-time FLOAT` - Override production time per segment (ns) - `--segment-frames INT` - Override frames per segment - `--skip-build` - Skip system building (use existing) ### `polyzymd continue` Continue a simulation from a previous segment checkpoint. ```bash polyzymd continue -w WORKING_DIR -s SEGMENT -t TIME [OPTIONS] ``` **Required:** - `-w, --working-dir PATH` - Working directory with previous segment - `-s, --segment INT` - Segment index to run (1-based) - `-t, --segment-time FLOAT` - Duration of this segment (ns) **Options:** - `-n, --num-samples INT` - Number of frames to save. Default: 250 --- ## Troubleshooting ### "Job pending forever" ```bash squeue -u $USER # Check REASON column ``` Common reasons: - `Resources` - Waiting for GPUs - `Priority` - Queue is busy - `Dependency` - Waiting for previous job ### "Module not found in job" If you see errors like "Lmod has detected the following error: The following module(s) are unknown", the cluster may have different module configurations on different partitions. Check available modules: ```bash module spider miniforge module spider anaconda ``` You may need to edit the generated scripts to use a different module: ```bash # Instead of: module load miniforge mamba activate polymerist-env # Try: module load anaconda conda activate polymerist-env # Or source conda directly: source /curc/sw/anaconda3/latest/etc/profile.d/conda.sh conda activate polymerist-env ``` ### "Out of memory" There are two types of out-of-memory errors: **GPU Memory (CUDA OOM):** ``` CUDA out of memory ``` Reduce system size: - Decrease `box.padding` - Use fewer polymers - Use smaller production `samples` (fewer frames saved) **System Memory (SLURM OOM):** ``` slurmstepd: error: Detected 1 oom_kill event in StepId=... ``` The job exceeded its RAM allocation. This often occurs during energy minimization when loading large systems onto the GPU. **Solution:** Increase memory with the `--memory` flag: ```bash # Default is 3G, increase for larger systems polyzymd submit -c config.yaml --memory 4G # For very large systems polyzymd submit -c config.yaml --memory 8G ``` ### "GPU not detected" Check: ```bash nvidia-smi # In job script ``` Make sure `#SBATCH --gres=gpu:1` is present. ### "config.yaml not found" Make sure you're running `polyzymd submit` from the directory containing your `config.yaml`, or use an absolute path: ```bash polyzymd submit -c /full/path/to/config.yaml --preset aa100 ```