Workflow Module
Job Submitter
Job submission for HPC SLURM scheduler.
This module provides utilities for submitting self-resubmitting MD
simulation jobs to SLURM. Each replicate gets a single job script
that calls polyzymd run-segment, checks progress, and resubmits
itself until the simulation is complete.
Changed in version 1.1.0: Replaced the legacy daisy-chain (dependency-chain) model with
self-resubmitting jobs. The public API (submit_daisy_chain,
DaisyChainConfig, DaisyChainSubmitter) is preserved for
backward compatibility but the internal behaviour is simplified.
- polyzymd.workflow.daisy_chain.check_existing_slurm_jobs(job_name)[source]
Query SLURM for RUNNING or PENDING jobs that match job_name.
This is a best-effort check: if
squeueis unavailable (e.g. in a non-SLURM environment or CI), a warning is logged and an empty list is returned so that submission proceeds unimpeded.
- polyzymd.workflow.daisy_chain.create_job_name(sim_config, replicate)[source]
Create a descriptive SLURM job name for a replicate.
Produces names like
r1_310K_Fibronectin_SBMA-OEGMA_A75_B25matching the directory naming convention.- Parameters:
sim_config (SimulationConfig) – Validated simulation configuration.
replicate (int) – Replicate number.
- Returns:
Formatted job name.
- Return type:
- class polyzymd.workflow.daisy_chain.DaisyChainConfig(slurm_config, total_production_time_ns, total_samples=2500, equilibration_time_ns=0.5, replicates=<factory>, dry_run=False, force=False, output_script_dir=PosixPath('daisy_chain_scripts'), config_path='config.yaml')[source]
Bases:
objectConfiguration for job submission.
Despite the legacy name, this now configures single self-resubmitting jobs (one per replicate) rather than dependency chains.
- Variables:
slurm_config (SlurmConfig) – SLURM job configuration.
total_production_time_ns (float) – Total production time in nanoseconds.
total_samples (int) – Total trajectory frames across the entire production run.
equilibration_time_ns (float) – Equilibration time (informational only).
dry_run (bool) – If True, create scripts but don’t submit.
force (bool) – If True, skip the squeue duplicate-job check and submit even if a RUNNING/PENDING job already exists for the same replicate.
output_script_dir (Path) – Directory for generated job scripts.
config_path (str) – Path to the YAML configuration file.
- slurm_config: SlurmConfig
- classmethod from_simulation_config(sim_config, slurm_config, replicates='1', dry_run=False, force=False, output_script_dir='daisy_chain_scripts', config_path='config.yaml')[source]
Create DaisyChainConfig from a SimulationConfig.
- Parameters:
sim_config (SimulationConfig) – Simulation configuration.
slurm_config (SlurmConfig) – SLURM configuration.
replicates (str or list of int) – Replicate range string (e.g.
"1-5") or list of ints.dry_run (bool) – If True, don’t submit jobs.
force (bool) – If True, skip duplicate-job check.
output_script_dir (str or Path) – Directory for job scripts.
config_path (str) – Path to the YAML configuration file.
- Returns:
Configured instance.
- Return type:
- __init__(slurm_config, total_production_time_ns, total_samples=2500, equilibration_time_ns=0.5, replicates=<factory>, dry_run=False, force=False, output_script_dir=PosixPath('daisy_chain_scripts'), config_path='config.yaml')
- class polyzymd.workflow.daisy_chain.SubmissionResult(job_id, script_path, segment_index, replicate, is_dry_run=False)[source]
Bases:
objectResult of job submission.
- Variables:
- __init__(job_id, script_path, segment_index, replicate, is_dry_run=False)
- class polyzymd.workflow.daisy_chain.DaisyChainSubmitter(sim_config, dc_config, pixi_env='cuda-12-4', openff_logs=False, skip_build=False)[source]
Bases:
objectHandles job submission for MD simulations.
In the self-resubmitting model, each replicate gets a single job script. The script calls
polyzymd run-segment, checks progress, and resubmits itself until the simulation is complete.Example
>>> sim_config = SimulationConfig.from_yaml("config.yaml") >>> slurm_config = SlurmConfig.from_preset("aa100", email="user@example.com") >>> dc_config = DaisyChainConfig.from_simulation_config( ... sim_config, slurm_config, replicates="1-3" ... ) >>> submitter = DaisyChainSubmitter(sim_config, dc_config) >>> results = submitter.submit_all()
- __init__(sim_config, dc_config, pixi_env='cuda-12-4', openff_logs=False, skip_build=False)[source]
Initialize the submitter.
- Parameters:
sim_config (SimulationConfig) – Simulation configuration.
dc_config (DaisyChainConfig) – Submission configuration.
pixi_env (str) – Pixi environment name (e.g.
"cuda-12-4","cuda-12-6").openff_logs (bool) – Enable verbose OpenFF logs in generated scripts.
skip_build (bool) – Skip system building in generated scripts.
- property sim_config: SimulationConfig
Get the simulation configuration.
- property dc_config: DaisyChainConfig
Get the submission configuration.
- property job_chains: Dict[int, List[SubmissionResult]]
Get the submission results for all replicates.
- submit_replicate(replicate)[source]
Generate and submit the job for a single replicate.
Before submitting, checks
squeuefor existing RUNNING/PENDING jobs with the same job name. If duplicates are found andforceis not set, raisesRuntimeError.- Parameters:
replicate (int) – Replicate number.
- Returns:
Submission result.
- Return type:
- Raises:
RuntimeError – If a SLURM job is already RUNNING or PENDING for this replicate and
forceis False.
- polyzymd.workflow.daisy_chain.submit_daisy_chain(config_path, slurm_preset='aa100', replicates='1', email='', dry_run=False, force=False, pixi_env='cuda-12-4', output_dir=None, scratch_dir=None, projects_dir=None, time_limit=None, memory=None, account=None, gpu_type=None, openff_logs=False, skip_build=False)[source]
Submit self-resubmitting simulation jobs from a YAML config.
This is the main entry point called by
polyzymd submit. Despite the legacy function name, it now submits one self-resubmitting job per replicate rather than a chain of dependent jobs.- Parameters:
config_path (str or Path) – Path to simulation YAML config.
slurm_preset (str) – SLURM preset name (aa100, al40, blanca-shirts, bridges2, testing).
replicates (str) – Replicate range string (e.g.
"1-5","1,3,5").email (str) – Email for job notifications.
dry_run (bool) – If True, don’t submit jobs.
force (bool) – If True, skip the squeue duplicate-job check.
pixi_env (str) – Pixi environment name (e.g.
"cuda-12-4","cuda-12-6").output_dir (str or Path or None) – Directory for job scripts.
scratch_dir (str or Path or None) – Override scratch directory for simulation output.
projects_dir (str or Path or None) – Override projects directory for scripts/logs.
time_limit (str or None) – Override SLURM time limit (format:
HH:MM:SS).memory (str or None) – Override SLURM memory allocation (e.g.
"4G").account (str or None) – Override SLURM account / allocation ID.
gpu_type (str or None) – Override GPU type for presets that use
--gpusdirective.openff_logs (bool) – Enable verbose OpenFF logs in generated scripts.
skip_build (bool) – Skip system building in generated scripts.
- Returns:
Mapping of replicate numbers to submission results.
- Return type:
- Raises:
ValueError – If the SLURM account is empty on a preset that requires one and
dry_runis False.
SLURM Configuration
SLURM job script generation for HPC cluster submission.
This module provides templates and utilities for generating SLURM batch scripts for self-resubmitting MD simulation jobs.
Changed in version 1.1.0: Replaced conda/module-load environment activation with pixi.
The module_load and conda_command fields on SlurmConfig
have been removed. Environment activation is now handled by
pixi shell-hook using the pixi_env parameter on
SlurmScriptGenerator.
- class polyzymd.workflow.slurm.SlurmConfig(partition='aa100', qos='normal', account='ucb625_asc1', time_limit='23:59:59', email='', nodes=1, ntasks=1, memory='3G', gpus=1, exclude=None, gpu_type=None, gpu_directive_style='gres')[source]
Bases:
objectConfiguration for SLURM job submission.
- Variables:
partition (str) – SLURM partition(s) to use.
qos (str) – Quality of service. Set to
""to omit the--qosdirective entirely (required for clusters such as Bridges2 that do not use QoS).account (str) – Account / allocation ID for resource allocation. Set to
""to omit the--accountdirective entirely (e.g. Bridges2, which infers the allocation from the submitting user’s login).time_limit (str) – Wall time limit (HH:MM:SS).
email (str) – Email address for SLURM failure notifications. Set to
""to omit both--mail-typeand--mail-userdirectives.nodes (int) – Number of nodes.
ntasks (int) – Number of tasks. Ignored when
gpu_directive_style == "gpus"(Bridges2-style); those scripts emit#SBATCH -N {nodes}only.memory (str | None) – Memory allocation (e.g.
"3G"). Set toNoneto omit the--memdirective entirely (some clusters allocate memory per GPU and reject an explicit--memrequest).gpus (int) – Number of GPUs.
exclude (str | None) – Nodes to exclude (omitted when
None).gpu_type (str | None) – Optional GPU type string used with the
--gpusdirective (e.g."v100-32"for Bridges2). WhenNonethe classic--gres=gpu:<N>directive is emitted instead.gpu_directive_style (str) –
"gres"(default, Alpine-style) or"gpus"(Bridges2-style). Controls which SBATCH GPU directive is written. Also governs which nodes/ntasks format is emitted.
- classmethod from_preset(preset, email='')[source]
Create a SlurmConfig from a named preset.
- Parameters:
- Returns:
SlurmConfig with preset values.
- Return type:
- __init__(partition='aa100', qos='normal', account='ucb625_asc1', time_limit='23:59:59', email='', nodes=1, ntasks=1, memory='3G', gpus=1, exclude=None, gpu_type=None, gpu_directive_style='gres')
- class polyzymd.workflow.slurm.JobContext(job_name, output_file, scratch_dir, projects_dir='.', segment_index=0, replicate_num=1, extra_vars=<factory>)[source]
Bases:
objectContext for job script template rendering.
- Variables:
job_name (str) – SLURM job name.
output_file (str) – Output file pattern (for SLURM logs).
scratch_dir (str) – Directory for simulation output (trajectories, checkpoints).
projects_dir (str) – Directory for scripts and logs.
segment_index (int) – Current segment index.
replicate_num (int) – Replicate number.
extra_vars (Dict) – Additional template variables.
- __init__(job_name, output_file, scratch_dir, projects_dir='.', segment_index=0, replicate_num=1, extra_vars=<factory>)
- class polyzymd.workflow.slurm.SlurmScriptGenerator(config, pixi_env='cuda-12-4', openff_logs=False, skip_build=False)[source]
Bases:
objectGenerator for SLURM batch scripts.
Supports separate directories for: - projects_dir: Where scripts live and jobs are submitted from - scratch_dir: Where simulation output goes (trajectories, checkpoints)
Example
>>> config = SlurmConfig.from_preset("aa100", email="user@example.com") >>> generator = SlurmScriptGenerator(config) >>> script = generator.generate_job_script( ... config_path="/projects/user/config.yaml", ... replicate=1, ... working_dir="/scratch/user/sim_output", ... )
- JOB_TEMPLATE = '#!/bin/bash\n#SBATCH --partition={partition}\n#SBATCH --job-name={job_name}\n#SBATCH --output={output_file}\n{qos_line}\n{nodes_line}\n{mem_line}\n#SBATCH --time={time_limit}\n{gpu_line}\n{mail_line}\n{account_line}\n{exclude_line}\n#SBATCH --signal=B:USR1@300\n#SBATCH --no-requeue\n\n# =============================================================================\n# PolyzyMD Self-Resubmitting Simulation Job\n# {FULL_CREDIT_LINE}\n# Generated by polyzymd — do not edit manually\n# =============================================================================\n\n# Activate pixi environment\n# The manifest path was resolved at submission time from `which polyzymd`.\neval "$(pixi shell-hook -e {pixi_env} --manifest-path {manifest_path})"\n\n# Enable strict error handling after environment setup\nset -e\n\n# Required for OpenFF Interchange.combine() functionality\nexport INTERCHANGE_EXPERIMENTAL=1\n\n# Resolve this script\'s path for self-resubmission.\n# $SLURM_JOB_SCRIPT is only available in SLURM >= 22.05; fall back to $0.\nTHIS_SCRIPT="${{SLURM_JOB_SCRIPT:-$(realpath "$0")}}"\n\n# Configuration\nCONFIG_PATH="{config_path}"\nREPLICATE={replicate}\nWORKING_DIR="{working_dir}"\n\n# Ensure working directory exists\nmkdir -p "$WORKING_DIR"\n\necho "=================================================="\necho "PolyzyMD self-resubmitting job"\necho "{FULL_CREDIT_LINE}"\necho "Config: $CONFIG_PATH"\necho "Replicate: $REPLICATE"\necho "Work dir: $WORKING_DIR"\necho "Pixi env: {pixi_env}"\necho "Job ID: ${{SLURM_JOB_ID:-local}}"\necho "Timestamp: $(date)"\necho "=================================================="\n\n# =========================================================================\n# Signal forwarding: SLURM sends signals to the batch shell, not to child\n# processes. We trap SIGUSR1 (wall-time warning) and SIGTERM (preemption)\n# and forward them to the Python process running in the background.\n# =========================================================================\nCHILD_PID=""\nforward_signal() {{\n if [ -n "$CHILD_PID" ] && kill -0 "$CHILD_PID" 2>/dev/null; then\n echo "Forwarding $1 to Python process (PID $CHILD_PID)"\n kill -"$1" "$CHILD_PID"\n fi\n}}\ntrap \'forward_signal USR1\' USR1\ntrap \'forward_signal TERM\' TERM\n\n# Run the next segment (backgrounded for signal forwarding)\npolyzymd{openff_logs_flag} run-segment \\\n -c "$CONFIG_PATH" \\\n -r "$REPLICATE" \\\n --scratch-dir "$WORKING_DIR"{skip_build_flag} &\nCHILD_PID=$!\n\n# Wait for the child; \'wait\' is interrupted by trapped signals, so loop\n# until the child actually exits. Temporarily disable \'set -e\' so we can\n# capture non-zero exit codes (e.g. 99 for graceful shutdown) without the\n# shell exiting prematurely.\nset +e\nwait "$CHILD_PID" 2>/dev/null\nRC=$?\nwhile kill -0 "$CHILD_PID" 2>/dev/null; do\n wait "$CHILD_PID" 2>/dev/null\n RC=$?\ndone\nset -e\n\necho "run-segment exited with code $RC at $(date)"\n\n# =========================================================================\n# Resubmission logic\n# =========================================================================\nif [ $RC -eq 2 ]; then\n echo "CONCURRENT: Another job is already running this replicate — NOT resubmitting."\n echo "This duplicate job chain will now terminate cleanly."\n exit 0\nfi\n\nif [ $RC -ne 0 ] && [ $RC -ne 99 ]; then\n echo "FATAL: run-segment failed (exit code $RC) — NOT resubmitting"\n exit $RC\nfi\n\n# Check whether more work remains\nset +e\npolyzymd check-progress -c "$CONFIG_PATH" -r "$REPLICATE" --scratch-dir "$WORKING_DIR"\nPROGRESS_RC=$?\nset -e\n\nif [ $PROGRESS_RC -eq 0 ]; then\n echo "Simulation complete — no resubmission needed."\n exit 0\nfi\n\nif [ $PROGRESS_RC -ne 1 ]; then\n echo "FATAL: check-progress failed (exit code $PROGRESS_RC) — NOT resubmitting"\n exit $PROGRESS_RC\nfi\n\n# Work remains (exit code 1) — resubmit this same script\necho "Work remains — resubmitting job..."\nsbatch "$THIS_SCRIPT"\nSUBMIT_RC=$?\n\nif [ $SUBMIT_RC -eq 0 ]; then\n echo "Resubmitted successfully."\nelse\n echo "WARNING: sbatch resubmission failed (exit code $SUBMIT_RC)"\n echo "You can manually resume with:"\n echo " sbatch $THIS_SCRIPT"\n exit 1\nfi\n\nexit 0\n'
- __init__(config, pixi_env='cuda-12-4', openff_logs=False, skip_build=False)[source]
Initialize the generator.
- Parameters:
config (SlurmConfig) – SLURM configuration.
pixi_env (str) – Pixi environment name (e.g.
"cuda-12-4","cuda-12-6").openff_logs (bool) – Enable verbose OpenFF logs in generated scripts.
skip_build (bool) – Skip system building in generated scripts (use pre-built system).
- property config: SlurmConfig
Get the SLURM configuration.
- generate_job_script(config_path, replicate, working_dir, job_name=None, output_file=None)[source]
Generate a self-resubmitting SLURM job script.
This produces a single script that handles the entire simulation lifecycle. Each invocation calls
polyzymd run-segmentwhich determines what work remains, runs the next segment, and exits. The bash wrapper then checks progress and resubmits itself if more work is needed.- Parameters:
config_path (str) – Absolute path to the YAML configuration file.
replicate (int) – Replicate number.
working_dir (str) – Directory for simulation output (trajectories, checkpoints).
job_name (str or None, optional) – SLURM job name. Callers should use
create_job_name()to produce descriptive names (e.g.r1_310K_Fibronectin_...). Falls back topzmd_r{replicate}if not provided.output_file (str or None, optional) – SLURM log file pattern. Falls back to
slurm_logs/{job_name}.%j.outrelative to the directory wheresbatchis invoked.
- Returns:
Complete SLURM batch script content.
- Return type:
- polyzymd.workflow.slurm.parse_replicate_range(replicate_range)[source]
Parse a SLURM array range into a list of replicate numbers.
- Parameters:
replicate_range (str) – SLURM array format (e.g., “1-5”, “1,3,5”, “1-10:2”).
- Returns:
List of replicate numbers.
- Return type:
Example
>>> parse_replicate_range("1-5") [1, 2, 3, 4, 5] >>> parse_replicate_range("1,3,5") [1, 3, 5] >>> parse_replicate_range("1-10:2") [1, 3, 5, 7, 9]
- polyzymd.workflow.slurm.validate_replicate_range(replicate_range)[source]
Validate that a replicate range is in proper SLURM array format.
- Parameters:
replicate_range (str) – Range string to validate.
- Returns:
True if valid.
- Raises:
ValueError – If the format is invalid.
- Return type: