Workflow Module

Job Submitter

Job submission for HPC SLURM scheduler.

This module provides utilities for submitting self-resubmitting MD simulation jobs to SLURM. Each replicate gets a single job script that calls polyzymd run-segment, checks progress, and resubmits itself until the simulation is complete.

Changed in version 1.1.0: Replaced the legacy daisy-chain (dependency-chain) model with self-resubmitting jobs. The public API (submit_daisy_chain, DaisyChainConfig, DaisyChainSubmitter) is preserved for backward compatibility but the internal behaviour is simplified.

polyzymd.workflow.daisy_chain.check_existing_slurm_jobs(job_name)[source]

Query SLURM for RUNNING or PENDING jobs that match job_name.

This is a best-effort check: if squeue is unavailable (e.g. in a non-SLURM environment or CI), a warning is logged and an empty list is returned so that submission proceeds unimpeded.

Parameters:

job_name (str) – The SLURM --job-name to search for (exact match).

Returns:

SLURM job IDs that are RUNNING or PENDING with the given name. Empty if squeue is unavailable or returns no matches.

Return type:

list of str

polyzymd.workflow.daisy_chain.create_job_name(sim_config, replicate)[source]

Create a descriptive SLURM job name for a replicate.

Produces names like r1_310K_Fibronectin_SBMA-OEGMA_A75_B25 matching the directory naming convention.

Parameters:
  • sim_config (SimulationConfig) – Validated simulation configuration.

  • replicate (int) – Replicate number.

Returns:

Formatted job name.

Return type:

str

class polyzymd.workflow.daisy_chain.DaisyChainConfig(slurm_config, total_production_time_ns, total_samples=2500, equilibration_time_ns=0.5, replicates=<factory>, dry_run=False, force=False, output_script_dir=PosixPath('daisy_chain_scripts'), config_path='config.yaml')[source]

Bases: object

Configuration for job submission.

Despite the legacy name, this now configures single self-resubmitting jobs (one per replicate) rather than dependency chains.

Variables:
  • slurm_config (SlurmConfig) – SLURM job configuration.

  • total_production_time_ns (float) – Total production time in nanoseconds.

  • total_samples (int) – Total trajectory frames across the entire production run.

  • equilibration_time_ns (float) – Equilibration time (informational only).

  • replicates (list of int) – Replicate numbers to run.

  • dry_run (bool) – If True, create scripts but don’t submit.

  • force (bool) – If True, skip the squeue duplicate-job check and submit even if a RUNNING/PENDING job already exists for the same replicate.

  • output_script_dir (Path) – Directory for generated job scripts.

  • config_path (str) – Path to the YAML configuration file.

slurm_config: SlurmConfig
total_production_time_ns: float
total_samples: int = 2500
equilibration_time_ns: float = 0.5
replicates: List[int]
dry_run: bool = False
force: bool = False
output_script_dir: Path = PosixPath('daisy_chain_scripts')
config_path: str = 'config.yaml'
classmethod from_simulation_config(sim_config, slurm_config, replicates='1', dry_run=False, force=False, output_script_dir='daisy_chain_scripts', config_path='config.yaml')[source]

Create DaisyChainConfig from a SimulationConfig.

Parameters:
  • sim_config (SimulationConfig) – Simulation configuration.

  • slurm_config (SlurmConfig) – SLURM configuration.

  • replicates (str or list of int) – Replicate range string (e.g. "1-5") or list of ints.

  • dry_run (bool) – If True, don’t submit jobs.

  • force (bool) – If True, skip duplicate-job check.

  • output_script_dir (str or Path) – Directory for job scripts.

  • config_path (str) – Path to the YAML configuration file.

Returns:

Configured instance.

Return type:

DaisyChainConfig

__init__(slurm_config, total_production_time_ns, total_samples=2500, equilibration_time_ns=0.5, replicates=<factory>, dry_run=False, force=False, output_script_dir=PosixPath('daisy_chain_scripts'), config_path='config.yaml')
class polyzymd.workflow.daisy_chain.SubmissionResult(job_id, script_path, segment_index, replicate, is_dry_run=False)[source]

Bases: object

Result of job submission.

Variables:
  • job_id (str) – SLURM job ID (or dummy ID for dry run).

  • script_path (Path) – Path to the generated script.

  • segment_index (int) – Always 0 in the self-resubmitting model (kept for compatibility).

  • replicate (int) – Replicate number.

  • is_dry_run (bool) – Whether this was a dry run.

job_id: str
script_path: Path
segment_index: int
replicate: int
is_dry_run: bool = False
__init__(job_id, script_path, segment_index, replicate, is_dry_run=False)
class polyzymd.workflow.daisy_chain.DaisyChainSubmitter(sim_config, dc_config, pixi_env='cuda-12-4', openff_logs=False, skip_build=False)[source]

Bases: object

Handles job submission for MD simulations.

In the self-resubmitting model, each replicate gets a single job script. The script calls polyzymd run-segment, checks progress, and resubmits itself until the simulation is complete.

Example

>>> sim_config = SimulationConfig.from_yaml("config.yaml")
>>> slurm_config = SlurmConfig.from_preset("aa100", email="user@example.com")
>>> dc_config = DaisyChainConfig.from_simulation_config(
...     sim_config, slurm_config, replicates="1-3"
... )
>>> submitter = DaisyChainSubmitter(sim_config, dc_config)
>>> results = submitter.submit_all()
__init__(sim_config, dc_config, pixi_env='cuda-12-4', openff_logs=False, skip_build=False)[source]

Initialize the submitter.

Parameters:
  • sim_config (SimulationConfig) – Simulation configuration.

  • dc_config (DaisyChainConfig) – Submission configuration.

  • pixi_env (str) – Pixi environment name (e.g. "cuda-12-4", "cuda-12-6").

  • openff_logs (bool) – Enable verbose OpenFF logs in generated scripts.

  • skip_build (bool) – Skip system building in generated scripts.

property sim_config: SimulationConfig

Get the simulation configuration.

property dc_config: DaisyChainConfig

Get the submission configuration.

property job_chains: Dict[int, List[SubmissionResult]]

Get the submission results for all replicates.

generate_job_script(replicate)[source]

Generate a self-resubmitting job script for a replicate.

Parameters:

replicate (int) – Replicate number.

Returns:

Complete SLURM batch script content.

Return type:

str

submit_replicate(replicate)[source]

Generate and submit the job for a single replicate.

Before submitting, checks squeue for existing RUNNING/PENDING jobs with the same job name. If duplicates are found and force is not set, raises RuntimeError.

Parameters:

replicate (int) – Replicate number.

Returns:

Submission result.

Return type:

SubmissionResult

Raises:

RuntimeError – If a SLURM job is already RUNNING or PENDING for this replicate and force is False.

submit_all()[source]

Submit jobs for all replicates.

Returns:

Mapping of replicate numbers to their submission results (each value is a single-element list for compatibility).

Return type:

dict

polyzymd.workflow.daisy_chain.submit_daisy_chain(config_path, slurm_preset='aa100', replicates='1', email='', dry_run=False, force=False, pixi_env='cuda-12-4', output_dir=None, scratch_dir=None, projects_dir=None, time_limit=None, memory=None, account=None, gpu_type=None, openff_logs=False, skip_build=False)[source]

Submit self-resubmitting simulation jobs from a YAML config.

This is the main entry point called by polyzymd submit. Despite the legacy function name, it now submits one self-resubmitting job per replicate rather than a chain of dependent jobs.

Parameters:
  • config_path (str or Path) – Path to simulation YAML config.

  • slurm_preset (str) – SLURM preset name (aa100, al40, blanca-shirts, bridges2, testing).

  • replicates (str) – Replicate range string (e.g. "1-5", "1,3,5").

  • email (str) – Email for job notifications.

  • dry_run (bool) – If True, don’t submit jobs.

  • force (bool) – If True, skip the squeue duplicate-job check.

  • pixi_env (str) – Pixi environment name (e.g. "cuda-12-4", "cuda-12-6").

  • output_dir (str or Path or None) – Directory for job scripts.

  • scratch_dir (str or Path or None) – Override scratch directory for simulation output.

  • projects_dir (str or Path or None) – Override projects directory for scripts/logs.

  • time_limit (str or None) – Override SLURM time limit (format: HH:MM:SS).

  • memory (str or None) – Override SLURM memory allocation (e.g. "4G").

  • account (str or None) – Override SLURM account / allocation ID.

  • gpu_type (str or None) – Override GPU type for presets that use --gpus directive.

  • openff_logs (bool) – Enable verbose OpenFF logs in generated scripts.

  • skip_build (bool) – Skip system building in generated scripts.

Returns:

Mapping of replicate numbers to submission results.

Return type:

dict

Raises:

ValueError – If the SLURM account is empty on a preset that requires one and dry_run is False.

SLURM Configuration

SLURM job script generation for HPC cluster submission.

This module provides templates and utilities for generating SLURM batch scripts for self-resubmitting MD simulation jobs.

Changed in version 1.1.0: Replaced conda/module-load environment activation with pixi. The module_load and conda_command fields on SlurmConfig have been removed. Environment activation is now handled by pixi shell-hook using the pixi_env parameter on SlurmScriptGenerator.

class polyzymd.workflow.slurm.SlurmConfig(partition='aa100', qos='normal', account='ucb625_asc1', time_limit='23:59:59', email='', nodes=1, ntasks=1, memory='3G', gpus=1, exclude=None, gpu_type=None, gpu_directive_style='gres')[source]

Bases: object

Configuration for SLURM job submission.

Variables:
  • partition (str) – SLURM partition(s) to use.

  • qos (str) – Quality of service. Set to "" to omit the --qos directive entirely (required for clusters such as Bridges2 that do not use QoS).

  • account (str) – Account / allocation ID for resource allocation. Set to "" to omit the --account directive entirely (e.g. Bridges2, which infers the allocation from the submitting user’s login).

  • time_limit (str) – Wall time limit (HH:MM:SS).

  • email (str) – Email address for SLURM failure notifications. Set to "" to omit both --mail-type and --mail-user directives.

  • nodes (int) – Number of nodes.

  • ntasks (int) – Number of tasks. Ignored when gpu_directive_style == "gpus" (Bridges2-style); those scripts emit #SBATCH -N {nodes} only.

  • memory (str | None) – Memory allocation (e.g. "3G"). Set to None to omit the --mem directive entirely (some clusters allocate memory per GPU and reject an explicit --mem request).

  • gpus (int) – Number of GPUs.

  • exclude (str | None) – Nodes to exclude (omitted when None).

  • gpu_type (str | None) – Optional GPU type string used with the --gpus directive (e.g. "v100-32" for Bridges2). When None the classic --gres=gpu:<N> directive is emitted instead.

  • gpu_directive_style (str) – "gres" (default, Alpine-style) or "gpus" (Bridges2-style). Controls which SBATCH GPU directive is written. Also governs which nodes/ntasks format is emitted.

partition: str = 'aa100'
qos: str = 'normal'
account: str = 'ucb625_asc1'
time_limit: str = '23:59:59'
email: str = ''
nodes: int = 1
ntasks: int = 1
memory: str | None = '3G'
gpus: int = 1
exclude: str | None = None
gpu_type: str | None = None
gpu_directive_style: str = 'gres'
classmethod from_preset(preset, email='')[source]

Create a SlurmConfig from a named preset.

Parameters:
  • preset (Literal['aa100', 'al40', 'blanca-shirts', 'bridges2', 'testing']) – Preset name.

  • email (str) – Email for notifications.

Returns:

SlurmConfig with preset values.

Return type:

SlurmConfig

__init__(partition='aa100', qos='normal', account='ucb625_asc1', time_limit='23:59:59', email='', nodes=1, ntasks=1, memory='3G', gpus=1, exclude=None, gpu_type=None, gpu_directive_style='gres')
class polyzymd.workflow.slurm.JobContext(job_name, output_file, scratch_dir, projects_dir='.', segment_index=0, replicate_num=1, extra_vars=<factory>)[source]

Bases: object

Context for job script template rendering.

Variables:
  • job_name (str) – SLURM job name.

  • output_file (str) – Output file pattern (for SLURM logs).

  • scratch_dir (str) – Directory for simulation output (trajectories, checkpoints).

  • projects_dir (str) – Directory for scripts and logs.

  • segment_index (int) – Current segment index.

  • replicate_num (int) – Replicate number.

  • extra_vars (Dict) – Additional template variables.

job_name: str
output_file: str
scratch_dir: str
projects_dir: str = '.'
segment_index: int = 0
replicate_num: int = 1
extra_vars: Dict
property working_dir: str

Alias for scratch_dir for backwards compatibility.

__init__(job_name, output_file, scratch_dir, projects_dir='.', segment_index=0, replicate_num=1, extra_vars=<factory>)
class polyzymd.workflow.slurm.SlurmScriptGenerator(config, pixi_env='cuda-12-4', openff_logs=False, skip_build=False)[source]

Bases: object

Generator for SLURM batch scripts.

Supports separate directories for: - projects_dir: Where scripts live and jobs are submitted from - scratch_dir: Where simulation output goes (trajectories, checkpoints)

Example

>>> config = SlurmConfig.from_preset("aa100", email="user@example.com")
>>> generator = SlurmScriptGenerator(config)
>>> script = generator.generate_job_script(
...     config_path="/projects/user/config.yaml",
...     replicate=1,
...     working_dir="/scratch/user/sim_output",
... )
JOB_TEMPLATE = '#!/bin/bash\n#SBATCH --partition={partition}\n#SBATCH --job-name={job_name}\n#SBATCH --output={output_file}\n{qos_line}\n{nodes_line}\n{mem_line}\n#SBATCH --time={time_limit}\n{gpu_line}\n{mail_line}\n{account_line}\n{exclude_line}\n#SBATCH --signal=B:USR1@300\n#SBATCH --no-requeue\n\n# =============================================================================\n# PolyzyMD Self-Resubmitting Simulation Job\n# {FULL_CREDIT_LINE}\n# Generated by polyzymd do not edit manually\n# =============================================================================\n\n# Activate pixi environment\n# The manifest path was resolved at submission time from `which polyzymd`.\neval "$(pixi shell-hook -e {pixi_env} --manifest-path {manifest_path})"\n\n# Enable strict error handling after environment setup\nset -e\n\n# Required for OpenFF Interchange.combine() functionality\nexport INTERCHANGE_EXPERIMENTAL=1\n\n# Resolve this script\'s path for self-resubmission.\n# $SLURM_JOB_SCRIPT is only available in SLURM >= 22.05; fall back to $0.\nTHIS_SCRIPT="${{SLURM_JOB_SCRIPT:-$(realpath "$0")}}"\n\n# Configuration\nCONFIG_PATH="{config_path}"\nREPLICATE={replicate}\nWORKING_DIR="{working_dir}"\n\n# Ensure working directory exists\nmkdir -p "$WORKING_DIR"\n\necho "=================================================="\necho "PolyzyMD self-resubmitting job"\necho "{FULL_CREDIT_LINE}"\necho "Config:    $CONFIG_PATH"\necho "Replicate: $REPLICATE"\necho "Work dir:  $WORKING_DIR"\necho "Pixi env:  {pixi_env}"\necho "Job ID:    ${{SLURM_JOB_ID:-local}}"\necho "Timestamp: $(date)"\necho "=================================================="\n\n# =========================================================================\n# Signal forwarding: SLURM sends signals to the batch shell, not to child\n# processes.  We trap SIGUSR1 (wall-time warning) and SIGTERM (preemption)\n# and forward them to the Python process running in the background.\n# =========================================================================\nCHILD_PID=""\nforward_signal() {{\n    if [ -n "$CHILD_PID" ] && kill -0 "$CHILD_PID" 2>/dev/null; then\n        echo "Forwarding $1 to Python process (PID $CHILD_PID)"\n        kill -"$1" "$CHILD_PID"\n    fi\n}}\ntrap \'forward_signal USR1\' USR1\ntrap \'forward_signal TERM\' TERM\n\n# Run the next segment (backgrounded for signal forwarding)\npolyzymd{openff_logs_flag} run-segment \\\n    -c "$CONFIG_PATH" \\\n    -r "$REPLICATE" \\\n    --scratch-dir "$WORKING_DIR"{skip_build_flag} &\nCHILD_PID=$!\n\n# Wait for the child; \'wait\' is interrupted by trapped signals, so loop\n# until the child actually exits.  Temporarily disable \'set -e\' so we can\n# capture non-zero exit codes (e.g. 99 for graceful shutdown) without the\n# shell exiting prematurely.\nset +e\nwait "$CHILD_PID" 2>/dev/null\nRC=$?\nwhile kill -0 "$CHILD_PID" 2>/dev/null; do\n    wait "$CHILD_PID" 2>/dev/null\n    RC=$?\ndone\nset -e\n\necho "run-segment exited with code $RC at $(date)"\n\n# =========================================================================\n# Resubmission logic\n# =========================================================================\nif [ $RC -eq 2 ]; then\n    echo "CONCURRENT: Another job is already running this replicate NOT resubmitting."\n    echo "This duplicate job chain will now terminate cleanly."\n    exit 0\nfi\n\nif [ $RC -ne 0 ] && [ $RC -ne 99 ]; then\n    echo "FATAL: run-segment failed (exit code $RC) NOT resubmitting"\n    exit $RC\nfi\n\n# Check whether more work remains\nset +e\npolyzymd check-progress -c "$CONFIG_PATH" -r "$REPLICATE" --scratch-dir "$WORKING_DIR"\nPROGRESS_RC=$?\nset -e\n\nif [ $PROGRESS_RC -eq 0 ]; then\n    echo "Simulation complete no resubmission needed."\n    exit 0\nfi\n\nif [ $PROGRESS_RC -ne 1 ]; then\n    echo "FATAL: check-progress failed (exit code $PROGRESS_RC) NOT resubmitting"\n    exit $PROGRESS_RC\nfi\n\n# Work remains (exit code 1) resubmit this same script\necho "Work remains resubmitting job..."\nsbatch "$THIS_SCRIPT"\nSUBMIT_RC=$?\n\nif [ $SUBMIT_RC -eq 0 ]; then\n    echo "Resubmitted successfully."\nelse\n    echo "WARNING: sbatch resubmission failed (exit code $SUBMIT_RC)"\n    echo "You can manually resume with:"\n    echo "  sbatch $THIS_SCRIPT"\n    exit 1\nfi\n\nexit 0\n'
__init__(config, pixi_env='cuda-12-4', openff_logs=False, skip_build=False)[source]

Initialize the generator.

Parameters:
  • config (SlurmConfig) – SLURM configuration.

  • pixi_env (str) – Pixi environment name (e.g. "cuda-12-4", "cuda-12-6").

  • openff_logs (bool) – Enable verbose OpenFF logs in generated scripts.

  • skip_build (bool) – Skip system building in generated scripts (use pre-built system).

property config: SlurmConfig

Get the SLURM configuration.

generate_job_script(config_path, replicate, working_dir, job_name=None, output_file=None)[source]

Generate a self-resubmitting SLURM job script.

This produces a single script that handles the entire simulation lifecycle. Each invocation calls polyzymd run-segment which determines what work remains, runs the next segment, and exits. The bash wrapper then checks progress and resubmits itself if more work is needed.

Parameters:
  • config_path (str) – Absolute path to the YAML configuration file.

  • replicate (int) – Replicate number.

  • working_dir (str) – Directory for simulation output (trajectories, checkpoints).

  • job_name (str or None, optional) – SLURM job name. Callers should use create_job_name() to produce descriptive names (e.g. r1_310K_Fibronectin_...). Falls back to pzmd_r{replicate} if not provided.

  • output_file (str or None, optional) – SLURM log file pattern. Falls back to slurm_logs/{job_name}.%j.out relative to the directory where sbatch is invoked.

Returns:

Complete SLURM batch script content.

Return type:

str

save_script(script_content, output_path, make_executable=True)[source]

Save a script to a file.

Parameters:
  • script_content (str) – Script content.

  • output_path (str | Path) – Output file path.

  • make_executable (bool) – Whether to make the script executable.

Returns:

Path to the saved script.

Return type:

Path

polyzymd.workflow.slurm.parse_replicate_range(replicate_range)[source]

Parse a SLURM array range into a list of replicate numbers.

Parameters:

replicate_range (str) – SLURM array format (e.g., “1-5”, “1,3,5”, “1-10:2”).

Returns:

List of replicate numbers.

Return type:

List[int]

Example

>>> parse_replicate_range("1-5")
[1, 2, 3, 4, 5]
>>> parse_replicate_range("1,3,5")
[1, 3, 5]
>>> parse_replicate_range("1-10:2")
[1, 3, 5, 7, 9]
polyzymd.workflow.slurm.validate_replicate_range(replicate_range)[source]

Validate that a replicate range is in proper SLURM array format.

Parameters:

replicate_range (str) – Range string to validate.

Returns:

True if valid.

Raises:

ValueError – If the format is invalid.

Return type:

bool