Data Requirements & Directory Layout

This page documents the directory structures, file formats, and naming conventions that PolyzyMD uses for simulations and analysis. Use it as a lookup reference when setting up new projects or troubleshooting missing-file errors.


The Two-Project Pattern

PolyzyMD separates simulation execution from cross-condition analysis into two distinct project types, each with its own directory scaffold and configuration file:

Project Type

Created By

Config File

Purpose

Simulation project

polyzymd init -n <name>

config.yaml

Build, run, and store one simulation condition

Comparison project

polyzymd compare init -n <name>

comparison.yaml

Analyze and compare results across conditions

A comparison project does not contain trajectory data. Instead, its comparison.yaml points to one or more simulation project config.yaml files, which in turn resolve to the trajectory directories on disk.


Simulation Project Layout

Running polyzymd init -n my_simulation creates:

my_simulation/
├── config.yaml              # Simulation configuration (edit this)
├── structures/              # Input PDB/SDF files
├── job_scripts/             # Generated SLURM submission scripts
└── slurm_logs/              # SLURM stdout/stderr logs

After building and running a simulation, the output directory (on scratch or in the projects directory) grows to:

{scratch_dir}/{naming_template}/       # One directory per replicate
├── solvated_system.pdb                # Topology (created by polyzymd build)
├── equilibration_heating/             # Equilibration stage output
│   └── ...
├── production_0/                      # First production segment
│   ├── production_0_trajectory.dcd    # Trajectory
│   └── production_0_topology.pdb      # Topology snapshot
├── production_1/                      # Daisy-chain continuation segment
│   ├── production_1_trajectory.dcd
│   └── production_1_topology.pdb
└── ...                                # Additional segments if daisy-chained

Each replicate gets its own complete directory containing a topology file and one or more trajectory segments.


Directory Naming Template

The naming_template field in the output section of config.yaml controls how per-replicate directories are named.

Default template:

{enzyme}_{substrate}_{polymer_type}_{duration}ns_{temperature}K_run{replicate}

Available placeholders:

Placeholder

Source

Example Value

{enzyme}

enzyme.name

LipA

{substrate}

substrate.name (hyphens removed), or apo if null

ResorufinButyrate

{polymer_type}

Derived from polymer config, or none if disabled

SBMA-EGPMA_A70_B30

{temperature}

thermodynamics.temperature (integer)

300

{replicate}

Replicate number (1-indexed)

1

{duration}

simulation_phases.production.duration (integer ns)

100

Example resolved name:

LipA_ResorufinButyrate_SBMA-EGPMA_A70_B30_100ns_300K_run1

Scratch vs Projects Directories

PolyzyMD supports separating lightweight project files (scripts, logs) from large simulation output (trajectories, checkpoints). This is common on HPC systems where long-term storage and high-performance scratch are different filesystems.

Field

Purpose

Example

projects_directory

Scripts, configs, SLURM logs

/projects/user/polyzymd

scratch_directory

Trajectories, checkpoints, state data

/scratch/alpine/user/simulations

If scratch_directory is null or omitted, all output goes to projects_directory.

Example config.yaml snippet:

output:
  projects_directory: "/projects/$USER/polyzymd"
  scratch_directory: "/scratch/alpine/$USER/simulations"
  naming_template: "{enzyme}_{substrate}_{polymer_type}_{duration}ns_{temperature}K_run{replicate}"

Environment variables ($USER, $HOME, ${VAR}) and ~ are expanded automatically in both path fields.


What the Analysis Framework Expects

The TrajectoryLoader class resolves trajectory paths from a simulation config.yaml. It uses the config’s scratch_directory (or projects_directory as fallback) combined with the naming_template to locate each replicate’s working directory.

Topology and trajectory layout

Current OpenMM runs write solvated_system.pdb in the replicate working directory and production trajectories as indexed daisy-chain segments: production_N/production_N_trajectory.dcd.

When multiple daisy-chain segments exist (e.g., production_0/, production_1/, production_2/), they are automatically stitched together in segment-index order using the MDAnalysis ChainReader. The resulting Universe presents all segments as a single continuous trajectory.


Input File Requirements

These are the input files placed in the simulation project’s structures/ directory and referenced from config.yaml.

File

Format

Config Field

Requirements

Protein structure

PDB (.pdb)

enzyme.pdb_path

Standard residue names, protonated at simulation pH, no missing heavy atoms in regions of interest

Substrate

SDF (.sdf)

substrate.sdf_path

3D coordinates with docked pose, explicit hydrogens preferred

Polymer (if pre-built)

SDF (.sdf)

polymers.sdf_directory

One SDF per chain, or use dynamic generation from SMILES

Reaction templates

RXN or "default"

polymers.reactions.*

The string "default" loads bundled ATRP templates; a file path loads a custom template

Note

The sentinel value "default" for reaction templates is not a file path. Do not prepend a directory to it. PolyzyMD resolves "default" to bundled reaction files at runtime.


Comparison Project Layout

Running polyzymd compare init -n my_study creates:

my_study/
├── comparison.yaml          # Analysis configuration (edit this)
├── comparison/              # Cross-condition comparison results
├── figures/                 # Generated plots
└── structures/              # (Optional) shared structure files (e.g., enzyme PDB for SASA)

Analysis runs also create and populate analysis/ with canonical ReplicateArtifact and ConditionArtifact outputs for per-replicate and per-condition results. The comparison/ directory is reserved for cross-condition comparison results.

comparison.yaml structure

The comparison config references simulation projects by pointing to their config.yaml files:

name: "polymer_study"
description: "Comparison of polymer conjugation effects"

control: "No Polymer"       # Label of the control condition, or null

conditions:
  - label: "No Polymer"
    config: "../no_polymer/config.yaml"
    replicates: [1, 2, 3]

  - label: "PEG 10k"
    config: "../peg_10k/config.yaml"
    replicates: [1, 2, 3]

defaults:
  equilibration_time: "10ns"

plugins:
  rmsf:
    selection: "protein and name CA"
  # ... additional analysis plugins

Relative paths in conditions[].config are resolved relative to the directory containing comparison.yaml, not the current working directory.


Connecting It All Together

polyzymd init       -->  config.yaml  -->  polyzymd build  -->  polyzymd run  -->  trajectories/
                                                                                       |
polyzymd compare init  -->  comparison.yaml  -->  polyzymd compare run  -->  results + figures
                                  |
                    (points to config.yaml files)

The comparison framework reads each condition’s config.yaml, resolves the scratch directory and naming template, then uses TrajectoryLoader to find topology and trajectory files for each replicate.


Common Pitfalls

Warning

Path resolution is config-relative, not CWD-relative. Relative paths in config.yaml (e.g., enzyme.pdb_path: "structures/enzyme.pdb") are resolved relative to the directory containing config.yaml, not your shell’s current working directory. The same applies to conditions[].config paths in comparison.yaml.

  • Mismatched scratch directory. If you built and ran simulations with one scratch_directory value but later changed it in config.yaml, the analysis framework will look in the wrong location. The scratch_directory in config.yaml must match where the trajectory files actually reside.

  • The "default" sentinel for reactions. Setting polymers.reactions.initiation: "default" tells PolyzyMD to use a bundled reaction template. Writing "structures/default" or any path containing "default" will fail because no such file exists.

  • Missing replicate directories. Each replicate number listed in comparison.yaml must have a corresponding directory on disk. If replicate 3 was never simulated, the analysis will fail with a FileNotFoundError showing the expected path.

  • Incomplete replicate directories. Every replicate directory must contain at least a topology file (solvated_system.pdb) and one or more production trajectory files. Partially completed simulations that crashed before writing a trajectory will cause load failures.


See Also