# Data Requirements & Directory Layout

This page documents the directory structures, file formats, and naming
conventions that PolyzyMD uses for simulations and analysis. Use it as a
lookup reference when setting up new projects or troubleshooting missing-file
errors.

---

## The Two-Project Pattern

PolyzyMD separates simulation execution from cross-condition analysis into two
distinct project types, each with its own directory scaffold and configuration
file:

| Project Type | Created By | Config File | Purpose |
|---|---|---|---|
| Simulation project | `polyzymd init -n <name>` | `config.yaml` | Build, run, and store one simulation condition |
| Comparison project | `polyzymd compare init -n <name>` | `comparison.yaml` | Analyze and compare results across conditions |

A comparison project does not contain trajectory data. Instead, its
`comparison.yaml` points to one or more simulation project `config.yaml` files,
which in turn resolve to the trajectory directories on disk.

---

## Simulation Project Layout

Running `polyzymd init -n my_simulation` creates:

```
my_simulation/
├── config.yaml              # Simulation configuration (edit this)
├── structures/              # Input PDB/SDF files
├── job_scripts/             # Generated SLURM submission scripts
└── slurm_logs/              # SLURM stdout/stderr logs
```

After building and running a simulation, the **output directory** (on
scratch or in the projects directory) grows to:

```
{scratch_dir}/{naming_template}/       # One directory per replicate
├── solvated_system.pdb                # Topology (created by polyzymd build)
├── equilibration_heating/             # Equilibration stage output
│   └── ...
├── production_0/                      # First production segment
│   ├── production_0_trajectory.dcd    # Trajectory
│   └── production_0_topology.pdb      # Topology snapshot
├── production_1/                      # Daisy-chain continuation segment
│   ├── production_1_trajectory.dcd
│   └── production_1_topology.pdb
└── ...                                # Additional segments if daisy-chained
```

Each replicate gets its own complete directory containing a topology file and
one or more trajectory segments.

---

## Directory Naming Template

The `naming_template` field in the `output` section of `config.yaml` controls
how per-replicate directories are named.

**Default template:**

```
{enzyme}_{substrate}_{polymer_type}_{duration}ns_{temperature}K_run{replicate}
```

**Available placeholders:**

| Placeholder | Source | Example Value |
|---|---|---|
| `{enzyme}` | `enzyme.name` | `LipA` |
| `{substrate}` | `substrate.name` (hyphens removed), or `apo` if null | `ResorufinButyrate` |
| `{polymer_type}` | Derived from polymer config, or `none` if disabled | `SBMA-EGPMA_A70_B30` |
| `{temperature}` | `thermodynamics.temperature` (integer) | `300` |
| `{replicate}` | Replicate number (1-indexed) | `1` |
| `{duration}` | `simulation_phases.production.duration` (integer ns) | `100` |

**Example resolved name:**

```
LipA_ResorufinButyrate_SBMA-EGPMA_A70_B30_100ns_300K_run1
```

---

## Scratch vs Projects Directories

PolyzyMD supports separating lightweight project files (scripts, logs) from
large simulation output (trajectories, checkpoints). This is common on HPC
systems where long-term storage and high-performance scratch are different
filesystems.

| Field | Purpose | Example |
|---|---|---|
| `projects_directory` | Scripts, configs, SLURM logs | `/projects/user/polyzymd` |
| `scratch_directory` | Trajectories, checkpoints, state data | `/scratch/alpine/user/simulations` |

If `scratch_directory` is `null` or omitted, all output goes to
`projects_directory`.

**Example `config.yaml` snippet:**

```yaml
output:
  projects_directory: "/projects/$USER/polyzymd"
  scratch_directory: "/scratch/alpine/$USER/simulations"
  naming_template: "{enzyme}_{substrate}_{polymer_type}_{duration}ns_{temperature}K_run{replicate}"
```

Environment variables (`$USER`, `$HOME`, `${VAR}`) and `~` are expanded
automatically in both path fields.

---

## What the Analysis Framework Expects

The `TrajectoryLoader` class resolves trajectory paths from a simulation
`config.yaml`. It uses the config's `scratch_directory` (or
`projects_directory` as fallback) combined with the `naming_template` to
locate each replicate's working directory.

### Topology search order

Within each replicate's working directory, `TrajectoryLoader` searches for a
topology file in this order:

| Priority | Path | When It Exists |
|---|---|---|
| 1 | `solvated_system.pdb` | Always present after `polyzymd build` |
| 2 | `production_0/production_0_topology.pdb` | Daisy-chain structure |
| 3 | `production/production_topology.pdb` | Legacy (pre-daisy-chain) |
| 4 | Glob fallback: `production_*/*_topology.pdb`, then `*.pdb` | Last resort |

### Trajectory search order

| Priority | Pattern | Description |
|---|---|---|
| 1 | `production_N/production_N_trajectory.dcd` | Daisy-chain segments (current) |
| 2 | `production/production_trajectory.dcd` | Legacy single file (deprecated) |
| 3 | Glob fallback: `**/production*trajectory.dcd` | Any production DCD |

When multiple daisy-chain segments exist (e.g., `production_0/`,
`production_1/`, `production_2/`), they are automatically stitched together in
segment-index order using the MDAnalysis `ChainReader`. The resulting
`Universe` presents all segments as a single continuous trajectory.

---

## Input File Requirements

These are the input files placed in the simulation project's `structures/`
directory and referenced from `config.yaml`.

| File | Format | Config Field | Requirements |
|---|---|---|---|
| Protein structure | PDB (`.pdb`) | `enzyme.pdb_path` | Standard residue names, protonated at simulation pH, no missing heavy atoms in regions of interest |
| Substrate | SDF (`.sdf`) | `substrate.sdf_path` | 3D coordinates with docked pose, explicit hydrogens preferred |
| Polymer (if pre-built) | SDF (`.sdf`) | `polymers.sdf_directory` | One SDF per chain, or use dynamic generation from SMILES |
| Reaction templates | RXN or `"default"` | `polymers.reactions.*` | The string `"default"` loads bundled ATRP templates; a file path loads a custom template |

```{note}
The sentinel value `"default"` for reaction templates is **not** a file path.
Do not prepend a directory to it. PolyzyMD resolves `"default"` to bundled
reaction files at runtime.
```

---

## Comparison Project Layout

Running `polyzymd compare init -n my_study` creates:

```
my_study/
├── comparison.yaml          # Analysis configuration (edit this)
├── comparison/              # Analysis result JSON files
├── figures/                 # Generated plots
└── structures/              # (Optional) shared structure files (e.g., enzyme PDB for SASA)
```

### comparison.yaml structure

The comparison config references simulation projects by pointing to their
`config.yaml` files:

```yaml
name: "polymer_study"
description: "Comparison of polymer conjugation effects"

control: "No Polymer"       # Label of the control condition, or null

conditions:
  - label: "No Polymer"
    config: "../no_polymer/config.yaml"
    replicates: [1, 2, 3]

  - label: "PEG 10k"
    config: "../peg_10k/config.yaml"
    replicates: [1, 2, 3]

defaults:
  equilibration_time: "10ns"

plugins:
  rmsf:
    selection: "protein and name CA"
  # ... additional analysis plugins
```

Relative paths in `conditions[].config` are resolved relative to the directory
containing `comparison.yaml`, not the current working directory.

---

## Connecting It All Together

```
polyzymd init       -->  config.yaml  -->  polyzymd build  -->  polyzymd run  -->  trajectories/
                                                                                       |
polyzymd compare init  -->  comparison.yaml  -->  polyzymd compare run  -->  results + figures
                                  |
                    (points to config.yaml files)
```

The comparison framework reads each condition's `config.yaml`, resolves the
scratch directory and naming template, then uses `TrajectoryLoader` to find
topology and trajectory files for each replicate.

---

## Common Pitfalls

```{warning}
**Path resolution is config-relative, not CWD-relative.**
Relative paths in `config.yaml` (e.g., `enzyme.pdb_path: "structures/enzyme.pdb"`)
are resolved relative to the directory containing `config.yaml`, not your
shell's current working directory. The same applies to `conditions[].config`
paths in `comparison.yaml`.
```

- **Mismatched scratch directory.** If you built and ran simulations with one
  `scratch_directory` value but later changed it in `config.yaml`, the analysis
  framework will look in the wrong location. The `scratch_directory` in
  `config.yaml` must match where the trajectory files actually reside.

- **The `"default"` sentinel for reactions.** Setting
  `polymers.reactions.initiation: "default"` tells PolyzyMD to use a bundled
  reaction template. Writing `"structures/default"` or any path containing
  `"default"` will fail because no such file exists.

- **Missing replicate directories.** Each replicate number listed in
  `comparison.yaml` must have a corresponding directory on disk. If replicate 3
  was never simulated, the analysis will fail with a `FileNotFoundError`
  showing the expected path.

- **Incomplete replicate directories.** Every replicate directory must contain
  at least a topology file (`solvated_system.pdb`) and one or more production
  trajectory files. Partially completed simulations that crashed before writing
  a trajectory will cause load failures.

---

## See Also

- {doc}`configuration` -- Full configuration field reference
- {doc}`cli_reference` -- CLI command reference including `init` and `compare init`
- {doc}`../how_to/analysis_compare_conditions` -- How to set up and run a comparison
- {doc}`../tutorials/quickstart` -- Run your first simulation end-to-end