# Configuration Reference

This document describes all configuration options for PolyzyMD YAML files.

## Configuration Structure

A complete configuration file has these sections:

```yaml
name: "simulation_name"
description: "optional description"

enzyme: { ... }           # Required
substrate: { ... }        # Optional (null for apo)
polymers: { ... }         # Optional (null to disable)
solvent: { ... }          # Required
restraints: [ ... ]       # Optional
thermodynamics: { ... }   # Required
simulation_phases: { ... } # Required
output: { ... }           # Required
force_field: { ... }      # Optional (has defaults)
```

---

## Enzyme Configuration

```yaml
enzyme:
  name: "LipA"                           # Identifier (required)
  pdb_path: "structures/enzyme.pdb"      # Path to PDB file (required)
  description: "Bacillus subtilis Lipase A"  # Optional description
```

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `name` | string | Yes | Short identifier for the enzyme |
| `pdb_path` | path | Yes | Path to prepared PDB file |
| `description` | string | No | Human-readable description |

---

## Substrate Configuration

```yaml
substrate:
  name: "Resorufin-Butyrate"             # Identifier (required)
  sdf_path: "structures/substrate.sdf"   # Path to SDF file (required)
  conformer_index: 0                     # Which conformer to use (default: 0)
  charge_method: "nagl"                  # Charge assignment method
  residue_name: "LIG"                    # 3-letter residue name
```

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `name` | string | Yes | - | Substrate identifier |
| `sdf_path` | path | Yes | - | Path to SDF with docked conformers |
| `conformer_index` | int | No | 0 | Index of conformer to use (0-indexed) |
| `charge_method` | string | No | "nagl" | Options: `nagl`, `espaloma`, `am1bcc` |
| `residue_name` | string | No | "LIG" | 3-letter code for topology |

### Charge Methods

| Method | Description | Speed |
|--------|-------------|-------|
| `nagl` | Graph neural network charges | Fast |
| `espaloma` | Machine learning charges | Medium |
| `am1bcc` | Semi-empirical QM charges | Slow |

### No Substrate (Apo Simulation)

```yaml
substrate: null
```

---

## Polymer Configuration

PolyzyMD supports two modes for polymer generation: **cached** (load from pre-built SDF files) and **dynamic** (generate on-the-fly from SMILES).

```{tip}
For a complete guide on dynamic polymer generation, see {doc}`dynamic_polymers`.
```

### Basic Configuration (Cached Mode)

```yaml
polymers:
  enabled: true                          # Enable/disable polymers
  type_prefix: "SBMA-EGPMA"              # Polymer type identifier
  
  monomers:                              # Monomer definitions
    - label: "A"                         # Single character label
      probability: 0.98                  # Selection probability (0-1)
      name: "SBMA"                       # Full name (optional)
    - label: "B"
      probability: 0.02
      name: "EGPMA"
  
  length: 5                              # Monomers per chain
  count: 2                               # Number of polymer chains
  
  sdf_directory: null                    # Pre-built polymer SDFs (optional)
  cache_directory: ".polymer_cache"      # Cache for generated polymers
```

### Dynamic Generation Configuration

To generate polymers on-the-fly from monomer SMILES (without pre-built SDF files):

```yaml
polymers:
  enabled: true
  generation_mode: "dynamic"             # Enable dynamic generation
  type_prefix: "SBMA-EGPMA"
  
  # ATRP reaction templates (use bundled defaults or custom paths)
  reactions:
    initiation: "default"                # or "/path/to/custom.rxn"
    polymerization: "default"
    termination: "default"
  
  monomers:
    - label: "A"
      probability: 0.7
      name: "SBMA"
      smiles: "[H]C([H])=C(C(=O)OC...)..."  # Required for dynamic mode
      residue_name: "SBM"                   # Optional 3-letter residue name
    - label: "B"
      probability: 0.3
      name: "EGPMA"
      smiles: "[H]C([H])=C(C(=O)OC...)..."
      residue_name: "EGM"
  
  length: 5
  count: 2
  charger: "nagl"                        # Charge method: nagl, espaloma, am1bcc
  max_retries: 10                        # Retries for ring-piercing detection
  cache_directory: ".polymer_cache"
```

### All Polymer Options

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `enabled` | bool | No | true | Enable polymer addition |
| `generation_mode` | string | No | "cached" | `cached` or `dynamic` |
| `type_prefix` | string | Yes | - | Identifier for polymer type |
| `monomers` | list | Yes | - | Monomer specifications |
| `length` | int | Yes | - | Chain length (number of monomers) |
| `count` | int | Yes | - | Number of chains to add |
| `sdf_directory` | path | No | null | Directory with pre-built polymer SDFs |
| `cache_directory` | path | No | ".polymer_cache" | Cache directory |
| `reactions` | object | No | all "default" | ATRP reaction templates (dynamic mode) |
| `charger` | string | No | "nagl" | Charge method for dynamic generation |
| `max_retries` | int | No | 10 | Max attempts for ring-piercing avoidance |

### Monomer Specification

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `label` | string | Yes | Single character (A, B, C...) |
| `probability` | float | Yes | Selection probability (must sum to 1.0) |
| `name` | string | No | Full monomer name |
| `smiles` | string | Dynamic only | Raw monomer SMILES (with C=C double bond) |
| `residue_name` | string | No | 3-letter residue code for topology |

### Charge Methods for Dynamic Generation

| Method | Description | Speed | Accuracy |
|--------|-------------|-------|----------|
| `nagl` | Graph neural network charges | Fast | Good |
| `espaloma` | Machine learning charges | Medium | Good |
| `am1bcc` | Semi-empirical QM charges | Slow | Best |

### No Polymers

```yaml
polymers: null
# or
polymers:
  enabled: false
```

---

## Solvent Configuration

```yaml
solvent:
  primary:
    type: "water"
    model: "tip3p"                       # Water model
  
  co_solvents: []                        # List of co-solvents (optional)
  
  ions:
    neutralize: true                     # Add counter-ions
    nacl_concentration: 0.15             # NaCl concentration (M)
  
  box:
    padding: 1.2                         # nm from solute to box edge
    shape: "rhombic_dodecahedron"        # Box shape
    target_density: 1.0                  # g/mL
    tolerance: 2.0                       # PACKMOL tolerance (Angstrom)
```

### Water Models

| Model | Description |
|-------|-------------|
| `tip3p` | TIP3P (default, fast) |
| `spce` | SPC/E |
| `tip4pew` | TIP4P-Ew |
| `opc` | OPC (accurate, slower) |

### Box Shapes

| Shape | Description |
|-------|-------------|
| `cube` | Cubic box |
| `rhombic_dodecahedron` | Space-efficient (default) |
| `truncated_octahedron` | Alternative space-efficient |

### Co-solvents

PolyzyMD supports adding co-solvents to your simulation system. You can specify co-solvents using either **volume fraction** (v/v) or **molar concentration**.

#### Specification Methods

| Method | Field | Description | Effect on Water |
|--------|-------|-------------|-----------------|
| Volume Fraction | `volume_fraction` | Fraction of box volume (0-1) | Reduces water proportionally |
| Concentration | `concentration` | Molar concentration (mol/L) | Additive (water unchanged) |

**Important:** Use exactly ONE method per co-solvent. Do not specify both `volume_fraction` and `concentration` for the same co-solvent.

#### Volume Fraction Method

Use this when you want a specific percentage of the solvent to be the co-solvent (e.g., "30% DMSO").

```yaml
co_solvents:
  - name: "dmso"
    volume_fraction: 0.30    # 30% v/v DMSO
```

**Formula:**

```
n = (V_box × phi × rho) / M

Where:
  n     = number of co-solvent molecules
  V_box = simulation box volume (L)
  phi   = volume fraction (e.g., 0.30 for 30%)
  rho   = co-solvent density (g/mL)
  M     = molar mass (g/mol)
```

**Source:** [`src/polyzymd/builders/solvent.py:267-287`](https://github.com/joelaforet/polyzymd/blob/main/src/polyzymd/builders/solvent.py#L267-L287)

The water count is reduced proportionally: if you specify 30% DMSO, water fills the remaining 70% of the box.

#### Concentration Method

Use this when you want a specific molar concentration (e.g., "2 M urea for protein denaturation studies").

```yaml
co_solvents:
  - name: "urea"
    concentration: 2.0       # 2 M urea
```

**Formula:**

```
n = C × V_box × N_A

Where:
  n     = number of co-solvent molecules
  C     = concentration (mol/L)
  V_box = simulation box volume (L)
  N_A   = Avogadro's number (implicit in OpenMM)
```

**Source:** [`src/polyzymd/builders/solvent.py:295-312`](https://github.com/joelaforet/polyzymd/blob/main/src/polyzymd/builders/solvent.py#L295-L312)

The water count is NOT reduced when using concentration. The co-solvent molecules are added to the existing water, which may slightly increase the effective density.

#### Built-in Co-solvent Library

PolyzyMD includes a library of common co-solvents with pre-defined SMILES and densities. Density values are sourced from [PubChem](https://pubchem.ncbi.nlm.nih.gov/), a public database of chemical compounds. Each compound has a unique Compound Identification Number (CID) that can be used to look up detailed information including density, structure, and safety data.

| Name | SMILES | Density (g/mL) | Reference |
|------|--------|----------------|-----------|
| `dmso` | `CS(=O)C` | 1.10 | [CID 679](https://pubchem.ncbi.nlm.nih.gov/compound/679) |
| `dmf` | `CN(C)C=O` | 0.95 | [CID 6228](https://pubchem.ncbi.nlm.nih.gov/compound/6228) |
| `acetonitrile` | `CC#N` | 0.786 | [CID 6342](https://pubchem.ncbi.nlm.nih.gov/compound/6342) |
| `urea` | `C(=O)(N)N` | 1.32 | [CID 1176](https://pubchem.ncbi.nlm.nih.gov/compound/1176) |
| `ethanol` | `CCO` | 0.789 | [CID 702](https://pubchem.ncbi.nlm.nih.gov/compound/702) |
| `methanol` | `CO` | 0.792 | [CID 887](https://pubchem.ncbi.nlm.nih.gov/compound/887) |
| `glycerol` | `C(C(CO)O)O` | 1.261 | [CID 753](https://pubchem.ncbi.nlm.nih.gov/compound/753) |
| `isopropanol` | `CC(C)O` | 0.786 | [CID 3776](https://pubchem.ncbi.nlm.nih.gov/compound/3776) |
| `acetone` | `CC(=O)C` | 0.784 | [CID 180](https://pubchem.ncbi.nlm.nih.gov/compound/180) |
| `thf` | `C1CCOC1` | 0.883 | [CID 8028](https://pubchem.ncbi.nlm.nih.gov/compound/8028) |
| `dioxane` | `C1COCCO1` | 1.033 | [CID 31275](https://pubchem.ncbi.nlm.nih.gov/compound/31275) |
| `ethylene_glycol` | `C(CO)O` | 1.114 | [CID 174](https://pubchem.ncbi.nlm.nih.gov/compound/174) |

For library co-solvents, you only need to specify the `name` and either `volume_fraction` or `concentration`:

```yaml
co_solvents:
  - name: "dmso"
    volume_fraction: 0.10    # 10% v/v DMSO - smiles and density auto-populated
```

#### Custom Co-solvents

For molecules not in the library, you must provide the SMILES string. Density is required only when using `volume_fraction`:

```yaml
co_solvents:
  # Custom co-solvent with volume fraction (density required)
  - name: "ethyl_acetate"
    smiles: "CCOC(=O)C"
    density: 0.902           # g/mL - required for volume_fraction
    volume_fraction: 0.15

  # Custom co-solvent with concentration (density not needed)
  - name: "my_additive"
    smiles: "CC(=O)NC"
    concentration: 0.5       # 0.5 M
```

#### Multiple Co-solvents

You can combine multiple co-solvents. Each can use either specification method independently:

```yaml
co_solvents:
  - name: "dmso"
    volume_fraction: 0.20    # 20% v/v DMSO
  - name: "urea"
    concentration: 1.0       # Plus 1 M urea
```

**Warning:** When using multiple co-solvents with `volume_fraction`, ensure the total does not exceed 1.0 (100%). The remaining fraction is filled with water.

```{warning}
**YAML List Syntax**

A common mistake is placing each field on a separate line with its own `-`, which creates multiple list items instead of one object with multiple fields.

**Incorrect** (creates 3 separate incomplete items):
~~~yaml
co_solvents:
  - name: "dmso"
  - volume_fraction: 0.30
  - residue_name: "DMS"
~~~

**Correct** (one item with 3 fields):
~~~yaml
co_solvents:
  - name: "dmso"
    volume_fraction: 0.30
    residue_name: "DMS"
~~~

The `-` character starts a **new list item**. All fields belonging to the same item must be indented to the same level *without* a leading `-`.
```

#### Assumptions and Limitations

- **Ideal mixing:** Volume fractions assume ideal mixing (volumes are additive). Real solutions may deviate.
- **Room temperature densities:** Library densities are approximate values at ~25C.
- **PACKMOL placement:** Co-solvent molecules are placed randomly by PACKMOL and may require equilibration to achieve uniform distribution.

### Solvent Parameterization

PolyzyMD uses **pre-computed partial charges** for all solvent molecules to ensure consistency and performance.

#### Why Pre-computed Charges?

When adding many copies of the same solvent molecule (e.g., 1000 DMSO molecules), each molecule should have **identical partial charges**. However, charge calculation methods like AM1BCC have numerical variability - running the calculation twice on the same molecule can produce slightly different charges.

If charges were computed independently for each solvent molecule:
1. **Inconsistency**: Identical molecules would have different parameters (physically incorrect)
2. **Performance**: AM1BCC is expensive; computing it 1000x is wasteful
3. **Force field issues**: Parameter variability can cause OpenFF Interchange errors

#### How It Works

PolyzyMD solves this by computing charges **once** and reusing them:

1. **Built-in solvents**: Pre-computed SDF files are bundled with the package (in `src/polyzymd/data/solvents/`)
2. **User cache**: Custom solvents are cached in `~/.polyzymd/solvent_cache/` after first use
3. **Lookup order**: Memory cache → Bundled SDFs → User cache → Generate and cache

```
# Lookup order for get_solvent_molecule("dmso")
1. Check in-memory cache (fastest)
2. Check bundled library: src/polyzymd/data/solvents/dmso.sdf
3. Check user cache: ~/.polyzymd/solvent_cache/dmso.sdf
4. Generate from SMILES + AM1BCC, save to user cache
```

#### Available Pre-computed Solvents

All 12 library co-solvents plus water models have pre-computed charges:

| Solvent | File | Charge Method |
|---------|------|---------------|
| TIP3P Water | `tip3p.sdf` | Literature values |
| DMSO | `dmso.sdf` | AM1BCC |
| DMF | `dmf.sdf` | AM1BCC |
| Acetonitrile | `acetonitrile.sdf` | AM1BCC |
| Urea | `urea.sdf` | AM1BCC |
| Ethanol | `ethanol.sdf` | AM1BCC |
| Methanol | `methanol.sdf` | AM1BCC |
| Glycerol | `glycerol.sdf` | AM1BCC |
| Isopropanol | `isopropanol.sdf` | AM1BCC |
| Acetone | `acetone.sdf` | AM1BCC |
| THF | `thf.sdf` | AM1BCC |
| Dioxane | `dioxane.sdf` | AM1BCC |
| Ethylene Glycol | `ethylene_glycol.sdf` | AM1BCC |

#### Custom Solvents

When you use a custom co-solvent (not in the library), PolyzyMD will:

1. Generate the molecule from your SMILES string
2. Compute AM1BCC partial charges (this may take a few seconds)
3. Cache the parameterized molecule to `~/.polyzymd/solvent_cache/`
4. Reuse the cached version for all future simulations

```yaml
co_solvents:
  - name: "my_custom_solvent"
    smiles: "CC(=O)OCC"        # First use: computes and caches charges
    concentration: 0.5         # Future uses: loads from cache instantly
```

#### Managing the Cache

You can inspect and manage the solvent cache programmatically:

```python
from polyzymd.data import list_available_solvents, clear_cache

# List all available solvents (bundled + cached)
solvents = list_available_solvents()
print(solvents)
# {'bundled': ['dmso', 'ethanol', ...], 'cached': ['my_custom_solvent']}

# Clear the user cache (does not affect bundled solvents)
clear_cache()
```

The user cache location is `~/.polyzymd/solvent_cache/`. You can safely delete this directory to force re-computation of custom solvents.

---

## Restraints Configuration

```yaml
restraints:
  - type: "flat_bottom"                  # Restraint type
    name: "substrate_active_site"        # Identifier
    atom1:
      selection: "resid 77 and name OG"  # First atom selection
      description: "Catalytic serine"    # Optional description
    atom2:
      selection: "resname LIG and name C1"
      description: "Substrate carbon"
    distance: 3.3                        # Angstroms
    force_constant: 10000.0              # kJ/mol/nm²
    enabled: true                        # Enable/disable
```

See {doc}`restraints` for detailed selection syntax.

### Restraint Types

| Type | Description |
|------|-------------|
| `flat_bottom` | No force within threshold, harmonic beyond |
| `harmonic` | Harmonic potential at target distance |
| `upper_wall` | Prevent distance exceeding threshold |
| `lower_wall` | Prevent distance below threshold |

---

## Thermodynamics Configuration

```yaml
thermodynamics:
  temperature: 300.0                     # Kelvin
  pressure: 1.0                          # atmospheres
```

---

## Simulation Phases Configuration

```yaml
simulation_phases:
  equilibration_stages:
    - name: "heating"
      duration: 0.2                      # nanoseconds
      samples: 20                        # frames to save
      ensemble: "NVT"
      temperature_start: 60.0            # starting temperature (K)
      temperature_end: 300.0             # final temperature (K)
      temperature_increment: 1.0         # step size (K)
      temperature_interval: 1200.0       # time between steps (fs)
      position_restraints:
        - group: "protein_heavy"
          force_constant: 4184.0
    - name: "free_equilibration"
      duration: 0.8                      # nanoseconds
      samples: 80
      ensemble: "NPT"
      temperature: 300.0

  production:
    ensemble: "NPT"
    duration: 100.0                      # nanoseconds total
    samples: 2500                        # total frames
    time_step: 2.0
    thermostat: "LangevinMiddle"
    thermostat_timescale: 1.0
    barostat: "MC"                       # Monte Carlo barostat
    barostat_frequency: 25               # steps between barostat moves
  
```

PolyzyMD requires staged equilibration. Use one or more entries in
`equilibration_stages` even for minimal workflows.

### Ensembles

| Ensemble | Description |
|----------|-------------|
| `NVT` | Constant volume, temperature |
| `NPT` | Constant pressure, temperature |
| `NVE` | Microcanonical (no thermostat) |

### Thermostats

| Thermostat | Description |
|------------|-------------|
| `LangevinMiddle` | Langevin integrator (recommended) |
| `Langevin` | Standard Langevin |
| `Andersen` | Andersen thermostat |
| `NoseHoover` | Nosé-Hoover chain |

### Barostats

| Barostat | Description |
|----------|-------------|
| `MC` | Monte Carlo barostat (recommended) |
| `MCA` | Monte Carlo anisotropic |

---

## Output Configuration

Environment variables (`$USER`, `$HOME`, `${VAR}`) and `~` are automatically expanded in path fields.

```yaml
output:
  # Directory structure - environment variables are expanded automatically
  projects_directory: "/projects/$USER/polyzymd"   # Scripts, logs
  scratch_directory: "/scratch/alpine/$USER/simulations"  # Trajectories
  
  # You can also use ~ for home directory
  # projects_directory: "~/polyzymd"
  
  # Subdirectories within projects_directory
  job_scripts_subdir: "job_scripts"
  slurm_logs_subdir: "slurm_logs"
  
  # Naming
  naming_template: "{enzyme}_{substrate}_{polymer_type}_{temperature}K_run{replicate}"
  
  # Output options
  save_checkpoint: true                  # Save restart files
  save_state_data: true                  # Save energy/temperature CSV
  trajectory_format: "dcd"               # dcd or xtc
```

### Naming Template Variables

| Variable | Description | Example |
|----------|-------------|---------|
| `{enzyme}` | Enzyme name | "LipA" |
| `{substrate}` | Substrate name | "ResorufinButyrate" |
| `{polymer_type}` | Polymer type | "SBMA-EGPMA" |
| `{temperature}` | Temperature in K | "300" |
| `{replicate}` | Replicate number | "1" |

---

## Force Field Configuration

```yaml
force_field:
  protein: "ff14sb_off_impropers_0.0.4.offxml"  # Protein force field
  small_molecule: "openff-2.0.0.offxml"          # Ligand/polymer force field
```

### Available Force Fields

**Protein:**
- `ff14sb_off_impropers_0.0.4.offxml` - Amber ff14SB (recommended)

**Small Molecule:**
- `openff-2.0.0.offxml` - OpenFF Sage 2.0 (recommended)
- `openff-2.1.0.offxml` - OpenFF Sage 2.1

### Key Collision Warnings

When building systems with both proteins and small molecules, you may see warnings like:

```
Key collision with different parameters, fixing. Key is [#6X4:1]-[#1:2]
```

**This is expected behavior and does not indicate a problem.**

#### Why This Happens

PolyzyMD uses different force fields for different molecule types:
- **Proteins**: ff14SB (Amber force field ported to OpenFF format)
- **Small molecules**: OpenFF Sage 2.0 (general small molecule force field)

When these force fields are combined, the same SMIRKS pattern (e.g., `[#6X4:1]-[#1:2]` for sp³ carbon-hydrogen bonds) may appear in both, but with **different parameter values**. This is expected because:

1. ff14SB was optimized for protein behavior
2. OpenFF Sage was optimized for general organic molecules
3. Both are valid parameterizations for their respective domains

#### How OpenFF Handles This

OpenFF Interchange detects these collisions and resolves them by appending `_DUPLICATE` to the key, allowing both parameter sets to coexist:

```python
# Simplified OpenFF behavior
if key in existing_parameters:
    if parameters_are_identical:
        pass  # No action needed
    else:
        key.id += "_DUPLICATE"  # Keep both parameter sets
```

This ensures that:
- Protein atoms use ff14SB parameters
- Small molecule atoms use OpenFF Sage parameters
- The simulation runs correctly with appropriate parameters for each molecule type

#### What You'll See in Logs

With PolyzyMD's logging, you can identify which molecule combinations trigger collisions:

```
Combining 7 component Interchange(s)
  Components: LipA, ResorufinButyrate, EGPMA-SBMA_AAABA, ..., dmso, water/ions
[DEBUG] Combining 'LipA' with 'ResorufinButyrate'...
Key collision with different parameters, fixing. Key is [#6X4:1]-[#1:2]
...
```

Collisions typically occur when combining protein Interchanges (using ff14SB) with small molecule Interchanges (using OpenFF Sage).

#### Further Reading

For more details on this behavior, see the OpenFF Interchange documentation:
- [Sharp Edges: Combining Interchanges](https://docs.openforcefield.org/projects/interchange/en/stable/using/edges.html)

---

## Complete Example

See the example configurations in `src/polyzymd/configs/examples/`:

- `enzyme_only.yaml` - Enzyme + substrate, no polymers
- `enzyme_polymer.yaml` - Full enzyme + polymer simulation
- `enzyme_cosolvent.yaml` - Enzyme with DMSO co-solvent

---

## See Also

- {doc}`dynamic_polymers` - Dynamic polymer generation from SMILES
- {doc}`gromacs_export` - Running simulations with GROMACS
- {doc}`polymers` - Polymer setup guide
- {doc}`restraints` - Atom selection and restraints
- {doc}`cli_reference` - CLI documentation