# Prepare a PDB for OpenFF and PolyzyMD

Raw Protein Data Bank files often describe crystallographic assemblies, waters,
alternate locations, missing-coordinate records, and historical chain labels that
are not the same thing as a simulation-ready PolyzyMD enzyme input. This tutorial
shows a conservative preparation workflow for the raw 4CHA alpha-chymotrypsin PDB
using PyMOL for biological-system inspection and a small PDBFixer script for
mechanical cleanup.

The workflow deliberately adds **missing hydrogens only**. It does not add
missing amino-acid residues or missing heavy atoms. If your structure is missing
protein residues or heavy atoms that are required for your science, model and
validate them before using PolyzyMD. Other structure-preparation tools can be
appropriate, but this tutorial prescribes PyMOL plus PDBFixer so the exact
selection and cleanup steps are visible.

## What you will produce

Starting from a raw RCSB PDB download:

```text
structures/4CHA.pdb
```

The intended final enzyme file has:

- one alpha-chymotrypsin enzyme copy
- all protein atom records using chain ID `A`
- TER records preserved between the disconnected peptide fragments
- no crystallographic waters
- no second enzyme copy
- explicit hydrogens
- simple structural checks that pass before OpenFF chemistry assignment
- successful `openff.toolkit.Topology.from_pdb()` validation after any required
  upstream residue, atom, protonation, or connectivity curation is complete

```{important}
PolyzyMD-prepared structures must use chain ID `A` for all protein residues.
Substrates use chain `B`, polymers use chain `C`, and solvent chains use `D` and
later. It is acceptable for one enzyme to contain multiple disconnected peptide
fragments after OpenFF ingestion. MD depends on chemical connectivity, not on
whether every protein fragment has a unique chain label.
```

## Prerequisites

- PolyzyMD installed in a pixi environment that includes PDBFixer, OpenMM, and
  OpenFF
- PyMOL for interactive inspection
- the raw 4CHA PDB file downloaded into `structures/4CHA.pdb`
- a scientific decision about whether any missing coordinates must be modeled
  before simulation

Create a clean working directory and download the raw PDB file:

```bash
mkdir -p 4cha_pdb_prep/structures
cd 4cha_pdb_prep
curl -L https://files.rcsb.org/download/4CHA.pdb -o structures/4CHA.pdb
```

## Inspect the raw 4CHA biological contents

Open the raw file in PyMOL:

```text
load structures/4CHA.pdb, raw4cha
hide everything, raw4cha
show cartoon, polymer.protein
show sticks, polymer.protein and resn CYS
show spheres, solvent

select first_copy, raw4cha and polymer.protein and chain A+B+C
select second_copy, raw4cha and polymer.protein and chain E+F+G
select crystal_waters, raw4cha and solvent

color marine, first_copy
color orange, second_copy
color cyan, crystal_waters
zoom first_copy
```

Read the PDB header before deleting anything. The relevant 4CHA records say:

- COMPND assigns one alpha-chymotrypsin enzyme copy across chains `A+B+C`
- COMPND assigns a second enzyme copy across chains `E+F+G`
- mature alpha-chymotrypsin is a cleaved enzyme with multiple peptide fragments
- residues 14-15 and 147-148 are biologically excised activation peptides, so
  TER records between the mature fragments are expected
- REMARK 465 reports missing-coordinate residues in the deposited model, including
  `GLY A 12` and `LEU A 13` in the first copy

Those REMARK 465 records are not a PDBFixer to-do list for this workflow. They
are a signal that the raw crystal model is not automatically simulation-ready. If
those missing coordinates matter for your study, provide a curated PDB where they
have already been modeled and checked. Do not rely on PolyzyMD or the helper
script below to make that biological modeling decision.

## Preview the system selection in PyMOL

Use PyMOL to confirm that the first copy is the enzyme you intend to simulate:

```text
disable second_copy
enable first_copy
show cartoon, first_copy
show sticks, first_copy and resn CYS
distance disulfides, first_copy and name SG, first_copy and name SG, 2.2
zoom first_copy
```

If you want to preview the trimmed system without changing the final preparation
path, create a temporary PyMOL object:

```text
create 4cha_first_copy_preview, raw4cha and polymer.protein and chain A+B+C
disable raw4cha
enable 4cha_first_copy_preview
```

Use this preview for visual inspection only. The final file is written by the
script below so the TER boundaries between fragments are preserved as separate
PDB topology chains while every atom record is relabeled to chain `A` for
PolyzyMD.

## Add hydrogens and write the PolyzyMD PDB

Save this script as `prepare_4cha_for_polyzymd.py` and run it from a PolyzyMD
pixi shell. It removes the second enzyme copy and waters, keeps the first copy's
three protein fragments, relabels all remaining protein chains to `A`, and adds
hydrogens only.

```python
from pathlib import Path

from openmm.app import PDBFile
from pdbfixer import PDBFixer


RAW_PDB = Path("structures/4CHA.pdb")
OUTPUT_PDB = Path("structures/4cha_chymotrypsin_chain_a_openff.pdb")
KEEP_CHAINS = {"A", "B", "C"}
PH = 7.0


def main() -> None:
    """Prepare one 4CHA enzyme copy for PolyzyMD."""
    fixer = PDBFixer(filename=str(RAW_PDB))

    chains_to_remove = [
        chain_index
        for chain_index, chain in enumerate(fixer.topology.chains())
        if chain.id not in KEEP_CHAINS
    ]
    fixer.removeChains(chains_to_remove)

    fixer.removeHeterogens(keepWater=False)
    fixer.addMissingHydrogens(PH)

    for chain in fixer.topology.chains():
        chain.id = "A"

    OUTPUT_PDB.parent.mkdir(parents=True, exist_ok=True)
    with OUTPUT_PDB.open("w") as handle:
        PDBFile.writeFile(fixer.topology, fixer.positions, handle, keepIds=True)

    print(f"Wrote {OUTPUT_PDB}")


if __name__ == "__main__":
    main()
```

Run it:

```bash
python prepare_4cha_for_polyzymd.py
```

```{warning}
This script intentionally does not call `findMissingResidues()`,
`findMissingAtoms()`, `addMissingAtoms()`, or any missing-residue filling method.
If OpenFF later reports missing chemistry, stop and curate the input structure
upstream instead of turning on automatic residue or heavy-atom construction here.
```

## Validate the prepared file with OpenFF

Run this validation snippet before referencing the PDB from `config.yaml`:

```python
from pathlib import Path

from openff.toolkit import Topology


PDB_PATH = Path("structures/4cha_chymotrypsin_chain_a_openff.pdb")


def main() -> None:
    """Validate the prepared PDB for PolyzyMD and OpenFF."""
    lines = PDB_PATH.read_text().splitlines()
    atom_records = [line for line in lines if line.startswith(("ATOM", "HETATM"))]
    ter_count = sum(line.startswith("TER") for line in lines)
    chains = {line[21] for line in atom_records}
    residue_names = {line[17:20].strip() for line in atom_records}
    elements = {line[76:78].strip() for line in atom_records if len(line) >= 78}

    if chains != {"A"}:
        raise SystemExit(f"Expected only chain A, found {sorted(chains)}")
    if ter_count != 3:
        raise SystemExit(f"Expected three TER records, found {ter_count}")
    if "HOH" in residue_names:
        raise SystemExit("Expected no crystallographic waters")
    if "H" not in elements:
        raise SystemExit("Expected explicit hydrogens")

    print("Structural PDB checks passed")
    print(f"  Atom records use chains: {sorted(chains)}")
    print(f"  TER records: {ter_count}")
    print("  Crystallographic waters: absent")
    print("  Explicit hydrogens: present")

    print("Running OpenFF chemistry validation")
    Topology.from_pdb(str(PDB_PATH))
    print("OpenFF chemistry assignment succeeded")


if __name__ == "__main__":
    main()
```

Expected success for a fully curated input is:

```text
Structural PDB checks passed
  Atom records use chains: ['A']
  TER records: 3
  Crystallographic waters: absent
  Explicit hydrogens: present
Running OpenFF chemistry validation
OpenFF chemistry assignment succeeded
```

When this exact script is run on the uncurated raw 4CHA file, the chain, TER,
water, and hydrogen checks pass, but OpenFF can still reject the file. In a
collaborator reproduction, the mechanically prepared PDB had all atom records on
chain `A`, three TER records, waters removed, the second enzyme copy removed, and
hydrogens present, while OpenFF still reported chemistry errors around terminal
CYS#0001 hydrogen and disulfide assignment and SER#0011. That failure is useful:
it separates successful structural preparation from unsuccessful chemistry
assignment and tells you to stop and provide a curated protein model before using
PolyzyMD.

If the chain, TER, water, or hydrogen checks fail, fix the preparation script or
your file paths. If `Topology.from_pdb()` fails after those simple checks pass,
OpenFF is telling you that the file still has unresolved chemistry such as
missing residues, missing heavy atoms, unsupported atom naming, or ambiguous
connectivity. Resolve those issues in a curated input PDB before running
PolyzyMD.

## What curate upstream means

OpenFF failures after the simple structural checks pass are chemistry-modeling
problems in the enzyme input, not PolyzyMD configuration problems. Curating
upstream means producing an externally inspected PDB whose residue templates,
atom names, protonation states, coordinates, and connectivity are chemically
consistent before PolyzyMD loads it.

Common issues to inspect include:

- terminal atom naming and terminal hydrogen counts
- disulfide cysteine protonation and SG-SG connectivity
- missing-coordinate fragments reported in the PDB header
- missing heavy atoms in otherwise retained residues
- biologically excised residues that should remain absent, with TER records kept
  between the mature fragments

Possible curation paths include manual editing and inspection in PyMOL or
ChimeraX, deliberate PDBFixer use with careful review of every modeled atom,
Amber-oriented cleanup with `pdb4amber`, or rebuilding missing structural regions
with tools such as MODELLER, AlphaFold, or SWISS-MODEL when that is scientifically
appropriate. These are external modeling decisions. PolyzyMD should receive the
curated result; do not ask PolyzyMD or this preparation script to infer missing
heavy atoms or missing residues automatically.

After curation, preserve the PolyzyMD conventions from this tutorial: all protein
atom records use chain `A`, and TER records remain between disconnected protein
fragments.

## Use the validated PDB in PolyzyMD

After the OpenFF validation snippet succeeds, point the enzyme block in your
PolyzyMD config at the prepared PDB:

```yaml
enzyme:
  name: "alpha_chymotrypsin_4cha"
  pdb_path: "structures/4cha_chymotrypsin_chain_a_openff.pdb"
```

Then run the normal PolyzyMD config and build checks:

```bash
pixi run -e build polyzymd validate -c config.yaml
pixi run -e build polyzymd build -c config.yaml --dry-run
```

```{warning}
These commands check PolyzyMD configuration and planned build inputs. They do not
prove that OpenFF can assign enzyme chemistry from the PDB. A config validation
and dry-run build can pass even when a real build later fails while loading the
enzyme. Use the OpenFF validation snippet above, or a real PolyzyMD build/load,
to confirm OpenFF chemistry validity.
```

## Why not just run `polyzymd clean-pdb`?

`polyzymd clean-pdb` is a convenience helper for simple PDB cleanup. It is not a
substitute for selecting the biological system, deciding which crystal copy to
simulate, removing unwanted molecules, preserving fragment boundaries, assigning
PolyzyMD chain IDs, or modeling missing residues and heavy atoms. Use it only
when those scientific decisions have already been made and the input is already
close to simulation-ready.