Prepare a PDB for OpenFF and PolyzyMD
Raw Protein Data Bank files often describe crystallographic assemblies, waters, alternate locations, missing-coordinate records, and historical chain labels that are not the same thing as a simulation-ready PolyzyMD enzyme input. This tutorial shows a conservative preparation workflow for the raw 4CHA alpha-chymotrypsin PDB using PyMOL for biological-system inspection and a small PDBFixer script for mechanical cleanup.
The workflow deliberately adds missing hydrogens only. It does not add missing amino-acid residues or missing heavy atoms. If your structure is missing protein residues or heavy atoms that are required for your science, model and validate them before using PolyzyMD. Other structure-preparation tools can be appropriate, but this tutorial prescribes PyMOL plus PDBFixer so the exact selection and cleanup steps are visible.
What you will produce
Starting from a raw RCSB PDB download:
structures/4CHA.pdb
The intended final enzyme file has:
one alpha-chymotrypsin enzyme copy
all protein atom records using chain ID
ATER records preserved between the disconnected peptide fragments
no crystallographic waters
no second enzyme copy
explicit hydrogens
simple structural checks that pass before OpenFF chemistry assignment
successful
openff.toolkit.Topology.from_pdb()validation after any required upstream residue, atom, protonation, or connectivity curation is complete
Important
PolyzyMD-prepared structures must use chain ID A for all protein residues.
Substrates use chain B, polymers use chain C, and solvent chains use D and
later. It is acceptable for one enzyme to contain multiple disconnected peptide
fragments after OpenFF ingestion. MD depends on chemical connectivity, not on
whether every protein fragment has a unique chain label.
Prerequisites
PolyzyMD installed in a pixi environment that includes PDBFixer, OpenMM, and OpenFF
PyMOL for interactive inspection
the raw 4CHA PDB file downloaded into
structures/4CHA.pdba scientific decision about whether any missing coordinates must be modeled before simulation
Create a clean working directory and download the raw PDB file:
mkdir -p 4cha_pdb_prep/structures
cd 4cha_pdb_prep
curl -L https://files.rcsb.org/download/4CHA.pdb -o structures/4CHA.pdb
Inspect the raw 4CHA biological contents
Open the raw file in PyMOL:
load structures/4CHA.pdb, raw4cha
hide everything, raw4cha
show cartoon, polymer.protein
show sticks, polymer.protein and resn CYS
show spheres, solvent
select first_copy, raw4cha and polymer.protein and chain A+B+C
select second_copy, raw4cha and polymer.protein and chain E+F+G
select crystal_waters, raw4cha and solvent
color marine, first_copy
color orange, second_copy
color cyan, crystal_waters
zoom first_copy
Read the PDB header before deleting anything. The relevant 4CHA records say:
COMPND assigns one alpha-chymotrypsin enzyme copy across chains
A+B+CCOMPND assigns a second enzyme copy across chains
E+F+Gmature alpha-chymotrypsin is a cleaved enzyme with multiple peptide fragments
residues 14-15 and 147-148 are biologically excised activation peptides, so TER records between the mature fragments are expected
REMARK 465 reports missing-coordinate residues in the deposited model, including
GLY A 12andLEU A 13in the first copy
Those REMARK 465 records are not a PDBFixer to-do list for this workflow. They are a signal that the raw crystal model is not automatically simulation-ready. If those missing coordinates matter for your study, provide a curated PDB where they have already been modeled and checked. Do not rely on PolyzyMD or the helper script below to make that biological modeling decision.
Preview the system selection in PyMOL
Use PyMOL to confirm that the first copy is the enzyme you intend to simulate:
disable second_copy
enable first_copy
show cartoon, first_copy
show sticks, first_copy and resn CYS
distance disulfides, first_copy and name SG, first_copy and name SG, 2.2
zoom first_copy
If you want to preview the trimmed system without changing the final preparation path, create a temporary PyMOL object:
create 4cha_first_copy_preview, raw4cha and polymer.protein and chain A+B+C
disable raw4cha
enable 4cha_first_copy_preview
Use this preview for visual inspection only. The final file is written by the
script below so the TER boundaries between fragments are preserved as separate
PDB topology chains while every atom record is relabeled to chain A for
PolyzyMD.
Add hydrogens and write the PolyzyMD PDB
Save this script as prepare_4cha_for_polyzymd.py and run it from a PolyzyMD
pixi shell. It removes the second enzyme copy and waters, keeps the first copy’s
three protein fragments, relabels all remaining protein chains to A, and adds
hydrogens only.
from pathlib import Path
from openmm.app import PDBFile
from pdbfixer import PDBFixer
RAW_PDB = Path("structures/4CHA.pdb")
OUTPUT_PDB = Path("structures/4cha_chymotrypsin_chain_a_openff.pdb")
KEEP_CHAINS = {"A", "B", "C"}
PH = 7.0
def main() -> None:
"""Prepare one 4CHA enzyme copy for PolyzyMD."""
fixer = PDBFixer(filename=str(RAW_PDB))
chains_to_remove = [
chain_index
for chain_index, chain in enumerate(fixer.topology.chains())
if chain.id not in KEEP_CHAINS
]
fixer.removeChains(chains_to_remove)
fixer.removeHeterogens(keepWater=False)
fixer.addMissingHydrogens(PH)
for chain in fixer.topology.chains():
chain.id = "A"
OUTPUT_PDB.parent.mkdir(parents=True, exist_ok=True)
with OUTPUT_PDB.open("w") as handle:
PDBFile.writeFile(fixer.topology, fixer.positions, handle, keepIds=True)
print(f"Wrote {OUTPUT_PDB}")
if __name__ == "__main__":
main()
Run it:
python prepare_4cha_for_polyzymd.py
Warning
This script intentionally does not call findMissingResidues(),
findMissingAtoms(), addMissingAtoms(), or any missing-residue filling method.
If OpenFF later reports missing chemistry, stop and curate the input structure
upstream instead of turning on automatic residue or heavy-atom construction here.
Validate the prepared file with OpenFF
Run this validation snippet before referencing the PDB from config.yaml:
from pathlib import Path
from openff.toolkit import Topology
PDB_PATH = Path("structures/4cha_chymotrypsin_chain_a_openff.pdb")
def main() -> None:
"""Validate the prepared PDB for PolyzyMD and OpenFF."""
lines = PDB_PATH.read_text().splitlines()
atom_records = [line for line in lines if line.startswith(("ATOM", "HETATM"))]
ter_count = sum(line.startswith("TER") for line in lines)
chains = {line[21] for line in atom_records}
residue_names = {line[17:20].strip() for line in atom_records}
elements = {line[76:78].strip() for line in atom_records if len(line) >= 78}
if chains != {"A"}:
raise SystemExit(f"Expected only chain A, found {sorted(chains)}")
if ter_count != 3:
raise SystemExit(f"Expected three TER records, found {ter_count}")
if "HOH" in residue_names:
raise SystemExit("Expected no crystallographic waters")
if "H" not in elements:
raise SystemExit("Expected explicit hydrogens")
print("Structural PDB checks passed")
print(f" Atom records use chains: {sorted(chains)}")
print(f" TER records: {ter_count}")
print(" Crystallographic waters: absent")
print(" Explicit hydrogens: present")
print("Running OpenFF chemistry validation")
Topology.from_pdb(str(PDB_PATH))
print("OpenFF chemistry assignment succeeded")
if __name__ == "__main__":
main()
Expected success for a fully curated input is:
Structural PDB checks passed
Atom records use chains: ['A']
TER records: 3
Crystallographic waters: absent
Explicit hydrogens: present
Running OpenFF chemistry validation
OpenFF chemistry assignment succeeded
When this exact script is run on the uncurated raw 4CHA file, the chain, TER,
water, and hydrogen checks pass, but OpenFF can still reject the file. In a
collaborator reproduction, the mechanically prepared PDB had all atom records on
chain A, three TER records, waters removed, the second enzyme copy removed, and
hydrogens present, while OpenFF still reported chemistry errors around terminal
CYS#0001 hydrogen and disulfide assignment and SER#0011. That failure is useful:
it separates successful structural preparation from unsuccessful chemistry
assignment and tells you to stop and provide a curated protein model before using
PolyzyMD.
If the chain, TER, water, or hydrogen checks fail, fix the preparation script or
your file paths. If Topology.from_pdb() fails after those simple checks pass,
OpenFF is telling you that the file still has unresolved chemistry such as
missing residues, missing heavy atoms, unsupported atom naming, or ambiguous
connectivity. Resolve those issues in a curated input PDB before running
PolyzyMD.
What curate upstream means
OpenFF failures after the simple structural checks pass are chemistry-modeling problems in the enzyme input, not PolyzyMD configuration problems. Curating upstream means producing an externally inspected PDB whose residue templates, atom names, protonation states, coordinates, and connectivity are chemically consistent before PolyzyMD loads it.
Common issues to inspect include:
terminal atom naming and terminal hydrogen counts
disulfide cysteine protonation and SG-SG connectivity
missing-coordinate fragments reported in the PDB header
missing heavy atoms in otherwise retained residues
biologically excised residues that should remain absent, with TER records kept between the mature fragments
Possible curation paths include manual editing and inspection in PyMOL or
ChimeraX, deliberate PDBFixer use with careful review of every modeled atom,
Amber-oriented cleanup with pdb4amber, or rebuilding missing structural regions
with tools such as MODELLER, AlphaFold, or SWISS-MODEL when that is scientifically
appropriate. These are external modeling decisions. PolyzyMD should receive the
curated result; do not ask PolyzyMD or this preparation script to infer missing
heavy atoms or missing residues automatically.
After curation, preserve the PolyzyMD conventions from this tutorial: all protein
atom records use chain A, and TER records remain between disconnected protein
fragments.
Use the validated PDB in PolyzyMD
After the OpenFF validation snippet succeeds, point the enzyme block in your PolyzyMD config at the prepared PDB:
enzyme:
name: "alpha_chymotrypsin_4cha"
pdb_path: "structures/4cha_chymotrypsin_chain_a_openff.pdb"
Then run the normal PolyzyMD config and build checks:
pixi run -e build polyzymd validate -c config.yaml
pixi run -e build polyzymd build -c config.yaml --dry-run
Warning
These commands check PolyzyMD configuration and planned build inputs. They do not prove that OpenFF can assign enzyme chemistry from the PDB. A config validation and dry-run build can pass even when a real build later fails while loading the enzyme. Use the OpenFF validation snippet above, or a real PolyzyMD build/load, to confirm OpenFF chemistry validity.
Why not just run polyzymd clean-pdb?
polyzymd clean-pdb is a convenience helper for simple PDB cleanup. It is not a
substitute for selecting the biological system, deciding which crystal copy to
simulate, removing unwanted molecules, preserving fragment boundaries, assigning
PolyzyMD chain IDs, or modeling missing residues and heavy atoms. Use it only
when those scientific decisions have already been made and the input is already
close to simulation-ready.