Troubleshoot OpenFF PDB ingestion
Use this guide when a PolyzyMD build fails while OpenFF loads an enzyme PDB, or
when direct openff.toolkit.Topology.from_pdb() validation fails.
Important
Do not bypass a charge mismatch or monkeypatch OpenFF to continue. OpenFF is reporting that the PDB chemistry it inferred does not match a supported residue graph. Fix or curate the structure first.
Quick triage
Reproduce the failure outside PolyzyMD.
Run simple structural checks on the PDB.
Read the OpenFF error dump for the residue names, atom names, and charges it expected versus found.
Fix the PDB upstream, or use a documented narrow proof of concept only when the user accepts the caveat.
Add newly diagnosed error signatures to the durable error catalog.
Check the structure first
Use a small script or text inspection to answer these questions. Some checks are OpenFF chemistry checks; chain-ID checks are PolyzyMD input conventions that keep protein, substrate, polymer, and solvent roles unambiguous later in the build.
Do protein atom records use PolyzyMD chain
A?Are substrate, polymer, and solvent chains kept out of the enzyme PDB or placed on the expected chains
B,C, andD+when relevant?Are TER records present between disconnected protein fragments?
Are crystallographic waters and unrelated heterogens removed from the enzyme PDB unless intentionally retained elsewhere?
Are hydrogens explicit?
Do PDB header records report missing residues or missing heavy atoms?
Do disulfide cysteines have SG-SG connectivity and no SG-bound HG proton?
Example quick check:
import argparse
from pathlib import Path
def summarize_pdb(path: str) -> None:
lines = Path(path).read_text().splitlines()
atoms = [line for line in lines if line.startswith(("ATOM", "HETATM"))]
chains = sorted({line[21] for line in atoms})
residues = sorted({line[17:20].strip() for line in atoms})
elements = {line[76:78].strip() for line in atoms if len(line) >= 78}
print(f"chains: {chains}")
print(f"TER records: {sum(line.startswith('TER') for line in lines)}")
print(f"hydrogens present: {'H' in elements}")
print(f"residue names include CYX: {'CYX' in residues}")
print(f"SSBOND records: {sum(line.startswith('SSBOND') for line in lines)}")
print(f"CONECT records: {sum(line.startswith('CONECT') for line in lines)}")
def main() -> None:
parser = argparse.ArgumentParser(description="Summarize a PDB before OpenFF validation")
parser.add_argument("pdb", help="Prepared enzyme PDB to inspect")
args = parser.parse_args()
summarize_pdb(args.pdb)
if __name__ == "__main__":
main()
Validate directly with OpenFF
Run direct validation before editing PolyzyMD configuration:
import argparse
def main() -> None:
parser = argparse.ArgumentParser(description="Validate PDB ingestion with OpenFF")
parser.add_argument("pdb", help="Prepared enzyme PDB to validate")
args = parser.parse_args()
from openff.toolkit import Topology
Topology.from_pdb(args.pdb)
print("OpenFF PDB ingestion succeeded")
if __name__ == "__main__":
main()
With pixi:
pixi run -e build python validate_openff.py prepared_enzyme.pdb
If direct validation fails, the problem is in the PDB chemistry OpenFF sees. A
passing polyzymd validate or polyzymd build --dry-run does not prove OpenFF
can ingest the enzyme PDB.
Interpret OpenFF error dumps
OpenFF PDB errors often include a residue-level description of atoms, bonds, and charges. Focus on:
the first residue name and number mentioned in the mismatch
unexpected hydrogens on terminal atoms or cysteine SG atoms
residues OpenFF matched as a different protonation or connectivity state
formal-charge totals that differ between the PDB graph and the template
nearby TER records, SSBOND records, and CONECT records
Do not delete atoms just to make the message disappear. Make a chemically consistent model and validate it again.
Common signatures
Error signature |
Likely cause |
Diagnostic |
Acceptable fix |
Caveats |
|---|---|---|---|---|
|
Residue graph, hydrogens, termini, or disulfide state differs from OpenFF’s matched template |
Inspect the named residue’s hydrogens, bonds, and neighboring TER/SSBOND/CONECT records |
Correct protonation/connectivity or use a reviewed custom substructure proof of concept |
Never ignore the mismatch |
Failure around |
N-terminal cysteine/cystine has terminal hydrogens and disulfide state OpenFF does not match cleanly |
Check N-terminal atom names, SG-HG absence, and SG-SG bond |
Curate the terminal cystine or use a narrow |
Private OpenFF API; not universal |
Residue names include |
|
Compare residue atoms and SG-SG connectivity |
Add/verify disulfide connectivity and hydrogens; consider upstream issue/PR |
Do not assume renaming to CYX is sufficient |
Failure near residues reported in |
Missing-coordinate residues or missing heavy atoms affect chemistry or termini |
Read PDB header and visualize gaps |
Model missing regions with an external tool when scientifically appropriate |
PolyzyMD should receive a curated result |
Disulfides
For each disulfide:
Confirm the paired cysteine SG atoms are close and intentionally bonded.
Remove inappropriate SG-bound
HGprotons from disulfide cysteines.Preserve or add reliable connectivity records.
SSBONDis useful metadata;CONECTcan make the actual SG-SG bond explicit for parser paths that use it.Validate again with
Topology.from_pdb().
Termini and hydrogen naming
Terminal residues combine residue chemistry with chain-fragment state. A mature protein can have multiple TER-separated fragments, each with termini. Verify that terminal hydrogens and atom names match the intended protonation state. This is especially important for N-terminal cystines.
Missing residues and heavy atoms
REMARK 465 and related header records are not instructions to auto-fill a PDB.
They are warnings that the deposited model is incomplete. Decide whether to model
missing residues or heavy atoms with external tools such as PDBFixer, MODELLER,
SWISS-MODEL, AlphaFold-derived models, ChimeraX, or PyMOL workflows, then review
the result before passing it to PolyzyMD.
Charge mismatch
Treat charge mismatch as a blocker. It means OpenFF’s inferred molecule and the matched substructure disagree. The fix is to make the residue graph, atom names, bonds, hydrogens, and protonation state consistent, not to suppress the error.
4CHA case study
The 4CHA alpha-chymotrypsin preparation can pass structural checks after selecting
one enzyme copy, relabeling protein atoms to chain A, preserving TER records,
removing waters, and adding hydrogens. Direct OpenFF validation may still fail
around the N-terminal cystine and nearby residues. That failure separates PDB
cleanup from chemistry assignment.
See examples/pdb_preparation/4cha/ for proof-of-concept scripts, including an
NCYX custom substructure example. The _custom_substructures argument is a
private OpenFF API, so this example is evidence for a targeted workaround or
upstream contribution, not a production-ready universal preparation method.
Keep the error catalog current
When you diagnose a new OpenFF PDB ingestion error, update this page and OpenFF PDB ingestion reference before marking the task complete. Add:
exact error text or shortest unique traceback excerpt
likely cause
diagnostic snippet or command
acceptable fix
caveats, especially private APIs or structure-specific assumptions
If the user explicitly defers the update, record that deferral in your final response.