Troubleshoot OpenFF PDB ingestion

Use this guide when a PolyzyMD build fails while OpenFF loads an enzyme PDB, or when direct openff.toolkit.Topology.from_pdb() validation fails.

Important

Do not bypass a charge mismatch or monkeypatch OpenFF to continue. OpenFF is reporting that the PDB chemistry it inferred does not match a supported residue graph. Fix or curate the structure first.

Quick triage

Reproduce the failure outside PolyzyMD.
Run simple structural checks on the PDB.
Read the OpenFF error dump for the residue names, atom names, and charges it expected versus found.
Fix the PDB upstream, or use a documented narrow proof of concept only when the user accepts the caveat.
Add newly diagnosed error signatures to the durable error catalog.

Check the structure first

Use a small script or text inspection to answer these questions. Some checks are OpenFF chemistry checks; chain-ID checks are PolyzyMD input conventions that keep protein, substrate, polymer, and solvent roles unambiguous later in the build.

Do protein atom records use PolyzyMD chain A?
Are substrate, polymer, and solvent chains kept out of the enzyme PDB or placed on the expected chains B, C, and D+ when relevant?
Are TER records present between disconnected protein fragments?
Are crystallographic waters and unrelated heterogens removed from the enzyme PDB unless intentionally retained elsewhere?
Are hydrogens explicit?
Do PDB header records report missing residues or missing heavy atoms?
Do disulfide cysteines have SG-SG connectivity and no SG-bound HG proton?

Example quick check:

import argparse
from pathlib import Path


def summarize_pdb(path: str) -> None:
    lines = Path(path).read_text().splitlines()
    atoms = [line for line in lines if line.startswith(("ATOM", "HETATM"))]
    chains = sorted({line[21] for line in atoms})
    residues = sorted({line[17:20].strip() for line in atoms})
    elements = {line[76:78].strip() for line in atoms if len(line) >= 78}
    print(f"chains: {chains}")
    print(f"TER records: {sum(line.startswith('TER') for line in lines)}")
    print(f"hydrogens present: {'H' in elements}")
    print(f"residue names include CYX: {'CYX' in residues}")
    print(f"SSBOND records: {sum(line.startswith('SSBOND') for line in lines)}")
    print(f"CONECT records: {sum(line.startswith('CONECT') for line in lines)}")


def main() -> None:
    parser = argparse.ArgumentParser(description="Summarize a PDB before OpenFF validation")
    parser.add_argument("pdb", help="Prepared enzyme PDB to inspect")
    args = parser.parse_args()
    summarize_pdb(args.pdb)


if __name__ == "__main__":
    main()

Validate directly with OpenFF

Run direct validation before editing PolyzyMD configuration:

import argparse


def main() -> None:
    parser = argparse.ArgumentParser(description="Validate PDB ingestion with OpenFF")
    parser.add_argument("pdb", help="Prepared enzyme PDB to validate")
    args = parser.parse_args()

    from openff.toolkit import Topology

    Topology.from_pdb(args.pdb)
    print("OpenFF PDB ingestion succeeded")


if __name__ == "__main__":
    main()

With pixi:

pixi run -e build python validate_openff.py prepared_enzyme.pdb

If direct validation fails, the problem is in the PDB chemistry OpenFF sees. A passing polyzymd validate or polyzymd build --dry-run does not prove OpenFF can ingest the enzyme PDB.

Interpret OpenFF error dumps

OpenFF PDB errors often include a residue-level description of atoms, bonds, and charges. Focus on:

the first residue name and number mentioned in the mismatch
unexpected hydrogens on terminal atoms or cysteine SG atoms
residues OpenFF matched as a different protonation or connectivity state
formal-charge totals that differ between the PDB graph and the template
nearby TER records, SSBOND records, and CONECT records

Do not delete atoms just to make the message disappear. Make a chemically consistent model and validate it again.

Common signatures

Error signature	Likely cause	Diagnostic	Acceptable fix	Caveats
`Molecule has more/fewer total formal charges than the matched substructure`	Residue graph, hydrogens, termini, or disulfide state differs from OpenFF’s matched template	Inspect the named residue’s hydrogens, bonds, and neighboring TER/SSBOND/CONECT records	Correct protonation/connectivity or use a reviewed custom substructure proof of concept	Never ignore the mismatch
Failure around `CYS#0001`, terminal `H`, or N-terminal cysteine	N-terminal cysteine/cystine has terminal hydrogens and disulfide state OpenFF does not match cleanly	Check N-terminal atom names, SG-HG absence, and SG-SG bond	Curate the terminal cystine or use a narrow `NCYX` custom substructure proof of concept	Private OpenFF API; not universal
Residue names include `CYX`, but OpenFF still fails	`CYX` may be treated as a cysteine alias, not a complete public template solution	Compare residue atoms and SG-SG connectivity	Add/verify disulfide connectivity and hydrogens; consider upstream issue/PR	Do not assume renaming to CYX is sufficient
Failure near residues reported in `REMARK 465`	Missing-coordinate residues or missing heavy atoms affect chemistry or termini	Read PDB header and visualize gaps	Model missing regions with an external tool when scientifically appropriate	PolyzyMD should receive a curated result

Disulfides

For each disulfide:

Confirm the paired cysteine SG atoms are close and intentionally bonded.
Remove inappropriate SG-bound HG protons from disulfide cysteines.
Preserve or add reliable connectivity records. SSBOND is useful metadata; CONECT can make the actual SG-SG bond explicit for parser paths that use it.
Validate again with Topology.from_pdb().

Termini and hydrogen naming

Terminal residues combine residue chemistry with chain-fragment state. A mature protein can have multiple TER-separated fragments, each with termini. Verify that terminal hydrogens and atom names match the intended protonation state. This is especially important for N-terminal cystines.

Missing residues and heavy atoms

REMARK 465 and related header records are not instructions to auto-fill a PDB. They are warnings that the deposited model is incomplete. Decide whether to model missing residues or heavy atoms with external tools such as PDBFixer, MODELLER, SWISS-MODEL, AlphaFold-derived models, ChimeraX, or PyMOL workflows, then review the result before passing it to PolyzyMD.

Charge mismatch

Treat charge mismatch as a blocker. It means OpenFF’s inferred molecule and the matched substructure disagree. The fix is to make the residue graph, atom names, bonds, hydrogens, and protonation state consistent, not to suppress the error.

4CHA case study

The 4CHA alpha-chymotrypsin preparation can pass structural checks after selecting one enzyme copy, relabeling protein atoms to chain A, preserving TER records, removing waters, and adding hydrogens. Direct OpenFF validation may still fail around the N-terminal cystine and nearby residues. That failure separates PDB cleanup from chemistry assignment.

See examples/pdb_preparation/4cha/ for proof-of-concept scripts, including an NCYX custom substructure example. The _custom_substructures argument is a private OpenFF API, so this example is evidence for a targeted workaround or upstream contribution, not a production-ready universal preparation method.

Keep the error catalog current

When you diagnose a new OpenFF PDB ingestion error, update this page and OpenFF PDB ingestion reference before marking the task complete. Add:

exact error text or shortest unique traceback excerpt
likely cause
diagnostic snippet or command
acceptable fix
caveats, especially private APIs or structure-specific assumptions

If the user explicitly defers the update, record that deferral in your final response.