Troubleshoot OpenFF PDB ingestion

Use this guide when a PolyzyMD build fails while OpenFF loads an enzyme PDB, or when direct openff.toolkit.Topology.from_pdb() validation fails.

Important

Do not bypass a charge mismatch or monkeypatch OpenFF to continue. OpenFF is reporting that the PDB chemistry it inferred does not match a supported residue graph. Fix or curate the structure first.

Quick triage

  1. Reproduce the failure outside PolyzyMD.

  2. Run simple structural checks on the PDB.

  3. Read the OpenFF error dump for the residue names, atom names, and charges it expected versus found.

  4. Fix the PDB upstream, or use a documented narrow proof of concept only when the user accepts the caveat.

  5. Add newly diagnosed error signatures to the durable error catalog.

Check the structure first

Use a small script or text inspection to answer these questions. Some checks are OpenFF chemistry checks; chain-ID checks are PolyzyMD input conventions that keep protein, substrate, polymer, and solvent roles unambiguous later in the build.

  • Do protein atom records use PolyzyMD chain A?

  • Are substrate, polymer, and solvent chains kept out of the enzyme PDB or placed on the expected chains B, C, and D+ when relevant?

  • Are TER records present between disconnected protein fragments?

  • Are crystallographic waters and unrelated heterogens removed from the enzyme PDB unless intentionally retained elsewhere?

  • Are hydrogens explicit?

  • Do PDB header records report missing residues or missing heavy atoms?

  • Do disulfide cysteines have SG-SG connectivity and no SG-bound HG proton?

Example quick check:

import argparse
from pathlib import Path


def summarize_pdb(path: str) -> None:
    lines = Path(path).read_text().splitlines()
    atoms = [line for line in lines if line.startswith(("ATOM", "HETATM"))]
    chains = sorted({line[21] for line in atoms})
    residues = sorted({line[17:20].strip() for line in atoms})
    elements = {line[76:78].strip() for line in atoms if len(line) >= 78}
    print(f"chains: {chains}")
    print(f"TER records: {sum(line.startswith('TER') for line in lines)}")
    print(f"hydrogens present: {'H' in elements}")
    print(f"residue names include CYX: {'CYX' in residues}")
    print(f"SSBOND records: {sum(line.startswith('SSBOND') for line in lines)}")
    print(f"CONECT records: {sum(line.startswith('CONECT') for line in lines)}")


def main() -> None:
    parser = argparse.ArgumentParser(description="Summarize a PDB before OpenFF validation")
    parser.add_argument("pdb", help="Prepared enzyme PDB to inspect")
    args = parser.parse_args()
    summarize_pdb(args.pdb)


if __name__ == "__main__":
    main()

Validate directly with OpenFF

Run direct validation before editing PolyzyMD configuration:

import argparse


def main() -> None:
    parser = argparse.ArgumentParser(description="Validate PDB ingestion with OpenFF")
    parser.add_argument("pdb", help="Prepared enzyme PDB to validate")
    args = parser.parse_args()

    from openff.toolkit import Topology

    Topology.from_pdb(args.pdb)
    print("OpenFF PDB ingestion succeeded")


if __name__ == "__main__":
    main()

With pixi:

pixi run -e build python validate_openff.py prepared_enzyme.pdb

If direct validation fails, the problem is in the PDB chemistry OpenFF sees. A passing polyzymd validate or polyzymd build --dry-run does not prove OpenFF can ingest the enzyme PDB.

Interpret OpenFF error dumps

OpenFF PDB errors often include a residue-level description of atoms, bonds, and charges. Focus on:

  • the first residue name and number mentioned in the mismatch

  • unexpected hydrogens on terminal atoms or cysteine SG atoms

  • residues OpenFF matched as a different protonation or connectivity state

  • formal-charge totals that differ between the PDB graph and the template

  • nearby TER records, SSBOND records, and CONECT records

Do not delete atoms just to make the message disappear. Make a chemically consistent model and validate it again.

Common signatures

Error signature

Likely cause

Diagnostic

Acceptable fix

Caveats

Molecule has more/fewer total formal charges than the matched substructure

Residue graph, hydrogens, termini, or disulfide state differs from OpenFF’s matched template

Inspect the named residue’s hydrogens, bonds, and neighboring TER/SSBOND/CONECT records

Correct protonation/connectivity or use a reviewed custom substructure proof of concept

Never ignore the mismatch

Failure around CYS#0001, terminal H, or N-terminal cysteine

N-terminal cysteine/cystine has terminal hydrogens and disulfide state OpenFF does not match cleanly

Check N-terminal atom names, SG-HG absence, and SG-SG bond

Curate the terminal cystine or use a narrow NCYX custom substructure proof of concept

Private OpenFF API; not universal

Residue names include CYX, but OpenFF still fails

CYX may be treated as a cysteine alias, not a complete public template solution

Compare residue atoms and SG-SG connectivity

Add/verify disulfide connectivity and hydrogens; consider upstream issue/PR

Do not assume renaming to CYX is sufficient

Failure near residues reported in REMARK 465

Missing-coordinate residues or missing heavy atoms affect chemistry or termini

Read PDB header and visualize gaps

Model missing regions with an external tool when scientifically appropriate

PolyzyMD should receive a curated result

Disulfides

For each disulfide:

  1. Confirm the paired cysteine SG atoms are close and intentionally bonded.

  2. Remove inappropriate SG-bound HG protons from disulfide cysteines.

  3. Preserve or add reliable connectivity records. SSBOND is useful metadata; CONECT can make the actual SG-SG bond explicit for parser paths that use it.

  4. Validate again with Topology.from_pdb().

Termini and hydrogen naming

Terminal residues combine residue chemistry with chain-fragment state. A mature protein can have multiple TER-separated fragments, each with termini. Verify that terminal hydrogens and atom names match the intended protonation state. This is especially important for N-terminal cystines.

Missing residues and heavy atoms

REMARK 465 and related header records are not instructions to auto-fill a PDB. They are warnings that the deposited model is incomplete. Decide whether to model missing residues or heavy atoms with external tools such as PDBFixer, MODELLER, SWISS-MODEL, AlphaFold-derived models, ChimeraX, or PyMOL workflows, then review the result before passing it to PolyzyMD.

Charge mismatch

Treat charge mismatch as a blocker. It means OpenFF’s inferred molecule and the matched substructure disagree. The fix is to make the residue graph, atom names, bonds, hydrogens, and protonation state consistent, not to suppress the error.

4CHA case study

The 4CHA alpha-chymotrypsin preparation can pass structural checks after selecting one enzyme copy, relabeling protein atoms to chain A, preserving TER records, removing waters, and adding hydrogens. Direct OpenFF validation may still fail around the N-terminal cystine and nearby residues. That failure separates PDB cleanup from chemistry assignment.

See examples/pdb_preparation/4cha/ for proof-of-concept scripts, including an NCYX custom substructure example. The _custom_substructures argument is a private OpenFF API, so this example is evidence for a targeted workaround or upstream contribution, not a production-ready universal preparation method.

Keep the error catalog current

When you diagnose a new OpenFF PDB ingestion error, update this page and OpenFF PDB ingestion reference before marking the task complete. Add:

  • exact error text or shortest unique traceback excerpt

  • likely cause

  • diagnostic snippet or command

  • acceptable fix

  • caveats, especially private APIs or structure-specific assumptions

If the user explicitly defers the update, record that deferral in your final response.