# Troubleshoot OpenFF PDB ingestion Use this guide when a PolyzyMD build fails while OpenFF loads an enzyme PDB, or when direct `openff.toolkit.Topology.from_pdb()` validation fails. ```{important} Do not bypass a charge mismatch or monkeypatch OpenFF to continue. OpenFF is reporting that the PDB chemistry it inferred does not match a supported residue graph. Fix or curate the structure first. ``` ## Quick triage 1. Reproduce the failure outside PolyzyMD. 2. Run simple structural checks on the PDB. 3. Read the OpenFF error dump for the residue names, atom names, and charges it expected versus found. 4. Fix the PDB upstream, or use a documented narrow proof of concept only when the user accepts the caveat. 5. Add newly diagnosed error signatures to the durable error catalog. ## Check the structure first Use a small script or text inspection to answer these questions. Some checks are OpenFF chemistry checks; chain-ID checks are PolyzyMD input conventions that keep protein, substrate, polymer, and solvent roles unambiguous later in the build. - Do protein atom records use PolyzyMD chain `A`? - Are substrate, polymer, and solvent chains kept out of the enzyme PDB or placed on the expected chains `B`, `C`, and `D+` when relevant? - Are TER records present between disconnected protein fragments? - Are crystallographic waters and unrelated heterogens removed from the enzyme PDB unless intentionally retained elsewhere? - Are hydrogens explicit? - Do PDB header records report missing residues or missing heavy atoms? - Do disulfide cysteines have SG-SG connectivity and no SG-bound HG proton? Example quick check: ```python import argparse from pathlib import Path def summarize_pdb(path: str) -> None: lines = Path(path).read_text().splitlines() atoms = [line for line in lines if line.startswith(("ATOM", "HETATM"))] chains = sorted({line[21] for line in atoms}) residues = sorted({line[17:20].strip() for line in atoms}) elements = {line[76:78].strip() for line in atoms if len(line) >= 78} print(f"chains: {chains}") print(f"TER records: {sum(line.startswith('TER') for line in lines)}") print(f"hydrogens present: {'H' in elements}") print(f"residue names include CYX: {'CYX' in residues}") print(f"SSBOND records: {sum(line.startswith('SSBOND') for line in lines)}") print(f"CONECT records: {sum(line.startswith('CONECT') for line in lines)}") def main() -> None: parser = argparse.ArgumentParser(description="Summarize a PDB before OpenFF validation") parser.add_argument("pdb", help="Prepared enzyme PDB to inspect") args = parser.parse_args() summarize_pdb(args.pdb) if __name__ == "__main__": main() ``` ## Validate directly with OpenFF Run direct validation before editing PolyzyMD configuration: ```python import argparse def main() -> None: parser = argparse.ArgumentParser(description="Validate PDB ingestion with OpenFF") parser.add_argument("pdb", help="Prepared enzyme PDB to validate") args = parser.parse_args() from openff.toolkit import Topology Topology.from_pdb(args.pdb) print("OpenFF PDB ingestion succeeded") if __name__ == "__main__": main() ``` With pixi: ```bash pixi run -e build python validate_openff.py prepared_enzyme.pdb ``` If direct validation fails, the problem is in the PDB chemistry OpenFF sees. A passing `polyzymd validate` or `polyzymd build --dry-run` does not prove OpenFF can ingest the enzyme PDB. ## Interpret OpenFF error dumps OpenFF PDB errors often include a residue-level description of atoms, bonds, and charges. Focus on: - the first residue name and number mentioned in the mismatch - unexpected hydrogens on terminal atoms or cysteine SG atoms - residues OpenFF matched as a different protonation or connectivity state - formal-charge totals that differ between the PDB graph and the template - nearby TER records, SSBOND records, and CONECT records Do not delete atoms just to make the message disappear. Make a chemically consistent model and validate it again. ## Common signatures | Error signature | Likely cause | Diagnostic | Acceptable fix | Caveats | |---|---|---|---|---| | `Molecule has more/fewer total formal charges than the matched substructure` | Residue graph, hydrogens, termini, or disulfide state differs from OpenFF's matched template | Inspect the named residue's hydrogens, bonds, and neighboring TER/SSBOND/CONECT records | Correct protonation/connectivity or use a reviewed custom substructure proof of concept | Never ignore the mismatch | | Failure around `CYS#0001`, terminal `H`, or N-terminal cysteine | N-terminal cysteine/cystine has terminal hydrogens and disulfide state OpenFF does not match cleanly | Check N-terminal atom names, SG-HG absence, and SG-SG bond | Curate the terminal cystine or use a narrow `NCYX` custom substructure proof of concept | Private OpenFF API; not universal | | Residue names include `CYX`, but OpenFF still fails | `CYX` may be treated as a cysteine alias, not a complete public template solution | Compare residue atoms and SG-SG connectivity | Add/verify disulfide connectivity and hydrogens; consider upstream issue/PR | Do not assume renaming to CYX is sufficient | | Failure near residues reported in `REMARK 465` | Missing-coordinate residues or missing heavy atoms affect chemistry or termini | Read PDB header and visualize gaps | Model missing regions with an external tool when scientifically appropriate | PolyzyMD should receive a curated result | ## Disulfides For each disulfide: 1. Confirm the paired cysteine SG atoms are close and intentionally bonded. 2. Remove inappropriate SG-bound `HG` protons from disulfide cysteines. 3. Preserve or add reliable connectivity records. `SSBOND` is useful metadata; `CONECT` can make the actual SG-SG bond explicit for parser paths that use it. 4. Validate again with `Topology.from_pdb()`. ## Termini and hydrogen naming Terminal residues combine residue chemistry with chain-fragment state. A mature protein can have multiple TER-separated fragments, each with termini. Verify that terminal hydrogens and atom names match the intended protonation state. This is especially important for N-terminal cystines. ## Missing residues and heavy atoms `REMARK 465` and related header records are not instructions to auto-fill a PDB. They are warnings that the deposited model is incomplete. Decide whether to model missing residues or heavy atoms with external tools such as PDBFixer, MODELLER, SWISS-MODEL, AlphaFold-derived models, ChimeraX, or PyMOL workflows, then review the result before passing it to PolyzyMD. ## Charge mismatch Treat charge mismatch as a blocker. It means OpenFF's inferred molecule and the matched substructure disagree. The fix is to make the residue graph, atom names, bonds, hydrogens, and protonation state consistent, not to suppress the error. ## 4CHA case study The 4CHA alpha-chymotrypsin preparation can pass structural checks after selecting one enzyme copy, relabeling protein atoms to chain `A`, preserving TER records, removing waters, and adding hydrogens. Direct OpenFF validation may still fail around the N-terminal cystine and nearby residues. That failure separates PDB cleanup from chemistry assignment. See `examples/pdb_preparation/4cha/` for proof-of-concept scripts, including an `NCYX` custom substructure example. The `_custom_substructures` argument is a private OpenFF API, so this example is evidence for a targeted workaround or upstream contribution, not a production-ready universal preparation method. ## Keep the error catalog current When you diagnose a new OpenFF PDB ingestion error, update this page and {doc}`../reference/openff_pdb_ingestion` before marking the task complete. Add: - exact error text or shortest unique traceback excerpt - likely cause - diagnostic snippet or command - acceptable fix - caveats, especially private APIs or structure-specific assumptions If the user explicitly defers the update, record that deferral in your final response.