OpenFF PDB ingestion reference

This reference separates OpenFF chemistry requirements from PolyzyMD enzyme-input expectations and lists known error signatures for protein PDB ingestion through openff.toolkit.Topology.from_pdb().

OpenFF chemistry requirements

OpenFF does not require PolyzyMD’s chain IDs. It requires a PDB whose inferred chemical graph can be matched to supported residue chemistry.

Requirement

Expected state

Notes

Hydrogens

Explicit

OpenFF protein PDB ingestion expects chemically complete hydrogens

TER records

Present where fragments are disconnected

Mature cleaved proteins may have multiple fragments

Missing residues

Curated intentionally

Header records such as REMARK 465 require scientific review

Disulfides

SG-SG connectivity clear; no SG-HG proton

Verify SSBOND and, when needed, CONECT records

Direct validation

Topology.from_pdb() succeeds

Run this before relying on PolyzyMD build steps

PolyzyMD enzyme-input expectations

PolyzyMD uses chain IDs to assign biological roles during system building and analysis. These are project conventions, not OpenFF parser requirements.

Role

PolyzyMD chain convention

Notes

Protein/enzyme

A

The enzyme PDB passed to OpenFF is usually protein-only on chain A

Substrate

B

Usually kept separate from the enzyme PDB and configured as substrate input

Polymer

C

Used for conjugates and polymer-specific selections

Solvent/ions/other

D and later

Usually generated or handled outside the enzyme PDB

An enzyme PDB can satisfy PolyzyMD chain conventions and still fail OpenFF ingestion if the residue graph, hydrogens, termini, or disulfide connectivity do not match supported chemistry.

OpenFF disulfide behavior

  • CYX may be accepted as a cysteine-like residue alias during parsing, but it is not a stable public OpenFF residue template for every disulfide case.

  • Disulfide cysteine SG atoms should be bonded to each other and should not have an attached HG proton.

  • SSBOND records identify intended disulfides. CONECT records can make the SG-SG bond explicit for parser paths that depend on connectivity.

  • N-terminal cystines combine terminal hydrogens with disulfide chemistry and can expose template/charge mismatches.

Custom substructures JSON

PolyzyMD’s enzyme.custom_substructures_path loads JSON and passes it to Topology.from_pdb(..., _custom_substructures=...).

Warning

_custom_substructures is a private/experimental OpenFF API. Treat examples as proofs of concept or upstream-PR candidates, not as stable public OpenFF support.

Shape:

{
  "RESNAME": {
    "[SMARTS:1]": ["ATOM1"]
  }
}

Each residue name maps to SMARTS patterns, and each SMARTS pattern maps to the corresponding PDB atom names for that residue.

Charge diagnostics

Charge mismatch messages are blockers. They usually mean one of these is wrong:

  • protonation state or terminal hydrogen count

  • disulfide SG-HG or SG-SG bonding

  • residue atom naming

  • missing heavy atoms or missing residues

  • ambiguous TER, SSBOND, or CONECT records

  • a custom substructure that does not match the PDB atom graph

Acceptable fixes are chemically explicit: curate the PDB, correct hydrogens and connectivity, model missing atoms when scientifically justified, or document a narrow custom-substructure proof of concept. Do not suppress the error.

Running error catalog

Exact signature

Likely cause

Diagnostic

Acceptable fix

Caveats

Molecule has more/fewer total formal charges than the matched substructure

OpenFF matched a residue graph whose formal charge differs from the PDB graph

Inspect the named residue’s atom list, bonds, hydrogens, TER records, and disulfide records

Correct residue chemistry or use a reviewed custom substructure proof of concept

Do not ignore; private custom substructures are not stable API

Error dump names CYS#0001, terminal H, or N-terminal cysteine/cystine

N-terminal cysteine has terminal hydrogens plus disulfide chemistry that does not match OpenFF’s template

Check SG-HG absence, SG-SG bond, N-terminal hydrogens, and residue naming

Curate the cystine or test a structure-specific NCYX custom substructure

Seen in 4CHA proof of concept; not universal

Renaming disulfide cysteine to CYX does not resolve ingestion

CYX aliasing is not equivalent to a complete public template for all contexts

Validate direct OpenFF ingestion and inspect charge mismatch

Fix connectivity/hydrogens or prepare an upstream OpenFF issue/PR

Avoid relying on residue rename alone

Failure adjacent to residues listed in REMARK 465

Missing-coordinate residues or missing heavy atoms alter termini or local chemistry

Read PDB header and visualize gaps

Model missing regions externally if required for the study

Automatic filling is a modeling decision

Catalog maintenance rule

When a new OpenFF PDB ingestion error is diagnosed, update this table and Troubleshoot OpenFF PDB ingestion with the exact error text, likely cause, diagnostic command, acceptable fix, and caveats before closing the task unless the user explicitly defers the durable documentation update.