Polymer Affinity Score Analysis

Quantify the total polymer-protein interaction strength by combining per-contact free energies with contact counts across all residue groups.

This analysis answers the experimentalist’s question: “Which polymer composition has the strongest overall adhesion to the enzyme surface?”

Note

The polymer affinity score is a comparative scoring metric for ranking polymer compositions. It is derived entirely from cached binding preference data - no additional trajectory access is needed. Run contacts analysis with compute_binding_preference: true first.

Warning

Polymer affinity is currently labeled experimental in the presentation release. The CLI command and plots remain available, but PolyzyMD now adds explicit experimental warnings because the score definition and interpretation may still change.

Scientific Motivation

Beyond Per-Contact Free Energy

Binding free energy analysis reports the thermodynamic preference of each polymer type for each residue group (e.g., “EGMA prefers aromatics by -0.52 kT”). However, this per-contact value misses the polyvalent binding effect: many contacts happen simultaneously.

A polymer with a modest per-contact affinity but many simultaneous contacts may have a stronger total interaction than one with strong per-contact affinity but few contacts. The polymer affinity score captures this by summing:

\[\text{Score} = N \times \Delta G_{\mathrm{sel}}\]

where:

N = mean number of simultaneous contacts with a residue group
ΔG\(_{\mathrm{sel}}\) = per-contact selectivity free energy from Boltzmann inversion

The Formula

Per (polymer_type, protein_group):

\[N_g = \text{mean\_contact\_fraction} \times n_\text{exposed\_in\_group}\]

\[\Delta G_{\mathrm{sel},g} = -\ln\!\left(\frac{\text{contact\_share}}{\text{expected\_share}}\right) \quad [k_\mathrm{b}T]\]

\[S_g = N_g \times \Delta G_{\mathrm{sel},g}\]

Per polymer type:

\[S_\text{polymer} = \sum_g S_g\]

Per condition (headline score):

\[S_\text{total} = \sum_\text{polymer\_types} S_\text{polymer}\]

How N_contacts Is Computed

N_contacts is the mean number of exposed residues in a protein group that are simultaneously in contact with a given polymer type, averaged across all trajectory frames after equilibration. It is not a total count of contact events over the simulation, and it is not a binary flag for whether any contact occurred.

From Raw Frames to N

The computation proceeds in four stages:

Stage 1 — Per residue, per frame (binary). For each protein residue and each trajectory frame, a contact is recorded if any heavy atom of a polymer residue of a given type is within the cutoff distance (default 4.5 A) of any heavy atom of the protein residue. Multiple polymer segments of the same type contacting the same protein residue in the same frame count as one contact (unique frame counting via set operations).

Stage 2 — Per residue, per polymer type (fraction). For each exposed protein residue, the number of frames with at least one contact from polymer type P is divided by the total number of analyzed frames. This gives a per-residue contact fraction in [0, 1].

Stage 3 — Per (polymer_type, protein_group) (mean fraction). The per-residue contact fractions are averaged across all surface-exposed residues in the protein group:

\[\text{mean\_contact\_fraction} = \frac{\sum_{r \in \text{group}} \text{frames\_contacted}_r}{n_\text{frames} \times n_\text{exposed\_in\_group}}\]

This is mathematically equivalent to the arithmetic mean of the per-residue fractions, computed as a single division for efficiency. The value answers: “on average, what fraction of frames is each exposed residue in this group contacted by this polymer type?”

Stage 4 — N_contacts. Multiplying by the group size recovers the mean number of simultaneously contacted residues per frame:

\[N = \text{mean\_contact\_fraction} \times n_\text{exposed\_in\_group} = \frac{\sum_{r \in \text{group}} \text{frames\_contacted}_r}{n_\text{frames}}\]

The algebra cancels n_exposed_in_group from numerator and denominator, revealing that N is simply the total contact-frames divided by the number of frames — i.e., the average number of contacted residues per frame.

Worked Example

Consider a polymer type SBM contacting the “aromatic” protein group (5 exposed residues) over 1000 analyzed frames:

Residue	Frames contacted	Per-residue fraction
Trp23	800	0.80
Phe45	200	0.20
Tyr67	150	0.15
Phe89	50	0.05
Trp112	0	0.00

total_contact_frames = 800 + 200 + 150 + 50 + 0 = 1200
mean_contact_fraction = 1200 / (1000 × 5) = 0.24
N = 0.24 × 5 = 1.2

Interpretation: on average, 1.2 aromatic residues are simultaneously in contact with SBM in any given frame. Some frames have 0 contacts, some have 3 or 4 — but the time-averaged expectation is 1.2 simultaneous contacts.

If this group’s ΔG\(_{\mathrm{sel}}\) is -0.22 kT, the group contribution to the affinity score is 1.2 × (-0.22) = -0.26 kT.

Why This Quantity (and Not Alternatives)

Alternative definition of N	Why it would be wrong
Total contact count over the simulation	Conflates simulation length with interaction strength. A 1 μs trajectory would score 10× higher than a 100 ns trajectory for identical binding.
Fraction of frames with any contact in the group	Loses polyvalency information. Cannot distinguish 1 residue contacted vs 15 residues contacted simultaneously — the entire point of this analysis.
Number of unique residues ever contacted	Loses temporal information. A residue contacted once in 10,000 frames contributes equally to one contacted every frame.
Peak (maximum) simultaneous contacts	Sensitive to outlier frames. Not representative of the sustained interaction.

Note

What N does not capture. Two polymers can have identical N = 10 but differ in binding mode: one contacts the same 10 residues every frame (stable binding), while the other contacts different residues each frame (transient surface scanning). For ranking total adhesion strength, this distinction does not matter — both have 10 simultaneous contacts contributing to the score. To distinguish binding modes, use residence time analysis.

Cross-Replicate Aggregation

When multiple replicates are available, mean_contact_fraction is arithmetically averaged across replicates with standard error of the mean (SEM). n_exposed_in_group is determined from the SASA calculation on the crystal structure PDB and is therefore identical across replicates — it is a property of the protein, not the trajectory. The resulting N inherits the cross-replicate mean and uncertainty from mean_contact_fraction alone.

Sign Convention

Score	Meaning
< 0	Net attractive polymer-protein interaction
= 0	Neutral (contacts match random expectation)
> 0	Net repulsive (polymer avoids protein surface)

More negative = stronger overall polymer-protein adhesion.

Important

Independence assumption disclaimer. The affinity score sums individual contact free energies assuming thermodynamic independence of contacts. Conformational coupling, surface exclusion effects, and reference state shifts violate this assumption. The absolute values are therefore not rigorous thermodynamic ΔG. However, the relative differences between polymer compositions are meaningful for ranking purposes, as systematic biases cancel when comparing compositions analyzed under identical conditions.

Interpretation for Experimentalists

“Among compositions with similar RMSF (structural stability), the one with the most negative affinity score has the strongest polymer-protein adhesion — good for thermal resistance, but check that it doesn’t block the active site.”

The affinity score should always be interpreted alongside:

RMSF — does strong adhesion stabilize or rigidify the enzyme?
Triad geometry — does polymer contact disrupt the catalytic triad?
Substrate distance — does the polymer occlude the active site?

Quick Start

Step 1: Ensure binding preference data exists

Polymer affinity analysis reads cached binding preference files. These are produced automatically by polyzymd compare contacts:

# comparison.yaml
analysis_settings:
  contacts:
    compute_binding_preference: true
    surface_exposure_threshold: 0.2
    enzyme_pdb_for_sasa: "structures/enzyme.pdb"
    include_default_aa_groups: true

# Generate binding preference cache (if not already done)
polyzymd compare contacts -f comparison.yaml

Step 2: Add polymer affinity settings (optional)

The polymer affinity score works with zero configuration — defaults are suitable for most use cases. To customize:

# comparison.yaml
analysis_settings:
  polymer_affinity:
    surface_exposure_threshold: 0.2

comparison_settings:
  polymer_affinity:
    fdr_alpha: 0.05

Step 3: Run polymer affinity analysis

polyzymd compare polymer-affinity -f comparison.yaml

Example Output

Polymer Affinity Score: LipA 363K Comparison
================================================================================
Formula : Score = N_contacts × ΔG_sel, where ΔG_sel = -ln(contact_share / expected_share)
Units   : kT (dimensionless, in units of k_bT)
Sign    : Negative = net attractive interaction
Note    : Assumes thermodynamic independence of contacts (see docs for caveats)

Per-Condition Summary (sign: more negative = stronger adhesion)
--------------------------------------------------------------------------------

  100% SBMA  (T = 363.0 K, n = 5)
  Polymer  AA Group            N_contacts  ΔG_sel/kT    Score/kT
  ----------------------------------------------------------------
  SBM      aromatic                12.3      -0.19       -2.34
  SBM      charged_negative         8.7      +0.08       +0.70
  SBM      charged_positive         6.1      -0.01       -0.06
  SBM      nonpolar                15.2      +0.04       +0.61
  SBM      polar                    9.4      -0.00       -0.00
  ----------------------------------------------------------------
  SBM      TOTAL                                         -1.09

  SBMA-EGPMA 5%  (T = 363.0 K, n = 5)
  Polymer  AA Group            N_contacts  ΔG_sel/kT    Score/kT
  ----------------------------------------------------------------
  SBM      aromatic                14.1      -0.22       -3.10
  SBM      nonpolar                17.3      +0.02       +0.35
  EGP      aromatic                 1.2      -0.45       -0.54
  ...
  ----------------------------------------------------------------
  ALL      TOTAL                                         -4.21

Pairwise Score Differences (Score_B - Score_A)
--------------------------------------------------------------------------------
  100% SBMA → SBMA-EGPMA 5%  :  ΔScore = -3.12 kT  (p = 0.003 **)

Interpreting the Numbers

From the example:

SBMA-EGPMA 5% total score: -4.21 kT — stronger overall adhesion than 100% SBMA (-1.09 kT). The 5% EGPMA comonomer increases total polymer-protein binding through aromatic contacts.
SBM aromatic N_contacts=14.1, ΔG\(_{\mathrm{sel}}\)=-0.22 kT — each of ~14 simultaneous aromatic contacts contributes -0.22 kT, giving a group score of -3.10 kT.
Pairwise ΔScore = -3.12 kT, p = 0.003 — the difference is statistically significant.

Tip

Scores are in kT units. At 363 K, 1 kT = 0.72 kcal/mol = 3.02 kJ/mol. A total score difference of |ΔScore| > 2 kT is generally considered meaningful.

Configuration Reference

`analysis_settings.polymer_affinity`

Field	Type	Default	Description
`surface_exposure_threshold`	float	`0.2`	Minimum relative SASA for exposed residues
`polymer_selection`	str	`"resname SBM EGM EGP"`	MDAnalysis polymer selection
`protein_selection`	str	`"protein"`	MDAnalysis protein selection
`cutoff`	float	`4.5`	Contact distance cutoff (Angstroms)
`compute_binding_preference`	bool	`true`	Must be true for affinity analysis

`comparison_settings.polymer_affinity`

Field	Type	Default	Description
`fdr_alpha`	float	`0.05`	Benjamini-Hochberg FDR threshold

CLI Reference

polyzymd compare polymer-affinity [OPTIONS]

Option	Description
`-f, --file PATH`	Path to `comparison.yaml` (default: `comparison.yaml`)
`--recompute`	Force recompute
`--format [table\|markdown\|json]`	Output format (default: `table`)
`-o, --output PATH`	Save output to file
`-q, --quiet`	Suppress INFO messages
`--debug`	Enable DEBUG logging
`--eq-time TEXT`	Override equilibration time

Usage Examples

# Default: console table
polyzymd compare polymer-affinity -f comparison.yaml

# Save markdown report
polyzymd compare polymer-affinity --format markdown -o affinity_report.md

# JSON for downstream processing
polyzymd compare polymer-affinity --format json -o affinity_result.json

Per-Replicate Computation

When per-replicate binding preference files (binding_preference_rep{N}.json) are available, the polymer affinity score computes exact per-replicate scores:

\[N_{\text{rep}} = \text{contact\_fraction}_{\text{rep}} \times n_\text{exposed}\]

\[\Delta G_{\mathrm{sel,rep}} = -\ln(\text{enrichment}_{\text{rep}} + 1)\]

\[S_{\text{rep}} = N_{\text{rep}} \times \Delta G_{\mathrm{sel,rep}}\]

Mean and SEM are computed across replicates. Pairwise t-tests use these per-replicate score distributions.

When per-replicate files are unavailable, the comparator falls back to analytical error propagation:

\[\sigma(S) = \sqrt{(N \cdot \sigma_{\Delta G_{\mathrm{sel}}})^2 + (\Delta G_{\mathrm{sel}} \cdot \sigma_N)^2}\]

Multi-Temperature Studies

Like binding free energy analysis, the polymer affinity score uses kT units that already account for temperature. However, the underlying contact distributions may differ between temperatures, so pairwise statistics are suppressed for cross-temperature pairs.

Per-condition scores are always shown — they are valid within each temperature.

Relation to Other Analyses

Analysis	What It Measures	Relation to Affinity Score
Contacts	Which residues are touched	Provides the raw contact data
Binding Preference	Enrichment per residue group	Provides enrichment → ΔG_sel
Binding Free Energy	Per-contact ΔG_sel	Component of the score (without N)
Polymer Affinity Score	Total interaction strength	N × ΔG_sel summed over all groups
RMSF	Structural flexibility	Complementary: stability metric
Triad	Active site geometry	Complementary: function metric

Output File Location

Results are saved to the comparison’s results/ directory:

comparison_output/
└── results/
    └── polymer_affinity_score_YYYYMMDD_HHMMSS.json

Reload for downstream processing:

from polyzymd.compare.results.polymer_affinity import PolymerAffinityScoreResult

result = PolymerAffinityScoreResult.load("polymer_affinity_score_20260228_120000.json")

# Access per-condition scores
for summary in result.conditions:
    print(f"\n{summary.label}: total_score = {summary.total_score:+.2f} kT")
    for entry in summary.entries:
        print(
            f"  {entry.polymer_type} × {entry.protein_group}: "
            f"N={entry.n_contacts:.1f}, ΔG_sel={entry.delta_G:+.3f}, "
            f"Score={entry.score:+.3f} kT"
        )

# Access pairwise comparisons
for pair in result.pairwise_comparisons:
    if pair.p_value is not None:
        sig = "**" if pair.p_value < 0.01 else "*" if pair.p_value < 0.05 else ""
        print(
            f"{pair.condition_a} → {pair.condition_b}: "
            f"ΔScore = {pair.delta_score:+.2f} kT, p = {pair.p_value:.4f} {sig}"
        )

Troubleshooting

“No binding preference data found”

Run contacts analysis with binding preference enabled first:

polyzymd compare contacts -f comparison.yaml

Ensure compute_binding_preference: true is set in your contacts analysis settings.

Very large scores (|S| > 50 kT)

Large scores usually indicate many contacts with strong enrichment/depletion. Check whether your polymer selection is too broad (including solvent) or your contact cutoff is too large. Also verify that n_exposed_in_group values are reasonable — very large groups amplify the score.

Score of exactly 0.0

This occurs when either mean_contact_fraction = 0 (no contacts) or enrichment = 0 (contacts exactly match random expectation). Check the binding preference data for the affected condition.

Different scores from BFE analysis

The affinity score and BFE analysis use the same ΔG\(_{\mathrm{sel}}\) formula but different aggregation: BFE reports per-contact ΔG\(_{\mathrm{sel}}\), while the affinity score multiplies by N_contacts. The per-contact ΔG\(_{\mathrm{sel}}\) values should match exactly; differences indicate a cache or configuration mismatch.