# Polymer Affinity Score Analysis Quantify the **total polymer-protein interaction strength** by combining per-contact free energies with contact counts across all residue groups. This analysis answers the experimentalist's question: *"Which polymer composition has the strongest overall adhesion to the enzyme surface?"* ```{note} The polymer affinity score is a **comparative scoring metric for ranking polymer compositions**. It is derived entirely from cached binding preference data - no additional trajectory access is needed. Run contacts analysis with `compute_binding_preference: true` first. ``` ```{warning} Polymer affinity is currently labeled **experimental** in the presentation release. The CLI command and plots remain available, but PolyzyMD now adds explicit experimental warnings because the score definition and interpretation may still change. ``` ## Scientific Motivation ### Beyond Per-Contact Free Energy [Binding free energy analysis](analysis_binding_free_energy.md) reports the thermodynamic preference of each polymer type for each residue group (e.g., "EGMA prefers aromatics by -0.52 kT"). However, this per-contact value misses the **polyvalent binding effect**: many contacts happen simultaneously. A polymer with a modest per-contact affinity but many simultaneous contacts may have a stronger total interaction than one with strong per-contact affinity but few contacts. The polymer affinity score captures this by summing: ```{math} \text{Score} = N \times \Delta G_{\mathrm{sel}} ``` where: - **N** = mean number of simultaneous contacts with a residue group - **ΔG$_{\mathrm{sel}}$** = per-contact selectivity free energy from Boltzmann inversion ### The Formula Per (polymer_type, protein_group): ```{math} N_g = \text{mean\_contact\_fraction} \times n_\text{exposed\_in\_group} ``` ```{math} \Delta G_{\mathrm{sel},g} = -\ln\!\left(\frac{\text{contact\_share}}{\text{expected\_share}}\right) \quad [k_\mathrm{b}T] ``` ```{math} S_g = N_g \times \Delta G_{\mathrm{sel},g} ``` Per polymer type: ```{math} S_\text{polymer} = \sum_g S_g ``` Per condition (headline score): ```{math} S_\text{total} = \sum_\text{polymer\_types} S_\text{polymer} ``` ### How N_contacts Is Computed N_contacts is the **mean number of exposed residues in a protein group that are simultaneously in contact with a given polymer type, averaged across all trajectory frames after equilibration**. It is not a total count of contact events over the simulation, and it is not a binary flag for whether any contact occurred. #### From Raw Frames to N The computation proceeds in four stages: **Stage 1 — Per residue, per frame (binary).** For each protein residue and each trajectory frame, a contact is recorded if any heavy atom of a polymer residue of a given type is within the cutoff distance (default 4.5 A) of any heavy atom of the protein residue. Multiple polymer segments of the same type contacting the same protein residue in the same frame count as **one** contact (unique frame counting via set operations). **Stage 2 — Per residue, per polymer type (fraction).** For each exposed protein residue, the number of frames with at least one contact from polymer type *P* is divided by the total number of analyzed frames. This gives a per-residue contact fraction in [0, 1]. **Stage 3 — Per (polymer_type, protein_group) (mean fraction).** The per-residue contact fractions are averaged across all surface-exposed residues in the protein group: ```{math} \text{mean\_contact\_fraction} = \frac{\sum_{r \in \text{group}} \text{frames\_contacted}_r}{n_\text{frames} \times n_\text{exposed\_in\_group}} ``` This is mathematically equivalent to the arithmetic mean of the per-residue fractions, computed as a single division for efficiency. The value answers: "on average, what fraction of frames is each exposed residue in this group contacted by this polymer type?" **Stage 4 — N_contacts.** Multiplying by the group size recovers the mean number of simultaneously contacted residues per frame: ```{math} N = \text{mean\_contact\_fraction} \times n_\text{exposed\_in\_group} = \frac{\sum_{r \in \text{group}} \text{frames\_contacted}_r}{n_\text{frames}} ``` The algebra cancels `n_exposed_in_group` from numerator and denominator, revealing that **N is simply the total contact-frames divided by the number of frames** — i.e., the average number of contacted residues per frame. #### Worked Example Consider a polymer type SBM contacting the "aromatic" protein group (5 exposed residues) over 1000 analyzed frames: | Residue | Frames contacted | Per-residue fraction | |---------|-----------------|---------------------| | Trp23 | 800 | 0.80 | | Phe45 | 200 | 0.20 | | Tyr67 | 150 | 0.15 | | Phe89 | 50 | 0.05 | | Trp112 | 0 | 0.00 | - `total_contact_frames = 800 + 200 + 150 + 50 + 0 = 1200` - `mean_contact_fraction = 1200 / (1000 × 5) = 0.24` - **`N = 0.24 × 5 = 1.2`** Interpretation: on average, **1.2 aromatic residues are simultaneously in contact with SBM in any given frame**. Some frames have 0 contacts, some have 3 or 4 — but the time-averaged expectation is 1.2 simultaneous contacts. If this group's ΔG$_{\mathrm{sel}}$ is -0.22 kT, the group contribution to the affinity score is `1.2 × (-0.22) = -0.26 kT`. #### Why This Quantity (and Not Alternatives) | Alternative definition of N | Why it would be wrong | |-----------------------------|-----------------------| | **Total contact count over the simulation** | Conflates simulation length with interaction strength. A 1 μs trajectory would score 10× higher than a 100 ns trajectory for identical binding. | | **Fraction of frames with any contact in the group** | Loses polyvalency information. Cannot distinguish 1 residue contacted vs 15 residues contacted simultaneously — the entire point of this analysis. | | **Number of unique residues ever contacted** | Loses temporal information. A residue contacted once in 10,000 frames contributes equally to one contacted every frame. | | **Peak (maximum) simultaneous contacts** | Sensitive to outlier frames. Not representative of the sustained interaction. | ```{note} **What N does not capture.** Two polymers can have identical N = 10 but differ in binding mode: one contacts the same 10 residues every frame (stable binding), while the other contacts different residues each frame (transient surface scanning). For ranking total adhesion strength, this distinction does not matter — both have 10 simultaneous contacts contributing to the score. To distinguish binding modes, use [residence time analysis](analysis_contacts_quickstart.md#residence-time-analysis). ``` #### Cross-Replicate Aggregation When multiple replicates are available, `mean_contact_fraction` is **arithmetically averaged** across replicates with standard error of the mean (SEM). `n_exposed_in_group` is determined from the SASA calculation on the crystal structure PDB and is therefore **identical across replicates** — it is a property of the protein, not the trajectory. The resulting N inherits the cross-replicate mean and uncertainty from `mean_contact_fraction` alone. ### Sign Convention | Score | Meaning | |-------|---------| | **< 0** | Net attractive polymer-protein interaction | | **= 0** | Neutral (contacts match random expectation) | | **> 0** | Net repulsive (polymer avoids protein surface) | More negative = stronger overall polymer-protein adhesion. ```{important} **Independence assumption disclaimer.** The affinity score sums individual contact free energies assuming thermodynamic independence of contacts. Conformational coupling, surface exclusion effects, and reference state shifts violate this assumption. The absolute values are therefore not rigorous thermodynamic ΔG. However, the **relative differences** between polymer compositions are meaningful for ranking purposes, as systematic biases cancel when comparing compositions analyzed under identical conditions. ``` ### Interpretation for Experimentalists "Among compositions with similar RMSF (structural stability), the one with the most negative affinity score has the strongest polymer-protein adhesion — good for thermal resistance, but check that it doesn't block the active site." The affinity score should always be interpreted alongside: - **RMSF** — does strong adhesion stabilize or rigidify the enzyme? - **Triad geometry** — does polymer contact disrupt the catalytic triad? - **Substrate distance** — does the polymer occlude the active site? ## Quick Start ### Step 1: Ensure binding preference data exists Polymer affinity analysis reads cached binding preference files. These are produced automatically by `polyzymd compare contacts`: ```yaml # comparison.yaml analysis_settings: contacts: compute_binding_preference: true surface_exposure_threshold: 0.2 enzyme_pdb_for_sasa: "structures/enzyme.pdb" include_default_aa_groups: true ``` ```bash # Generate binding preference cache (if not already done) polyzymd compare contacts -f comparison.yaml ``` ### Step 2: Add polymer affinity settings (optional) The polymer affinity score works with zero configuration — defaults are suitable for most use cases. To customize: ```yaml # comparison.yaml analysis_settings: polymer_affinity: surface_exposure_threshold: 0.2 comparison_settings: polymer_affinity: fdr_alpha: 0.05 ``` ### Step 3: Run polymer affinity analysis ```bash polyzymd compare polymer-affinity -f comparison.yaml ``` ## Example Output ``` Polymer Affinity Score: LipA 363K Comparison ================================================================================ Formula : Score = N_contacts × ΔG_sel, where ΔG_sel = -ln(contact_share / expected_share) Units : kT (dimensionless, in units of k_bT) Sign : Negative = net attractive interaction Note : Assumes thermodynamic independence of contacts (see docs for caveats) Per-Condition Summary (sign: more negative = stronger adhesion) -------------------------------------------------------------------------------- 100% SBMA (T = 363.0 K, n = 5) Polymer AA Group N_contacts ΔG_sel/kT Score/kT ---------------------------------------------------------------- SBM aromatic 12.3 -0.19 -2.34 SBM charged_negative 8.7 +0.08 +0.70 SBM charged_positive 6.1 -0.01 -0.06 SBM nonpolar 15.2 +0.04 +0.61 SBM polar 9.4 -0.00 -0.00 ---------------------------------------------------------------- SBM TOTAL -1.09 SBMA-EGPMA 5% (T = 363.0 K, n = 5) Polymer AA Group N_contacts ΔG_sel/kT Score/kT ---------------------------------------------------------------- SBM aromatic 14.1 -0.22 -3.10 SBM nonpolar 17.3 +0.02 +0.35 EGP aromatic 1.2 -0.45 -0.54 ... ---------------------------------------------------------------- ALL TOTAL -4.21 Pairwise Score Differences (Score_B - Score_A) -------------------------------------------------------------------------------- 100% SBMA → SBMA-EGPMA 5% : ΔScore = -3.12 kT (p = 0.003 **) ``` ### Interpreting the Numbers From the example: - **SBMA-EGPMA 5% total score: -4.21 kT** — stronger overall adhesion than 100% SBMA (-1.09 kT). The 5% EGPMA comonomer increases total polymer-protein binding through aromatic contacts. - **SBM aromatic N_contacts=14.1, ΔG$_{\mathrm{sel}}$=-0.22 kT** — each of ~14 simultaneous aromatic contacts contributes -0.22 kT, giving a group score of -3.10 kT. - **Pairwise ΔScore = -3.12 kT, p = 0.003** — the difference is statistically significant. ```{tip} Scores are in kT units. At 363 K, 1 kT = 0.72 kcal/mol = 3.02 kJ/mol. A total score difference of |ΔScore| > 2 kT is generally considered meaningful. ``` ## Configuration Reference ### `analysis_settings.polymer_affinity` | Field | Type | Default | Description | |-------|------|---------|-------------| | `surface_exposure_threshold` | float | `0.2` | Minimum relative SASA for exposed residues | | `polymer_selection` | str | `"resname SBM EGM EGP"` | MDAnalysis polymer selection | | `protein_selection` | str | `"protein"` | MDAnalysis protein selection | | `cutoff` | float | `4.5` | Contact distance cutoff (Angstroms) | | `compute_binding_preference` | bool | `true` | Must be true for affinity analysis | ### `comparison_settings.polymer_affinity` | Field | Type | Default | Description | |-------|------|---------|-------------| | `fdr_alpha` | float | `0.05` | Benjamini-Hochberg FDR threshold | ## CLI Reference ``` polyzymd compare polymer-affinity [OPTIONS] ``` | Option | Description | |--------|-------------| | `-f, --file PATH` | Path to `comparison.yaml` (default: `comparison.yaml`) | | `--recompute` | Force recompute | | `--format [table\|markdown\|json]` | Output format (default: `table`) | | `-o, --output PATH` | Save output to file | | `-q, --quiet` | Suppress INFO messages | | `--debug` | Enable DEBUG logging | | `--eq-time TEXT` | Override equilibration time | ### Usage Examples ```bash # Default: console table polyzymd compare polymer-affinity -f comparison.yaml # Save markdown report polyzymd compare polymer-affinity --format markdown -o affinity_report.md # JSON for downstream processing polyzymd compare polymer-affinity --format json -o affinity_result.json ``` ## Per-Replicate Computation When per-replicate binding preference files (`binding_preference_rep{N}.json`) are available, the polymer affinity score computes exact per-replicate scores: ```{math} N_{\text{rep}} = \text{contact\_fraction}_{\text{rep}} \times n_\text{exposed} ``` ```{math} \Delta G_{\mathrm{sel,rep}} = -\ln(\text{enrichment}_{\text{rep}} + 1) ``` ```{math} S_{\text{rep}} = N_{\text{rep}} \times \Delta G_{\mathrm{sel,rep}} ``` Mean and SEM are computed across replicates. Pairwise t-tests use these per-replicate score distributions. When per-replicate files are unavailable, the comparator falls back to analytical error propagation: ```{math} \sigma(S) = \sqrt{(N \cdot \sigma_{\Delta G_{\mathrm{sel}}})^2 + (\Delta G_{\mathrm{sel}} \cdot \sigma_N)^2} ``` ## Multi-Temperature Studies Like binding free energy analysis, the polymer affinity score uses kT units that already account for temperature. However, the underlying contact distributions may differ between temperatures, so **pairwise statistics are suppressed for cross-temperature pairs**. Per-condition scores are always shown — they are valid within each temperature. ## Relation to Other Analyses | Analysis | What It Measures | Relation to Affinity Score | |----------|-----------------|--------------------------| | **Contacts** | Which residues are touched | Provides the raw contact data | | **Binding Preference** | Enrichment per residue group | Provides enrichment → ΔG_sel | | **Binding Free Energy** | Per-contact ΔG_sel | Component of the score (without N) | | **Polymer Affinity Score** | Total interaction strength | N × ΔG_sel summed over all groups | | **RMSF** | Structural flexibility | Complementary: stability metric | | **Triad** | Active site geometry | Complementary: function metric | ## Output File Location Results are saved to the comparison's `results/` directory: ``` comparison_output/ └── results/ └── polymer_affinity_score_YYYYMMDD_HHMMSS.json ``` Reload for downstream processing: ```python from polyzymd.compare.results.polymer_affinity import PolymerAffinityScoreResult result = PolymerAffinityScoreResult.load("polymer_affinity_score_20260228_120000.json") # Access per-condition scores for summary in result.conditions: print(f"\n{summary.label}: total_score = {summary.total_score:+.2f} kT") for entry in summary.entries: print( f" {entry.polymer_type} × {entry.protein_group}: " f"N={entry.n_contacts:.1f}, ΔG_sel={entry.delta_G:+.3f}, " f"Score={entry.score:+.3f} kT" ) # Access pairwise comparisons for pair in result.pairwise_comparisons: if pair.p_value is not None: sig = "**" if pair.p_value < 0.01 else "*" if pair.p_value < 0.05 else "" print( f"{pair.condition_a} → {pair.condition_b}: " f"ΔScore = {pair.delta_score:+.2f} kT, p = {pair.p_value:.4f} {sig}" ) ``` ## Troubleshooting ### "No binding preference data found" Run contacts analysis with binding preference enabled first: ```bash polyzymd compare contacts -f comparison.yaml ``` Ensure `compute_binding_preference: true` is set in your contacts analysis settings. ### Very large scores (|S| > 50 kT) Large scores usually indicate many contacts with strong enrichment/depletion. Check whether your polymer selection is too broad (including solvent) or your contact cutoff is too large. Also verify that `n_exposed_in_group` values are reasonable — very large groups amplify the score. ### Score of exactly 0.0 This occurs when either `mean_contact_fraction = 0` (no contacts) or `enrichment = 0` (contacts exactly match random expectation). Check the binding preference data for the affected condition. ### Different scores from BFE analysis The affinity score and BFE analysis use the same ΔG$_{\mathrm{sel}}$ formula but different aggregation: BFE reports per-contact ΔG$_{\mathrm{sel}}$, while the affinity score multiplies by N_contacts. The per-contact ΔG$_{\mathrm{sel}}$ values should match exactly; differences indicate a cache or configuration mismatch. ## See Also - [Binding Free Energy Analysis](analysis_binding_free_energy.md) — per-contact ΔG_sel (the component metric) - [Binding Preference Analysis](analysis_binding_preference.md) — the prerequisite enrichment analysis - [Contacts Analysis Quick Start](analysis_contacts_quickstart.md) — basic contact computation - [Statistics Best Practices](analysis_statistics_best_practices.md) — replicate planning - [Comparing Conditions](analysis_compare_conditions.md) — multi-condition workflows - [Extending Comparators](extending_comparators.md) — add custom comparators