# Binding Preference Analysis Understand **which amino acid classes your polymer preferentially binds** using enrichment analysis. This module answers the fundamental question: *"Does my polymer type show preferential affinity for certain types of protein residues?"* ```{note} Binding preference analysis builds on [contacts analysis](analysis_contacts_quickstart.md). Run contacts analysis first, then enable binding preference in your comparison configuration. ``` ```{warning} Binding preference is currently labeled **experimental** in the presentation release. The workflow remains available from `polyzymd compare run contacts`, and its plots are still generated, but CLI output and figures are explicitly marked experimental because definitions and interpretation may still change. ``` :::{admonition} Environment Setup :class: tip All commands below assume you have activated the PolyzyMD pixi environment: ```bash pixi shell -e build ``` Alternatively, prefix each command with `pixi run -e build`. ::: ## Scientific Motivation When designing enzyme-polymer conjugates, understanding polymer binding preferences is crucial for several reasons: 1. **Active site protection**: Does the polymer avoid or shield the catalytic residues? 2. **Electrostatic complementarity**: Do zwitterionic polymers preferentially bind charged surface residues? 3. **Hydrophobic interactions**: Do PEG-like polymers show affinity for nonpolar patches? 4. **Rational design**: Which polymer chemistry best matches your enzyme's surface properties? Raw contact counts are misleading because protein surfaces have unequal distributions of amino acid types. A polymer might contact 50% of aromatic residues and 50% of polar residues, but if the surface has 5 aromatic residues and 50 polar residues, the polymer clearly *prefers* aromatic residues. **Binding preference analysis normalizes by surface availability** to reveal true preferences. ## The Enrichment Formula For each (polymer type, protein group) pair, we compute a **zero-centered enrichment** value: ```{math} \text{Enrichment} = \frac{\text{Contact Share}}{\text{Expected Share}} - 1 ``` where: ```{math} \text{Contact Share} = \frac{\sum_{\text{residues in group}} \text{contact frames}}{\sum_{\text{all residues}} \text{contact frames}} ``` and: ```{math} \text{Expected Share} = \frac{\text{exposed residues in group}}{\text{total exposed residues}} ``` This **surface-availability normalization** asks: *"Given how much of the protein surface is aromatic/charged/etc., does this polymer type contact that surface proportionally?"* ### Interpretation (Zero-Centered Scale) | Enrichment | Meaning | |------------|---------| | **> 0** | **Preferential binding** — polymer contacts this group more than expected | | **= 0** | **Neutral** — contact frequency matches random chance | | **< 0** | **Avoidance** — polymer contacts this group less than expected | | **= -1** | **Complete avoidance** — no contacts observed | **Examples:** - `+0.45` means "45% more contacts than expected" - `-0.30` means "30% fewer contacts than expected" - `+1.00` means "2× as many contacts as expected" (100% more) ### Why Surface-Availability Normalization? The correct baseline for "expected contacts" is **protein surface availability**, not polymer composition. Here's why: **The question we're asking**: *"Does this polymer type preferentially bind aromatic residues?"* **The correct comparison**: If aromatic residues comprise 10% of the protein's surface, and this polymer contacts aromatics 15% of the time, then enrichment = (0.15 / 0.10) - 1 = +0.50 (50% preference). **What NOT to compare**: The fraction of the polymer that is SBMA vs EGMA is irrelevant to this question — it tells us about polymer composition, not about which protein groups the polymer prefers. ```{note} Polymer composition (residue counts, heavy atom counts) is stored as **metadata** in the results for secondary analysis, but is not used in the enrichment formula. ``` ### Why Contact Frames, Not Binary Counts? The formula uses **contact frames** (duration-weighted) rather than binary "contacted yes/no" counts. This ensures that: - A residue contacted for 60% of the simulation contributes 60× more than one contacted for 1 frame - Transient, non-specific contacts don't dominate the signal - Stable, functionally relevant interactions are emphasized ### Surface Exposure Filtering Only **surface-exposed residues** are considered in the enrichment calculation. Buried residues are excluded because they're physically inaccessible to polymer. Surface exposure is determined using **Solvent Accessible Surface Area (SASA)**: - Residues with relative SASA > threshold (default 20%) are considered exposed - This is computed from a static enzyme PDB structure - The threshold can be adjusted in configuration ## Quick Start ### Enable Binding Preference in comparison.yaml `````{tab-set} ````{tab-item} YAML (Recommended) Add `compute_binding_preference: true` to your contacts analysis settings: ```yaml # comparison.yaml name: "Polymer Binding Preference Study" control: "No Polymer" conditions: - label: "No Polymer" config: "../no_polymer/config.yaml" replicates: [1, 2, 3] - label: "100% SBMA" config: "../sbma_100/config.yaml" replicates: [1, 2, 3] - label: "100% EGMA" config: "../egma_100/config.yaml" replicates: [1, 2, 3] plugins: contacts: polymer_selection: "resname SBM EGM" protein_selection: "protein" cutoff: 4.5 # Binding preference settings compute_binding_preference: true surface_exposure_threshold: 0.2 # 20% relative SASA enzyme_pdb_for_sasa: "structures/enzyme.pdb" include_default_aa_groups: true ``` Then run the comparison: ```bash polyzymd compare run contacts -f comparison.yaml ``` ```` ````{tab-item} CLI The CLI automatically uses binding preference settings from `comparison.yaml`. You cannot enable binding preference via CLI flags alone—it must be configured in YAML. ```bash # Run comparison (binding preference enabled in YAML) polyzymd compare run contacts -f comparison.yaml # View results in markdown format polyzymd compare run contacts -f comparison.yaml --format markdown ``` ```` ````{tab-item} Python ```python from polyzymd.analyses.shared.binding_preference import BindingPreferenceResult # Binding preference is computed automatically when: # plugins.contacts.compute_binding_preference: true # # After running: # polyzymd compare run contacts -f comparison.yaml result = BindingPreferenceResult.load( # Filenames include a settings fingerprint and replicate range, e.g.: # binding_preference_s_rep1.json (per-replicate) # binding_preference_aggregated_s_reps1-3.json (aggregated) # Use glob to find the aggregated file: # next(Path("analysis//contacts/").glob("binding_preference_aggregated_*.json")) "analysis//contacts/binding_preference_aggregated_s_reps1-3.json" ) # Access enrichment values for polymer_type, partition in result.binding_preference.aa_class_binding.items(): print(f"\n{polymer_type}:") for entry in partition.entries: print(f" {entry.partition_element}: {entry.enrichment:+.2f}") ``` ```` ````` ### Example Output When you run `polyzymd compare run contacts` with binding preference enabled, you'll see output like this after an experimental warning banner: ``` Binding Preference - Enrichment by Amino Acid Class ----------------------------------------------------------------------------------------------- Surface exposure threshold: 20% relative SASA Enrichment normalized by: protein surface availability Polymer: EGM Protein Group 0% SBMA / 100% 25% SBMA / 75% 50% SBMA / 50% ---------------------------------------------------------------------------------- aromatic +0.90±0.06 + +0.65±0.16 + +0.50±0.06 + charged_negative -0.55±0.04 - -0.44±0.10 - -0.39±0.08 - charged_positive -0.21±0.10 - -0.11±0.07 - -0.06±0.04 - nonpolar +0.31±0.04 + +0.14±0.01 + +0.13±0.06 + polar -0.29±0.02 - -0.13±0.05 - -0.12±0.03 - Polymer: SBM Protein Group 100% SBMA / 0% 25% SBMA / 75% 50% SBMA / 50% ---------------------------------------------------------------------------------- aromatic +0.29±0.03 + +0.83±0.09 + +0.54±0.07 + charged_negative -0.10±0.05 - -0.24±0.18 - -0.29±0.11 - charged_positive +0.02±0.03 + +0.22±0.10 + +0.10±0.02 + nonpolar -0.05±0.03 - -0.09±0.12 - -0.12±0.05 - polar +0.00±0.00 = -0.17±0.04 - +0.01±0.06 + + = enriched (>0), - = depleted (<0) ``` ### Interpreting These Results From the example above, we can draw several conclusions: 1. **EGMA (EGM)** shows: - Strong preference for **aromatic** residues (+0.90 = 90% more contacts than expected) - Strong preference for **nonpolar** residues (+0.31 = 31% more contacts) - Avoidance of **charged** residues (-0.55 for negative, -0.21 for positive) - This is consistent with EGMA's hydrophobic PEG-like character 2. **SBMA (SBM)** shows: - Moderate preference for **aromatic** residues (+0.29) - Slight preference for **charged positive** residues (+0.02) - Near-neutral behavior for most groups - This reflects SBMA's zwitterionic nature (balanced charges) 3. **Composition effects**: As SBMA ratio increases (25% → 50% → 100%): - EGMA's aromatic preference decreases (+0.90 → +0.65 → +0.50) - This suggests competition for aromatic binding sites ## Configuration Options ### ContactsAnalysisSettings | Field | Type | Default | Description | |-------|------|---------|-------------| | `compute_binding_preference` | bool | `false` | Enable binding preference analysis | | `surface_exposure_threshold` | float | `0.2` | Relative SASA threshold (0.0-1.0) | | `enzyme_pdb_for_sasa` | str | `null` | Path to enzyme PDB for SASA calculation | | `include_default_aa_groups` | bool | `true` | Use standard AA class groupings | | `protein_groups` | dict | `null` | Custom protein groups `{name: [resids]}` | | `polymer_type_selections` | dict | `null` | Custom polymer type definitions (see below) | ### Polymer Type Selections By default, polymer types are auto-detected from unique residue names in the polymer selection. You can explicitly define polymer types with custom selections: ```yaml plugins: contacts: polymer_selection: "chainID C" compute_binding_preference: true # Explicit polymer type definitions polymer_type_selections: SBMA: "chainID C and resname SBM" EGMA: "chainID C and resname EGM" OEGMA: "chainID C and resname OEG" ``` This is useful when: - Residue names don't match your preferred labels - You want to group multiple residue types together - You need fine-grained control over what counts as each polymer type ### Custom Protein Groups You can define custom protein groups to analyze specific regions of interest: ```yaml plugins: contacts: compute_binding_preference: true surface_exposure_threshold: 0.2 enzyme_pdb_for_sasa: "structures/enzyme.pdb" # Include default AA class groups (aromatic, polar, etc.) include_default_aa_groups: true # Add custom groups (can override defaults if names conflict) protein_groups: active_site: [77, 133, 156] # Catalytic triad lid_domain: [140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150] binding_pocket: [50, 51, 52, 80, 81, 82, 83] ``` ```{tip} If a custom group name matches a default AA class name (e.g., `aromatic`), your custom definition takes precedence. ``` ### Surface Exposure Threshold The `surface_exposure_threshold` controls which residues are considered "surface accessible": | Threshold | Effect | |-----------|--------| | `0.1` (10%) | More permissive — includes partially buried residues | | `0.2` (20%) | **Default** — standard definition of "exposed" | | `0.3` (30%) | More stringent — only highly exposed residues | | `0.5` (50%) | Very stringent — only fully exposed loops/turns | ```{note} Lower thresholds include more residues in the analysis, which may dilute the signal. Higher thresholds focus on the most accessible residues but may have smaller sample sizes. ``` ## Default Amino Acid Classifications When `include_default_aa_groups: true`, the following standard groupings are used: | Group | Amino Acids | Chemical Property | |-------|-------------|-------------------| | `aromatic` | TRP, PHE, TYR | π-stacking, hydrophobic | | `charged_positive` | LYS, ARG | Cationic at pH 7 | | `charged_negative` | ASP, GLU | Anionic at pH 7 | | `nonpolar` | ALA, VAL, LEU, ILE, MET, PRO, GLY | Hydrophobic/aliphatic | | `polar` | SER, THR, ASN, GLN, HIS, CYS | H-bond donors/acceptors | ```{note} Histidine (HIS) is classified as **polar** by default because its protonation state is pH-dependent. At low pH, it would be positively charged. ``` ## Output Files Binding preference results are saved alongside contact analysis results: ``` project/ └── condition_name/ └── analysis/ └── contacts/ ├── contacts_eq10ns_cut4.5_s_rep1.json ├── contacts_eq10ns_cut4.5_s_rep2.json ├── contacts_eq10ns_cut4.5_s_rep3.json ├── binding_preference_s_rep1.json # Per-replicate ├── binding_preference_s_rep2.json ├── binding_preference_s_rep3.json └── binding_preference_aggregated_s_reps1-3.json # Aggregated ``` ```{note} Filenames include a settings fingerprint (`s`) derived from the analysis settings. Use glob patterns (e.g., `binding_preference_aggregated_*.json`) to find result files programmatically. ``` ### JSON Structure (Schema v5) Schema version 5 introduced **partition-based binding preference**, where each polymer type has its own partition-based enrichment results. This ensures that `contact_share` sums to exactly 1.0 within each partition (no overlapping group bugs). ```json { "entries": [/* Legacy entries for backwards compatibility */], "polymer_composition": { "residue_counts": {"SBM": 50, "EGM": 50}, "heavy_atom_counts": {"SBM": 750, "EGM": 400} }, "binding_preference": { "aa_class_binding": { "SBM": { "partition_name": "aa_class", "partition_type": "aa_class", "polymer_type": "SBM", "entries": [ { "partition_element": "aromatic", "polymer_type": "SBM", "total_contact_frames": 12543, "contact_share": 0.164, "expected_share": 0.086, "enrichment": 0.907, "n_exposed_in_element": 7, "n_residues_in_element": 20, "n_residues_contacted": 6 }, {"partition_element": "polar", /* ... */}, {"partition_element": "nonpolar", /* ... */}, {"partition_element": "charged_positive", /* ... */}, {"partition_element": "charged_negative", /* ... */} ], "total_contact_share": 1.0, "total_expected_share": 1.0 }, "EGM": {/* same structure for EGM */} }, "user_defined_partitions": { "lid_helices": { "SBM": {/* PartitionBindingResult */}, "EGM": {/* PartitionBindingResult */} } }, "n_frames": 10000, "total_exposed_residues": 81, "surface_exposure_threshold": 0.2, "polymer_types": ["SBM", "EGM"], "schema_version": 5 }, "system_coverage": {/* SystemCoverageResult */}, "n_frames": 10000, "total_exposed_residues": 81, "surface_exposure_threshold": 0.2, "schema_version": 5 } ``` ```{important} **Schema v5 guarantees**: Within each partition (e.g., `aa_class`), the `contact_share` values sum to exactly 1.0 for each polymer type. This eliminates the overlapping-groups bug present in earlier versions where residues could be counted in multiple groups. ``` ```{note} For programmatic access, prefer the `binding_preference` field with its partition-based structure. ``` ## Statistical Treatment ### Aggregation Across Replicates When multiple replicates are available, binding preference is computed **per-replicate** and then aggregated: 1. Compute enrichment for each replicate independently 2. Report mean enrichment ± SEM across replicates 3. Store per-replicate values for downstream analysis This approach: - Preserves replicate independence for statistical testing - Provides uncertainty estimates (SEM) - Enables detection of outlier replicates ### Uncertainty in Enrichment The uncertainty in enrichment comes from: - Variation in contact patterns across replicates - Different initial configurations - Thermal fluctuations in the simulation The SEM (Standard Error of the Mean) quantifies confidence in the mean enrichment across replicates. ## Worked Example: Comparing Polymer Preferences ### Scientific Question *"Do zwitterionic (SBMA) and PEG-like (EGMA) polymers show different binding preferences? Which polymer type is better for protecting the enzyme's active site?"* ### Analysis Workflow ```yaml # comparison.yaml name: "SBMA vs EGMA Binding Preferences" control: "No Polymer" conditions: - label: "No Polymer" config: "../no_polymer/config.yaml" replicates: [1, 2, 3] - label: "100% SBMA" config: "../sbma_100/config.yaml" replicates: [1, 2, 3] - label: "100% EGMA" config: "../egma_100/config.yaml" replicates: [1, 2, 3] - label: "50% SBMA / 50% EGMA" config: "../copolymer_50_50/config.yaml" replicates: [1, 2, 3] plugins: contacts: polymer_selection: "resname SBM EGM" cutoff: 4.5 compute_binding_preference: true surface_exposure_threshold: 0.2 enzyme_pdb_for_sasa: "structures/lipase.pdb" include_default_aa_groups: true # Custom groups for mechanistic insight protein_groups: catalytic_triad: [77, 133, 156] oxyanion_hole: [10, 77] lid_helix: [140, 141, 142, 143, 144, 145, 146, 147, 148, 149] ``` ### Run Analysis ```bash polyzymd compare run contacts -f comparison.yaml ``` ### Interpretation Framework | Observation | Interpretation | Design Implication | |-------------|----------------|-------------------| | EGMA aromatic enrichment > +1.0 | Strong π-stacking with surface Trp/Phe/Tyr | May bind near aromatic-rich active sites | | SBMA charged_positive ≈ 0 | Neutral towards Lys/Arg despite zwitterionic nature | Balanced electrostatics | | SBMA catalytic_triad < -0.5 | Avoids active site | Good for maintaining catalytic activity | | EGMA lid_helix > +0.5 | Preferentially binds lid domain | May affect lid dynamics | ## API Reference ### compute_binding_preference ```python from polyzymd.analyses.shared.binding_preference import ( compute_binding_preference, extract_polymer_composition, ) # First, extract polymer composition from trajectory polymer_composition = extract_polymer_composition( universe, # MDAnalysis Universe with trajectory polymer_type_selections=None, # Optional: custom polymer type definitions ) # Then compute binding preference result = compute_binding_preference( contact_result, # ContactResult from trajectory analysis surface_exposure, # SurfaceExposureResult from SASA calculation protein_groups, # dict[str, set[int]] of group name -> resids polymer_composition, # PolymerComposition from extract_polymer_composition() polymer_types=None, # Optional filter for polymer types ) ``` ### BindingPreferenceResult Methods | Method | Returns | Description | |--------|---------|-------------| | `enrichment_matrix()` | `dict[str, dict[str, float]]` | Enrichment values (zero-centered) | | `contact_fraction_matrix()` | `dict[str, dict[str, float]]` | Mean contact fractions | | `contact_share_matrix()` | `dict[str, dict[str, float]]` | Contact shares | | `get_enrichment(polymer, group)` | `float` | Single enrichment value | | `get_entry(polymer, group)` | `BindingPreferenceEntry` | Full entry with all metrics | | `to_dataframe()` | `pd.DataFrame` | Convert to pandas for analysis | | `save(path)` | `None` | Save to JSON | | `load(path)` | `BindingPreferenceResult` | Load from JSON | ### BindingPreferenceEntry Fields | Field | Type | Description | |-------|------|-------------| | `polymer_type` | str | Polymer residue type (e.g., "SBM") | | `protein_group` | str | Protein group label (e.g., "aromatic") | | `total_contact_frames` | int | Sum of contact frames for group | | `mean_contact_fraction` | float | Average per-residue contact fraction | | `n_residues_in_group` | int | Total residues in group (all) | | `n_exposed_in_group` | int | Surface-exposed residues in group | | `n_residues_contacted` | int | Exposed residues with any contact | | `contact_share` | float | Fraction of polymer's contacts to group | | `expected_share` | float | Expected share (surface availability) | | `enrichment` | float | Zero-centered enrichment | | `polymer_residue_count` | int | Residues of this polymer type (metadata) | | `total_polymer_residues` | int | Total polymer residues (metadata) | | `polymer_heavy_atom_count` | int | Heavy atoms of this polymer type (metadata) | | `total_polymer_heavy_atoms` | int | Total polymer heavy atoms (metadata) | ### PolymerComposition ```python from polyzymd.analyses.shared.binding_preference import ( PolymerComposition, extract_polymer_composition, ) # Auto-extract from trajectory composition = extract_polymer_composition(universe) # Or create manually composition = PolymerComposition( residue_counts={"SBM": 50, "EGM": 50}, heavy_atom_counts={"SBM": 750, "EGM": 400}, ) # Access properties print(composition.total_residues) # 100 print(composition.residue_fraction("SBM")) # 0.5 print(composition.heavy_atom_fraction("SBM")) # 0.652 (larger monomer) ``` ## System Coverage Analysis While binding preference answers *"What does each polymer type prefer?"*, **system coverage** answers a complementary question: *"What does this polymer mixture collectively cover?"* ### When to Use System Coverage System coverage is useful when comparing **copolymer compositions** (conditions) rather than individual polymer types: - *"Does a 70:30 SBMA:EGMA mixture cover aromatic residues differently than a 30:70 mixture?"* - *"Which copolymer ratio provides better coverage of the active site?"* - *"How does aggregate polymer coverage change across my experimental conditions?"* ### The Coverage Formula For each protein group, system coverage computes: ```{math} \text{Coverage Share} = \frac{\sum_{\text{all polymers}} \text{contacts to group}}{\sum_{\text{all polymers}} \text{total contacts}} ``` ```{math} \text{Coverage Enrichment} = \frac{\text{Coverage Share}}{\text{Expected Share}} - 1 ``` where Expected Share is based on protein surface availability (same as binding preference). ### How It Differs from Binding Preference | Aspect | Binding Preference | System Coverage | |--------|-------------------|-----------------| | **Question** | What does SBMA prefer? | What does this mixture cover? | | **Scope** | Per polymer type | Collapsed across all polymers | | **Use case** | Compare polymer chemistries | Compare copolymer ratios | | **Comparable across** | Conditions (abundance cancels) | Conditions (aggregate behavior) | ```{important} Both metrics are **automatically computed** when you enable binding preference. System coverage is stored in `result.system_coverage`. ``` ### Partition-Based Architecture Both **binding preference** (schema v5) and **system coverage** (schema v2) use a **partition-based architecture** that ensures mathematically correct enrichment calculations. Each partition is a collection of **mutually exclusive** protein groups that together cover the protein surface. ```{important} **Why Partitions Must Be Mutually Exclusive** The enrichment formula requires that both `contact_share` and `expected_share` sum to exactly 1.0 across all groups in a partition: $$\sum_{\text{groups}} \text{contact\_share}_i = 1.0$$ $$\sum_{\text{groups}} \text{expected\_share}_i = 1.0$$ If groups **overlap** (share residues), the `expected_share` sum exceeds 1.0, causing systematically negative enrichments—a subtle but serious bug that was fixed in schema v5. ``` #### Built-in Partitions System coverage automatically computes several partition types: | Partition | Groups | Description | |-----------|--------|-------------| | **AA Class** | aromatic, polar, nonpolar, charged_positive, charged_negative | 5-way classification by amino acid chemistry | | **Custom Binary** | {custom_group, rest_of_protein} | One partition per custom group in `protein_groups` | | **Combined Custom** | {all custom groups, rest_of_protein} | Only if custom groups don't overlap | | **User-Defined** | User-specified groups from `protein_partitions` | Configurable via YAML | ### User-Defined Partitions For specialized analyses, you can define custom partitions that organize protein regions into mutually exclusive groups: `````{tab-set} ````{tab-item} YAML ```yaml # comparison.yaml plugins: contacts: # First, define individual protein groups as residue ID lists protein_groups: lid_helix_5: [141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155] lid_helix_10: [281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295] catalytic_triad: [131, 163, 194] active_site_loop: [75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85] # Then, define partitions (must be mutually exclusive within each partition) protein_partitions: lid_helices: - lid_helix_5 - lid_helix_10 catalytic_regions: - catalytic_triad - active_site_loop ``` ```` ````{tab-item} Python ```python from polyzymd.analyses.contacts import ContactsSettings settings = ContactsSettings( protein_groups={ "lid_helix_5": [141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155], "lid_helix_10": [281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295], "catalytic_triad": [131, 163, 194], "active_site_loop": [75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85], }, protein_partitions={ "lid_helices": ["lid_helix_5", "lid_helix_10"], "catalytic_regions": ["catalytic_triad", "active_site_loop"], }, ) ``` ```` ````` ```{note} **Automatic `rest_of_protein`**: If your partition groups don't cover all exposed protein residues, a `rest_of_protein` element is automatically added to ensure the partition sums to 1.0. ``` #### Validation at Config Load Time PolyzyMD validates partition definitions when the config is loaded, **failing fast** with a clear error if groups overlap: ```python # This will raise ValidationError at load time: protein_partitions: bad_partition: - lid_helix_5 # resid 141:155 - lid_domain_combined # resid 130:160 — overlaps with lid_helix_5! ``` Error message: ``` ValidationError: Partition 'bad_partition' has overlapping groups: lid_helix_5 and lid_domain_combined share residues {141, 142, ..., 155}. Partitions must contain mutually exclusive groups. ``` ### Accessing System Coverage `````{tab-set} ````{tab-item} Python ```python from polyzymd.analyses.shared.binding_preference import BindingPreferenceResult # Load binding preference result (includes system coverage) result = BindingPreferenceResult.load("binding_preference_rep1.json") # Access system coverage if result.system_coverage: coverage = result.system_coverage # Get AA class coverage enrichment print("AA Class Coverage Enrichment:") for entry in coverage.aa_class_coverage.entries: print(f" {entry.partition_element}: {entry.coverage_enrichment:+.2f}") # Get custom group enrichment (vs rest_of_protein) print("\nCustom Group Coverage:") for group_name in coverage.custom_group_names(): enrichment = coverage.get_custom_group_enrichment(group_name) print(f" {group_name}: {enrichment:+.2f}") # Access user-defined partitions print("\nUser-Defined Partitions:") for partition_name in coverage.user_partition_names(): partition = coverage.get_user_partition(partition_name) print(f" {partition_name}:") for entry in partition.entries: print(f" {entry.partition_element}: {entry.coverage_enrichment:+.2f}") ``` ```` ````{tab-item} JSON Structure (Schema v5) The full result uses schema v5 with both `binding_preference` (per-polymer) and `system_coverage` (aggregate) fields: ```javascript // Note: ... indicates additional fields omitted for brevity { "entries": [/* Legacy binding preference entries */], "polymer_composition": {/* polymer counts */}, "binding_preference": { "aa_class_binding": { "SBM": { "partition_name": "aa_class", "polymer_type": "SBM", "entries": [ { "partition_element": "aromatic", "polymer_type": "SBM", "total_contact_frames": 12543, "contact_share": 0.164, "expected_share": 0.086, "enrichment": 0.907, "n_exposed_in_element": 7, "n_residues_in_element": 20, "n_residues_contacted": 6 }, /* ... other AA classes ... */ ], "total_contact_share": 1.0, "total_expected_share": 1.0 }, "EGM": {/* same structure */} }, "user_defined_partitions": {/* optional user partitions per polymer */}, "polymer_types": ["SBM", "EGM"], "schema_version": 5 }, "system_coverage": { "aa_class_coverage": { "partition_name": "aa_class", "entries": [ { "partition_element": "aromatic", "total_contact_frames": 25432, "coverage_share": 0.164, "expected_share": 0.086, "coverage_enrichment": 0.907, "n_exposed_in_group": 7, "n_residues_in_group": 20, "polymer_contributions": {"SBM": 0.45, "EGM": 0.55} }, {"partition_element": "polar", /* ... */}, {"partition_element": "nonpolar", /* ... */}, {"partition_element": "charged_positive", /* ... */}, {"partition_element": "charged_negative", /* ... */} ] }, "custom_group_coverages": { "lid_helix_5": { "partition_name": "lid_helix_5_vs_rest", "entries": [ {"partition_element": "lid_helix_5", /* ... */}, {"partition_element": "rest_of_protein", /* ... */} ] } }, "user_defined_partitions": { "lid_helices": { "partition_name": "lid_helices", "entries": [ {"partition_element": "lid_helix_5", /* ... */}, {"partition_element": "lid_helix_10", /* ... */}, {"partition_element": "rest_of_protein", /* ... */} ] } }, "n_frames": 10000, "total_contact_frames": 154892, "total_exposed_residues": 81, "polymer_types_included": ["SBM", "EGM"], "schema_version": 2 }, "schema_version": 5 } ``` ```{note} The outer `schema_version: 5` indicates the BindingPreferenceResult format. The inner `system_coverage.schema_version: 2` indicates the SystemCoverageResult format (unchanged from previous versions). ``` ```` ````` ### Interpreting Polymer Contributions The `polymer_contributions` field shows how much each polymer type contributed to the coverage of a specific protein group. For example: ```python # For aromatic coverage: polymer_contributions = {"SBM": 0.45, "EGM": 0.55} # 45% of contacts to aromatics came from SBMA # 55% of contacts to aromatics came from EGMA ``` This helps understand: - Which polymer is "doing the work" for each protein group - Whether coverage is balanced across polymer types - How polymer ratio affects coverage patterns ### Aggregation Across Replicates System coverage is automatically aggregated when you aggregate binding preference: ```python from polyzymd.analyses.shared.binding_preference import ( aggregate_binding_preference, AggregatedBindingPreferenceResult, ) # Aggregate multiple replicates aggregated = aggregate_binding_preference([result1, result2, result3]) # Access aggregated system coverage if aggregated.system_coverage: # AA class coverage with statistics for entry in aggregated.system_coverage.aa_class_coverage.entries: print(f"{entry.partition_element}:") print(f" Mean enrichment: {entry.mean_coverage_enrichment:+.2f}") print(f" SEM: ±{entry.sem_coverage_enrichment:.2f}") print(f" Replicates: {entry.n_replicates}") # User-defined partitions with statistics for partition_name in aggregated.system_coverage.user_partition_names(): partition = aggregated.system_coverage.get_user_partition(partition_name) print(f"\n{partition_name}:") for entry in partition.entries: print(f" {entry.partition_element}: {entry.mean_coverage_enrichment:+.2f} ± {entry.sem_coverage_enrichment:.2f}") ``` ### SystemCoverageResult Methods | Method | Returns | Description | |--------|---------|-------------| | `aa_class_enrichment_dict()` | `dict[str, float]` | {aa_class: enrichment} mapping | | `get_aa_class_enrichment(cls)` | `float` | Enrichment for specific AA class | | `get_custom_group_enrichment(name)` | `float` | Enrichment for custom group | | `custom_group_names()` | `list[str]` | List of custom group names | | `user_partition_names()` | `list[str]` | List of user-defined partition names | | `get_user_partition(name)` | `PartitionCoverageResult` | Get user-defined partition | | `save(path)` / `load(path)` | | JSON serialization | ### PartitionCoverageEntry Fields | Field | Type | Description | |-------|------|-------------| | `partition_element` | str | Group name within the partition | | `total_contact_frames` | int | Sum of ALL polymer contacts to group | | `coverage_share` | float | Fraction of all contacts to this group | | `expected_share` | float | Expected share (surface availability) | | `coverage_enrichment` | float | Zero-centered enrichment | | `n_exposed_in_group` | int | Surface-exposed residues in group | | `n_residues_in_group` | int | Total residues in group | | `polymer_contributions` | dict[str, float] | Fraction from each polymer type | ### Why Overlapping Groups Cause Bugs Understanding why partitions must be mutually exclusive helps avoid subtle errors: ```{admonition} Worked Example: The Overlap Bug :class: warning **Setup**: Consider two groups that overlap: - `lid_helix_5`: residues {141, 142, 143, 144, 145} - `lid_domain`: residues {140, 141, 142, 143, 144, 145, 146, 147} **Expected shares** (if treated as a "partition"): - `lid_helix_5`: 5 exposed residues → expected_share = 5/100 = 0.05 - `lid_domain`: 8 exposed residues → expected_share = 8/100 = 0.08 - **Sum: 0.13** (should be 1.0!) The 5 overlapping residues are **double-counted** in expected_share. **Coverage shares** (actual contacts): - `lid_helix_5`: 1000 contact-frames → coverage_share = 1000/20000 = 0.05 - `lid_domain`: 1600 contact-frames → coverage_share = 1600/20000 = 0.08 - **Sum: 0.13** (contacts aren't double-counted if we sum over groups) Wait—but contacts **are** double-counted! Those 5 overlapping residues contribute to **both** groups. So coverage_share also sums to >1.0? No—the denominator (total contacts) is shared, making coverage_shares additive only if groups don't overlap. **Result**: enrichment = (0.05 / 0.05) - 1 = 0, but the "expected" baseline is wrong because the partition doesn't sum to 1.0. Comparisons between groups become meaningless. **Solution**: Use mutually exclusive partitions. PolyzyMD validates this at config load time. ``` ## Visualization Both binding preference and system coverage generate publication-ready plots when you run `polyzymd compare run contacts`. These plots are automatically enabled by default but can be controlled via `plot_settings`. ### Available Plots | Plot Type | Description | Settings Field | |-----------|-------------|----------------| | **Binding Preference Heatmap** | Enrichment by (polymer type × protein group) across conditions | `generate_enrichment_heatmap` | | **Binding Preference Bars** | Grouped bars per polymer type, comparing conditions | `generate_enrichment_bars` | | **System Coverage Heatmap** | Aggregate coverage by protein group across conditions | `generate_system_coverage_heatmap` | | **System Coverage Bars** | Grouped bars showing aggregate coverage across conditions | `generate_system_coverage_bars` | | **User Partition Bars** | One grouped bar chart per user-defined partition | `generate_user_partition_bars` | ### Configuring Plot Settings `````{tab-set} ````{tab-item} YAML ```yaml # comparison.yaml plot_settings: output_dir: "figures/" format: "png" dpi: 300 contacts: # Binding preference plots generate_enrichment_heatmap: true generate_enrichment_bars: true figsize_enrichment_heatmap: [12, 8] # Auto-calculated if null figsize_enrichment_bars: [10, 6] enrichment_colormap: "RdBu_r" # Diverging colormap show_enrichment_error: true # System coverage plots (AA class partition) generate_system_coverage_heatmap: true generate_system_coverage_bars: true figsize_system_coverage_heatmap: [10, 6] # Auto-calculated if null figsize_system_coverage_bars: [10, 6] show_system_coverage_error: true # User-defined partition plots generate_user_partition_bars: true figsize_user_partition_bars: [10, 6] show_user_partition_error: true ``` ```` ````{tab-item} Python ```python from polyzymd.analyses.contacts._plot_settings import ContactsPlotSettings from polyzymd.config.comparison import PlotSettings plot_settings = PlotSettings( output_dir="figures/", format="png", dpi=300, contacts=ContactsPlotSettings( generate_enrichment_heatmap=True, generate_system_coverage_heatmap=True, enrichment_colormap="RdBu_r", show_enrichment_error=True, show_system_coverage_error=True, # User partition plots generate_user_partition_bars=True, show_user_partition_error=True, ), ) ``` ```` ````` ### Understanding the Heatmaps **Binding Preference Heatmap:** - One subplot per condition - Rows: Protein groups (aromatic, polar, etc.) - Columns: Polymer types (SBMA, EGMA, etc.) - Color: Zero-centered diverging colormap - **Red (positive)**: Preferential binding - **White (zero)**: Neutral - **Blue (negative)**: Avoidance **System Coverage Heatmap:** - Single heatmap comparing conditions - Rows: Protein groups - Columns: Conditions (100% SBMA, 50/50 copolymer, etc.) - Shows aggregate coverage across all polymer types ### Understanding the Bar Charts **Binding Preference Bars:** - One plot per polymer type - Groups: Protein groups - Bars within group: One per condition - Error bars: SEM across replicates **System Coverage Bars:** - Single plot comparing all conditions - Groups: AA class groups (aromatic, polar, etc.) - Bars within group: One per condition - Shows how different copolymer compositions collectively cover each protein group **User Partition Bars:** - One plot per user-defined partition - Groups: Partition elements (e.g., lid_helix_5, lid_helix_10, rest_of_protein) - Bars within group: One per condition - Useful for comparing coverage of specific structural regions across conditions ### Disabling Specific Plots To disable specific plot types, set the corresponding field to `false`: ```yaml plot_settings: contacts: # Only generate bar charts, not heatmaps generate_enrichment_heatmap: false generate_system_coverage_heatmap: false generate_enrichment_bars: true generate_system_coverage_bars: true generate_user_partition_bars: true # Enable user partition plots ``` ## Troubleshooting ### "No exposed residues in group" This warning appears when a protein group has no surface-exposed residues (all are buried). Solutions: 1. Lower `surface_exposure_threshold` to include more residues 2. Verify the residue IDs in your custom `protein_groups` are correct 3. Check that your enzyme PDB has correct atom coordinates ### "Enzyme PDB not found" The SASA calculation requires a PDB file. Ensure: 1. `enzyme_pdb_for_sasa` path is relative to `comparison.yaml` location 2. The file exists and has correct format 3. Alternatively, place a symlink in the `structures/` directory ### Enrichment values of -1.0 An enrichment of exactly -1.0 means **complete avoidance** (no contacts observed for this polymer-group pair). This is expected for: - Buried residue groups with no surface exposure - Polymer types that physically cannot reach certain regions Check the `n_exposed_in_group` and `n_residues_contacted` fields to diagnose. ### High SEM values Large uncertainty (SEM) indicates high variability across replicates. This could mean: 1. Insufficient sampling — consider longer simulations 2. True variability in binding — the polymer explores multiple binding modes 3. Rare events dominating one replicate — check per-replicate values ## See Also - [Contacts Analysis Quick Start](analysis_contacts_quickstart.md) — basic contacts analysis - [Contacts Analysis Cookbook](analysis_contacts_cookbook.md) — worked examples - [Surface Exposure Calculation](analysis_contacts_cookbook.md#recipe-4-active-site-contacts-vs-surface-contacts) — SASA filtering - [Statistics Best Practices](../explanation/analysis_statistics_best_practices.md) — uncertainty quantification - [Comparing Conditions](analysis_compare_conditions.md) — multi-condition workflows