Binding Preference Analysis
Understand which amino acid classes your polymer preferentially binds using enrichment analysis. This module answers the fundamental question: “Does my polymer type show preferential affinity for certain types of protein residues?”
Note
Binding preference analysis builds on contacts analysis. Run contacts analysis first, then enable binding preference in your comparison configuration.
Warning
Binding preference is currently labeled experimental in the presentation
release. The workflow remains available from polyzymd compare contacts, and
its plots are still generated, but CLI output and figures are explicitly marked
experimental because definitions and interpretation may still change.
Scientific Motivation
When designing enzyme-polymer conjugates, understanding polymer binding preferences is crucial for several reasons:
Active site protection: Does the polymer avoid or shield the catalytic residues?
Electrostatic complementarity: Do zwitterionic polymers preferentially bind charged surface residues?
Hydrophobic interactions: Do PEG-like polymers show affinity for nonpolar patches?
Rational design: Which polymer chemistry best matches your enzyme’s surface properties?
Raw contact counts are misleading because protein surfaces have unequal distributions of amino acid types. A polymer might contact 50% of aromatic residues and 50% of polar residues, but if the surface has 5 aromatic residues and 50 polar residues, the polymer clearly prefers aromatic residues.
Binding preference analysis normalizes by surface availability to reveal true preferences.
The Enrichment Formula
For each (polymer type, protein group) pair, we compute a zero-centered enrichment value:
where:
and:
This surface-availability normalization asks: “Given how much of the protein surface is aromatic/charged/etc., does this polymer type contact that surface proportionally?”
Interpretation (Zero-Centered Scale)
Enrichment |
Meaning |
|---|---|
> 0 |
Preferential binding — polymer contacts this group more than expected |
= 0 |
Neutral — contact frequency matches random chance |
< 0 |
Avoidance — polymer contacts this group less than expected |
= -1 |
Complete avoidance — no contacts observed |
Examples:
+0.45means “45% more contacts than expected”-0.30means “30% fewer contacts than expected”+1.00means “2× as many contacts as expected” (100% more)
Why Surface-Availability Normalization?
The correct baseline for “expected contacts” is protein surface availability, not polymer composition. Here’s why:
The question we’re asking: “Does this polymer type preferentially bind aromatic residues?”
The correct comparison: If aromatic residues comprise 10% of the protein’s surface, and this polymer contacts aromatics 15% of the time, then enrichment = (0.15 / 0.10) - 1 = +0.50 (50% preference).
What NOT to compare: The fraction of the polymer that is SBMA vs EGMA is irrelevant to this question — it tells us about polymer composition, not about which protein groups the polymer prefers.
Note
Polymer composition (residue counts, heavy atom counts) is stored as metadata in the results for secondary analysis, but is not used in the enrichment formula.
Why Contact Frames, Not Binary Counts?
The formula uses contact frames (duration-weighted) rather than binary “contacted yes/no” counts. This ensures that:
A residue contacted for 60% of the simulation contributes 60× more than one contacted for 1 frame
Transient, non-specific contacts don’t dominate the signal
Stable, functionally relevant interactions are emphasized
Surface Exposure Filtering
Only surface-exposed residues are considered in the enrichment calculation. Buried residues are excluded because they’re physically inaccessible to polymer.
Surface exposure is determined using Solvent Accessible Surface Area (SASA):
Residues with relative SASA > threshold (default 20%) are considered exposed
This is computed from a static enzyme PDB structure
The threshold can be adjusted in configuration
Quick Start
Enable Binding Preference in comparison.yaml
Add compute_binding_preference: true to your contacts analysis settings:
# comparison.yaml
name: "Polymer Binding Preference Study"
control: "No Polymer"
structures:
enzyme_pdb: "structures/enzyme.pdb" # For SASA calculation
conditions:
- label: "No Polymer"
config: "../no_polymer/config.yaml"
replicates: [1, 2, 3]
- label: "100% SBMA"
config: "../sbma_100/config.yaml"
replicates: [1, 2, 3]
- label: "100% EGMA"
config: "../egma_100/config.yaml"
replicates: [1, 2, 3]
analysis_settings:
contacts:
name: "polymer_contacts"
polymer_selection: "resname SBM EGM"
protein_selection: "protein"
cutoff: 4.5
# Binding preference settings
compute_binding_preference: true
surface_exposure_threshold: 0.2 # 20% relative SASA
enzyme_pdb_for_sasa: "structures/enzyme.pdb"
include_default_aa_groups: true
Then run the comparison:
polyzymd compare contacts -f comparison.yaml
The CLI automatically uses binding preference settings from comparison.yaml.
You cannot enable binding preference via CLI flags alone—it must be configured
in YAML.
# Run comparison (binding preference enabled in YAML)
polyzymd compare contacts -f comparison.yaml
# View results in markdown format
polyzymd compare contacts -f comparison.yaml --format markdown
from pathlib import Path
from polyzymd.analysis.contacts import ContactResult
from polyzymd.analysis.contacts.binding_preference import compute_binding_preference
from polyzymd.analysis.contacts.surface_exposure import SurfaceExposureFilter
# Load contact results
contacts = ContactResult.load("analysis/contacts/contacts_rep1.json")
# Compute surface exposure
exposure_filter = SurfaceExposureFilter(threshold=0.2)
surface_exposure = exposure_filter.calculate("structures/enzyme.pdb")
# Define protein groups (resolved to residue IDs)
# You can use the default AA class groups or define custom ones
from polyzymd.analysis.common.aa_classification import AA_CLASS_RESIDUES
protein_groups = {
"aromatic": {12, 45, 67, 89, 102}, # resids of aromatic residues
"charged_positive": {23, 34, 56, 78}, # LYS, ARG resids
"charged_negative": {15, 28, 91}, # ASP, GLU resids
"nonpolar": {5, 10, 20, 30, 40, 50}, # hydrophobic resids
"polar": {8, 18, 25, 35, 42, 55}, # SER, THR, etc.
}
# Extract polymer composition from trajectory (stored as metadata)
from polyzymd.analysis.contacts.binding_preference import extract_polymer_composition
polymer_composition = extract_polymer_composition(universe) # universe from trajectory
# Compute binding preference
result = compute_binding_preference(
contact_result=contacts,
surface_exposure=surface_exposure,
protein_groups=protein_groups,
polymer_composition=polymer_composition,
)
# Access enrichment values (normalized by protein surface availability)
print("Enrichment matrix:")
for polymer_type, groups in result.enrichment_matrix().items():
print(f"\n{polymer_type}:")
for group, enrichment in groups.items():
marker = "+" if enrichment > 0 else "-" if enrichment < 0 else "="
print(f" {group}: {enrichment:+.2f} {marker}")
# Get detailed entry for a specific pair
entry = result.get_entry("SBM", "aromatic")
print(f"\nSBM → aromatic:")
print(f" Contact share: {entry.contact_share:.3f}")
print(f" Expected share (surface): {entry.expected_share:.3f}")
print(f" Enrichment: {entry.enrichment:+.2f}")
Example Output
When you run polyzymd compare contacts with binding preference enabled, you’ll
see output like this after an experimental warning banner:
Binding Preference - Enrichment by Amino Acid Class
-----------------------------------------------------------------------------------------------
Surface exposure threshold: 20% relative SASA
Enrichment normalized by: protein surface availability
Polymer: EGM
Protein Group 0% SBMA / 100% 25% SBMA / 75% 50% SBMA / 50%
----------------------------------------------------------------------------------
aromatic +0.90±0.06 + +0.65±0.16 + +0.50±0.06 +
charged_negative -0.55±0.04 - -0.44±0.10 - -0.39±0.08 -
charged_positive -0.21±0.10 - -0.11±0.07 - -0.06±0.04 -
nonpolar +0.31±0.04 + +0.14±0.01 + +0.13±0.06 +
polar -0.29±0.02 - -0.13±0.05 - -0.12±0.03 -
Polymer: SBM
Protein Group 100% SBMA / 0% 25% SBMA / 75% 50% SBMA / 50%
----------------------------------------------------------------------------------
aromatic +0.29±0.03 + +0.83±0.09 + +0.54±0.07 +
charged_negative -0.10±0.05 - -0.24±0.18 - -0.29±0.11 -
charged_positive +0.02±0.03 + +0.22±0.10 + +0.10±0.02 +
nonpolar -0.05±0.03 - -0.09±0.12 - -0.12±0.05 -
polar +0.00±0.00 = -0.17±0.04 - +0.01±0.06 +
+ = enriched (>0), - = depleted (<0)
Interpreting These Results
From the example above, we can draw several conclusions:
EGMA (EGM) shows:
Strong preference for aromatic residues (+0.90 = 90% more contacts than expected)
Strong preference for nonpolar residues (+0.31 = 31% more contacts)
Avoidance of charged residues (-0.55 for negative, -0.21 for positive)
This is consistent with EGMA’s hydrophobic PEG-like character
SBMA (SBM) shows:
Moderate preference for aromatic residues (+0.29)
Slight preference for charged positive residues (+0.02)
Near-neutral behavior for most groups
This reflects SBMA’s zwitterionic nature (balanced charges)
Composition effects: As SBMA ratio increases (25% → 50% → 100%):
EGMA’s aromatic preference decreases (+0.90 → +0.65 → +0.50)
This suggests competition for aromatic binding sites
Configuration Options
ContactsAnalysisSettings
Field |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable binding preference analysis |
|
float |
|
Relative SASA threshold (0.0-1.0) |
|
str |
|
Path to enzyme PDB for SASA calculation |
|
bool |
|
Use standard AA class groupings |
|
dict |
|
Custom protein groups |
|
dict |
|
Custom polymer type definitions (see below) |
Polymer Type Selections
By default, polymer types are auto-detected from unique residue names in the polymer selection. You can explicitly define polymer types with custom selections:
analysis_settings:
contacts:
polymer_selection: "chainID C"
compute_binding_preference: true
# Explicit polymer type definitions
polymer_type_selections:
SBMA: "chainID C and resname SBM"
EGMA: "chainID C and resname EGM"
OEGMA: "chainID C and resname OEG"
This is useful when:
Residue names don’t match your preferred labels
You want to group multiple residue types together
You need fine-grained control over what counts as each polymer type
Custom Protein Groups
You can define custom protein groups to analyze specific regions of interest:
analysis_settings:
contacts:
compute_binding_preference: true
surface_exposure_threshold: 0.2
enzyme_pdb_for_sasa: "structures/enzyme.pdb"
# Include default AA class groups (aromatic, polar, etc.)
include_default_aa_groups: true
# Add custom groups (can override defaults if names conflict)
protein_groups:
active_site: [77, 133, 156] # Catalytic triad
lid_domain: [140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150]
binding_pocket: [50, 51, 52, 80, 81, 82, 83]
Tip
If a custom group name matches a default AA class name (e.g., aromatic),
your custom definition takes precedence.
Surface Exposure Threshold
The surface_exposure_threshold controls which residues are considered
“surface accessible”:
Threshold |
Effect |
|---|---|
|
More permissive — includes partially buried residues |
|
Default — standard definition of “exposed” |
|
More stringent — only highly exposed residues |
|
Very stringent — only fully exposed loops/turns |
Note
Lower thresholds include more residues in the analysis, which may dilute the signal. Higher thresholds focus on the most accessible residues but may have smaller sample sizes.
Default Amino Acid Classifications
When include_default_aa_groups: true, the following standard groupings are used:
Group |
Amino Acids |
Chemical Property |
|---|---|---|
|
TRP, PHE, TYR |
π-stacking, hydrophobic |
|
LYS, ARG |
Cationic at pH 7 |
|
ASP, GLU |
Anionic at pH 7 |
|
ALA, VAL, LEU, ILE, MET, PRO, GLY |
Hydrophobic/aliphatic |
|
SER, THR, ASN, GLN, HIS, CYS |
H-bond donors/acceptors |
Note
Histidine (HIS) is classified as polar by default because its protonation state is pH-dependent. At low pH, it would be positively charged.
Output Files
Binding preference results are saved alongside contact analysis results:
project/
└── condition_name/
└── analysis/
└── contacts/
├── contacts_rep1.json
├── contacts_rep2.json
├── contacts_rep3.json
├── binding_preference_rep1.json # Per-replicate
├── binding_preference_rep2.json
├── binding_preference_rep3.json
└── binding_preference_aggregated_reps1-3.json # Aggregated
JSON Structure (Schema v5)
Schema version 5 introduced partition-based binding preference, where each
polymer type has its own partition-based enrichment results. This ensures that
contact_share sums to exactly 1.0 within each partition (no overlapping group
bugs).
{
"entries": [/* Legacy entries for backwards compatibility */],
"polymer_composition": {
"residue_counts": {"SBM": 50, "EGM": 50},
"heavy_atom_counts": {"SBM": 750, "EGM": 400}
},
"binding_preference": {
"aa_class_binding": {
"SBM": {
"partition_name": "aa_class",
"partition_type": "aa_class",
"polymer_type": "SBM",
"entries": [
{
"partition_element": "aromatic",
"polymer_type": "SBM",
"total_contact_frames": 12543,
"contact_share": 0.164,
"expected_share": 0.086,
"enrichment": 0.907,
"n_exposed_in_element": 7,
"n_residues_in_element": 20,
"n_residues_contacted": 6
},
{"partition_element": "polar", /* ... */},
{"partition_element": "nonpolar", /* ... */},
{"partition_element": "charged_positive", /* ... */},
{"partition_element": "charged_negative", /* ... */}
],
"total_contact_share": 1.0,
"total_expected_share": 1.0
},
"EGM": {/* same structure for EGM */}
},
"user_defined_partitions": {
"lid_helices": {
"SBM": {/* PartitionBindingResult */},
"EGM": {/* PartitionBindingResult */}
}
},
"n_frames": 10000,
"total_exposed_residues": 81,
"surface_exposure_threshold": 0.2,
"polymer_types": ["SBM", "EGM"],
"schema_version": 5
},
"system_coverage": {/* SystemCoverageResult */},
"n_frames": 10000,
"total_exposed_residues": 81,
"surface_exposure_threshold": 0.2,
"schema_version": 5
}
Important
Schema v5 guarantees: Within each partition (e.g., aa_class), the
contact_share values sum to exactly 1.0 for each polymer type. This
eliminates the overlapping-groups bug present in earlier versions where
residues could be counted in multiple groups.
Note
The legacy entries field is retained for backwards compatibility with
visualization tools. For programmatic access, use the binding_preference
field with its partition-based structure.
Statistical Treatment
Aggregation Across Replicates
When multiple replicates are available, binding preference is computed per-replicate and then aggregated:
Compute enrichment for each replicate independently
Report mean enrichment ± SEM across replicates
Store per-replicate values for downstream analysis
This approach:
Preserves replicate independence for statistical testing
Provides uncertainty estimates (SEM)
Enables detection of outlier replicates
Uncertainty in Enrichment
The uncertainty in enrichment comes from:
Variation in contact patterns across replicates
Different initial configurations
Thermal fluctuations in the simulation
The SEM (Standard Error of the Mean) quantifies confidence in the mean enrichment across replicates.
Worked Example: Comparing Polymer Preferences
Scientific Question
“Do zwitterionic (SBMA) and PEG-like (EGMA) polymers show different binding preferences? Which polymer type is better for protecting the enzyme’s active site?”
Analysis Workflow
# comparison.yaml
name: "SBMA vs EGMA Binding Preferences"
control: "No Polymer"
structures:
enzyme_pdb: "structures/lipase.pdb"
conditions:
- label: "No Polymer"
config: "../no_polymer/config.yaml"
replicates: [1, 2, 3]
- label: "100% SBMA"
config: "../sbma_100/config.yaml"
replicates: [1, 2, 3]
- label: "100% EGMA"
config: "../egma_100/config.yaml"
replicates: [1, 2, 3]
- label: "50% SBMA / 50% EGMA"
config: "../copolymer_50_50/config.yaml"
replicates: [1, 2, 3]
analysis_settings:
contacts:
polymer_selection: "resname SBM EGM"
cutoff: 4.5
compute_binding_preference: true
surface_exposure_threshold: 0.2
enzyme_pdb_for_sasa: "structures/lipase.pdb"
include_default_aa_groups: true
# Custom groups for mechanistic insight
protein_groups:
catalytic_triad: [77, 133, 156]
oxyanion_hole: [10, 77]
lid_helix: [140, 141, 142, 143, 144, 145, 146, 147, 148, 149]
Run Analysis
polyzymd compare contacts -f comparison.yaml
Interpretation Framework
Observation |
Interpretation |
Design Implication |
|---|---|---|
EGMA aromatic enrichment > +1.0 |
Strong π-stacking with surface Trp/Phe/Tyr |
May bind near aromatic-rich active sites |
SBMA charged_positive ≈ 0 |
Neutral towards Lys/Arg despite zwitterionic nature |
Balanced electrostatics |
SBMA catalytic_triad < -0.5 |
Avoids active site |
Good for maintaining catalytic activity |
EGMA lid_helix > +0.5 |
Preferentially binds lid domain |
May affect lid dynamics |
API Reference
compute_binding_preference
from polyzymd.analysis.contacts.binding_preference import (
compute_binding_preference,
extract_polymer_composition,
)
# First, extract polymer composition from trajectory
polymer_composition = extract_polymer_composition(
universe, # MDAnalysis Universe with trajectory
polymer_type_selections=None, # Optional: custom polymer type definitions
)
# Then compute binding preference
result = compute_binding_preference(
contact_result, # ContactResult from trajectory analysis
surface_exposure, # SurfaceExposureResult from SASA calculation
protein_groups, # dict[str, set[int]] of group name -> resids
polymer_composition, # PolymerComposition from extract_polymer_composition()
polymer_types=None, # Optional filter for polymer types
)
BindingPreferenceResult Methods
Method |
Returns |
Description |
|---|---|---|
|
|
Enrichment values (zero-centered) |
|
|
Mean contact fractions |
|
|
Contact shares |
|
|
Single enrichment value |
|
|
Full entry with all metrics |
|
|
Convert to pandas for analysis |
|
|
Save to JSON |
|
|
Load from JSON |
BindingPreferenceEntry Fields
Field |
Type |
Description |
|---|---|---|
|
str |
Polymer residue type (e.g., “SBM”) |
|
str |
Protein group label (e.g., “aromatic”) |
|
int |
Sum of contact frames for group |
|
float |
Average per-residue contact fraction |
|
int |
Total residues in group (all) |
|
int |
Surface-exposed residues in group |
|
int |
Exposed residues with any contact |
|
float |
Fraction of polymer’s contacts to group |
|
float |
Expected share (surface availability) |
|
float |
Zero-centered enrichment |
|
int |
Residues of this polymer type (metadata) |
|
int |
Total polymer residues (metadata) |
|
int |
Heavy atoms of this polymer type (metadata) |
|
int |
Total polymer heavy atoms (metadata) |
PolymerComposition
from polyzymd.analysis.contacts.binding_preference import (
PolymerComposition,
extract_polymer_composition,
)
# Auto-extract from trajectory
composition = extract_polymer_composition(universe)
# Or create manually
composition = PolymerComposition(
residue_counts={"SBM": 50, "EGM": 50},
heavy_atom_counts={"SBM": 750, "EGM": 400},
)
# Access properties
print(composition.total_residues) # 100
print(composition.residue_fraction("SBM")) # 0.5
print(composition.heavy_atom_fraction("SBM")) # 0.652 (larger monomer)
System Coverage Analysis
While binding preference answers “What does each polymer type prefer?”, system coverage answers a complementary question: “What does this polymer mixture collectively cover?”
When to Use System Coverage
System coverage is useful when comparing copolymer compositions (conditions) rather than individual polymer types:
“Does a 70:30 SBMA:EGMA mixture cover aromatic residues differently than a 30:70 mixture?”
“Which copolymer ratio provides better coverage of the active site?”
“How does aggregate polymer coverage change across my experimental conditions?”
The Coverage Formula
For each protein group, system coverage computes:
where Expected Share is based on protein surface availability (same as binding preference).
How It Differs from Binding Preference
Aspect |
Binding Preference |
System Coverage |
|---|---|---|
Question |
What does SBMA prefer? |
What does this mixture cover? |
Scope |
Per polymer type |
Collapsed across all polymers |
Use case |
Compare polymer chemistries |
Compare copolymer ratios |
Comparable across |
Conditions (abundance cancels) |
Conditions (aggregate behavior) |
Important
Both metrics are automatically computed when you enable binding preference.
System coverage is stored in result.system_coverage.
Partition-Based Architecture
Both binding preference (schema v5) and system coverage (schema v2) use a partition-based architecture that ensures mathematically correct enrichment calculations. Each partition is a collection of mutually exclusive protein groups that together cover the protein surface.
Important
Why Partitions Must Be Mutually Exclusive
The enrichment formula requires that both contact_share and expected_share
sum to exactly 1.0 across all groups in a partition:
If groups overlap (share residues), the expected_share sum exceeds 1.0,
causing systematically negative enrichments—a subtle but serious bug that was
fixed in schema v5.
Built-in Partitions
System coverage automatically computes several partition types:
Partition |
Groups |
Description |
|---|---|---|
AA Class |
aromatic, polar, nonpolar, charged_positive, charged_negative |
5-way classification by amino acid chemistry |
Custom Binary |
{custom_group, rest_of_protein} |
One partition per custom group in |
Combined Custom |
{all custom groups, rest_of_protein} |
Only if custom groups don’t overlap |
User-Defined |
User-specified groups from |
Configurable via YAML |
User-Defined Partitions
For specialized analyses, you can define custom partitions that organize protein regions into mutually exclusive groups:
# comparison.yaml
analysis_settings:
contacts:
# First, define individual protein groups (can overlap)
protein_groups:
lid_helix_5: "resid 141:155"
lid_helix_10: "resid 281:295"
catalytic_triad: "resid 131 163 194"
active_site_loop: "resid 75:85"
# Then, define partitions (must be mutually exclusive within each partition)
protein_partitions:
lid_helices:
- lid_helix_5
- lid_helix_10
catalytic_regions:
- catalytic_triad
- active_site_loop
from polyzymd.compare.settings import ContactsAnalysisSettings
settings = ContactsAnalysisSettings(
protein_groups={
"lid_helix_5": "resid 141:155",
"lid_helix_10": "resid 281:295",
"catalytic_triad": "resid 131 163 194",
"active_site_loop": "resid 75:85",
},
protein_partitions={
"lid_helices": ["lid_helix_5", "lid_helix_10"],
"catalytic_regions": ["catalytic_triad", "active_site_loop"],
},
)
Note
Automatic rest_of_protein: If your partition groups don’t cover all
exposed protein residues, a rest_of_protein element is automatically added
to ensure the partition sums to 1.0.
Validation at Config Load Time
PolyzyMD validates partition definitions when the config is loaded, failing fast with a clear error if groups overlap:
# This will raise ValidationError at load time:
protein_partitions:
bad_partition:
- lid_helix_5 # resid 141:155
- lid_domain_combined # resid 130:160 — overlaps with lid_helix_5!
Error message:
ValidationError: Partition 'bad_partition' has overlapping groups:
lid_helix_5 and lid_domain_combined share residues {141, 142, ..., 155}.
Partitions must contain mutually exclusive groups.
Accessing System Coverage
from polyzymd.analysis.contacts import BindingPreferenceResult
# Load binding preference result (includes system coverage)
result = BindingPreferenceResult.load("binding_preference_rep1.json")
# Access system coverage
if result.system_coverage:
coverage = result.system_coverage
# Get AA class coverage enrichment
print("AA Class Coverage Enrichment:")
for entry in coverage.aa_class_coverage.entries:
print(f" {entry.partition_element}: {entry.coverage_enrichment:+.2f}")
# Get custom group enrichment (vs rest_of_protein)
print("\nCustom Group Coverage:")
for group_name in coverage.custom_group_names():
enrichment = coverage.get_custom_group_enrichment(group_name)
print(f" {group_name}: {enrichment:+.2f}")
# Access user-defined partitions
print("\nUser-Defined Partitions:")
for partition_name in coverage.user_partition_names():
partition = coverage.get_user_partition(partition_name)
print(f" {partition_name}:")
for entry in partition.entries:
print(f" {entry.partition_element}: {entry.coverage_enrichment:+.2f}")
The full result uses schema v5 with both binding_preference (per-polymer)
and system_coverage (aggregate) fields:
// Note: ... indicates additional fields omitted for brevity
{
"entries": [/* Legacy binding preference entries */],
"polymer_composition": {/* polymer counts */},
"binding_preference": {
"aa_class_binding": {
"SBM": {
"partition_name": "aa_class",
"polymer_type": "SBM",
"entries": [
{
"partition_element": "aromatic",
"polymer_type": "SBM",
"total_contact_frames": 12543,
"contact_share": 0.164,
"expected_share": 0.086,
"enrichment": 0.907,
"n_exposed_in_element": 7,
"n_residues_in_element": 20,
"n_residues_contacted": 6
},
/* ... other AA classes ... */
],
"total_contact_share": 1.0,
"total_expected_share": 1.0
},
"EGM": {/* same structure */}
},
"user_defined_partitions": {/* optional user partitions per polymer */},
"polymer_types": ["SBM", "EGM"],
"schema_version": 5
},
"system_coverage": {
"aa_class_coverage": {
"partition_name": "aa_class",
"entries": [
{
"partition_element": "aromatic",
"total_contact_frames": 25432,
"coverage_share": 0.164,
"expected_share": 0.086,
"coverage_enrichment": 0.907,
"n_exposed_in_group": 7,
"n_residues_in_group": 20,
"polymer_contributions": {"SBM": 0.45, "EGM": 0.55}
},
{"partition_element": "polar", /* ... */},
{"partition_element": "nonpolar", /* ... */},
{"partition_element": "charged_positive", /* ... */},
{"partition_element": "charged_negative", /* ... */}
]
},
"custom_group_coverages": {
"lid_helix_5": {
"partition_name": "lid_helix_5_vs_rest",
"entries": [
{"partition_element": "lid_helix_5", /* ... */},
{"partition_element": "rest_of_protein", /* ... */}
]
}
},
"user_defined_partitions": {
"lid_helices": {
"partition_name": "lid_helices",
"entries": [
{"partition_element": "lid_helix_5", /* ... */},
{"partition_element": "lid_helix_10", /* ... */},
{"partition_element": "rest_of_protein", /* ... */}
]
}
},
"n_frames": 10000,
"total_contact_frames": 154892,
"total_exposed_residues": 81,
"polymer_types_included": ["SBM", "EGM"],
"schema_version": 2
},
"schema_version": 5
}
Note
The outer schema_version: 5 indicates the BindingPreferenceResult format.
The inner system_coverage.schema_version: 2 indicates the SystemCoverageResult
format (unchanged from previous versions).
Interpreting Polymer Contributions
The polymer_contributions field shows how much each polymer type contributed
to the coverage of a specific protein group. For example:
# For aromatic coverage:
polymer_contributions = {"SBM": 0.45, "EGM": 0.55}
# 45% of contacts to aromatics came from SBMA
# 55% of contacts to aromatics came from EGMA
This helps understand:
Which polymer is “doing the work” for each protein group
Whether coverage is balanced across polymer types
How polymer ratio affects coverage patterns
Aggregation Across Replicates
System coverage is automatically aggregated when you aggregate binding preference:
from polyzymd.analysis.contacts import (
aggregate_binding_preference,
AggregatedBindingPreferenceResult,
)
# Aggregate multiple replicates
aggregated = aggregate_binding_preference([result1, result2, result3])
# Access aggregated system coverage
if aggregated.system_coverage:
# AA class coverage with statistics
for entry in aggregated.system_coverage.aa_class_coverage.entries:
print(f"{entry.partition_element}:")
print(f" Mean enrichment: {entry.mean_coverage_enrichment:+.2f}")
print(f" SEM: ±{entry.sem_coverage_enrichment:.2f}")
print(f" Replicates: {entry.n_replicates}")
# User-defined partitions with statistics
for partition_name in aggregated.system_coverage.user_partition_names():
partition = aggregated.system_coverage.get_user_partition(partition_name)
print(f"\n{partition_name}:")
for entry in partition.entries:
print(f" {entry.partition_element}: {entry.mean_coverage_enrichment:+.2f} ± {entry.sem_coverage_enrichment:.2f}")
SystemCoverageResult Methods
Method |
Returns |
Description |
|---|---|---|
|
|
{aa_class: enrichment} mapping |
|
|
Enrichment for specific AA class |
|
|
Enrichment for custom group |
|
|
List of custom group names |
|
|
List of user-defined partition names |
|
|
Get user-defined partition |
|
JSON serialization |
PartitionCoverageEntry Fields
Field |
Type |
Description |
|---|---|---|
|
str |
Group name within the partition |
|
int |
Sum of ALL polymer contacts to group |
|
float |
Fraction of all contacts to this group |
|
float |
Expected share (surface availability) |
|
float |
Zero-centered enrichment |
|
int |
Surface-exposed residues in group |
|
int |
Total residues in group |
|
dict[str, float] |
Fraction from each polymer type |
Why Overlapping Groups Cause Bugs
Understanding why partitions must be mutually exclusive helps avoid subtle errors:
Worked Example: The Overlap Bug
Setup: Consider two groups that overlap:
lid_helix_5: residues {141, 142, 143, 144, 145}lid_domain: residues {140, 141, 142, 143, 144, 145, 146, 147}
Expected shares (if treated as a “partition”):
lid_helix_5: 5 exposed residues → expected_share = 5/100 = 0.05lid_domain: 8 exposed residues → expected_share = 8/100 = 0.08Sum: 0.13 (should be 1.0!)
The 5 overlapping residues are double-counted in expected_share.
Coverage shares (actual contacts):
lid_helix_5: 1000 contact-frames → coverage_share = 1000/20000 = 0.05lid_domain: 1600 contact-frames → coverage_share = 1600/20000 = 0.08Sum: 0.13 (contacts aren’t double-counted if we sum over groups)
Wait—but contacts are double-counted! Those 5 overlapping residues contribute to both groups. So coverage_share also sums to >1.0? No—the denominator (total contacts) is shared, making coverage_shares additive only if groups don’t overlap.
Result: enrichment = (0.05 / 0.05) - 1 = 0, but the “expected” baseline is wrong because the partition doesn’t sum to 1.0. Comparisons between groups become meaningless.
Solution: Use mutually exclusive partitions. PolyzyMD validates this at config load time.
Visualization
Both binding preference and system coverage generate publication-ready plots
when you run polyzymd compare contacts. These plots are automatically
enabled by default but can be controlled via plot_settings.
Available Plots
Plot Type |
Description |
Settings Field |
|---|---|---|
Binding Preference Heatmap |
Enrichment by (polymer type × protein group) across conditions |
|
Binding Preference Bars |
Grouped bars per polymer type, comparing conditions |
|
System Coverage Heatmap |
Aggregate coverage by protein group across conditions |
|
System Coverage Bars |
Grouped bars showing aggregate coverage across conditions |
|
User Partition Bars |
One grouped bar chart per user-defined partition |
|
Configuring Plot Settings
# comparison.yaml
plot_settings:
output_dir: "figures/"
format: "png"
dpi: 300
contacts:
# Binding preference plots
generate_enrichment_heatmap: true
generate_enrichment_bars: true
figsize_enrichment_heatmap: [12, 8] # Auto-calculated if null
figsize_enrichment_bars: [10, 6]
enrichment_colormap: "RdBu_r" # Diverging colormap
show_enrichment_error: true
# System coverage plots (AA class partition)
generate_system_coverage_heatmap: true
generate_system_coverage_bars: true
figsize_system_coverage_heatmap: [10, 6] # Auto-calculated if null
figsize_system_coverage_bars: [10, 6]
show_system_coverage_error: true
# User-defined partition plots
generate_user_partition_bars: true
figsize_user_partition_bars: [10, 6]
show_user_partition_error: true
from polyzymd.compare.config import PlotSettings, ContactsPlotSettings
plot_settings = PlotSettings(
output_dir="figures/",
format="png",
dpi=300,
contacts=ContactsPlotSettings(
generate_enrichment_heatmap=True,
generate_system_coverage_heatmap=True,
enrichment_colormap="RdBu_r",
show_enrichment_error=True,
show_system_coverage_error=True,
# User partition plots
generate_user_partition_bars=True,
show_user_partition_error=True,
),
)
Understanding the Heatmaps
Binding Preference Heatmap:
One subplot per condition
Rows: Protein groups (aromatic, polar, etc.)
Columns: Polymer types (SBMA, EGMA, etc.)
Color: Zero-centered diverging colormap
Red (positive): Preferential binding
White (zero): Neutral
Blue (negative): Avoidance
System Coverage Heatmap:
Single heatmap comparing conditions
Rows: Protein groups
Columns: Conditions (100% SBMA, 50/50 copolymer, etc.)
Shows aggregate coverage across all polymer types
Understanding the Bar Charts
Binding Preference Bars:
One plot per polymer type
Groups: Protein groups
Bars within group: One per condition
Error bars: SEM across replicates
System Coverage Bars:
Single plot comparing all conditions
Groups: AA class groups (aromatic, polar, etc.)
Bars within group: One per condition
Shows how different copolymer compositions collectively cover each protein group
User Partition Bars:
One plot per user-defined partition
Groups: Partition elements (e.g., lid_helix_5, lid_helix_10, rest_of_protein)
Bars within group: One per condition
Useful for comparing coverage of specific structural regions across conditions
Disabling Specific Plots
To disable specific plot types, set the corresponding field to false:
plot_settings:
contacts:
# Only generate bar charts, not heatmaps
generate_enrichment_heatmap: false
generate_system_coverage_heatmap: false
generate_enrichment_bars: true
generate_system_coverage_bars: true
generate_user_partition_bars: true # Enable user partition plots
Troubleshooting
“No exposed residues in group”
This warning appears when a protein group has no surface-exposed residues (all are buried). Solutions:
Lower
surface_exposure_thresholdto include more residuesVerify the residue IDs in your custom
protein_groupsare correctCheck that your enzyme PDB has correct atom coordinates
“Enzyme PDB not found”
The SASA calculation requires a PDB file. Ensure:
enzyme_pdb_for_sasapath is relative tocomparison.yamllocationThe file exists and has correct format
Alternatively, place a symlink in the
structures/directory
Enrichment values of -1.0
An enrichment of exactly -1.0 means complete avoidance (no contacts observed for this polymer-group pair). This is expected for:
Buried residue groups with no surface exposure
Polymer types that physically cannot reach certain regions
Check the n_exposed_in_group and n_residues_contacted fields to diagnose.
High SEM values
Large uncertainty (SEM) indicates high variability across replicates. This could mean:
Insufficient sampling — consider longer simulations
True variability in binding — the polymer explores multiple binding modes
Rare events dominating one replicate — check per-replicate values
See Also
Contacts Analysis Quick Start — basic contacts analysis
Contacts Analysis Cookbook — worked examples
Surface Exposure Calculation — SASA filtering
Statistics Best Practices — uncertainty quantification
Comparing Conditions — multi-condition workflows