# Polymer-Protein Contacts Analysis: Quick Start Analyze polymer-protein contact frequencies and coverage in your MD simulations. ```{note} This command analyzes contacts between polymer chains (Chain C) and protein residues (Chain A), following PolyzyMD's chain convention. It computes per-residue contact fractions with proper statistical treatment. ``` ## TL;DR ```bash # Single replicate analysis polyzymd analyze contacts -c config.yaml -r 1 --eq-time 10ns # Custom cutoff distance (default: 4.0 Å) polyzymd analyze contacts -c config.yaml -r 1 --eq-time 10ns --cutoff 5.0 # Force recompute (ignore cache) polyzymd analyze contacts -c config.yaml -r 1 --eq-time 10ns --recompute ``` ## Prerequisites Before running contacts analysis, you need: 1. **Completed production simulation** - with polymer chains present 2. **A config.yaml file** - pointing to your simulation output 3. **Trajectory files** - in the scratch directory specified in your config 4. **solvated_system.pdb** - topology with correct chain assignments ## PolyzyMD Chain Convention The contacts analyzer uses PolyzyMD's standard chain assignment: | Chain | Contents | |-------|----------| | A | Protein/Enzyme | | B | Substrate/Ligand | | C | Polymers | | D+ | Solvent (water, ions, co-solvents) | By default, the command analyzes contacts between `segid C` (polymers) and `protein` (chain A). ## Basic Usage ### Single Replicate Analysis `````{tab-set} ````{tab-item} YAML (Recommended) Create an `analysis.yaml` file alongside your `config.yaml`: ```yaml # analysis.yaml replicates: [1] defaults: equilibration_time: "10ns" contacts: enabled: true cutoff: 4.0 ``` Then run all enabled analyses: ```bash polyzymd analyze run ``` ```` ````{tab-item} CLI ```bash cd /path/to/your/project polyzymd analyze contacts -c config.yaml -r 1 --eq-time 10ns ``` ```` ````{tab-item} Python ```python from pathlib import Path from polyzymd.config import SimulationConfig from polyzymd.analysis.contacts import ContactAnalyzer # Load simulation config config = SimulationConfig.from_yaml("config.yaml") # Create analyzer with parameters analyzer = ContactAnalyzer( config=config, replicate=1, equilibration_time="10ns", cutoff=4.0, polymer_selection="segid C", protein_selection="protein", ) # Run analysis result = analyzer.analyze() # Access results print(f"Coverage: {result.coverage:.1%}") print(f"Mean contact fraction: {result.mean_contact_fraction:.1%}") print(f"Contacted residues: {result.n_contacted}/{result.n_residues}") ``` ```` ````` **Output:** ``` Loading configuration from: config.yaml Contact Analysis: MyEnzyme_Polymer_Study Replicates: 1 Equilibration: 10ns Cutoff: 4.0 Å Polymer selection: segid C Protein selection: protein Grouping: residue Processing replicate 1... done (134/181 residues contacted, 74.0% coverage, 18.0% mean contact) Contact Analysis Complete Contacted residues: 134/181 Coverage: 74.0% Mean contact fraction: 18.0% Results saved: /path/to/project/analysis/contacts/contacts_rep1.json ``` ### Key Metrics - **Coverage**: Fraction of protein residues that had at least one contact with polymer during the trajectory - **Mean contact fraction**: Average fraction of frames where each residue was in contact with polymer (averaged across all residues) - **Contacted residues**: Count of residues with any polymer contact ## Using analysis.yaml For reproducible, version-controlled analysis configuration, use `analysis.yaml` instead of CLI flags. Place this file alongside your `config.yaml`. ### Multi-Replicate Analysis with Full Options `````{tab-set} ````{tab-item} YAML (Recommended) Create an `analysis.yaml` file with all configuration options: ```yaml # analysis.yaml replicates: [1, 2, 3] defaults: equilibration_time: "10ns" contacts: enabled: true polymer_selection: "chainID C" # MDAnalysis selection for polymer protein_selection: "protein" # MDAnalysis selection for protein cutoff: 4.5 # Contact distance in Angstroms polymer_types: ["SBM", "EGM"] # Optional: filter by monomer type grouping: "aa_class" # aa_class | secondary_structure | none compute_residence_times: true # Enable residence time statistics ``` Then run all enabled analyses: ```bash # Initialize a template (if starting fresh) polyzymd analyze init # Run all analyses defined in analysis.yaml polyzymd analyze run # Force recompute polyzymd analyze run --recompute ``` ```` ````{tab-item} CLI ```bash # Run contacts analysis across multiple replicates with all options polyzymd analyze contacts -c config.yaml -r 1-3 --eq-time 10ns \ --cutoff 4.5 \ --polymer-selection "chainID C" \ --protein-selection "protein" \ --residence-times # Force recompute polyzymd analyze contacts -c config.yaml -r 1-3 --eq-time 10ns \ --cutoff 4.5 --residence-times --recompute ``` ```` ````{tab-item} Python ```python from pathlib import Path from polyzymd.config import SimulationConfig from polyzymd.analysis.contacts import ContactAnalyzer from polyzymd.analysis.contacts.aggregation import aggregate_contact_results config = SimulationConfig.from_yaml("config.yaml") # Analyze multiple replicates results = [] for rep in [1, 2, 3]: analyzer = ContactAnalyzer( config=config, replicate=rep, equilibration_time="10ns", cutoff=4.5, polymer_selection="chainID C", protein_selection="protein", compute_residence_times=True, ) results.append(analyzer.analyze()) # Aggregate results across replicates aggregated = aggregate_contact_results(results) print(f"Contact fraction: {aggregated.mean_contact_fraction:.1%} ± {aggregated.std_contact_fraction:.1%}") ``` ```` ````` ### Configuration Options | Field | Type | Default | Description | |-------|------|---------|-------------| | `enabled` | bool | `false` | Whether to run contact analysis | | `polymer_selection` | str | `"chainID C"` | MDAnalysis selection for polymer atoms | | `protein_selection` | str | `"protein"` | MDAnalysis selection for protein atoms | | `cutoff` | float | `4.5` | Distance cutoff in Angstroms | | `polymer_types` | list | `null` | Filter by polymer residue names | | `grouping` | str | `"aa_class"` | How to group protein residues | | `compute_residence_times` | bool | `true` | Compute residence time statistics | ```{tip} **When to use analysis.yaml vs CLI:** Use `analysis.yaml` for standard, reproducible workflows. Use CLI flags (`polyzymd analyze contacts ...`) for one-off analyses or when exploring different parameters. ``` ## Residence Time Analysis Residence time measures how long polymer segments remain in contact with protein residues. This is crucial for understanding binding kinetics and comparing different polymer types. ### Enable Residence Time Statistics `````{tab-set} ````{tab-item} YAML (Recommended) ```yaml # analysis.yaml replicates: [1] defaults: equilibration_time: "10ns" contacts: enabled: true compute_residence_times: true # Enable residence time statistics ``` ```bash polyzymd analyze run ``` ```` ````{tab-item} CLI ```bash polyzymd analyze contacts -c config.yaml -r 1 --eq-time 10ns --residence-times ``` ```` ````{tab-item} Python ```python from polyzymd.config import SimulationConfig from polyzymd.analysis.contacts import ContactAnalyzer config = SimulationConfig.from_yaml("config.yaml") analyzer = ContactAnalyzer( config=config, replicate=1, equilibration_time="10ns", compute_residence_times=True, # Enable residence time statistics ) result = analyzer.analyze() # Access residence time data for polymer_type, stats in result.residence_times.items(): print(f"{polymer_type}: mean={stats.mean:.2f} frames, max={stats.max} frames ({stats.n_events} events)") ``` ```` ````` **Output:** ``` Contact Analysis Complete Contacted residues: 138/181 Coverage: 76.2% Mean contact fraction: 18.3% Residence time by polymer type: EGM: mean=9.12 frames, max=1406 frames (6066 events) SBM: mean=8.93 frames, max=842 frames (5401 events) ``` ### Multi-Replicate Aggregation When analyzing multiple replicates, residence times are aggregated with proper statistical uncertainty: `````{tab-set} ````{tab-item} YAML (Recommended) ```yaml # analysis.yaml replicates: [1, 2, 3] defaults: equilibration_time: "10ns" contacts: enabled: true compute_residence_times: true ``` ```bash polyzymd analyze run ``` ```` ````{tab-item} CLI ```bash polyzymd analyze contacts -c config.yaml -r 1-3 --eq-time 10ns --residence-times ``` ```` ````{tab-item} Python ```python from polyzymd.config import SimulationConfig from polyzymd.analysis.contacts import ContactAnalyzer from polyzymd.analysis.contacts.aggregation import aggregate_contact_results config = SimulationConfig.from_yaml("config.yaml") results = [] for rep in [1, 2, 3]: analyzer = ContactAnalyzer( config=config, replicate=rep, equilibration_time="10ns", compute_residence_times=True, ) results.append(analyzer.analyze()) # Aggregate with proper uncertainty quantification aggregated = aggregate_contact_results(results) print(f"Contact fraction: {aggregated.mean_contact_fraction:.1%} ± {aggregated.std_contact_fraction:.1%}") for polymer_type, stats in aggregated.residence_times.items(): print(f"{polymer_type}: {stats.mean:.2f} ± {stats.std:.2f} frames") ``` ```` ````` **Output:** ``` Aggregated Contact Analysis Complete Contact fraction: 19.8% ± 1.9% Residence time by polymer type: EGM: 8.14 ± 0.56 frames SBM: 9.60 ± 0.53 frames ``` ### Interpreting Residence Times - **Mean residence time**: Average duration of individual contact events (in frames) - **Max residence time**: Longest continuous contact observed - **Total events**: Number of separate contact events for this polymer type Longer residence times suggest stronger or more stable polymer-protein interactions. Comparing residence times across polymer types reveals differences in binding behavior - for example, zwitterionic polymers (SBMA) often show different residence times than PEG-like polymers (EGMA). ### Converting to Physical Time Residence times are reported in frames by default. To convert to picoseconds, multiply by your trajectory's timestep: ```python # If your trajectory saves every 100 ps: residence_time_ps = mean_frames * 100 # ps residence_time_ns = mean_frames * 0.1 # ns ``` The JSON output also includes `mean_ps` and `max_ps` fields computed using the timestep from your configuration. ## Command Options | Option | Default | Description | |--------|---------|-------------| | `-c, --config` | Required | Path to config.yaml | | `-r, --replicates` | `"1"` | Replicate specification: `'1-5'`, `'1,3,5'`, or `'1'` | | `--eq-time` | `"0ns"` | Equilibration time to skip | | `--cutoff` | `4.0` | Contact distance cutoff in Angstroms | | `--polymer-selection` | `"segid C"` | MDAnalysis selection for polymers | | `--protein-selection` | `"protein"` | MDAnalysis selection for protein | | `--residence-times` | False | Compute and display residence time statistics by polymer type | | `--recompute` | False | Force recompute even if cached | | `-o, --output-dir` | Auto | Custom output directory | ## Output Files Results are saved as JSON in the project's analysis directory: ``` project/ └── analysis/ └── contacts/ └── contacts_rep1.json ``` The JSON file contains per-residue contact data including: - Contact events (start frame, duration) - Contact fractions - Amino acid classifications - Polymer chain interactions ## Custom Selections ### Analyze specific polymer types `````{tab-set} ````{tab-item} YAML (Recommended) ```yaml # analysis.yaml - Only SBMA monomers replicates: [1] defaults: equilibration_time: "10ns" contacts: enabled: true polymer_selection: "segid C and resname SBM" # Only SBMA protein_selection: "protein" ``` ```bash polyzymd analyze run ``` To analyze EGMA instead, change the selection: ```yaml contacts: polymer_selection: "segid C and resname EGM" # Only EGMA ``` ```` ````{tab-item} CLI ```bash # Only SBMA monomers polyzymd analyze contacts -c config.yaml -r 1 --eq-time 10ns \ --polymer-selection "segid C and resname SBM" # Only EGMA monomers polyzymd analyze contacts -c config.yaml -r 1 --eq-time 10ns \ --polymer-selection "segid C and resname EGM" ``` ```` ````{tab-item} Python ```python from polyzymd.config import SimulationConfig from polyzymd.analysis.contacts import ContactAnalyzer config = SimulationConfig.from_yaml("config.yaml") # Analyze only SBMA monomers sbma_analyzer = ContactAnalyzer( config=config, replicate=1, equilibration_time="10ns", polymer_selection="segid C and resname SBM", ) sbma_result = sbma_analyzer.analyze() # Analyze only EGMA monomers egma_analyzer = ContactAnalyzer( config=config, replicate=1, equilibration_time="10ns", polymer_selection="segid C and resname EGM", ) egma_result = egma_analyzer.analyze() print(f"SBMA coverage: {sbma_result.coverage:.1%}") print(f"EGMA coverage: {egma_result.coverage:.1%}") ``` ```` ````` ### Analyze specific protein regions `````{tab-set} ````{tab-item} YAML (Recommended) ```yaml # analysis.yaml - Only aromatic residues replicates: [1] defaults: equilibration_time: "10ns" contacts: enabled: true polymer_selection: "segid C" protein_selection: "protein and (resname TRP PHE TYR)" # Aromatics only ``` ```bash polyzymd analyze run ``` For active site analysis: ```yaml contacts: protein_selection: "protein and (resid 75-80 or resid 130-140)" # Active site ``` ```` ````{tab-item} CLI ```bash # Only aromatic residues polyzymd analyze contacts -c config.yaml -r 1 --eq-time 10ns \ --protein-selection "protein and (resname TRP PHE TYR)" # Active site region polyzymd analyze contacts -c config.yaml -r 1 --eq-time 10ns \ --protein-selection "protein and (resid 75-80 or resid 130-140)" ``` ```` ````{tab-item} Python ```python from polyzymd.config import SimulationConfig from polyzymd.analysis.contacts import ContactAnalyzer config = SimulationConfig.from_yaml("config.yaml") # Analyze contacts with aromatic residues only aromatic_analyzer = ContactAnalyzer( config=config, replicate=1, equilibration_time="10ns", protein_selection="protein and (resname TRP PHE TYR)", ) aromatic_result = aromatic_analyzer.analyze() # Analyze contacts with active site region active_site_analyzer = ContactAnalyzer( config=config, replicate=1, equilibration_time="10ns", protein_selection="protein and (resid 75-80 or resid 130-140)", ) active_site_result = active_site_analyzer.analyze() print(f"Aromatic contact fraction: {aromatic_result.mean_contact_fraction:.1%}") print(f"Active site contact fraction: {active_site_result.mean_contact_fraction:.1%}") ``` ```` ````` ## Answering Scientific Questions The contacts analysis module is designed to answer questions like: | Question | Approach | |----------|----------| | Do zwitterionic polymers preferentially contact aromatic residues? | `interaction_matrix()` with custom polymer grouping | | Which protein surface regions does my polymer bind? | `coverage_by_group()` | | Does SBMA have longer residence times than EGMA? | `residence_time_summary()` | ### Quick Example: Interaction Matrix The `interaction_matrix()` method computes contact metrics for each combination of polymer type and protein amino acid class: ```python from polyzymd.analysis.contacts.results import ContactResult result = ContactResult.load("analysis/contacts/contacts_rep1.json") # Get contact fraction by (polymer_type, protein_AA_class) matrix = result.interaction_matrix(metric="contact_fraction") # Compare polymer types contacting aromatic residues print(f"SBMA-aromatic: {matrix['SBM']['aromatic']:.1%}") print(f"EGMA-aromatic: {matrix['EGM']['aromatic']:.1%}") ``` **Example output:** ``` SBMA-aromatic: 45.4% EGMA-aromatic: 37.0% ``` ### Quick Example: Coverage by AA Group See what fraction of each amino acid class your polymer contacts: ```python coverage = result.coverage_by_group() for group, frac in sorted(coverage.items(), key=lambda x: -x[1]): print(f"{group}: {frac:.1%}") ``` **Example output:** ``` charged_negative: 100.0% aromatic: 100.0% charged_positive: 100.0% polar: 93.5% nonpolar: 86.2% ``` ```{tip} For complete worked examples including custom polymer groupings, residence time comparisons, and complex queries, see the [Contacts Analysis Cookbook](analysis_contacts_cookbook.md). ``` ## Technical Details ### Contact Definition A contact is defined when any heavy atom of a polymer residue is within the cutoff distance of any heavy atom of a protein residue. This uses MDAnalysis's `capped_distance` function with KDTree-based neighbor searching for O(N) performance. ### Contact Events Contacts are stored as compressed events `(start_frame, duration)` rather than frame-by-frame booleans. This provides efficient storage and enables residence time analysis. ### Statistical Treatment Per-residue contact fractions include proper uncertainty quantification: - Statistical inefficiency (g) is computed per-residue following Chodera et al. (2007) - Effective sample sizes account for temporal autocorrelation - When aggregating across replicates, uncertainties are propagated correctly ```{seealso} For a detailed explanation of autocorrelation, statistical inefficiency, and the LiveCoMS methodology, see the [Statistics Best Practices Guide](analysis_statistics_best_practices.md). ``` ## Troubleshooting ### "solvated_system.pdb not found" The contacts analyzer requires `solvated_system.pdb` in the run directory (scratch). This file is created during `polyzymd build` and contains the correct chain assignments. Do not use `production_N_topology.pdb` as it may have lost chain information. ### "No polymer atoms selected" Check that your polymer selection is correct. Use MDAnalysis syntax: - `segid C` - all atoms in segment C (default polymer chain) - `resname SBM EGM` - atoms with these residue names - Verify chain assignment with: `polyzymd validate -c config.yaml` ### Slow performance For very large systems or long trajectories: - Use `--eq-time` to skip equilibration frames - Results are cached - subsequent runs load from JSON - Consider using frame striding in custom scripts ## See Also - [Binding Preference Analysis](analysis_binding_preference.md) - enrichment-based AA class preferences - [Contacts Analysis Cookbook](analysis_contacts_cookbook.md) - worked examples for scientific questions - [RMSF Analysis Quick Start](analysis_rmsf_quickstart.md) - complementary stability analysis - [Catalytic Triad Analysis](analysis_triad_quickstart.md) - active site geometry - [Comparing Conditions](analysis_compare_conditions.md) - statistical comparison across polymers