# Distance Analysis: Quick Start Compute inter-atomic distances with proper statistical handling in under 5 minutes. ```{note} **Want to understand the statistics?** This guide focuses on getting results quickly. For proper uncertainty quantification (autocorrelation correction, SEM vs. SD), see the [Statistical Best Practices Guide](analysis_statistics_best_practices.md). ``` ## TL;DR ```bash # Single distance pair, single replicate polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \ --pair "resid 77 and name OG : resid 133 and name NE2" # Multiple pairs with contact threshold polyzymd analyze distances -c config.yaml -r 1-3 --eq-time 10ns \ --pair "resid 77 and name OG : resid 133 and name NE2" \ --pair "resid 133 and name NE2 : resid 156 and name OD1" \ --threshold 3.5 # With plots polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \ --pair "resid 77 and name OG : resid 133 and name NE2" --plot ``` ## Prerequisites Before running distance analysis, you need: 1. **Completed production simulation(s)** - at least one replicate 2. **Config file** - the same `config.yaml` used for the simulation 3. **Trajectory files** - in the scratch directory specified in config Verify your setup: ```bash # Check that trajectories exist ls $(polyzymd info -c config.yaml --scratch-dir)/production_*/ ``` ## What Distance Analysis Provides The distance analysis module computes: | Feature | Description | |---------|-------------| | **Mean distance** | Average distance over trajectory (equilibrated portion) | | **SEM** | Autocorrelation-corrected standard error of the mean | | **Mode (KDE peak)** | Most probable distance from kernel density estimation | | **Contact fraction** | \% of frames below a distance threshold | | **Distribution** | Full histogram and KDE for visualization | ```{tip} **When to use distances vs. contacts vs. triad:** - **Distances**: Specific atom pairs with continuous distance values - **Contacts**: All residue-residue contacts at an interface (binary count) - **Triad**: Pre-defined catalytic geometry with simultaneous contact analysis ``` ## Basic Usage `````{tab-set} ````{tab-item} YAML (Recommended) For reproducible analysis, define distance pairs in `analysis.yaml`: ```yaml # analysis.yaml (alongside config.yaml) replicates: [1, 2, 3] defaults: equilibration_time: "10ns" distances: enabled: true pairs: - label: "Ser77-His156" selection_a: "resid 77 and name OG" selection_b: "resid 156 and name NE2" - label: "His156-Asp133" selection_a: "resid 156 and name ND1" selection_b: "midpoint(resid 133 and name OD1 OD2)" ``` Then run: ```bash # Initialize template (if starting fresh) polyzymd analyze init # Run all enabled analyses (uses analysis.yaml) polyzymd analyze run # Force recompute polyzymd analyze run --recompute ``` **Benefits:** - Version-controlled, reproducible - Self-documenting experiment setup - Easy to re-run with different parameters ```` ````{tab-item} CLI ### Single Distance Pair ```bash polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \ --pair "resid 77 and name OG : resid 133 and name NE2" ``` **Expected output:** ``` Loading configuration from: config.yaml Distance Analysis: MySimulation Replicates: 1 Equilibration: 10ns Distance pairs: 1 1. resid 77 and name OG <-> resid 133 and name NE2 Distance Analysis Complete resid77_OG-resid133_NE2: Mean: 3.42 ± 0.15 Å Min: 2.61 Å Max: 5.87 Å ``` ### Multiple Pairs Specify `--pair` multiple times: ```bash polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \ --pair "resid 77 and name OG : resid 133 and name NE2" \ --pair "resid 133 and name NE2 : resid 156 and name OD1" ``` ### Multiple Replicates Aggregates results with SEM across replicates: ```bash polyzymd analyze distances -c config.yaml -r 1-3 --eq-time 10ns \ --pair "resid 77 and name OG : resid 133 and name NE2" ``` **Output:** ``` Distance Analysis Complete (Aggregated) Replicates: 1-3 resid77_OG-resid133_NE2: Mean: 3.38 ± 0.08 Å (SEM across 3 replicates) ``` ```` ````{tab-item} Python Use `DistanceCalculator` for programmatic analysis: ```python from polyzymd.config.schema import SimulationConfig from polyzymd.analysis import DistanceCalculator # Load configuration config = SimulationConfig.from_yaml("config.yaml") # Define distance pairs pairs = [ ("resid 77 and name OG", "resid 156 and name NE2"), ("resid 156 and name ND1", "midpoint(resid 133 and name OD1 OD2)"), ] # Create calculator calc = DistanceCalculator( config=config, pairs=pairs, equilibration="10ns", threshold=3.5, # Optional: contact analysis ) # Single replicate result = calc.compute(replicate=1) for pr in result.pair_results: print(f"{pr.pair_label}: {pr.mean_distance:.2f} ± {pr.sem_distance:.2f} Å") if pr.fraction_below_threshold is not None: print(f" Contact fraction: {pr.fraction_below_threshold:.1%}") # Multiple replicates (aggregated with SEM) agg_result = calc.compute_aggregated(replicates=[1, 2, 3]) for pr in agg_result.pair_results: print(f"{pr.pair_label}: {pr.overall_mean:.2f} ± {pr.overall_sem:.2f} Å") ``` ```` ````` ## Contact Threshold Analysis Add `--threshold` to compute the fraction of frames where the distance is below a cutoff (useful for hydrogen bond analysis, active site proximity, etc.): `````{tab-set} ````{tab-item} YAML (Recommended) ```yaml # analysis.yaml distances: enabled: true threshold: 3.5 # Angstroms (H-bond cutoff) pairs: - label: "Ser77-His156" selection_a: "resid 77 and name OG" selection_b: "resid 156 and name NE2" ``` ```bash polyzymd analyze run ``` ```` ````{tab-item} CLI ```bash polyzymd analyze distances -c config.yaml -r 1-3 --eq-time 10ns \ --pair "resid 77 and name OG : resid 133 and name NE2" \ --threshold 3.5 ``` **Output:** ``` Distance Analysis Complete resid77_OG-resid133_NE2: Mean: 3.42 ± 0.15 Å Min: 2.61 Å Max: 5.87 Å Contact fraction (<3.5Å): 62.4% ``` ```` ````{tab-item} Python ```python calc = DistanceCalculator( config=config, pairs=pairs, equilibration="10ns", threshold=3.5, # Contact cutoff in Angstroms ) result = calc.compute(replicate=1) for pr in result.pair_results: if pr.fraction_below_threshold is not None: print(f"{pr.pair_label}: {pr.fraction_below_threshold:.1%} below {pr.threshold} Å") ``` ```` ````` ## Special Selection Syntax PolyzyMD extends MDAnalysis selections with special position modes and keywords: :::{warning} **Chain-Aware Selections Required** Residue numbers restart at 1 for each chain in PolyzyMD systems. A selection like `resid 141-148` will match residues from **all chains** (protein, polymer, and water). For protein residues, always use `protein and resid X`: ```yaml # INCORRECT - selects from all chains, causing wrong distances selection_a: "com(resid 141-148)" # CORRECT - restricts to protein chain only selection_a: "com(protein and resid 141-148)" ``` PolyzyMD will emit a runtime warning if your selection spans multiple chains, but it's best to write correct selections from the start. ::: ### Position Modes | Syntax | Description | Use Case | |--------|-------------|----------| | `midpoint(selection)` | Geometric midpoint of selected atoms | Carboxylate groups (Asp, Glu) | | `com(selection)` | Center of mass of selected atoms | Entire residues, aromatic rings | ### PolyzyMD Keywords | Keyword | Description | Example | |---------|-------------|---------| | `pdbindex N` | Atom by PDB serial number (1-indexed) | `pdbindex 2740 and name CA` | The `pdbindex` keyword lets you reference atoms by their PDB ATOM serial number (the number displayed in PyMOL as "id"). This is especially useful when copying atom selections from restraint definitions in `config.yaml`. ```{tip} **Consistency with restraints:** You can use the same `pdbindex` selections in both your restraint configuration (config.yaml) and analysis commands. This makes it easy to verify that restrained distances match observed distances. ``` ### Examples ```yaml # Midpoint of Asp carboxylate oxygens (protein residue) selection_a: "midpoint(protein and resid 133 and name OD1 OD2)" # Center of mass of entire ligand (non-protein, no chain restriction needed) selection_b: "com(resname LIG)" # Standard single atom (protein residue) selection_a: "protein and resid 77 and name OG" # Atom by PDB serial number (unique, no chain restriction needed) selection_a: "pdbindex 2740" ``` ```{tip} Use `midpoint()` for carboxylate groups (Asp, Glu) where either oxygen can accept a hydrogen bond. This gives a single representative point instead of choosing arbitrarily between OD1/OD2 or OE1/OE2. ``` ### CLI Syntax On the command line, use quotes to protect the special syntax: ```bash polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \ --pair "protein and resid 156 and name ND1 : midpoint(protein and resid 133 and name OD1 OD2)" ``` ## Output Files Results are saved in your project's analysis directory: ``` / └── analysis/ └── distances/ ├── run_1/ │ └── distances_resid77_OG-resid133_NE2_eq10ns.json ├── run_2/ │ └── distances_resid77_OG-resid133_NE2_eq10ns.json ├── run_3/ │ └── distances_resid77_OG-resid133_NE2_eq10ns.json └── aggregated/ └── distances_reps1-3_eq10ns.json ``` ### JSON Result Structure ```python { "pair_results": [ { "pair_label": "resid77_OG-resid133_NE2", "selection1": "resid 77 and name OG", "selection2": "resid 133 and name NE2", "mean_distance": 3.42, "std_distance": 0.87, "sem_distance": 0.15, # Autocorrelation-corrected "median_distance": 3.31, "min_distance": 2.61, "max_distance": 5.87, "kde_peak": 3.18, # Mode from KDE "threshold": 3.5, "fraction_below_threshold": 0.624, "correlation_time": 245.3, # ps "n_independent_frames": 34, "histogram_edges": [...], "histogram_counts": [...], "kde_x": [...], "kde_y": [...] } ], "n_frames_total": 10000, "n_frames_used": 9000, "equilibration_time": 10.0, "equilibration_unit": "ns", # ... additional metadata } ``` ## Visualization `````{tab-set} ````{tab-item} CLI Generate plots automatically with `--plot`: ```bash polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \ --pair "resid 77 and name OG : resid 133 and name NE2" \ --plot ``` Plots are saved to `/plots/distances/`. ```` ````{tab-item} Python Use the plotting functions for custom figures: ```python from polyzymd.analysis.distances import ( plot_distance_histogram, plot_distance_timeseries, plot_distance_comparison, plot_contact_fraction_bar, ) # Single distribution result = calc.compute(replicate=1) fig, ax = plot_distance_histogram(result.pair_results[0]) fig.savefig("distance_histogram.png") # Time series (requires store_distributions=True) fig, ax = plot_distance_timeseries(result.pair_results[0]) fig.savefig("distance_timeseries.png") # Compare multiple conditions results_no_poly = calc_no_poly.compute(replicate=1) results_with_poly = calc_with_poly.compute(replicate=1) fig, ax = plot_distance_comparison( [results_no_poly.pair_results[0], results_with_poly.pair_results[0]], labels=["No Polymer", "With Polymer"], ) fig.savefig("distance_comparison.png") ``` ```` ````` ### Available Plot Types The CLI `--plot` flag generates histograms automatically. For other plot types, use the Python API: | Function | Description | CLI | Python | |----------|-------------|:---:|:------:| | `plot_distance_histogram` | Distribution with optional threshold line | ✓ | ✓ | | `plot_distance_timeseries` | Distance over frame number | | ✓ | | `plot_distance_comparison` | Overlay multiple conditions | | ✓ | | `plot_contact_fraction_bar` | Bar chart of contact fractions | ✓* | ✓ | *\*Only generated when `--threshold` is specified with multiple replicates.* ```{note} **Want more CLI plot options?** See [Issue #27](https://github.com/joelaforet/polyzymd/issues/27) and [Issue #28](https://github.com/joelaforet/polyzymd/issues/28) for planned enhancements to automatic plot generation. ``` ## Common Options | Option | Default | Description | |--------|---------|-------------| | `-c, --config` | (required) | Path to config.yaml | | `-r, --replicates` | `1` | Which replicates to analyze | | `--eq-time` | `0ns` | Equilibration time to skip | | `--pair` | (required) | Distance pair as `selection1 : selection2` | | `--threshold` | (none) | Contact cutoff in Angstroms | | `--plot` | off | Generate matplotlib figures | | `--recompute` | off | Ignore cached results | | `-o, --output-dir` | (auto) | Custom output location | ### Replicate Specification | Format | Meaning | |--------|---------| | `-r 1` | Single replicate | | `-r 1-5` | Replicates 1 through 5 | | `-r 1,3,5` | Specific replicates | ## PBC-Aware Distances and Trajectory Alignment ```{versionadded} 0.3.0 Distance calculations now include PBC-aware distances and trajectory alignment by default. ``` ### Periodic Boundary Conditions (PBC) By default, distances are computed using the **minimum image convention**, which correctly handles molecules near periodic boundaries. This prevents artificially large distances (60-70Å) when atoms are actually close but on opposite sides of the simulation box. ```{note} **When does PBC matter?** PBC correction is critical when: - Molecules diffuse across box boundaries - Long polymers span the periodic box - Active sites are near the box edge For well-centered proteins in large boxes, PBC usually has minimal effect, but it's always safer to keep it enabled (the default). ``` **Supported box types:** - ✅ Orthorhombic boxes (cubic, rectangular): Fully supported - ⚠️ Triclinic boxes: Warning issued, falls back to Euclidean distance `````{tab-set} ````{tab-item} YAML ```yaml # analysis.yaml distances: use_pbc: true # Default, can be omitted pairs: - label: "Ser77-His156" selection_a: "resid 77 and name OG" selection_b: "resid 156 and name NE2" ``` ```` ````{tab-item} Python ```python from polyzymd.analysis import DistanceCalculator # PBC enabled by default calc = DistanceCalculator( config=config, pairs=pairs, equilibration="10ns", use_pbc=True, # Default, can be omitted ) # Disable PBC (not recommended) calc = DistanceCalculator( config=config, pairs=pairs, equilibration="10ns", use_pbc=False, ) ``` ```` ````` ### Trajectory Alignment By default, trajectories are **aligned to a reference structure** before computing distances. This removes rotational drift and center-of-mass motion that can add noise to distance measurements. **Why alignment matters:** MD simulations allow the entire system to rotate and translate. Without alignment, even a rigid protein will show larger fluctuations in inter-atomic distances due to this global motion. **Reference modes:** | Mode | Description | Best for | |------|-------------|----------| | `centroid` (default) | Align to most populated conformational cluster (K-Means) | General use | | `average` | Align to mathematical average structure | Pure thermal fluctuation analysis | | `frame` | Align to a specific frame number | Comparing to known functional conformation | ```{note} When alignment is performed, an INFO-level log message notifies you. This ensures you're aware that trajectory coordinates have been modified in-memory. ``` `````{tab-set} ````{tab-item} YAML ```yaml # analysis.yaml distances: align_trajectory: true # Default alignment_mode: centroid # Default alignment_selection: "protein and name CA" # Default pairs: - label: "Ser77-His156" selection_a: "resid 77 and name OG" selection_b: "resid 156 and name NE2" ``` ```` ````{tab-item} Python ```python from polyzymd.analysis import DistanceCalculator from polyzymd.analysis.core.alignment import AlignmentConfig # Default: align to centroid using CA atoms calc = DistanceCalculator( config=config, pairs=pairs, equilibration="10ns", ) # Custom alignment: align to frame 500 calc = DistanceCalculator( config=config, pairs=pairs, equilibration="10ns", alignment=AlignmentConfig( reference_mode="frame", reference_frame=500, selection="protein and backbone", ), ) # Disable alignment (not recommended for most analyses) calc = DistanceCalculator( config=config, pairs=pairs, equilibration="10ns", alignment=AlignmentConfig(enabled=False), ) ``` ```` ````` ### Cache Invalidation The result filename includes PBC and alignment settings, so changing these parameters automatically invalidates the cache: ``` distances_resid77_OG-resid133_NE2_eq10ns_pbc_align-centroid.json distances_resid77_OG-resid133_NE2_eq10ns_nopbc_noalign.json ``` This means you can safely experiment with different settings without manually clearing cached results. ## Troubleshooting ### "Selection matched no atoms" **Cause:** MDAnalysis selection doesn't match any atoms in your topology. **Fix:** - Check residue numbering in your PDB vs. MDAnalysis (0-indexed vs 1-indexed) - Verify atom names match your topology: `protein and resid 77` to see available atoms - Use `polyzymd --debug analyze distances ...` for detailed selection diagnostics ### Very wide distance distribution **Cause:** The selected atoms may be flexible or the selection is too broad. **Fix:** - Ensure selections resolve to single atoms (or use `midpoint()`/`com()`) - Check that `selection1` and `selection2` are correctly specified - Visualize the selections in a molecular viewer ### "Low statistical reliability" warning **Cause:** Long correlation time relative to trajectory length. **This is informational, not an error.** Results are still valid but uncertainties may be underestimated. **Mitigation:** - Use multiple replicates (aggregated SEM is more reliable) - Run longer simulations - Results are still useful for qualitative comparisons ### Missing replicate data **Message:** `Skipping replicate N: trajectory data not found` **Cause:** The requested replicate hasn't completed or path is incorrect. **Fix:** This is informational—analysis continues with available replicates. Check simulation status if unexpected. ## Comparison with Catalytic Triad Analysis Distance analysis and [Catalytic Triad Analysis](analysis_triad_quickstart.md) both measure atom-pair distances, but serve different purposes: | Feature | Distances | Catalytic Triad | |---------|-----------|-----------------| | **Focus** | Any atom pairs | Pre-defined catalytic geometry | | **Configuration** | `analysis.yaml` or CLI | `comparison.yaml` with conditions | | **Multi-condition** | Via `compare run distances` | Built-in condition comparison | | **Simultaneous contacts** | Not computed | Key metric (all pairs < threshold) | | **Use case** | Ad-hoc distance measurements | Structured enzyme comparisons | ```{tip} Use **distances** for exploratory analysis of specific interactions. Use **catalytic triad** when comparing enzyme integrity across conditions. ``` ## Comparing Distances Across Conditions To statistically compare distances across multiple simulation conditions (e.g., different polymer compositions), use the `compare run distances` command: ```bash # Add distances section to comparison.yaml, then: polyzymd compare run distances -f comparison.yaml ``` This provides: - **Dual-metric ranking**: By mean distance (primary) and fraction below threshold (secondary) - **Statistical tests**: t-tests, Cohen's d effect sizes, ANOVA - **Per-pair summaries**: Distance statistics for each defined pair See [Comparing Distances Across Conditions](analysis_compare_conditions.md#comparing-distances-across-conditions) for full documentation. ## Next Steps - **Compare distances across conditions**: [Comparing Conditions Guide](analysis_compare_conditions.md#comparing-distances-across-conditions) - **Catalytic triad analysis**: [Triad Quick Start](analysis_triad_quickstart.md) - **Understand statistics**: [Statistical Best Practices](analysis_statistics_best_practices.md) - **Contact analysis**: [Contacts Quick Start](analysis_contacts_quickstart.md)