Distance Analysis: Quick Start
Compute inter-atomic distances with proper statistical handling in under 5 minutes.
Note
Want to understand the statistics? This guide focuses on getting results quickly. For proper uncertainty quantification (autocorrelation correction, SEM vs. SD), see the Statistical Best Practices Guide.
TL;DR
# Single distance pair, single replicate
polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
--pair "resid 77 and name OG : resid 133 and name NE2"
# Multiple pairs with contact threshold
polyzymd analyze distances -c config.yaml -r 1-3 --eq-time 10ns \
--pair "resid 77 and name OG : resid 133 and name NE2" \
--pair "resid 133 and name NE2 : resid 156 and name OD1" \
--threshold 3.5
# With plots
polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
--pair "resid 77 and name OG : resid 133 and name NE2" --plot
Prerequisites
Before running distance analysis, you need:
Completed production simulation(s) - at least one replicate
Config file - the same
config.yamlused for the simulationTrajectory files - in the scratch directory specified in config
Verify your setup:
# Check that trajectories exist
ls $(polyzymd info -c config.yaml --scratch-dir)/production_*/
What Distance Analysis Provides
The distance analysis module computes:
Feature |
Description |
|---|---|
Mean distance |
Average distance over trajectory (equilibrated portion) |
SEM |
Autocorrelation-corrected standard error of the mean |
Mode (KDE peak) |
Most probable distance from kernel density estimation |
Contact fraction |
% of frames below a distance threshold |
Distribution |
Full histogram and KDE for visualization |
Tip
When to use distances vs. contacts vs. triad:
Distances: Specific atom pairs with continuous distance values
Contacts: All residue-residue contacts at an interface (binary count)
Triad: Pre-defined catalytic geometry with simultaneous contact analysis
Basic Usage
For reproducible analysis, define distance pairs in analysis.yaml:
# analysis.yaml (alongside config.yaml)
replicates: [1, 2, 3]
defaults:
equilibration_time: "10ns"
distances:
enabled: true
pairs:
- label: "Ser77-His156"
selection_a: "resid 77 and name OG"
selection_b: "resid 156 and name NE2"
- label: "His156-Asp133"
selection_a: "resid 156 and name ND1"
selection_b: "midpoint(resid 133 and name OD1 OD2)"
Then run:
# Initialize template (if starting fresh)
polyzymd analyze init
# Run all enabled analyses (uses analysis.yaml)
polyzymd analyze run
# Force recompute
polyzymd analyze run --recompute
Benefits:
Version-controlled, reproducible
Self-documenting experiment setup
Easy to re-run with different parameters
Single Distance Pair
polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
--pair "resid 77 and name OG : resid 133 and name NE2"
Expected output:
Loading configuration from: config.yaml
Distance Analysis: MySimulation
Replicates: 1
Equilibration: 10ns
Distance pairs: 1
1. resid 77 and name OG <-> resid 133 and name NE2
Distance Analysis Complete
resid77_OG-resid133_NE2:
Mean: 3.42 ± 0.15 Å
Min: 2.61 Å
Max: 5.87 Å
Multiple Pairs
Specify --pair multiple times:
polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
--pair "resid 77 and name OG : resid 133 and name NE2" \
--pair "resid 133 and name NE2 : resid 156 and name OD1"
Multiple Replicates
Aggregates results with SEM across replicates:
polyzymd analyze distances -c config.yaml -r 1-3 --eq-time 10ns \
--pair "resid 77 and name OG : resid 133 and name NE2"
Output:
Distance Analysis Complete (Aggregated)
Replicates: 1-3
resid77_OG-resid133_NE2:
Mean: 3.38 ± 0.08 Å (SEM across 3 replicates)
Use DistanceCalculator for programmatic analysis:
from polyzymd.config.schema import SimulationConfig
from polyzymd.analysis import DistanceCalculator
# Load configuration
config = SimulationConfig.from_yaml("config.yaml")
# Define distance pairs
pairs = [
("resid 77 and name OG", "resid 156 and name NE2"),
("resid 156 and name ND1", "midpoint(resid 133 and name OD1 OD2)"),
]
# Create calculator
calc = DistanceCalculator(
config=config,
pairs=pairs,
equilibration="10ns",
threshold=3.5, # Optional: contact analysis
)
# Single replicate
result = calc.compute(replicate=1)
for pr in result.pair_results:
print(f"{pr.pair_label}: {pr.mean_distance:.2f} ± {pr.sem_distance:.2f} Å")
if pr.fraction_below_threshold is not None:
print(f" Contact fraction: {pr.fraction_below_threshold:.1%}")
# Multiple replicates (aggregated with SEM)
agg_result = calc.compute_aggregated(replicates=[1, 2, 3])
for pr in agg_result.pair_results:
print(f"{pr.pair_label}: {pr.overall_mean:.2f} ± {pr.overall_sem:.2f} Å")
Contact Threshold Analysis
Add --threshold to compute the fraction of frames where the distance is below
a cutoff (useful for hydrogen bond analysis, active site proximity, etc.):
# analysis.yaml
distances:
enabled: true
threshold: 3.5 # Angstroms (H-bond cutoff)
pairs:
- label: "Ser77-His156"
selection_a: "resid 77 and name OG"
selection_b: "resid 156 and name NE2"
polyzymd analyze run
polyzymd analyze distances -c config.yaml -r 1-3 --eq-time 10ns \
--pair "resid 77 and name OG : resid 133 and name NE2" \
--threshold 3.5
Output:
Distance Analysis Complete
resid77_OG-resid133_NE2:
Mean: 3.42 ± 0.15 Å
Min: 2.61 Å
Max: 5.87 Å
Contact fraction (<3.5Å): 62.4%
calc = DistanceCalculator(
config=config,
pairs=pairs,
equilibration="10ns",
threshold=3.5, # Contact cutoff in Angstroms
)
result = calc.compute(replicate=1)
for pr in result.pair_results:
if pr.fraction_below_threshold is not None:
print(f"{pr.pair_label}: {pr.fraction_below_threshold:.1%} below {pr.threshold} Å")
Special Selection Syntax
PolyzyMD extends MDAnalysis selections with special position modes and keywords:
Warning
Chain-Aware Selections Required
Residue numbers restart at 1 for each chain in PolyzyMD systems. A selection like
resid 141-148 will match residues from all chains (protein, polymer, and water).
For protein residues, always use protein and resid X:
# INCORRECT - selects from all chains, causing wrong distances
selection_a: "com(resid 141-148)"
# CORRECT - restricts to protein chain only
selection_a: "com(protein and resid 141-148)"
PolyzyMD will emit a runtime warning if your selection spans multiple chains, but it’s best to write correct selections from the start.
Position Modes
Syntax |
Description |
Use Case |
|---|---|---|
|
Geometric midpoint of selected atoms |
Carboxylate groups (Asp, Glu) |
|
Center of mass of selected atoms |
Entire residues, aromatic rings |
PolyzyMD Keywords
Keyword |
Description |
Example |
|---|---|---|
|
Atom by PDB serial number (1-indexed) |
|
The pdbindex keyword lets you reference atoms by their PDB ATOM serial number
(the number displayed in PyMOL as “id”). This is especially useful when copying
atom selections from restraint definitions in config.yaml.
Tip
Consistency with restraints: You can use the same pdbindex selections in
both your restraint configuration (config.yaml) and analysis commands. This
makes it easy to verify that restrained distances match observed distances.
Examples
# Midpoint of Asp carboxylate oxygens (protein residue)
selection_a: "midpoint(protein and resid 133 and name OD1 OD2)"
# Center of mass of entire ligand (non-protein, no chain restriction needed)
selection_b: "com(resname LIG)"
# Standard single atom (protein residue)
selection_a: "protein and resid 77 and name OG"
# Atom by PDB serial number (unique, no chain restriction needed)
selection_a: "pdbindex 2740"
Tip
Use midpoint() for carboxylate groups (Asp, Glu) where either oxygen can
accept a hydrogen bond. This gives a single representative point instead of
choosing arbitrarily between OD1/OD2 or OE1/OE2.
CLI Syntax
On the command line, use quotes to protect the special syntax:
polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
--pair "protein and resid 156 and name ND1 : midpoint(protein and resid 133 and name OD1 OD2)"
Output Files
Results are saved in your project’s analysis directory:
<projects_directory>/
└── analysis/
└── distances/
├── run_1/
│ └── distances_resid77_OG-resid133_NE2_eq10ns.json
├── run_2/
│ └── distances_resid77_OG-resid133_NE2_eq10ns.json
├── run_3/
│ └── distances_resid77_OG-resid133_NE2_eq10ns.json
└── aggregated/
└── distances_reps1-3_eq10ns.json
JSON Result Structure
{
"pair_results": [
{
"pair_label": "resid77_OG-resid133_NE2",
"selection1": "resid 77 and name OG",
"selection2": "resid 133 and name NE2",
"mean_distance": 3.42,
"std_distance": 0.87,
"sem_distance": 0.15, # Autocorrelation-corrected
"median_distance": 3.31,
"min_distance": 2.61,
"max_distance": 5.87,
"kde_peak": 3.18, # Mode from KDE
"threshold": 3.5,
"fraction_below_threshold": 0.624,
"correlation_time": 245.3, # ps
"n_independent_frames": 34,
"histogram_edges": [...],
"histogram_counts": [...],
"kde_x": [...],
"kde_y": [...]
}
],
"n_frames_total": 10000,
"n_frames_used": 9000,
"equilibration_time": 10.0,
"equilibration_unit": "ns",
# ... additional metadata
}
Visualization
Generate plots automatically with --plot:
polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
--pair "resid 77 and name OG : resid 133 and name NE2" \
--plot
Plots are saved to <projects_directory>/plots/distances/.
Use the plotting functions for custom figures:
from polyzymd.analysis.distances import (
plot_distance_histogram,
plot_distance_timeseries,
plot_distance_comparison,
plot_contact_fraction_bar,
)
# Single distribution
result = calc.compute(replicate=1)
fig, ax = plot_distance_histogram(result.pair_results[0])
fig.savefig("distance_histogram.png")
# Time series (requires store_distributions=True)
fig, ax = plot_distance_timeseries(result.pair_results[0])
fig.savefig("distance_timeseries.png")
# Compare multiple conditions
results_no_poly = calc_no_poly.compute(replicate=1)
results_with_poly = calc_with_poly.compute(replicate=1)
fig, ax = plot_distance_comparison(
[results_no_poly.pair_results[0], results_with_poly.pair_results[0]],
labels=["No Polymer", "With Polymer"],
)
fig.savefig("distance_comparison.png")
Available Plot Types
The CLI --plot flag generates histograms automatically. For other plot types,
use the Python API:
Function |
Description |
CLI |
Python |
|---|---|---|---|
|
Distribution with optional threshold line |
✓ |
✓ |
|
Distance over frame number |
✓ |
|
|
Overlay multiple conditions |
✓ |
|
|
Bar chart of contact fractions |
✓* |
✓ |
*Only generated when --threshold is specified with multiple replicates.
Common Options
Option |
Default |
Description |
|---|---|---|
|
(required) |
Path to config.yaml |
|
|
Which replicates to analyze |
|
|
Equilibration time to skip |
|
(required) |
Distance pair as |
|
(none) |
Contact cutoff in Angstroms |
|
off |
Generate matplotlib figures |
|
off |
Ignore cached results |
|
(auto) |
Custom output location |
Replicate Specification
Format |
Meaning |
|---|---|
|
Single replicate |
|
Replicates 1 through 5 |
|
Specific replicates |
PBC-Aware Distances and Trajectory Alignment
Added in version 0.3.0: Distance calculations now include PBC-aware distances and trajectory alignment by default.
Periodic Boundary Conditions (PBC)
By default, distances are computed using the minimum image convention, which correctly handles molecules near periodic boundaries. This prevents artificially large distances (60-70Å) when atoms are actually close but on opposite sides of the simulation box.
Note
When does PBC matter? PBC correction is critical when:
Molecules diffuse across box boundaries
Long polymers span the periodic box
Active sites are near the box edge
For well-centered proteins in large boxes, PBC usually has minimal effect, but it’s always safer to keep it enabled (the default).
Supported box types:
✅ Orthorhombic boxes (cubic, rectangular): Fully supported
⚠️ Triclinic boxes: Warning issued, falls back to Euclidean distance
# analysis.yaml
distances:
use_pbc: true # Default, can be omitted
pairs:
- label: "Ser77-His156"
selection_a: "resid 77 and name OG"
selection_b: "resid 156 and name NE2"
from polyzymd.analysis import DistanceCalculator
# PBC enabled by default
calc = DistanceCalculator(
config=config,
pairs=pairs,
equilibration="10ns",
use_pbc=True, # Default, can be omitted
)
# Disable PBC (not recommended)
calc = DistanceCalculator(
config=config,
pairs=pairs,
equilibration="10ns",
use_pbc=False,
)
Trajectory Alignment
By default, trajectories are aligned to a reference structure before computing distances. This removes rotational drift and center-of-mass motion that can add noise to distance measurements.
Why alignment matters: MD simulations allow the entire system to rotate and translate. Without alignment, even a rigid protein will show larger fluctuations in inter-atomic distances due to this global motion.
Reference modes:
Mode |
Description |
Best for |
|---|---|---|
|
Align to most populated conformational cluster (K-Means) |
General use |
|
Align to mathematical average structure |
Pure thermal fluctuation analysis |
|
Align to a specific frame number |
Comparing to known functional conformation |
Note
When alignment is performed, an INFO-level log message notifies you. This ensures you’re aware that trajectory coordinates have been modified in-memory.
# analysis.yaml
distances:
align_trajectory: true # Default
alignment_mode: centroid # Default
alignment_selection: "protein and name CA" # Default
pairs:
- label: "Ser77-His156"
selection_a: "resid 77 and name OG"
selection_b: "resid 156 and name NE2"
from polyzymd.analysis import DistanceCalculator
from polyzymd.analysis.core.alignment import AlignmentConfig
# Default: align to centroid using CA atoms
calc = DistanceCalculator(
config=config,
pairs=pairs,
equilibration="10ns",
)
# Custom alignment: align to frame 500
calc = DistanceCalculator(
config=config,
pairs=pairs,
equilibration="10ns",
alignment=AlignmentConfig(
reference_mode="frame",
reference_frame=500,
selection="protein and backbone",
),
)
# Disable alignment (not recommended for most analyses)
calc = DistanceCalculator(
config=config,
pairs=pairs,
equilibration="10ns",
alignment=AlignmentConfig(enabled=False),
)
Cache Invalidation
The result filename includes PBC and alignment settings, so changing these parameters automatically invalidates the cache:
distances_resid77_OG-resid133_NE2_eq10ns_pbc_align-centroid.json
distances_resid77_OG-resid133_NE2_eq10ns_nopbc_noalign.json
This means you can safely experiment with different settings without manually clearing cached results.
Troubleshooting
“Selection matched no atoms”
Cause: MDAnalysis selection doesn’t match any atoms in your topology.
Fix:
Check residue numbering in your PDB vs. MDAnalysis (0-indexed vs 1-indexed)
Verify atom names match your topology:
protein and resid 77to see available atomsUse
polyzymd --debug analyze distances ...for detailed selection diagnostics
Very wide distance distribution
Cause: The selected atoms may be flexible or the selection is too broad.
Fix:
Ensure selections resolve to single atoms (or use
midpoint()/com())Check that
selection1andselection2are correctly specifiedVisualize the selections in a molecular viewer
“Low statistical reliability” warning
Cause: Long correlation time relative to trajectory length.
This is informational, not an error. Results are still valid but uncertainties may be underestimated.
Mitigation:
Use multiple replicates (aggregated SEM is more reliable)
Run longer simulations
Results are still useful for qualitative comparisons
Missing replicate data
Message: Skipping replicate N: trajectory data not found
Cause: The requested replicate hasn’t completed or path is incorrect.
Fix: This is informational—analysis continues with available replicates. Check simulation status if unexpected.
Comparison with Catalytic Triad Analysis
Distance analysis and Catalytic Triad Analysis both measure atom-pair distances, but serve different purposes:
Feature |
Distances |
Catalytic Triad |
|---|---|---|
Focus |
Any atom pairs |
Pre-defined catalytic geometry |
Configuration |
|
|
Multi-condition |
Via |
Built-in condition comparison |
Simultaneous contacts |
Not computed |
Key metric (all pairs < threshold) |
Use case |
Ad-hoc distance measurements |
Structured enzyme comparisons |
Tip
Use distances for exploratory analysis of specific interactions. Use catalytic triad when comparing enzyme integrity across conditions.
Comparing Distances Across Conditions
To statistically compare distances across multiple simulation conditions (e.g.,
different polymer compositions), use the compare run distances command:
# Add distances section to comparison.yaml, then:
polyzymd compare run distances -f comparison.yaml
This provides:
Dual-metric ranking: By mean distance (primary) and fraction below threshold (secondary)
Statistical tests: t-tests, Cohen’s d effect sizes, ANOVA
Per-pair summaries: Distance statistics for each defined pair
See Comparing Distances Across Conditions for full documentation.
Next Steps
Compare distances across conditions: Comparing Conditions Guide
Catalytic triad analysis: Triad Quick Start
Understand statistics: Statistical Best Practices
Contact analysis: Contacts Quick Start