Distance Analysis: Quick Start

Compute inter-atomic distances with proper statistical handling in under 5 minutes.

Note

Want to understand the statistics? This guide focuses on getting results quickly. For proper uncertainty quantification (autocorrelation correction, SEM vs. SD), see the Statistical Best Practices Guide.

TL;DR

# Single distance pair, single replicate
polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
    --pair "resid 77 and name OG : resid 133 and name NE2"

# Multiple pairs with contact threshold
polyzymd analyze distances -c config.yaml -r 1-3 --eq-time 10ns \
    --pair "resid 77 and name OG : resid 133 and name NE2" \
    --pair "resid 133 and name NE2 : resid 156 and name OD1" \
    --threshold 3.5

# With plots
polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
    --pair "resid 77 and name OG : resid 133 and name NE2" --plot

Prerequisites

Before running distance analysis, you need:

Completed production simulation(s) - at least one replicate
Config file - the same config.yaml used for the simulation
Trajectory files - in the scratch directory specified in config

Verify your setup:

# Check that trajectories exist
ls $(polyzymd info -c config.yaml --scratch-dir)/production_*/

What Distance Analysis Provides

The distance analysis module computes:

Feature	Description
Mean distance	Average distance over trajectory (equilibrated portion)
SEM	Autocorrelation-corrected standard error of the mean
Mode (KDE peak)	Most probable distance from kernel density estimation
Contact fraction	% of frames below a distance threshold
Distribution	Full histogram and KDE for visualization

Tip

When to use distances vs. contacts vs. triad:

Distances: Specific atom pairs with continuous distance values
Contacts: All residue-residue contacts at an interface (binary count)
Triad: Pre-defined catalytic geometry with simultaneous contact analysis

Basic Usage

YAML (Recommended)

For reproducible analysis, define distance pairs in analysis.yaml:

# analysis.yaml (alongside config.yaml)
replicates: [1, 2, 3]

defaults:
  equilibration_time: "10ns"

distances:
  enabled: true
  pairs:
    - label: "Ser77-His156"
      selection_a: "resid 77 and name OG"
      selection_b: "resid 156 and name NE2"
    - label: "His156-Asp133"
      selection_a: "resid 156 and name ND1"
      selection_b: "midpoint(resid 133 and name OD1 OD2)"

Then run:

# Initialize template (if starting fresh)
polyzymd analyze init

# Run all enabled analyses (uses analysis.yaml)
polyzymd analyze run

# Force recompute
polyzymd analyze run --recompute

Benefits:

Version-controlled, reproducible
Self-documenting experiment setup
Easy to re-run with different parameters

CLI

Single Distance Pair

polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
    --pair "resid 77 and name OG : resid 133 and name NE2"

Expected output:

Loading configuration from: config.yaml
Distance Analysis: MySimulation
  Replicates: 1
  Equilibration: 10ns
  Distance pairs: 1
    1. resid 77 and name OG <-> resid 133 and name NE2

Distance Analysis Complete
  resid77_OG-resid133_NE2:
    Mean: 3.42 ± 0.15 Å
    Min:  2.61 Å
    Max:  5.87 Å

Multiple Pairs

Specify --pair multiple times:

polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
    --pair "resid 77 and name OG : resid 133 and name NE2" \
    --pair "resid 133 and name NE2 : resid 156 and name OD1"

Multiple Replicates

Aggregates results with SEM across replicates:

polyzymd analyze distances -c config.yaml -r 1-3 --eq-time 10ns \
    --pair "resid 77 and name OG : resid 133 and name NE2"

Output:

Distance Analysis Complete (Aggregated)
  Replicates: 1-3
  resid77_OG-resid133_NE2:
    Mean: 3.38 ± 0.08 Å (SEM across 3 replicates)

Python

Use DistanceCalculator for programmatic analysis:

from polyzymd.config.schema import SimulationConfig
from polyzymd.analysis import DistanceCalculator

# Load configuration
config = SimulationConfig.from_yaml("config.yaml")

# Define distance pairs
pairs = [
    ("resid 77 and name OG", "resid 156 and name NE2"),
    ("resid 156 and name ND1", "midpoint(resid 133 and name OD1 OD2)"),
]

# Create calculator
calc = DistanceCalculator(
    config=config,
    pairs=pairs,
    equilibration="10ns",
    threshold=3.5,  # Optional: contact analysis
)

# Single replicate
result = calc.compute(replicate=1)

for pr in result.pair_results:
    print(f"{pr.pair_label}: {pr.mean_distance:.2f} ± {pr.sem_distance:.2f} Å")
    if pr.fraction_below_threshold is not None:
        print(f"  Contact fraction: {pr.fraction_below_threshold:.1%}")

# Multiple replicates (aggregated with SEM)
agg_result = calc.compute_aggregated(replicates=[1, 2, 3])

for pr in agg_result.pair_results:
    print(f"{pr.pair_label}: {pr.overall_mean:.2f} ± {pr.overall_sem:.2f} Å")

Contact Threshold Analysis

Add --threshold to compute the fraction of frames where the distance is below a cutoff (useful for hydrogen bond analysis, active site proximity, etc.):

YAML (Recommended)

# analysis.yaml
distances:
  enabled: true
  threshold: 3.5  # Angstroms (H-bond cutoff)
  pairs:
    - label: "Ser77-His156"
      selection_a: "resid 77 and name OG"
      selection_b: "resid 156 and name NE2"

polyzymd analyze run

CLI

polyzymd analyze distances -c config.yaml -r 1-3 --eq-time 10ns \
    --pair "resid 77 and name OG : resid 133 and name NE2" \
    --threshold 3.5

Output:

Distance Analysis Complete
  resid77_OG-resid133_NE2:
    Mean: 3.42 ± 0.15 Å
    Min:  2.61 Å
    Max:  5.87 Å
    Contact fraction (<3.5Å): 62.4%

Python

calc = DistanceCalculator(
    config=config,
    pairs=pairs,
    equilibration="10ns",
    threshold=3.5,  # Contact cutoff in Angstroms
)

result = calc.compute(replicate=1)
for pr in result.pair_results:
    if pr.fraction_below_threshold is not None:
        print(f"{pr.pair_label}: {pr.fraction_below_threshold:.1%} below {pr.threshold} Å")

Special Selection Syntax

PolyzyMD extends MDAnalysis selections with special position modes and keywords:

Warning

Chain-Aware Selections Required

Residue numbers restart at 1 for each chain in PolyzyMD systems. A selection like resid 141-148 will match residues from all chains (protein, polymer, and water).

For protein residues, always use protein and resid X:

# INCORRECT - selects from all chains, causing wrong distances
selection_a: "com(resid 141-148)"

# CORRECT - restricts to protein chain only
selection_a: "com(protein and resid 141-148)"

PolyzyMD will emit a runtime warning if your selection spans multiple chains, but it’s best to write correct selections from the start.

Position Modes

Syntax	Description	Use Case
`midpoint(selection)`	Geometric midpoint of selected atoms	Carboxylate groups (Asp, Glu)
`com(selection)`	Center of mass of selected atoms	Entire residues, aromatic rings

PolyzyMD Keywords

Keyword	Description	Example
`pdbindex N`	Atom by PDB serial number (1-indexed)	`pdbindex 2740 and name CA`

The pdbindex keyword lets you reference atoms by their PDB ATOM serial number (the number displayed in PyMOL as “id”). This is especially useful when copying atom selections from restraint definitions in config.yaml.

Tip

Consistency with restraints: You can use the same pdbindex selections in both your restraint configuration (config.yaml) and analysis commands. This makes it easy to verify that restrained distances match observed distances.

Examples

# Midpoint of Asp carboxylate oxygens (protein residue)
selection_a: "midpoint(protein and resid 133 and name OD1 OD2)"

# Center of mass of entire ligand (non-protein, no chain restriction needed)
selection_b: "com(resname LIG)"

# Standard single atom (protein residue)
selection_a: "protein and resid 77 and name OG"

# Atom by PDB serial number (unique, no chain restriction needed)
selection_a: "pdbindex 2740"

Tip

Use midpoint() for carboxylate groups (Asp, Glu) where either oxygen can accept a hydrogen bond. This gives a single representative point instead of choosing arbitrarily between OD1/OD2 or OE1/OE2.

CLI Syntax

On the command line, use quotes to protect the special syntax:

polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
    --pair "protein and resid 156 and name ND1 : midpoint(protein and resid 133 and name OD1 OD2)"

Output Files

Results are saved in your project’s analysis directory:

<projects_directory>/
└── analysis/
    └── distances/
        ├── run_1/
        │   └── distances_resid77_OG-resid133_NE2_eq10ns.json
        ├── run_2/
        │   └── distances_resid77_OG-resid133_NE2_eq10ns.json
        ├── run_3/
        │   └── distances_resid77_OG-resid133_NE2_eq10ns.json
        └── aggregated/
            └── distances_reps1-3_eq10ns.json

JSON Result Structure

{
    "pair_results": [
        {
            "pair_label": "resid77_OG-resid133_NE2",
            "selection1": "resid 77 and name OG",
            "selection2": "resid 133 and name NE2",
            "mean_distance": 3.42,
            "std_distance": 0.87,
            "sem_distance": 0.15,  # Autocorrelation-corrected
            "median_distance": 3.31,
            "min_distance": 2.61,
            "max_distance": 5.87,
            "kde_peak": 3.18,  # Mode from KDE
            "threshold": 3.5,
            "fraction_below_threshold": 0.624,
            "correlation_time": 245.3,  # ps
            "n_independent_frames": 34,
            "histogram_edges": [...],
            "histogram_counts": [...],
            "kde_x": [...],
            "kde_y": [...]
        }
    ],
    "n_frames_total": 10000,
    "n_frames_used": 9000,
    "equilibration_time": 10.0,
    "equilibration_unit": "ns",
    # ... additional metadata
}

Visualization

CLI

Generate plots automatically with --plot:

polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
    --pair "resid 77 and name OG : resid 133 and name NE2" \
    --plot

Plots are saved to <projects_directory>/plots/distances/.

Python

Use the plotting functions for custom figures:

from polyzymd.analysis.distances import (
    plot_distance_histogram,
    plot_distance_timeseries,
    plot_distance_comparison,
    plot_contact_fraction_bar,
)

# Single distribution
result = calc.compute(replicate=1)
fig, ax = plot_distance_histogram(result.pair_results[0])
fig.savefig("distance_histogram.png")

# Time series (requires store_distributions=True)
fig, ax = plot_distance_timeseries(result.pair_results[0])
fig.savefig("distance_timeseries.png")

# Compare multiple conditions
results_no_poly = calc_no_poly.compute(replicate=1)
results_with_poly = calc_with_poly.compute(replicate=1)

fig, ax = plot_distance_comparison(
    [results_no_poly.pair_results[0], results_with_poly.pair_results[0]],
    labels=["No Polymer", "With Polymer"],
)
fig.savefig("distance_comparison.png")

Available Plot Types

The CLI --plot flag generates histograms automatically. For other plot types, use the Python API:

Function	Description	CLI	Python
`plot_distance_histogram`	Distribution with optional threshold line	✓	✓
`plot_distance_timeseries`	Distance over frame number		✓
`plot_distance_comparison`	Overlay multiple conditions		✓
`plot_contact_fraction_bar`	Bar chart of contact fractions	✓*	✓

*Only generated when --threshold is specified with multiple replicates.

Note

Want more CLI plot options? See Issue #27 and Issue #28 for planned enhancements to automatic plot generation.

Common Options

Option	Default	Description
`-c, --config`	(required)	Path to config.yaml
`-r, --replicates`	`1`	Which replicates to analyze
`--eq-time`	`0ns`	Equilibration time to skip
`--pair`	(required)	Distance pair as `selection1 : selection2`
`--threshold`	(none)	Contact cutoff in Angstroms
`--plot`	off	Generate matplotlib figures
`--recompute`	off	Ignore cached results
`-o, --output-dir`	(auto)	Custom output location

Replicate Specification

Format	Meaning
`-r 1`	Single replicate
`-r 1-5`	Replicates 1 through 5
`-r 1,3,5`	Specific replicates

PBC-Aware Distances and Trajectory Alignment

Added in version 0.3.0: Distance calculations now include PBC-aware distances and trajectory alignment by default.

Periodic Boundary Conditions (PBC)

By default, distances are computed using the minimum image convention, which correctly handles molecules near periodic boundaries. This prevents artificially large distances (60-70Å) when atoms are actually close but on opposite sides of the simulation box.

Note

When does PBC matter? PBC correction is critical when:

Molecules diffuse across box boundaries
Long polymers span the periodic box
Active sites are near the box edge

For well-centered proteins in large boxes, PBC usually has minimal effect, but it’s always safer to keep it enabled (the default).

Supported box types:

✅ Orthorhombic boxes (cubic, rectangular): Fully supported
⚠️ Triclinic boxes: Warning issued, falls back to Euclidean distance

YAML

# analysis.yaml
distances:
  use_pbc: true  # Default, can be omitted
  pairs:
    - label: "Ser77-His156"
      selection_a: "resid 77 and name OG"
      selection_b: "resid 156 and name NE2"

Python

from polyzymd.analysis import DistanceCalculator

# PBC enabled by default
calc = DistanceCalculator(
    config=config,
    pairs=pairs,
    equilibration="10ns",
    use_pbc=True,  # Default, can be omitted
)

# Disable PBC (not recommended)
calc = DistanceCalculator(
    config=config,
    pairs=pairs,
    equilibration="10ns",
    use_pbc=False,
)

Trajectory Alignment

By default, trajectories are aligned to a reference structure before computing distances. This removes rotational drift and center-of-mass motion that can add noise to distance measurements.

Why alignment matters: MD simulations allow the entire system to rotate and translate. Without alignment, even a rigid protein will show larger fluctuations in inter-atomic distances due to this global motion.

Reference modes:

Mode	Description	Best for
`centroid` (default)	Align to most populated conformational cluster (K-Means)	General use
`average`	Align to mathematical average structure	Pure thermal fluctuation analysis
`frame`	Align to a specific frame number	Comparing to known functional conformation

Note

When alignment is performed, an INFO-level log message notifies you. This ensures you’re aware that trajectory coordinates have been modified in-memory.

YAML

# analysis.yaml
distances:
  align_trajectory: true  # Default
  alignment_mode: centroid  # Default
  alignment_selection: "protein and name CA"  # Default
  pairs:
    - label: "Ser77-His156"
      selection_a: "resid 77 and name OG"
      selection_b: "resid 156 and name NE2"

Python

from polyzymd.analysis import DistanceCalculator
from polyzymd.analysis.core.alignment import AlignmentConfig

# Default: align to centroid using CA atoms
calc = DistanceCalculator(
    config=config,
    pairs=pairs,
    equilibration="10ns",
)

# Custom alignment: align to frame 500
calc = DistanceCalculator(
    config=config,
    pairs=pairs,
    equilibration="10ns",
    alignment=AlignmentConfig(
        reference_mode="frame",
        reference_frame=500,
        selection="protein and backbone",
    ),
)

# Disable alignment (not recommended for most analyses)
calc = DistanceCalculator(
    config=config,
    pairs=pairs,
    equilibration="10ns",
    alignment=AlignmentConfig(enabled=False),
)

Cache Invalidation

The result filename includes PBC and alignment settings, so changing these parameters automatically invalidates the cache:

distances_resid77_OG-resid133_NE2_eq10ns_pbc_align-centroid.json
distances_resid77_OG-resid133_NE2_eq10ns_nopbc_noalign.json

This means you can safely experiment with different settings without manually clearing cached results.

Troubleshooting

“Selection matched no atoms”

Cause: MDAnalysis selection doesn’t match any atoms in your topology.

Fix:

Check residue numbering in your PDB vs. MDAnalysis (0-indexed vs 1-indexed)
Verify atom names match your topology: protein and resid 77 to see available atoms
Use polyzymd --debug analyze distances ... for detailed selection diagnostics

Very wide distance distribution

Cause: The selected atoms may be flexible or the selection is too broad.

Fix:

Ensure selections resolve to single atoms (or use midpoint()/com())
Check that selection1 and selection2 are correctly specified
Visualize the selections in a molecular viewer

“Low statistical reliability” warning

Cause: Long correlation time relative to trajectory length.

This is informational, not an error. Results are still valid but uncertainties may be underestimated.

Mitigation:

Use multiple replicates (aggregated SEM is more reliable)
Run longer simulations
Results are still useful for qualitative comparisons

Missing replicate data

Message: Skipping replicate N: trajectory data not found

Cause: The requested replicate hasn’t completed or path is incorrect.

Fix: This is informational—analysis continues with available replicates. Check simulation status if unexpected.

Comparison with Catalytic Triad Analysis

Distance analysis and Catalytic Triad Analysis both measure atom-pair distances, but serve different purposes:

Feature	Distances	Catalytic Triad
Focus	Any atom pairs	Pre-defined catalytic geometry
Configuration	`analysis.yaml` or CLI	`comparison.yaml` with conditions
Multi-condition	Via `compare run distances`	Built-in condition comparison
Simultaneous contacts	Not computed	Key metric (all pairs < threshold)
Use case	Ad-hoc distance measurements	Structured enzyme comparisons

Tip

Use distances for exploratory analysis of specific interactions. Use catalytic triad when comparing enzyme integrity across conditions.

Comparing Distances Across Conditions

To statistically compare distances across multiple simulation conditions (e.g., different polymer compositions), use the compare run distances command:

# Add distances section to comparison.yaml, then:
polyzymd compare run distances -f comparison.yaml

This provides:

Dual-metric ranking: By mean distance (primary) and fraction below threshold (secondary)
Statistical tests: t-tests, Cohen’s d effect sizes, ANOVA
Per-pair summaries: Distance statistics for each defined pair

See Comparing Distances Across Conditions for full documentation.

Next Steps

Compare distances across conditions: Comparing Conditions Guide
Catalytic triad analysis: Triad Quick Start
Understand statistics: Statistical Best Practices
Contact analysis: Contacts Quick Start