Distance Analysis: Quick Start

Compute inter-atomic distances with proper statistical handling in under 5 minutes.

Note

Want to understand the statistics? This guide focuses on getting results quickly. For proper uncertainty quantification (autocorrelation correction, SEM vs. SD), see the Statistical Best Practices Guide.

TL;DR

# Single distance pair, single replicate
polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
    --pair "resid 77 and name OG : resid 133 and name NE2"

# Multiple pairs with contact threshold
polyzymd analyze distances -c config.yaml -r 1-3 --eq-time 10ns \
    --pair "resid 77 and name OG : resid 133 and name NE2" \
    --pair "resid 133 and name NE2 : resid 156 and name OD1" \
    --threshold 3.5

# With plots
polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
    --pair "resid 77 and name OG : resid 133 and name NE2" --plot

Prerequisites

Before running distance analysis, you need:

  1. Completed production simulation(s) - at least one replicate

  2. Config file - the same config.yaml used for the simulation

  3. Trajectory files - in the scratch directory specified in config

Verify your setup:

# Check that trajectories exist
ls $(polyzymd info -c config.yaml --scratch-dir)/production_*/

What Distance Analysis Provides

The distance analysis module computes:

Feature

Description

Mean distance

Average distance over trajectory (equilibrated portion)

SEM

Autocorrelation-corrected standard error of the mean

Mode (KDE peak)

Most probable distance from kernel density estimation

Contact fraction

% of frames below a distance threshold

Distribution

Full histogram and KDE for visualization

Tip

When to use distances vs. contacts vs. triad:

  • Distances: Specific atom pairs with continuous distance values

  • Contacts: All residue-residue contacts at an interface (binary count)

  • Triad: Pre-defined catalytic geometry with simultaneous contact analysis

Basic Usage

For reproducible analysis, define distance pairs in analysis.yaml:

# analysis.yaml (alongside config.yaml)
replicates: [1, 2, 3]

defaults:
  equilibration_time: "10ns"

distances:
  enabled: true
  pairs:
    - label: "Ser77-His156"
      selection_a: "resid 77 and name OG"
      selection_b: "resid 156 and name NE2"
    - label: "His156-Asp133"
      selection_a: "resid 156 and name ND1"
      selection_b: "midpoint(resid 133 and name OD1 OD2)"

Then run:

# Initialize template (if starting fresh)
polyzymd analyze init

# Run all enabled analyses (uses analysis.yaml)
polyzymd analyze run

# Force recompute
polyzymd analyze run --recompute

Benefits:

  • Version-controlled, reproducible

  • Self-documenting experiment setup

  • Easy to re-run with different parameters

Single Distance Pair

polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
    --pair "resid 77 and name OG : resid 133 and name NE2"

Expected output:

Loading configuration from: config.yaml
Distance Analysis: MySimulation
  Replicates: 1
  Equilibration: 10ns
  Distance pairs: 1
    1. resid 77 and name OG <-> resid 133 and name NE2

Distance Analysis Complete
  resid77_OG-resid133_NE2:
    Mean: 3.42 ± 0.15 Å
    Min:  2.61 Å
    Max:  5.87 Å

Multiple Pairs

Specify --pair multiple times:

polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
    --pair "resid 77 and name OG : resid 133 and name NE2" \
    --pair "resid 133 and name NE2 : resid 156 and name OD1"

Multiple Replicates

Aggregates results with SEM across replicates:

polyzymd analyze distances -c config.yaml -r 1-3 --eq-time 10ns \
    --pair "resid 77 and name OG : resid 133 and name NE2"

Output:

Distance Analysis Complete (Aggregated)
  Replicates: 1-3
  resid77_OG-resid133_NE2:
    Mean: 3.38 ± 0.08 Å (SEM across 3 replicates)

Use DistanceCalculator for programmatic analysis:

from polyzymd.config.schema import SimulationConfig
from polyzymd.analysis import DistanceCalculator

# Load configuration
config = SimulationConfig.from_yaml("config.yaml")

# Define distance pairs
pairs = [
    ("resid 77 and name OG", "resid 156 and name NE2"),
    ("resid 156 and name ND1", "midpoint(resid 133 and name OD1 OD2)"),
]

# Create calculator
calc = DistanceCalculator(
    config=config,
    pairs=pairs,
    equilibration="10ns",
    threshold=3.5,  # Optional: contact analysis
)

# Single replicate
result = calc.compute(replicate=1)

for pr in result.pair_results:
    print(f"{pr.pair_label}: {pr.mean_distance:.2f} ± {pr.sem_distance:.2f} Å")
    if pr.fraction_below_threshold is not None:
        print(f"  Contact fraction: {pr.fraction_below_threshold:.1%}")

# Multiple replicates (aggregated with SEM)
agg_result = calc.compute_aggregated(replicates=[1, 2, 3])

for pr in agg_result.pair_results:
    print(f"{pr.pair_label}: {pr.overall_mean:.2f} ± {pr.overall_sem:.2f} Å")

Contact Threshold Analysis

Add --threshold to compute the fraction of frames where the distance is below a cutoff (useful for hydrogen bond analysis, active site proximity, etc.):

# analysis.yaml
distances:
  enabled: true
  threshold: 3.5  # Angstroms (H-bond cutoff)
  pairs:
    - label: "Ser77-His156"
      selection_a: "resid 77 and name OG"
      selection_b: "resid 156 and name NE2"
polyzymd analyze run
polyzymd analyze distances -c config.yaml -r 1-3 --eq-time 10ns \
    --pair "resid 77 and name OG : resid 133 and name NE2" \
    --threshold 3.5

Output:

Distance Analysis Complete
  resid77_OG-resid133_NE2:
    Mean: 3.42 ± 0.15 Å
    Min:  2.61 Å
    Max:  5.87 Å
    Contact fraction (<3.5Å): 62.4%
calc = DistanceCalculator(
    config=config,
    pairs=pairs,
    equilibration="10ns",
    threshold=3.5,  # Contact cutoff in Angstroms
)

result = calc.compute(replicate=1)
for pr in result.pair_results:
    if pr.fraction_below_threshold is not None:
        print(f"{pr.pair_label}: {pr.fraction_below_threshold:.1%} below {pr.threshold} Å")

Special Selection Syntax

PolyzyMD extends MDAnalysis selections with special position modes and keywords:

Warning

Chain-Aware Selections Required

Residue numbers restart at 1 for each chain in PolyzyMD systems. A selection like resid 141-148 will match residues from all chains (protein, polymer, and water).

For protein residues, always use protein and resid X:

# INCORRECT - selects from all chains, causing wrong distances
selection_a: "com(resid 141-148)"

# CORRECT - restricts to protein chain only
selection_a: "com(protein and resid 141-148)"

PolyzyMD will emit a runtime warning if your selection spans multiple chains, but it’s best to write correct selections from the start.

Position Modes

Syntax

Description

Use Case

midpoint(selection)

Geometric midpoint of selected atoms

Carboxylate groups (Asp, Glu)

com(selection)

Center of mass of selected atoms

Entire residues, aromatic rings

PolyzyMD Keywords

Keyword

Description

Example

pdbindex N

Atom by PDB serial number (1-indexed)

pdbindex 2740 and name CA

The pdbindex keyword lets you reference atoms by their PDB ATOM serial number (the number displayed in PyMOL as “id”). This is especially useful when copying atom selections from restraint definitions in config.yaml.

Tip

Consistency with restraints: You can use the same pdbindex selections in both your restraint configuration (config.yaml) and analysis commands. This makes it easy to verify that restrained distances match observed distances.

Examples

# Midpoint of Asp carboxylate oxygens (protein residue)
selection_a: "midpoint(protein and resid 133 and name OD1 OD2)"

# Center of mass of entire ligand (non-protein, no chain restriction needed)
selection_b: "com(resname LIG)"

# Standard single atom (protein residue)
selection_a: "protein and resid 77 and name OG"

# Atom by PDB serial number (unique, no chain restriction needed)
selection_a: "pdbindex 2740"

Tip

Use midpoint() for carboxylate groups (Asp, Glu) where either oxygen can accept a hydrogen bond. This gives a single representative point instead of choosing arbitrarily between OD1/OD2 or OE1/OE2.

CLI Syntax

On the command line, use quotes to protect the special syntax:

polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
    --pair "protein and resid 156 and name ND1 : midpoint(protein and resid 133 and name OD1 OD2)"

Output Files

Results are saved in your project’s analysis directory:

<projects_directory>/
└── analysis/
    └── distances/
        ├── run_1/
        │   └── distances_resid77_OG-resid133_NE2_eq10ns.json
        ├── run_2/
        │   └── distances_resid77_OG-resid133_NE2_eq10ns.json
        ├── run_3/
        │   └── distances_resid77_OG-resid133_NE2_eq10ns.json
        └── aggregated/
            └── distances_reps1-3_eq10ns.json

JSON Result Structure

{
    "pair_results": [
        {
            "pair_label": "resid77_OG-resid133_NE2",
            "selection1": "resid 77 and name OG",
            "selection2": "resid 133 and name NE2",
            "mean_distance": 3.42,
            "std_distance": 0.87,
            "sem_distance": 0.15,  # Autocorrelation-corrected
            "median_distance": 3.31,
            "min_distance": 2.61,
            "max_distance": 5.87,
            "kde_peak": 3.18,  # Mode from KDE
            "threshold": 3.5,
            "fraction_below_threshold": 0.624,
            "correlation_time": 245.3,  # ps
            "n_independent_frames": 34,
            "histogram_edges": [...],
            "histogram_counts": [...],
            "kde_x": [...],
            "kde_y": [...]
        }
    ],
    "n_frames_total": 10000,
    "n_frames_used": 9000,
    "equilibration_time": 10.0,
    "equilibration_unit": "ns",
    # ... additional metadata
}

Visualization

Generate plots automatically with --plot:

polyzymd analyze distances -c config.yaml -r 1 --eq-time 10ns \
    --pair "resid 77 and name OG : resid 133 and name NE2" \
    --plot

Plots are saved to <projects_directory>/plots/distances/.

Use the plotting functions for custom figures:

from polyzymd.analysis.distances import (
    plot_distance_histogram,
    plot_distance_timeseries,
    plot_distance_comparison,
    plot_contact_fraction_bar,
)

# Single distribution
result = calc.compute(replicate=1)
fig, ax = plot_distance_histogram(result.pair_results[0])
fig.savefig("distance_histogram.png")

# Time series (requires store_distributions=True)
fig, ax = plot_distance_timeseries(result.pair_results[0])
fig.savefig("distance_timeseries.png")

# Compare multiple conditions
results_no_poly = calc_no_poly.compute(replicate=1)
results_with_poly = calc_with_poly.compute(replicate=1)

fig, ax = plot_distance_comparison(
    [results_no_poly.pair_results[0], results_with_poly.pair_results[0]],
    labels=["No Polymer", "With Polymer"],
)
fig.savefig("distance_comparison.png")

Available Plot Types

The CLI --plot flag generates histograms automatically. For other plot types, use the Python API:

Function

Description

CLI

Python

plot_distance_histogram

Distribution with optional threshold line

plot_distance_timeseries

Distance over frame number

plot_distance_comparison

Overlay multiple conditions

plot_contact_fraction_bar

Bar chart of contact fractions

✓*

*Only generated when --threshold is specified with multiple replicates.

Note

Want more CLI plot options? See Issue #27 and Issue #28 for planned enhancements to automatic plot generation.

Common Options

Option

Default

Description

-c, --config

(required)

Path to config.yaml

-r, --replicates

1

Which replicates to analyze

--eq-time

0ns

Equilibration time to skip

--pair

(required)

Distance pair as selection1 : selection2

--threshold

(none)

Contact cutoff in Angstroms

--plot

off

Generate matplotlib figures

--recompute

off

Ignore cached results

-o, --output-dir

(auto)

Custom output location

Replicate Specification

Format

Meaning

-r 1

Single replicate

-r 1-5

Replicates 1 through 5

-r 1,3,5

Specific replicates

PBC-Aware Distances and Trajectory Alignment

Added in version 0.3.0: Distance calculations now include PBC-aware distances and trajectory alignment by default.

Periodic Boundary Conditions (PBC)

By default, distances are computed using the minimum image convention, which correctly handles molecules near periodic boundaries. This prevents artificially large distances (60-70Å) when atoms are actually close but on opposite sides of the simulation box.

Note

When does PBC matter? PBC correction is critical when:

  • Molecules diffuse across box boundaries

  • Long polymers span the periodic box

  • Active sites are near the box edge

For well-centered proteins in large boxes, PBC usually has minimal effect, but it’s always safer to keep it enabled (the default).

Supported box types:

  • ✅ Orthorhombic boxes (cubic, rectangular): Fully supported

  • ⚠️ Triclinic boxes: Warning issued, falls back to Euclidean distance

# analysis.yaml
distances:
  use_pbc: true  # Default, can be omitted
  pairs:
    - label: "Ser77-His156"
      selection_a: "resid 77 and name OG"
      selection_b: "resid 156 and name NE2"
from polyzymd.analysis import DistanceCalculator

# PBC enabled by default
calc = DistanceCalculator(
    config=config,
    pairs=pairs,
    equilibration="10ns",
    use_pbc=True,  # Default, can be omitted
)

# Disable PBC (not recommended)
calc = DistanceCalculator(
    config=config,
    pairs=pairs,
    equilibration="10ns",
    use_pbc=False,
)

Trajectory Alignment

By default, trajectories are aligned to a reference structure before computing distances. This removes rotational drift and center-of-mass motion that can add noise to distance measurements.

Why alignment matters: MD simulations allow the entire system to rotate and translate. Without alignment, even a rigid protein will show larger fluctuations in inter-atomic distances due to this global motion.

Reference modes:

Mode

Description

Best for

centroid (default)

Align to most populated conformational cluster (K-Means)

General use

average

Align to mathematical average structure

Pure thermal fluctuation analysis

frame

Align to a specific frame number

Comparing to known functional conformation

Note

When alignment is performed, an INFO-level log message notifies you. This ensures you’re aware that trajectory coordinates have been modified in-memory.

# analysis.yaml
distances:
  align_trajectory: true  # Default
  alignment_mode: centroid  # Default
  alignment_selection: "protein and name CA"  # Default
  pairs:
    - label: "Ser77-His156"
      selection_a: "resid 77 and name OG"
      selection_b: "resid 156 and name NE2"
from polyzymd.analysis import DistanceCalculator
from polyzymd.analysis.core.alignment import AlignmentConfig

# Default: align to centroid using CA atoms
calc = DistanceCalculator(
    config=config,
    pairs=pairs,
    equilibration="10ns",
)

# Custom alignment: align to frame 500
calc = DistanceCalculator(
    config=config,
    pairs=pairs,
    equilibration="10ns",
    alignment=AlignmentConfig(
        reference_mode="frame",
        reference_frame=500,
        selection="protein and backbone",
    ),
)

# Disable alignment (not recommended for most analyses)
calc = DistanceCalculator(
    config=config,
    pairs=pairs,
    equilibration="10ns",
    alignment=AlignmentConfig(enabled=False),
)

Cache Invalidation

The result filename includes PBC and alignment settings, so changing these parameters automatically invalidates the cache:

distances_resid77_OG-resid133_NE2_eq10ns_pbc_align-centroid.json
distances_resid77_OG-resid133_NE2_eq10ns_nopbc_noalign.json

This means you can safely experiment with different settings without manually clearing cached results.

Troubleshooting

“Selection matched no atoms”

Cause: MDAnalysis selection doesn’t match any atoms in your topology.

Fix:

  • Check residue numbering in your PDB vs. MDAnalysis (0-indexed vs 1-indexed)

  • Verify atom names match your topology: protein and resid 77 to see available atoms

  • Use polyzymd --debug analyze distances ... for detailed selection diagnostics

Very wide distance distribution

Cause: The selected atoms may be flexible or the selection is too broad.

Fix:

  • Ensure selections resolve to single atoms (or use midpoint()/com())

  • Check that selection1 and selection2 are correctly specified

  • Visualize the selections in a molecular viewer

“Low statistical reliability” warning

Cause: Long correlation time relative to trajectory length.

This is informational, not an error. Results are still valid but uncertainties may be underestimated.

Mitigation:

  • Use multiple replicates (aggregated SEM is more reliable)

  • Run longer simulations

  • Results are still useful for qualitative comparisons

Missing replicate data

Message: Skipping replicate N: trajectory data not found

Cause: The requested replicate hasn’t completed or path is incorrect.

Fix: This is informational—analysis continues with available replicates. Check simulation status if unexpected.

Comparison with Catalytic Triad Analysis

Distance analysis and Catalytic Triad Analysis both measure atom-pair distances, but serve different purposes:

Feature

Distances

Catalytic Triad

Focus

Any atom pairs

Pre-defined catalytic geometry

Configuration

analysis.yaml or CLI

comparison.yaml with conditions

Multi-condition

Via compare run distances

Built-in condition comparison

Simultaneous contacts

Not computed

Key metric (all pairs < threshold)

Use case

Ad-hoc distance measurements

Structured enzyme comparisons

Tip

Use distances for exploratory analysis of specific interactions. Use catalytic triad when comparing enzyme integrity across conditions.

Comparing Distances Across Conditions

To statistically compare distances across multiple simulation conditions (e.g., different polymer compositions), use the compare run distances command:

# Add distances section to comparison.yaml, then:
polyzymd compare run distances -f comparison.yaml

This provides:

  • Dual-metric ranking: By mean distance (primary) and fraction below threshold (secondary)

  • Statistical tests: t-tests, Cohen’s d effect sizes, ANOVA

  • Per-pair summaries: Distance statistics for each defined pair

See Comparing Distances Across Conditions for full documentation.

Next Steps