Statistics Best Practices for MD Analysis

This guide explains the statistical methods used in PolyzyMD’s analysis module, following the LiveCoMS Best Practices for uncertainty quantification in molecular simulations (Grossfield et al., 2018).

Why Statistics Matter in MD

Molecular dynamics trajectories are highly correlated in time. Unlike independent experimental measurements, consecutive MD frames show the system in nearly identical configurations. This correlation has profound implications for how we analyze and report results.

Warning

Ignoring correlation in MD analysis leads to:

  • Overconfident uncertainties (SEM too small by factors of 10-100×)

  • Biased variance estimates (RMSF, heat capacity systematically wrong)

  • Irreproducible conclusions (apparent significance that doesn’t replicate)

PolyzyMD automatically handles these issues using methods from the polyzymd.analysis.core.autocorrelation module.


Key Concepts

Autocorrelation Function (ACF)

The autocorrelation function measures how correlated a signal is with itself at different time lags:

\[C(\tau) = \frac{\langle (x(t) - \mu)(x(t+\tau) - \mu) \rangle}{\sigma^2}\]
  • \(C(0) = 1\) (perfectly correlated with itself)

  • \(C(\tau) \to 0\) as \(\tau \to \infty\) (decorrelation at long times)

from polyzymd.analysis.core.autocorrelation import compute_acf

# Compute ACF from a timeseries (e.g., RMSD, distance)
acf_result = compute_acf(
    timeseries,
    timestep=10.0,      # ps between frames
    timestep_unit="ps",
)

# acf_result.acf contains the autocorrelation values
# acf_result.lags contains the time lags

Correlation Time (τ)

The correlation time τ is the characteristic timescale for decorrelation. Frames separated by more than 2τ are approximately independent.

PolyzyMD estimates τ using numerical integration of the ACF (the most robust method for noisy data):

\[\tau = \int_0^{\infty} C(t) \, dt\]
from polyzymd.analysis.core.autocorrelation import estimate_correlation_time

tau_result = estimate_correlation_time(
    acf_result,
    method="integration",  # Most robust for MD data
)

print(f"Correlation time: {tau_result.tau:.1f} {tau_result.tau_unit}")
print(f"Independent samples: {tau_result.n_independent}")

Statistical Inefficiency (g)

The statistical inefficiency quantifies how much correlation inflates variance estimates:

\[g = 1 + 2\sum_{t=1}^{N} C(t) \left(1 - \frac{t}{N}\right)\]

The effective number of independent samples is:

\[N_{\text{eff}} = \frac{N}{g}\]

Tip

A typical 100 ns trajectory with 10,000 frames might have only 50-200 effective independent samples, depending on the observable and system dynamics.


The Critical Distinction: Means vs. Variances

This is the most important concept in MD statistics.

Different quantities require different statistical treatments depending on whether they involve first moments (means) or second moments (variances).

First Moments: Use All Data, Correct Uncertainty

For quantities that are averages (first moments):

  • Mean distance

  • Mean RMSD

  • Contact fraction

  • Average energy

Correlation affects precision but NOT accuracy.

The sample mean \(\bar{x}\) is an unbiased estimator regardless of correlation. Correlated samples still provide valid information about the mean—they just don’t provide as much independent information.

Correct approach: Use all frames, but correct the standard error:

\[\text{SEM} = \frac{\sigma}{\sqrt{N_{\text{eff}}}} = \frac{\sigma}{\sqrt{N/g}}\]

This is exactly what PolyzyMD’s distance analysis does:

# From distances/calculator.py (lines 638-641)
# SEM = std / sqrt(n_independent)
if n_independent_frames > 0:
    sem_distance = float(std_dist / np.sqrt(n_independent_frames))

Second Moments: Subsample to Independent Frames

For quantities that measure fluctuations (second moments):

  • RMSF (root mean square fluctuation)

  • Heat capacity (energy variance)

  • Compressibility (volume variance)

  • Order parameters from variance

Correlation introduces systematic BIAS, not just imprecision.

Correlated samples systematically underestimate the true variance because consecutive frames show the system in similar configurations.

Physical Intuition: The Pendulum Analogy

Imagine measuring the position variance of a swinging pendulum:

Sampling Rate

What You See

Apparent Variance

1000 Hz

Smooth continuous motion

Small (consecutive samples similar)

0.1 Hz

Independent snapshots at random phases

Correct (true swing amplitude)

The fast sampling doesn’t capture the full range of motion between samples—it oversamples the slow transitions and underestimates fluctuations.

Correct approach: Use only independent frames (spaced by ≥2τ):

# From rmsf/calculator.py (lines 322-327)
frame_indices = get_independent_indices(
    n_frames=n_frames_total,
    correlation_time=correlation_time,
    timestep=timestep,
    start_frame=start_frame,
)

How PolyzyMD Implements This

Distance Analysis

The DistanceCalculator computes distances for all frames and uses autocorrelation to correct the uncertainty:

# Compute distances for every frame
for i, ts in enumerate(u.trajectory):
    dist = np.linalg.norm(pos2 - pos1)
    distances.append(dist)

# Compute autocorrelation
tau_result = estimate_correlation_time(distances_arr, ...)

# Correct SEM using effective sample size
sem_distance = std_dist / np.sqrt(n_independent_frames)

Why this works: The mean distance estimate benefits from all data points. More frames = more precise mean estimate. The corrected SEM properly reflects our actual uncertainty.

RMSF Analysis

The RMSFCalculator subsamples to independent frames before computing fluctuations:

# Compute RMSD timeseries for ACF
rmsd_timeseries = self._compute_rmsd_timeseries(u, atoms, start_frame)

# Estimate correlation time
tau_result = estimate_correlation_time(acf_result, n_frames=n_frames_after_eq)

# Select only independent frames (spaced by 2τ)
frame_indices = get_independent_indices(
    n_frames=n_frames_total,
    correlation_time=correlation_time,
    timestep=timestep,
    start_frame=start_frame,
)

# Compute RMSF using ONLY independent frames
rmsf_values = self._compute_rmsf(u, atoms, frame_indices)

Why this is necessary: RMSF measures fluctuations (variance). Using correlated frames would systematically underestimate protein flexibility.


Summary Table

Quantity

Statistical Type

Effect of Correlation

PolyzyMD Approach

Mean distance

1st moment

Reduces precision (no bias)

All frames + corrected SEM

Contact fraction

1st moment

Reduces precision (no bias)

All frames + corrected SEM

RMSF

2nd moment

Introduces bias

Subsample to independent frames

Distance variance

2nd moment

Introduces bias

Subsample to independent frames


Practical Recommendations

1. Check Your Correlation Time

Always examine the correlation time relative to your trajectory length:

from polyzymd.analysis.core.autocorrelation import (
    compute_acf,
    estimate_correlation_time,
    check_statistical_reliability,
)

acf = compute_acf(observable, timestep=10.0, timestep_unit="ps")
tau = estimate_correlation_time(acf, n_frames=len(observable))

print(f"Correlation time: {tau.tau:.1f} {tau.tau_unit}")
print(f"Independent samples: {tau.n_independent}")
print(f"Statistically reliable: {tau.is_reliable}")

Warning

If n_independent < 10, your results may not be statistically reliable. Consider:

  1. Running longer simulations

  2. Using multiple independent replicates

  3. Reporting results with appropriate caveats

2. Use Multiple Replicates

Independent replicates provide truly uncorrelated samples. PolyzyMD’s aggregation functions properly combine replicates:

# Aggregate across 5 independent replicates
polyzymd analyze distances config.yaml \
    --replicates 1-5 \
    --equilibration 100ns

The aggregated SEM is computed from the variance across replicate means, which is statistically robust regardless of within-trajectory correlation.

3. Report Uncertainties Correctly

Always report uncertainties that account for correlation:

Good Practice

“The mean Ser77-His133 distance was 3.42 ± 0.15 Å (SEM, N_eff = 47 independent samples from 5 replicates of 100 ns each).”

Poor Practice

“The mean distance was 3.42 ± 0.02 Å” (using naive SEM with N = 10,000 frames)

4. Understand What You’re Measuring

Before running analysis, ask yourself:

  1. Am I computing a mean or a variance?

  2. What timescale does this process occur on?

  3. Is my trajectory long enough to sample this process multiple times?

For slow processes (e.g., loop conformational changes with τ ~ 10 ns), even a 100 ns trajectory may only contain ~10 independent samples.


LiveCoMS References

This implementation follows the guidelines from:

Grossfield, A., et al. (2018). “Best Practices for Quantification of Uncertainty and Sampling Quality in Molecular Simulations.” Living Journal of Computational Molecular Science, 1(1), 5067. DOI: 10.33011/livecoms.1.1.5067

Key sections:

  • Section 3.1: Uncertainty in averages (first moments)

  • Section 3.2: Uncertainty in fluctuations (second moments)

  • Section 4: Practical recommendations for MD analysis

Additional references:

Chodera, J. D., et al. (2007). “Use of the Weighted Histogram Analysis Method for the Analysis of Simulated and Parallel Tempering Simulations.” Journal of Chemical Theory and Computation, 3(1), 26-41.

Flyvbjerg, H., & Petersen, H. G. (1989). “Error estimates on averages of correlated data.” Journal of Chemical Physics, 91(1), 461-466.


API Reference

The statistical functions are available in polyzymd.analysis.core.autocorrelation:

Function

Description

compute_acf()

Compute autocorrelation function via FFT

estimate_correlation_time()

Estimate τ from ACF or timeseries

get_independent_indices()

Get frame indices spaced by 2τ

statistical_inefficiency()

Compute g directly from timeseries

statistical_inefficiency_multiple()

Compute g from multiple timeseries

n_effective()

Compute N_eff = N / g

check_statistical_reliability()

Warn if N_eff < 10

See the Core Module documentation for full API details.