Statistics Best Practices for MD Analysis

This guide explains the statistical methods used in PolyzyMD’s analysis module, following the LiveCoMS Best Practices for uncertainty quantification in molecular simulations (Grossfield et al., 2018).

Why Statistics Matter in MD 

Molecular dynamics trajectories are highly correlated in time. Unlike independent experimental measurements, consecutive MD frames show the system in nearly identical configurations. This correlation has profound implications for how we analyze and report results.

Warning

Ignoring correlation in MD analysis leads to:

Overconfident uncertainties (SEM too small by factors of 10-100×)
Biased variance estimates (RMSF, heat capacity systematically wrong)
Irreproducible conclusions (apparent significance that doesn’t replicate)

PolyzyMD automatically handles these issues using methods from the polyzymd.analysis.core.autocorrelation module.

Key Concepts 

Autocorrelation Function (ACF)

The autocorrelation function measures how correlated a signal is with itself at different time lags:

\[C(\tau) = \frac{\langle (x(t) - \mu)(x(t+\tau) - \mu) \rangle}{\sigma^2}\]

\(C(0) = 1\) (perfectly correlated with itself)
\(C(\tau) \to 0\) as \(\tau \to \infty\) (decorrelation at long times)

from polyzymd.analysis.core.autocorrelation import compute_acf

# Compute ACF from a timeseries (e.g., RMSD, distance)
acf_result = compute_acf(
    timeseries,
    timestep=10.0,      # ps between frames
    timestep_unit="ps",
)

# acf_result.acf contains the autocorrelation values
# acf_result.lags contains the time lags

Correlation Time (τ)

The correlation time τ is the characteristic timescale for decorrelation. Frames separated by more than 2τ are approximately independent.

PolyzyMD estimates τ using numerical integration of the ACF (the most robust method for noisy data):

\[\tau = \int_0^{\infty} C(t) \, dt\]

from polyzymd.analysis.core.autocorrelation import estimate_correlation_time

tau_result = estimate_correlation_time(
    acf_result,
    method="integration",  # Most robust for MD data
)

print(f"Correlation time: {tau_result.tau:.1f} {tau_result.tau_unit}")
print(f"Independent samples: {tau_result.n_independent}")

Statistical Inefficiency (g)

The statistical inefficiency quantifies how much correlation inflates variance estimates:

\[g = 1 + 2\sum_{t=1}^{N} C(t) \left(1 - \frac{t}{N}\right)\]

The effective number of independent samples is:

\[N_{\text{eff}} = \frac{N}{g}\]

Tip

A typical 100 ns trajectory with 10,000 frames might have only 50-200 effective independent samples, depending on the observable and system dynamics.

The Critical Distinction: Means vs. Variances 

This is the most important concept in MD statistics.

Different quantities require different statistical treatments depending on whether they involve first moments (means) or second moments (variances).

First Moments: Use All Data, Correct Uncertainty 

For quantities that are averages (first moments):

Mean distance
Mean RMSD
Contact fraction
Average energy

Correlation affects precision but NOT accuracy.

The sample mean \(\bar{x}\) is an unbiased estimator regardless of correlation. Correlated samples still provide valid information about the mean—they just don’t provide as much independent information.

Correct approach: Use all frames, but correct the standard error:

\[\text{SEM} = \frac{\sigma}{\sqrt{N_{\text{eff}}}} = \frac{\sigma}{\sqrt{N/g}}\]

This is exactly what PolyzyMD’s distance analysis does:

# From distances/calculator.py (lines 638-641)
# SEM = std / sqrt(n_independent)
if n_independent_frames > 0:
    sem_distance = float(std_dist / np.sqrt(n_independent_frames))

Second Moments: Subsample to Independent Frames 

For quantities that measure fluctuations (second moments):

RMSF (root mean square fluctuation)
Heat capacity (energy variance)
Compressibility (volume variance)
Order parameters from variance

Correlation introduces systematic BIAS, not just imprecision.

Correlated samples systematically underestimate the true variance because consecutive frames show the system in similar configurations.

Physical Intuition: The Pendulum Analogy

Imagine measuring the position variance of a swinging pendulum:

Sampling Rate	What You See	Apparent Variance
1000 Hz	Smooth continuous motion	Small (consecutive samples similar)
0.1 Hz	Independent snapshots at random phases	Correct (true swing amplitude)

The fast sampling doesn’t capture the full range of motion between samples—it oversamples the slow transitions and underestimates fluctuations.

Correct approach: Use only independent frames (spaced by ≥2τ):

# From rmsf/calculator.py (lines 322-327)
frame_indices = get_independent_indices(
    n_frames=n_frames_total,
    correlation_time=correlation_time,
    timestep=timestep,
    start_frame=start_frame,
)

How PolyzyMD Implements This 

Distance Analysis 

The DistanceCalculator computes distances for all frames and uses autocorrelation to correct the uncertainty:

# Compute distances for every frame
for i, ts in enumerate(u.trajectory):
    dist = np.linalg.norm(pos2 - pos1)
    distances.append(dist)

# Compute autocorrelation
tau_result = estimate_correlation_time(distances_arr, ...)

# Correct SEM using effective sample size
sem_distance = std_dist / np.sqrt(n_independent_frames)

Why this works: The mean distance estimate benefits from all data points. More frames = more precise mean estimate. The corrected SEM properly reflects our actual uncertainty.

RMSF Analysis 

The RMSFCalculator subsamples to independent frames before computing fluctuations:

# Compute RMSD timeseries for ACF
rmsd_timeseries = self._compute_rmsd_timeseries(u, atoms, start_frame)

# Estimate correlation time
tau_result = estimate_correlation_time(acf_result, n_frames=n_frames_after_eq)

# Select only independent frames (spaced by 2τ)
frame_indices = get_independent_indices(
    n_frames=n_frames_total,
    correlation_time=correlation_time,
    timestep=timestep,
    start_frame=start_frame,
)

# Compute RMSF using ONLY independent frames
rmsf_values = self._compute_rmsf(u, atoms, frame_indices)

Why this is necessary: RMSF measures fluctuations (variance). Using correlated frames would systematically underestimate protein flexibility.

Summary Table 

Quantity	Statistical Type	Effect of Correlation	PolyzyMD Approach
Mean distance	1st moment	Reduces precision (no bias)	All frames + corrected SEM
Contact fraction	1st moment	Reduces precision (no bias)	All frames + corrected SEM
RMSF	2nd moment	Introduces bias	Subsample to independent frames
Distance variance	2nd moment	Introduces bias	Subsample to independent frames

Practical Recommendations 

1. Check Your Correlation Time 

Always examine the correlation time relative to your trajectory length:

from polyzymd.analysis.core.autocorrelation import (
    compute_acf,
    estimate_correlation_time,
    check_statistical_reliability,
)

acf = compute_acf(observable, timestep=10.0, timestep_unit="ps")
tau = estimate_correlation_time(acf, n_frames=len(observable))

print(f"Correlation time: {tau.tau:.1f} {tau.tau_unit}")
print(f"Independent samples: {tau.n_independent}")
print(f"Statistically reliable: {tau.is_reliable}")

Warning

If n_independent < 10, your results may not be statistically reliable. Consider:

Running longer simulations
Using multiple independent replicates
Reporting results with appropriate caveats

2. Use Multiple Replicates 

Independent replicates provide truly uncorrelated samples. PolyzyMD’s aggregation functions properly combine replicates:

# Aggregate across 5 independent replicates
polyzymd analyze distances config.yaml \
    --replicates 1-5 \
    --equilibration 100ns

The aggregated SEM is computed from the variance across replicate means, which is statistically robust regardless of within-trajectory correlation.

3. Report Uncertainties Correctly 

Always report uncertainties that account for correlation:

Good Practice

“The mean Ser77-His133 distance was 3.42 ± 0.15 Å (SEM, N_eff = 47 independent samples from 5 replicates of 100 ns each).”

Poor Practice

“The mean distance was 3.42 ± 0.02 Å” (using naive SEM with N = 10,000 frames)

4. Understand What You’re Measuring 

Before running analysis, ask yourself:

Am I computing a mean or a variance?
What timescale does this process occur on?
Is my trajectory long enough to sample this process multiple times?

For slow processes (e.g., loop conformational changes with τ ~ 10 ns), even a 100 ns trajectory may only contain ~10 independent samples.

LiveCoMS References 

This implementation follows the guidelines from:

Grossfield, A., et al. (2018). “Best Practices for Quantification of Uncertainty and Sampling Quality in Molecular Simulations.” Living Journal of Computational Molecular Science, 1(1), 5067. DOI: 10.33011/livecoms.1.1.5067

Key sections:

Section 3.1: Uncertainty in averages (first moments)
Section 3.2: Uncertainty in fluctuations (second moments)
Section 4: Practical recommendations for MD analysis

Additional references:

Chodera, J. D., et al. (2007). “Use of the Weighted Histogram Analysis Method for the Analysis of Simulated and Parallel Tempering Simulations.” Journal of Chemical Theory and Computation, 3(1), 26-41.

Flyvbjerg, H., & Petersen, H. G. (1989). “Error estimates on averages of correlated data.” Journal of Chemical Physics, 91(1), 461-466.

API Reference 

The statistical functions are available in polyzymd.analysis.core.autocorrelation:

Function	Description
`compute_acf()`	Compute autocorrelation function via FFT
`estimate_correlation_time()`	Estimate τ from ACF or timeseries
`get_independent_indices()`	Get frame indices spaced by 2τ
`statistical_inefficiency()`	Compute g directly from timeseries
`statistical_inefficiency_multiple()`	Compute g from multiple timeseries
`n_effective()`	Compute N_eff = N / g
`check_statistical_reliability()`	Warn if N_eff < 10

See the Core Module documentation for full API details.