Statistics Best Practices for MD Analysis

This guide explains the statistical methods used in PolyzyMD’s analysis module, following the LiveCoMS Best Practices for uncertainty quantification in molecular simulations (Grossfield et al., 2018).

Why Statistics Matter in MD 

Molecular dynamics trajectories are highly correlated in time. Unlike independent experimental measurements, consecutive MD frames show the system in nearly identical configurations. This correlation has profound implications for how we analyze and report results.

Warning

Ignoring correlation in MD analysis leads to:

Overconfident uncertainties (SEM too small by factors of 10-100×)
Biased variance estimates (RMSF, heat capacity systematically wrong)
Irreproducible conclusions (apparent significance that doesn’t replicate)

PolyzyMD automatically handles these issues using methods from the polyzymd.analyses.shared.autocorrelation module.

Key Concepts 

Autocorrelation Function (ACF)

The autocorrelation function measures how correlated a signal is with itself at different time lags:

\[C(\tau) = \frac{\langle (x(t) - \mu)(x(t+\tau) - \mu) \rangle}{\sigma^2}\]

\(C(0) = 1\) (perfectly correlated with itself)
\(C(\tau) \to 0\) as \(\tau \to \infty\) (decorrelation at long times)

from polyzymd.analyses.shared.autocorrelation import compute_acf

# Compute ACF from a timeseries (e.g., RMSD, distance)
acf_result = compute_acf(
    timeseries,
    timestep=10.0,      # ps between frames
    timestep_unit="ps",
)

# acf_result.acf contains the autocorrelation values
# acf_result.lags contains the time lags

Correlation Time (τ)

The correlation time τ is the characteristic timescale for decorrelation. Frames separated by more than 2τ are approximately independent.

PolyzyMD estimates τ using numerical integration of the ACF (the most robust method for noisy data):

\[\tau = \int_0^{\infty} C(t) \, dt\]

from polyzymd.analyses.shared.autocorrelation import estimate_correlation_time

tau_result = estimate_correlation_time(
    acf_result,
    method="integration",  # Most robust for MD data
)

print(f"Correlation time: {tau_result.tau:.1f} {tau_result.tau_unit}")
print(f"Independent samples: {tau_result.n_independent}")

Statistical Inefficiency (g)

The statistical inefficiency quantifies how much correlation inflates variance estimates:

\[g = 1 + 2\sum_{t=1}^{N} C(t) \left(1 - \frac{t}{N}\right)\]

The effective number of independent samples is:

\[N_{\text{eff}} = \frac{N}{g}\]

Tip

A typical 100 ns trajectory with 10,000 frames might have only 50-200 effective independent samples, depending on the observable and system dynamics.

The Critical Distinction: Means vs. Variances 

This is the most important concept in MD statistics.

Different quantities require different statistical treatments depending on whether they involve first moments (means) or second moments (variances).

First Moments: Use All Data, Correct Uncertainty 

For quantities that are averages (first moments):

Mean distance
Mean RMSD
Contact fraction
Average energy

Correlation affects precision but NOT accuracy.

The sample mean \(\bar{x}\) is an unbiased estimator regardless of correlation. Correlated samples still provide valid information about the mean—they just don’t provide as much independent information.

Correct approach: Use all frames, but correct the standard error:

\[\text{SEM} = \frac{\sigma}{\sqrt{N_{\text{eff}}}} = \frac{\sigma}{\sqrt{N/g}}\]

This is exactly what PolyzyMD’s distances plugin does internally in compute_replicate():

# From analyses/distances/__init__.py (DistanceCalculator internals)
# SEM = std / sqrt(n_independent)
if n_independent_frames > 0:
    sem_distance = float(std_dist / np.sqrt(n_independent_frames))

Second Moments: Subsample to Independent Frames 

For quantities that measure fluctuations (second moments):

RMSF (root mean square fluctuation)
Heat capacity (energy variance)
Compressibility (volume variance)
Order parameters from variance

Correlation introduces systematic BIAS, not just imprecision.

Correlated samples systematically underestimate the true variance because consecutive frames show the system in similar configurations.

Physical Intuition: The Pendulum Analogy

Imagine measuring the position variance of a swinging pendulum:

Sampling Rate	What You See	Apparent Variance
1000 Hz	Smooth continuous motion	Small (consecutive samples similar)
0.1 Hz	Independent snapshots at random phases	Correct (true swing amplitude)

The fast sampling doesn’t capture the full range of motion between samples—it oversamples the slow transitions and underestimates fluctuations.

Correct approach: Use only independent frames (spaced by ≥2τ):

# From analyses/rmsf plugin internals
frame_indices = get_independent_indices(
    n_frames=n_frames_total,
    correlation_time=correlation_time,
    timestep=timestep,
    start_frame=start_frame,
)

How PolyzyMD Implements This 

Distance analysis 

The distances plugin computes distances for all frames in compute_replicate() and uses autocorrelation to correct the uncertainty:

# Compute distances for every frame
for i, ts in enumerate(u.trajectory):
    dist = np.linalg.norm(pos2 - pos1)
    distances.append(dist)

# Compute autocorrelation
tau_result = estimate_correlation_time(distances_arr, ...)

# Correct SEM using effective sample size
sem_distance = std_dist / np.sqrt(n_independent_frames)

Why this works: The mean distance estimate benefits from all data points. More frames = more precise mean estimate. The corrected SEM properly reflects our actual uncertainty.

RMSF analysis 

The RMSF plugin subsamples to independent frames in compute_replicate() before computing fluctuations:

# Compute RMSD timeseries for ACF
rmsd_timeseries = self._compute_rmsd_timeseries(u, atoms, start_frame)

# Estimate correlation time
tau_result = estimate_correlation_time(acf_result, n_frames=n_frames_after_eq)

# Select only independent frames (spaced by 2τ)
frame_indices = get_independent_indices(
    n_frames=n_frames_total,
    correlation_time=correlation_time,
    timestep=timestep,
    start_frame=start_frame,
)

# Compute RMSF using ONLY independent frames
rmsf_values = self._compute_rmsf(u, atoms, frame_indices)

Why this is necessary: RMSF measures fluctuations (variance). Using correlated frames would systematically underestimate protein flexibility.

Summary Table 

Quantity	Statistical Type	Effect of Correlation	PolyzyMD Approach
Mean distance	1st moment	Reduces precision (no bias)	All frames + corrected SEM
Contact fraction	1st moment	Reduces precision (no bias)	All frames + corrected SEM
RMSF	2nd moment	Introduces bias	Subsample to independent frames
Distance variance	2nd moment	Introduces bias	Subsample to independent frames

Practical Recommendations 

1. Check Your Correlation Time 

Always examine the correlation time relative to your trajectory length:

from polyzymd.analyses.shared.autocorrelation import (
    compute_acf,
    estimate_correlation_time,
    check_statistical_reliability,
)

acf = compute_acf(observable, timestep=10.0, timestep_unit="ps")
tau = estimate_correlation_time(acf, n_frames=len(observable))

print(f"Correlation time: {tau.tau:.1f} {tau.tau_unit}")
print(f"Independent samples: {tau.n_independent}")
print(f"Statistically reliable: {tau.is_reliable}")

Warning

If n_independent < 10, your results may not be statistically reliable. Consider:

Running longer simulations
Using multiple independent replicates
Reporting results with appropriate caveats

2. Use Multiple Replicates 

Independent replicates provide truly uncorrelated samples. PolyzyMD’s aggregation functions properly combine replicates:

# Aggregate across configured independent replicates
polyzymd compare run distances -f comparison.yaml --eq-time 100ns

Replicates are configured per condition in comparison.yaml.

The aggregated SEM is computed from the variance across replicate means, which is statistically robust regardless of within-trajectory correlation.

3. Report Uncertainties Correctly 

Always report uncertainties that account for correlation:

Good Practice

“The mean Ser77-His133 distance was 3.42 ± 0.15 Å (SEM, N_eff = 47 independent samples from 5 replicates of 100 ns each).”

Poor Practice

“The mean distance was 3.42 ± 0.02 Å” (using naive SEM with N = 10,000 frames)

4. Understand What You’re Measuring 

Before running analysis, ask yourself:

Am I computing a mean or a variance?
What timescale does this process occur on?
Is my trajectory long enough to sample this process multiple times?

For slow processes (e.g., loop conformational changes with τ ~ 10 ns), even a 100 ns trajectory may only contain ~10 independent samples.

Multiple Comparison Correction 

Why It Matters 

When comparing N conditions pairwise, the number of hypothesis tests grows as N(N−1)/2. With 5 conditions this is already 10 tests; with 10 conditions, 45 tests. Without correction, the probability of at least one false positive (the familywise error rate) increases rapidly — even when every null hypothesis is true.

Benjamini-Hochberg Procedure 

When posthoc_method is "ttest_bh" (the default), PolyzyMD uses the Benjamini-Hochberg (BH) step-up procedure to control the false discovery rate (FDR) — the expected proportion of false positives among rejected hypotheses. The algorithm:

Rank all m p-values in ascending order: \(p_{(1)} \le p_{(2)} \le \dots \le p_{(m)}\).
For each rank k, compare \(p_{(k)}\) to the threshold \((k / m) \times \alpha\).
Find the largest k where \(p_{(k)} \le (k / m) \times \alpha\).
Reject all hypotheses with rank ≤ k.

Adjusted p-values are computed with step-down monotonicity enforcement so that \(p_{\text{adj},(k)} \le p_{\text{adj},(k+1)}\) always holds.

How PolyzyMD Applies BH Correction 

Each plugin defines its own hypothesis family:

Contacts — All pairwise t-test p-values (both coverage and contact_fraction metrics) form one hypothesis family. ANOVA p-values form a separate small family (typically 2 tests).
Binding free energy — Same-temperature pairwise entries form one family per temperature group.
Polymer affinity — Same pattern as binding free energy: one family per temperature group.

Tukey’s HSD Alternative 

When posthoc_method is "tukey_hsd", PolyzyMD uses Tukey’s Honestly Significant Difference test instead of pairwise t-tests. Tukey HSD tests all pairs simultaneously using a single studentized range distribution, controlling the family-wise error rate (FWER) rather than FDR.

Key differences from BH-corrected t-tests:

Tukey HSD is best suited for balanced designs (equal replicates per condition).
It assumes equal variance across conditions.
No separate FDR correction is needed — Tukey p-values are already family-wise corrected.
The p_value_adjusted field mirrors p_value for Tukey results.
The t_statistic field is NaN for Tukey results (the test uses the studentized range distribution, not a t-distribution).

Choose Tukey HSD when all conditions are equally important and sample sizes are similar. Choose BH-corrected t-tests when you have specific pairs of interest or heterogeneous sample sizes.

The `fdr_alpha` Parameter 

fdr_alpha is configured in the defaults: block of comparison.yaml (applies to all plugins using the default comparison pipeline) or per plugin in the plugins: block for plugins with their own fdr_alpha setting. The default is 0.05.

When posthoc_method is "ttest_bh", fdr_alpha controls the BH false-discovery-rate threshold. When posthoc_method is "tukey_hsd", fdr_alpha is used as the family-wise alpha threshold for determining significance. It is also used as the ANOVA significance threshold regardless of post-hoc method.

Lowering the value (e.g., 0.01) makes the correction more conservative; raising it allows more discoveries at the cost of a higher expected false positive rate.

defaults:
  fdr_alpha: 0.01  # More conservative than default (applies to all plugins)
  posthoc_method: "ttest_bh"  # or "tukey_hsd"

plugins:
  contacts:
    fdr_alpha: 0.01  # Per-plugin override (takes precedence for this plugin)

Raw vs Adjusted p-Values 

Both raw and BH-adjusted p-values are shown in formatted output. The significant field in saved JSON and the significance markers (*, **, ***) in CLI tables are based on the adjusted value. This ensures that reported significance accounts for the number of comparisons performed.

Relationship to Within-Trajectory Statistics 

FDR correction and autocorrelation handling address different sources of statistical error:

Problem	Scope	PolyzyMD Solution
Autocorrelation	Within a single trajectory — consecutive frames are not independent	Correct SEM via \(N_{\text{eff}}\) or subsample by 2τ (see earlier sections)
Multiple comparisons	Across conditions — many hypothesis tests inflate false positives	BH correction on pairwise p-values (`ttest_bh`) or Tukey HSD FWER control (`tukey_hsd`)

Both corrections are important. Autocorrelation handling ensures that the per-condition uncertainty is not artificially small; FDR correction ensures that the cross-condition significance claims are not inflated by the number of tests performed.

For the full field-level reference of post-hoc output (methods, configuration keys, output JSON fields, and edge cases), see Post-Hoc Testing Reference.

LiveCoMS References 

This implementation follows the guidelines from:

Grossfield, A., et al. (2018). “Best Practices for Quantification of Uncertainty and Sampling Quality in Molecular Simulations.” Living Journal of Computational Molecular Science, 1(1), 5067. DOI: 10.33011/livecoms.1.1.5067

Key sections:

Section 3.1: Uncertainty in averages (first moments)
Section 3.2: Uncertainty in fluctuations (second moments)
Section 4: Practical recommendations for MD analysis

Additional references:

Chodera, J. D., et al. (2007). “Use of the Weighted Histogram Analysis Method for the Analysis of Simulated and Parallel Tempering Simulations.” Journal of Chemical Theory and Computation, 3(1), 26-41.

Flyvbjerg, H., & Petersen, H. G. (1989). “Error estimates on averages of correlated data.” Journal of Chemical Physics, 91(1), 461-466.

API Reference 

The statistical functions are available in polyzymd.analyses.shared.autocorrelation:

Function	Description
`compute_acf()`	Compute autocorrelation function via FFT
`estimate_correlation_time()`	Estimate τ from ACF or timeseries
`get_independent_indices()`	Get frame indices spaced by 2τ
`statistical_inefficiency()`	Compute g directly from timeseries
`statistical_inefficiency_multiple()`	Compute g from multiple timeseries
`n_effective()`	Compute N_eff = N / g
`check_statistical_reliability()`	Warn if N_eff < 10

See the Core Module documentation for full API details.