Statistics Best Practices for MD Analysis
This guide explains the statistical methods used in PolyzyMD’s analysis module, following the LiveCoMS Best Practices for uncertainty quantification in molecular simulations (Grossfield et al., 2018).
Why Statistics Matter in MD
Molecular dynamics trajectories are highly correlated in time. Unlike independent experimental measurements, consecutive MD frames show the system in nearly identical configurations. This correlation has profound implications for how we analyze and report results.
Warning
Ignoring correlation in MD analysis leads to:
Overconfident uncertainties (SEM too small by factors of 10-100×)
Biased variance estimates (RMSF, heat capacity systematically wrong)
Irreproducible conclusions (apparent significance that doesn’t replicate)
PolyzyMD automatically handles these issues using methods from the
polyzymd.analyses.shared.autocorrelation module.
Key Concepts
Autocorrelation Function (ACF)
The autocorrelation function measures how correlated a signal is with itself at different time lags:
\(C(0) = 1\) (perfectly correlated with itself)
\(C(\tau) \to 0\) as \(\tau \to \infty\) (decorrelation at long times)
from polyzymd.analyses.shared.autocorrelation import compute_acf
# Compute ACF from a timeseries (e.g., RMSD, distance)
acf_result = compute_acf(
timeseries,
timestep=10.0, # ps between frames
timestep_unit="ps",
)
# acf_result.acf contains the autocorrelation values
# acf_result.lags contains the time lags
Correlation Time (τ)
The correlation time τ is the characteristic timescale for decorrelation. Frames separated by more than 2τ are approximately independent.
PolyzyMD estimates τ using numerical integration of the ACF (the most robust method for noisy data):
from polyzymd.analyses.shared.autocorrelation import estimate_correlation_time
tau_result = estimate_correlation_time(
acf_result,
method="integration", # Most robust for MD data
)
print(f"Correlation time: {tau_result.tau:.1f} {tau_result.tau_unit}")
print(f"Independent samples: {tau_result.n_independent}")
Statistical Inefficiency (g)
The statistical inefficiency quantifies how much correlation inflates variance estimates:
The effective number of independent samples is:
Tip
A typical 100 ns trajectory with 10,000 frames might have only 50-200 effective independent samples, depending on the observable and system dynamics.
The Critical Distinction: Means vs. Variances
This is the most important concept in MD statistics.
Different quantities require different statistical treatments depending on whether they involve first moments (means) or second moments (variances).
First Moments: Use All Data, Correct Uncertainty
For quantities that are averages (first moments):
Mean distance
Mean RMSD
Contact fraction
Average energy
Correlation affects precision but NOT accuracy.
The sample mean \(\bar{x}\) is an unbiased estimator regardless of correlation. Correlated samples still provide valid information about the mean—they just don’t provide as much independent information.
Correct approach: Use all frames, but correct the standard error:
This is exactly what PolyzyMD’s distances plugin does internally in
compute_replicate():
# From analyses/distances/__init__.py (DistanceCalculator internals)
# SEM = std / sqrt(n_independent)
if n_independent_frames > 0:
sem_distance = float(std_dist / np.sqrt(n_independent_frames))
Second Moments: Subsample to Independent Frames
For quantities that measure fluctuations (second moments):
RMSF (root mean square fluctuation)
Heat capacity (energy variance)
Compressibility (volume variance)
Order parameters from variance
Correlation introduces systematic BIAS, not just imprecision.
Correlated samples systematically underestimate the true variance because consecutive frames show the system in similar configurations.
Physical Intuition: The Pendulum Analogy
Imagine measuring the position variance of a swinging pendulum:
Sampling Rate |
What You See |
Apparent Variance |
|---|---|---|
1000 Hz |
Smooth continuous motion |
Small (consecutive samples similar) |
0.1 Hz |
Independent snapshots at random phases |
Correct (true swing amplitude) |
The fast sampling doesn’t capture the full range of motion between samples—it oversamples the slow transitions and underestimates fluctuations.
Correct approach: Use only independent frames (spaced by ≥2τ):
# From analyses/rmsf plugin internals
frame_indices = get_independent_indices(
n_frames=n_frames_total,
correlation_time=correlation_time,
timestep=timestep,
start_frame=start_frame,
)
How PolyzyMD Implements This
Distance analysis
The distances plugin computes distances for all frames in
compute_replicate() and uses autocorrelation to correct the uncertainty:
# Compute distances for every frame
for i, ts in enumerate(u.trajectory):
dist = np.linalg.norm(pos2 - pos1)
distances.append(dist)
# Compute autocorrelation
tau_result = estimate_correlation_time(distances_arr, ...)
# Correct SEM using effective sample size
sem_distance = std_dist / np.sqrt(n_independent_frames)
Why this works: The mean distance estimate benefits from all data points. More frames = more precise mean estimate. The corrected SEM properly reflects our actual uncertainty.
RMSF analysis
The RMSF plugin subsamples to independent frames in compute_replicate()
before computing fluctuations:
# Compute RMSD timeseries for ACF
rmsd_timeseries = self._compute_rmsd_timeseries(u, atoms, start_frame)
# Estimate correlation time
tau_result = estimate_correlation_time(acf_result, n_frames=n_frames_after_eq)
# Select only independent frames (spaced by 2τ)
frame_indices = get_independent_indices(
n_frames=n_frames_total,
correlation_time=correlation_time,
timestep=timestep,
start_frame=start_frame,
)
# Compute RMSF using ONLY independent frames
rmsf_values = self._compute_rmsf(u, atoms, frame_indices)
Why this is necessary: RMSF measures fluctuations (variance). Using correlated frames would systematically underestimate protein flexibility.
Summary Table
Quantity |
Statistical Type |
Effect of Correlation |
PolyzyMD Approach |
|---|---|---|---|
Mean distance |
1st moment |
Reduces precision (no bias) |
All frames + corrected SEM |
Contact fraction |
1st moment |
Reduces precision (no bias) |
All frames + corrected SEM |
RMSF |
2nd moment |
Introduces bias |
Subsample to independent frames |
Distance variance |
2nd moment |
Introduces bias |
Subsample to independent frames |
Practical Recommendations
1. Check Your Correlation Time
Always examine the correlation time relative to your trajectory length:
from polyzymd.analyses.shared.autocorrelation import (
compute_acf,
estimate_correlation_time,
check_statistical_reliability,
)
acf = compute_acf(observable, timestep=10.0, timestep_unit="ps")
tau = estimate_correlation_time(acf, n_frames=len(observable))
print(f"Correlation time: {tau.tau:.1f} {tau.tau_unit}")
print(f"Independent samples: {tau.n_independent}")
print(f"Statistically reliable: {tau.is_reliable}")
Warning
If n_independent < 10, your results may not be statistically reliable.
Consider:
Running longer simulations
Using multiple independent replicates
Reporting results with appropriate caveats
2. Use Multiple Replicates
Independent replicates provide truly uncorrelated samples. PolyzyMD’s aggregation functions properly combine replicates:
# Aggregate across configured independent replicates
polyzymd compare run distances -f comparison.yaml --eq-time 100ns
Replicates are configured per condition in comparison.yaml.
The aggregated SEM is computed from the variance across replicate means, which is statistically robust regardless of within-trajectory correlation.
3. Report Uncertainties Correctly
Always report uncertainties that account for correlation:
Good Practice
“The mean Ser77-His133 distance was 3.42 ± 0.15 Å (SEM, N_eff = 47 independent samples from 5 replicates of 100 ns each).”
Poor Practice
“The mean distance was 3.42 ± 0.02 Å” (using naive SEM with N = 10,000 frames)
4. Understand What You’re Measuring
Before running analysis, ask yourself:
Am I computing a mean or a variance?
What timescale does this process occur on?
Is my trajectory long enough to sample this process multiple times?
For slow processes (e.g., loop conformational changes with τ ~ 10 ns), even a 100 ns trajectory may only contain ~10 independent samples.
Multiple Comparison Correction
Why It Matters
When comparing N conditions pairwise, the number of hypothesis tests grows as N(N−1)/2. With 5 conditions this is already 10 tests; with 10 conditions, 45 tests. Without correction, the probability of at least one false positive (the familywise error rate) increases rapidly — even when every null hypothesis is true.
Benjamini-Hochberg Procedure
When posthoc_method is "ttest_bh" (the default), PolyzyMD uses the
Benjamini-Hochberg (BH) step-up procedure to control the false discovery
rate (FDR) — the expected proportion of false positives among rejected
hypotheses. The algorithm:
Rank all m p-values in ascending order: \(p_{(1)} \le p_{(2)} \le \dots \le p_{(m)}\).
For each rank k, compare \(p_{(k)}\) to the threshold \((k / m) \times \alpha\).
Find the largest k where \(p_{(k)} \le (k / m) \times \alpha\).
Reject all hypotheses with rank ≤ k.
Adjusted p-values are computed with step-down monotonicity enforcement so that \(p_{\text{adj},(k)} \le p_{\text{adj},(k+1)}\) always holds.
How PolyzyMD Applies BH Correction
Each plugin defines its own hypothesis family:
Contacts — All pairwise t-test p-values (both coverage and contact_fraction metrics) form one hypothesis family. ANOVA p-values form a separate small family (typically 2 tests).
Binding free energy — Same-temperature pairwise entries form one family per temperature group.
Polymer affinity — Same pattern as binding free energy: one family per temperature group.
Tukey’s HSD Alternative
When posthoc_method is "tukey_hsd", PolyzyMD uses Tukey’s Honestly
Significant Difference test instead of pairwise t-tests. Tukey HSD tests all
pairs simultaneously using a single studentized range distribution, controlling
the family-wise error rate (FWER) rather than FDR.
Key differences from BH-corrected t-tests:
Tukey HSD is best suited for balanced designs (equal replicates per condition).
It assumes equal variance across conditions.
No separate FDR correction is needed — Tukey p-values are already family-wise corrected.
The
p_value_adjustedfield mirrorsp_valuefor Tukey results.The
t_statisticfield isNaNfor Tukey results (the test uses the studentized range distribution, not a t-distribution).
Choose Tukey HSD when all conditions are equally important and sample sizes are similar. Choose BH-corrected t-tests when you have specific pairs of interest or heterogeneous sample sizes.
The fdr_alpha Parameter
fdr_alpha is configured in the defaults: block of comparison.yaml
(applies to all plugins using the default comparison pipeline) or per plugin
in the plugins: block for plugins with their own fdr_alpha setting. The
default is 0.05.
When posthoc_method is "ttest_bh", fdr_alpha controls the BH
false-discovery-rate threshold. When posthoc_method is "tukey_hsd",
fdr_alpha is used as the family-wise alpha threshold for determining
significance. It is also used as the ANOVA significance threshold regardless
of post-hoc method.
Lowering the value (e.g., 0.01) makes the correction more conservative; raising it allows more discoveries at the cost of a higher expected false positive rate.
defaults:
fdr_alpha: 0.01 # More conservative than default (applies to all plugins)
posthoc_method: "ttest_bh" # or "tukey_hsd"
plugins:
contacts:
fdr_alpha: 0.01 # Per-plugin override (takes precedence for this plugin)
Raw vs Adjusted p-Values
Both raw and BH-adjusted p-values are shown in formatted output. The
significant field in saved JSON and the significance markers (*, **,
***) in CLI tables are based on the adjusted value. This ensures that
reported significance accounts for the number of comparisons performed.
Relationship to Within-Trajectory Statistics
FDR correction and autocorrelation handling address different sources of statistical error:
Problem |
Scope |
PolyzyMD Solution |
|---|---|---|
Autocorrelation |
Within a single trajectory — consecutive frames are not independent |
Correct SEM via \(N_{\text{eff}}\) or subsample by 2τ (see earlier sections) |
Multiple comparisons |
Across conditions — many hypothesis tests inflate false positives |
BH correction on pairwise p-values ( |
Both corrections are important. Autocorrelation handling ensures that the per-condition uncertainty is not artificially small; FDR correction ensures that the cross-condition significance claims are not inflated by the number of tests performed.
For the full field-level reference of post-hoc output (methods, configuration keys, output JSON fields, and edge cases), see Post-Hoc Testing Reference.
LiveCoMS References
This implementation follows the guidelines from:
Grossfield, A., et al. (2018). “Best Practices for Quantification of Uncertainty and Sampling Quality in Molecular Simulations.” Living Journal of Computational Molecular Science, 1(1), 5067. DOI: 10.33011/livecoms.1.1.5067
Key sections:
Section 3.1: Uncertainty in averages (first moments)
Section 3.2: Uncertainty in fluctuations (second moments)
Section 4: Practical recommendations for MD analysis
Additional references:
Chodera, J. D., et al. (2007). “Use of the Weighted Histogram Analysis Method for the Analysis of Simulated and Parallel Tempering Simulations.” Journal of Chemical Theory and Computation, 3(1), 26-41.
Flyvbjerg, H., & Petersen, H. G. (1989). “Error estimates on averages of correlated data.” Journal of Chemical Physics, 91(1), 461-466.
API Reference
The statistical functions are available in polyzymd.analyses.shared.autocorrelation:
Function |
Description |
|---|---|
|
Compute autocorrelation function via FFT |
|
Estimate τ from ACF or timeseries |
|
Get frame indices spaced by 2τ |
|
Compute g directly from timeseries |
|
Compute g from multiple timeseries |
|
Compute N_eff = N / g |
|
Warn if N_eff < 10 |
See the Core Module documentation for full API details.