# Statistics Best Practices for MD Analysis This guide explains the statistical methods used in PolyzyMD's analysis module, following the **LiveCoMS Best Practices** for uncertainty quantification in molecular simulations (Grossfield et al., 2018). ```{contents} On This Page :local: :depth: 2 ``` ## Why Statistics Matter in MD Molecular dynamics trajectories are **highly correlated in time**. Unlike independent experimental measurements, consecutive MD frames show the system in nearly identical configurations. This correlation has profound implications for how we analyze and report results. ```{warning} Ignoring correlation in MD analysis leads to: - **Overconfident uncertainties** (SEM too small by factors of 10-100×) - **Biased variance estimates** (RMSF, heat capacity systematically wrong) - **Irreproducible conclusions** (apparent significance that doesn't replicate) ``` PolyzyMD automatically handles these issues using methods from the `polyzymd.analysis.core.autocorrelation` module. --- ## Key Concepts ### Autocorrelation Function (ACF) The autocorrelation function measures how correlated a signal is with itself at different time lags: $$C(\tau) = \frac{\langle (x(t) - \mu)(x(t+\tau) - \mu) \rangle}{\sigma^2}$$ - $C(0) = 1$ (perfectly correlated with itself) - $C(\tau) \to 0$ as $\tau \to \infty$ (decorrelation at long times) ```python from polyzymd.analysis.core.autocorrelation import compute_acf # Compute ACF from a timeseries (e.g., RMSD, distance) acf_result = compute_acf( timeseries, timestep=10.0, # ps between frames timestep_unit="ps", ) # acf_result.acf contains the autocorrelation values # acf_result.lags contains the time lags ``` ### Correlation Time (τ) The **correlation time** τ is the characteristic timescale for decorrelation. Frames separated by more than 2τ are approximately independent. PolyzyMD estimates τ using numerical integration of the ACF (the most robust method for noisy data): $$\tau = \int_0^{\infty} C(t) \, dt$$ ```python from polyzymd.analysis.core.autocorrelation import estimate_correlation_time tau_result = estimate_correlation_time( acf_result, method="integration", # Most robust for MD data ) print(f"Correlation time: {tau_result.tau:.1f} {tau_result.tau_unit}") print(f"Independent samples: {tau_result.n_independent}") ``` ### Statistical Inefficiency (g) The **statistical inefficiency** quantifies how much correlation inflates variance estimates: $$g = 1 + 2\sum_{t=1}^{N} C(t) \left(1 - \frac{t}{N}\right)$$ The effective number of independent samples is: $$N_{\text{eff}} = \frac{N}{g}$$ ```{tip} A typical 100 ns trajectory with 10,000 frames might have only 50-200 effective independent samples, depending on the observable and system dynamics. ``` --- ## The Critical Distinction: Means vs. Variances **This is the most important concept in MD statistics.** Different quantities require different statistical treatments depending on whether they involve first moments (means) or second moments (variances). ### First Moments: Use All Data, Correct Uncertainty For quantities that are **averages** (first moments): - Mean distance - Mean RMSD - Contact fraction - Average energy **Correlation affects precision but NOT accuracy.** The sample mean $\bar{x}$ is an unbiased estimator regardless of correlation. Correlated samples still provide valid information about the mean—they just don't provide as much *independent* information. **Correct approach**: Use all frames, but correct the standard error: $$\text{SEM} = \frac{\sigma}{\sqrt{N_{\text{eff}}}} = \frac{\sigma}{\sqrt{N/g}}$$ This is exactly what PolyzyMD's distance analysis does: ```python # From distances/calculator.py (lines 638-641) # SEM = std / sqrt(n_independent) if n_independent_frames > 0: sem_distance = float(std_dist / np.sqrt(n_independent_frames)) ``` ### Second Moments: Subsample to Independent Frames For quantities that measure **fluctuations** (second moments): - RMSF (root mean square fluctuation) - Heat capacity (energy variance) - Compressibility (volume variance) - Order parameters from variance **Correlation introduces systematic BIAS, not just imprecision.** Correlated samples systematically **underestimate** the true variance because consecutive frames show the system in similar configurations. #### Physical Intuition: The Pendulum Analogy Imagine measuring the position variance of a swinging pendulum: | Sampling Rate | What You See | Apparent Variance | |--------------|--------------|-------------------| | 1000 Hz | Smooth continuous motion | **Small** (consecutive samples similar) | | 0.1 Hz | Independent snapshots at random phases | **Correct** (true swing amplitude) | The fast sampling doesn't capture the full range of motion between samples—it oversamples the slow transitions and underestimates fluctuations. **Correct approach**: Use only independent frames (spaced by ≥2τ): ```python # From rmsf/calculator.py (lines 322-327) frame_indices = get_independent_indices( n_frames=n_frames_total, correlation_time=correlation_time, timestep=timestep, start_frame=start_frame, ) ``` --- ## How PolyzyMD Implements This ### Distance Analysis The `DistanceCalculator` computes distances for **all frames** and uses autocorrelation to correct the uncertainty: ```python # Compute distances for every frame for i, ts in enumerate(u.trajectory): dist = np.linalg.norm(pos2 - pos1) distances.append(dist) # Compute autocorrelation tau_result = estimate_correlation_time(distances_arr, ...) # Correct SEM using effective sample size sem_distance = std_dist / np.sqrt(n_independent_frames) ``` **Why this works**: The mean distance estimate benefits from all data points. More frames = more precise mean estimate. The corrected SEM properly reflects our actual uncertainty. ### RMSF Analysis The `RMSFCalculator` **subsamples** to independent frames before computing fluctuations: ```python # Compute RMSD timeseries for ACF rmsd_timeseries = self._compute_rmsd_timeseries(u, atoms, start_frame) # Estimate correlation time tau_result = estimate_correlation_time(acf_result, n_frames=n_frames_after_eq) # Select only independent frames (spaced by 2τ) frame_indices = get_independent_indices( n_frames=n_frames_total, correlation_time=correlation_time, timestep=timestep, start_frame=start_frame, ) # Compute RMSF using ONLY independent frames rmsf_values = self._compute_rmsf(u, atoms, frame_indices) ``` **Why this is necessary**: RMSF measures fluctuations (variance). Using correlated frames would systematically underestimate protein flexibility. --- ## Summary Table | Quantity | Statistical Type | Effect of Correlation | PolyzyMD Approach | |----------|-----------------|----------------------|-------------------| | Mean distance | 1st moment | Reduces precision (no bias) | All frames + corrected SEM | | Contact fraction | 1st moment | Reduces precision (no bias) | All frames + corrected SEM | | RMSF | 2nd moment | **Introduces bias** | Subsample to independent frames | | Distance variance | 2nd moment | **Introduces bias** | Subsample to independent frames | --- ## Practical Recommendations ### 1. Check Your Correlation Time Always examine the correlation time relative to your trajectory length: ```python from polyzymd.analysis.core.autocorrelation import ( compute_acf, estimate_correlation_time, check_statistical_reliability, ) acf = compute_acf(observable, timestep=10.0, timestep_unit="ps") tau = estimate_correlation_time(acf, n_frames=len(observable)) print(f"Correlation time: {tau.tau:.1f} {tau.tau_unit}") print(f"Independent samples: {tau.n_independent}") print(f"Statistically reliable: {tau.is_reliable}") ``` ```{warning} If `n_independent < 10`, your results may not be statistically reliable. Consider: 1. Running longer simulations 2. Using multiple independent replicates 3. Reporting results with appropriate caveats ``` ### 2. Use Multiple Replicates Independent replicates provide truly uncorrelated samples. PolyzyMD's aggregation functions properly combine replicates: ```bash # Aggregate across 5 independent replicates polyzymd analyze distances config.yaml \ --replicates 1-5 \ --equilibration 100ns ``` The aggregated SEM is computed from the **variance across replicate means**, which is statistically robust regardless of within-trajectory correlation. ### 3. Report Uncertainties Correctly Always report uncertainties that account for correlation: ```{admonition} Good Practice :class: tip "The mean Ser77-His133 distance was 3.42 ± 0.15 Å (SEM, N_eff = 47 independent samples from 5 replicates of 100 ns each)." ``` ```{admonition} Poor Practice :class: warning "The mean distance was 3.42 ± 0.02 Å" (using naive SEM with N = 10,000 frames) ``` ### 4. Understand What You're Measuring Before running analysis, ask yourself: 1. **Am I computing a mean or a variance?** 2. **What timescale does this process occur on?** 3. **Is my trajectory long enough to sample this process multiple times?** For slow processes (e.g., loop conformational changes with τ ~ 10 ns), even a 100 ns trajectory may only contain ~10 independent samples. --- ## LiveCoMS References This implementation follows the guidelines from: > **Grossfield, A., et al.** (2018). "Best Practices for Quantification of > Uncertainty and Sampling Quality in Molecular Simulations." > *Living Journal of Computational Molecular Science*, 1(1), 5067. > [DOI: 10.33011/livecoms.1.1.5067](https://doi.org/10.33011/livecoms.1.1.5067) Key sections: - **Section 3.1**: Uncertainty in averages (first moments) - **Section 3.2**: Uncertainty in fluctuations (second moments) - **Section 4**: Practical recommendations for MD analysis Additional references: > **Chodera, J. D., et al.** (2007). "Use of the Weighted Histogram Analysis > Method for the Analysis of Simulated and Parallel Tempering Simulations." > *Journal of Chemical Theory and Computation*, 3(1), 26-41. > **Flyvbjerg, H., & Petersen, H. G.** (1989). "Error estimates on averages > of correlated data." *Journal of Chemical Physics*, 91(1), 461-466. --- ## API Reference The statistical functions are available in `polyzymd.analysis.core.autocorrelation`: | Function | Description | |----------|-------------| | `compute_acf()` | Compute autocorrelation function via FFT | | `estimate_correlation_time()` | Estimate τ from ACF or timeseries | | `get_independent_indices()` | Get frame indices spaced by 2τ | | `statistical_inefficiency()` | Compute g directly from timeseries | | `statistical_inefficiency_multiple()` | Compute g from multiple timeseries | | `n_effective()` | Compute N_eff = N / g | | `check_statistical_reliability()` | Warn if N_eff < 10 | See the {doc}`/api/core` documentation for full API details.