# Statistics Best Practices for MD Analysis This guide explains the statistical methods used in PolyzyMD's analysis module, following the **LiveCoMS Best Practices** for uncertainty quantification in molecular simulations (Grossfield et al., 2018). ```{contents} On This Page :local: :depth: 2 ``` ## Why Statistics Matter in MD Molecular dynamics trajectories are **highly correlated in time**. Unlike independent experimental measurements, consecutive MD frames show the system in nearly identical configurations. This correlation has profound implications for how we analyze and report results. ```{warning} Ignoring correlation in MD analysis leads to: - **Overconfident uncertainties** (SEM too small by factors of 10-100×) - **Biased variance estimates** (RMSF, heat capacity systematically wrong) - **Irreproducible conclusions** (apparent significance that doesn't replicate) ``` PolyzyMD automatically handles these issues using methods from the `polyzymd.analyses.shared.autocorrelation` module. --- ## Key Concepts ### Autocorrelation Function (ACF) The autocorrelation function measures how correlated a signal is with itself at different time lags: $$C(\tau) = \frac{\langle (x(t) - \mu)(x(t+\tau) - \mu) \rangle}{\sigma^2}$$ - $C(0) = 1$ (perfectly correlated with itself) - $C(\tau) \to 0$ as $\tau \to \infty$ (decorrelation at long times) ```python from polyzymd.analyses.shared.autocorrelation import compute_acf # Compute ACF from a timeseries (e.g., RMSD, distance) acf_result = compute_acf( timeseries, timestep=10.0, # ps between frames timestep_unit="ps", ) # acf_result.acf contains the autocorrelation values # acf_result.lags contains the time lags ``` ### Correlation Time (τ) The **correlation time** τ is the characteristic timescale for decorrelation. Frames separated by more than 2τ are approximately independent. PolyzyMD estimates τ using numerical integration of the ACF (the most robust method for noisy data): $$\tau = \int_0^{\infty} C(t) \, dt$$ ```python from polyzymd.analyses.shared.autocorrelation import estimate_correlation_time tau_result = estimate_correlation_time( acf_result, method="integration", # Most robust for MD data ) print(f"Correlation time: {tau_result.tau:.1f} {tau_result.tau_unit}") print(f"Independent samples: {tau_result.n_independent}") ``` ### Statistical Inefficiency (g) The **statistical inefficiency** quantifies how much correlation inflates variance estimates: $$g = 1 + 2\sum_{t=1}^{N} C(t) \left(1 - \frac{t}{N}\right)$$ The effective number of independent samples is: $$N_{\text{eff}} = \frac{N}{g}$$ ```{tip} A typical 100 ns trajectory with 10,000 frames might have only 50-200 effective independent samples, depending on the observable and system dynamics. ``` --- ## The Critical Distinction: Means vs. Variances **This is the most important concept in MD statistics.** Different quantities require different statistical treatments depending on whether they involve first moments (means) or second moments (variances). ### First Moments: Use All Data, Correct Uncertainty For quantities that are **averages** (first moments): - Mean distance - Mean RMSD - Contact fraction - Average energy **Correlation affects precision but NOT accuracy.** The sample mean $\bar{x}$ is an unbiased estimator regardless of correlation. Correlated samples still provide valid information about the mean—they just don't provide as much *independent* information. **Correct approach**: Use all frames, but correct the standard error: $$\text{SEM} = \frac{\sigma}{\sqrt{N_{\text{eff}}}} = \frac{\sigma}{\sqrt{N/g}}$$ This is exactly what PolyzyMD's distances plugin does internally in `compute_replicate()`: ```python # From analyses/distances/__init__.py (DistanceCalculator internals) # SEM = std / sqrt(n_independent) if n_independent_frames > 0: sem_distance = float(std_dist / np.sqrt(n_independent_frames)) ``` ### Second Moments: Subsample to Independent Frames For quantities that measure **fluctuations** (second moments): - RMSF (root mean square fluctuation) - Heat capacity (energy variance) - Compressibility (volume variance) - Order parameters from variance **Correlation introduces systematic BIAS, not just imprecision.** Correlated samples systematically **underestimate** the true variance because consecutive frames show the system in similar configurations. #### Physical Intuition: The Pendulum Analogy Imagine measuring the position variance of a swinging pendulum: | Sampling Rate | What You See | Apparent Variance | |--------------|--------------|-------------------| | 1000 Hz | Smooth continuous motion | **Small** (consecutive samples similar) | | 0.1 Hz | Independent snapshots at random phases | **Correct** (true swing amplitude) | The fast sampling doesn't capture the full range of motion between samples—it oversamples the slow transitions and underestimates fluctuations. **Correct approach**: Use only independent frames (spaced by ≥2τ): ```python # From analyses/rmsf plugin internals frame_indices = get_independent_indices( n_frames=n_frames_total, correlation_time=correlation_time, timestep=timestep, start_frame=start_frame, ) ``` --- ## How PolyzyMD Implements This ### Distance analysis The distances plugin computes distances for **all frames** in `compute_replicate()` and uses autocorrelation to correct the uncertainty: ```python # Compute distances for every frame for i, ts in enumerate(u.trajectory): dist = np.linalg.norm(pos2 - pos1) distances.append(dist) # Compute autocorrelation tau_result = estimate_correlation_time(distances_arr, ...) # Correct SEM using effective sample size sem_distance = std_dist / np.sqrt(n_independent_frames) ``` **Why this works**: The mean distance estimate benefits from all data points. More frames = more precise mean estimate. The corrected SEM properly reflects our actual uncertainty. ### RMSF analysis The RMSF plugin **subsamples** to independent frames in `compute_replicate()` before computing fluctuations: ```python # Compute RMSD timeseries for ACF rmsd_timeseries = self._compute_rmsd_timeseries(u, atoms, start_frame) # Estimate correlation time tau_result = estimate_correlation_time(acf_result, n_frames=n_frames_after_eq) # Select only independent frames (spaced by 2τ) frame_indices = get_independent_indices( n_frames=n_frames_total, correlation_time=correlation_time, timestep=timestep, start_frame=start_frame, ) # Compute RMSF using ONLY independent frames rmsf_values = self._compute_rmsf(u, atoms, frame_indices) ``` **Why this is necessary**: RMSF measures fluctuations (variance). Using correlated frames would systematically underestimate protein flexibility. --- ## Summary Table | Quantity | Statistical Type | Effect of Correlation | PolyzyMD Approach | |----------|-----------------|----------------------|-------------------| | Mean distance | 1st moment | Reduces precision (no bias) | All frames + corrected SEM | | Contact fraction | 1st moment | Reduces precision (no bias) | All frames + corrected SEM | | RMSF | 2nd moment | **Introduces bias** | Subsample to independent frames | | Distance variance | 2nd moment | **Introduces bias** | Subsample to independent frames | --- ## Practical Recommendations ### 1. Check Your Correlation Time Always examine the correlation time relative to your trajectory length: ```python from polyzymd.analyses.shared.autocorrelation import ( compute_acf, estimate_correlation_time, check_statistical_reliability, ) acf = compute_acf(observable, timestep=10.0, timestep_unit="ps") tau = estimate_correlation_time(acf, n_frames=len(observable)) print(f"Correlation time: {tau.tau:.1f} {tau.tau_unit}") print(f"Independent samples: {tau.n_independent}") print(f"Statistically reliable: {tau.is_reliable}") ``` ```{warning} If `n_independent < 10`, your results may not be statistically reliable. Consider: 1. Running longer simulations 2. Using multiple independent replicates 3. Reporting results with appropriate caveats ``` ### 2. Use Multiple Replicates Independent replicates provide truly uncorrelated samples. PolyzyMD's aggregation functions properly combine replicates: ```bash # Aggregate across configured independent replicates polyzymd compare run distances -f comparison.yaml --eq-time 100ns ``` Replicates are configured per condition in `comparison.yaml`. The aggregated SEM is computed from the **variance across replicate means**, which is statistically robust regardless of within-trajectory correlation. ### 3. Report Uncertainties Correctly Always report uncertainties that account for correlation: ```{admonition} Good Practice :class: tip "The mean Ser77-His133 distance was 3.42 ± 0.15 Å (SEM, N_eff = 47 independent samples from 5 replicates of 100 ns each)." ``` ```{admonition} Poor Practice :class: warning "The mean distance was 3.42 ± 0.02 Å" (using naive SEM with N = 10,000 frames) ``` ### 4. Understand What You're Measuring Before running analysis, ask yourself: 1. **Am I computing a mean or a variance?** 2. **What timescale does this process occur on?** 3. **Is my trajectory long enough to sample this process multiple times?** For slow processes (e.g., loop conformational changes with τ ~ 10 ns), even a 100 ns trajectory may only contain ~10 independent samples. --- ## Multiple Comparison Correction ### Why It Matters When comparing *N* conditions pairwise, the number of hypothesis tests grows as *N*(*N*−1)/2. With 5 conditions this is already 10 tests; with 10 conditions, 45 tests. Without correction, the probability of at least one false positive (the **familywise error rate**) increases rapidly — even when every null hypothesis is true. ### Benjamini-Hochberg Procedure When `posthoc_method` is `"ttest_bh"` (the default), PolyzyMD uses the **Benjamini-Hochberg (BH)** step-up procedure to control the **false discovery rate** (FDR) — the expected proportion of false positives among rejected hypotheses. The algorithm: 1. Rank all *m* p-values in ascending order: $p_{(1)} \le p_{(2)} \le \dots \le p_{(m)}$. 2. For each rank *k*, compare $p_{(k)}$ to the threshold $(k / m) \times \alpha$. 3. Find the largest *k* where $p_{(k)} \le (k / m) \times \alpha$. 4. Reject all hypotheses with rank ≤ *k*. Adjusted p-values are computed with step-down monotonicity enforcement so that $p_{\text{adj},(k)} \le p_{\text{adj},(k+1)}$ always holds. ### How PolyzyMD Applies BH Correction Each plugin defines its own hypothesis family: - **Contacts** — All pairwise t-test p-values (both coverage and contact_fraction metrics) form **one hypothesis family**. ANOVA p-values form a separate small family (typically 2 tests). - **Binding free energy** — Same-temperature pairwise entries form **one family per temperature group**. - **Polymer affinity** — Same pattern as binding free energy: one family per temperature group. ### Tukey's HSD Alternative When `posthoc_method` is `"tukey_hsd"`, PolyzyMD uses **Tukey's Honestly Significant Difference** test instead of pairwise t-tests. Tukey HSD tests all pairs simultaneously using a single studentized range distribution, controlling the **family-wise error rate** (FWER) rather than FDR. Key differences from BH-corrected t-tests: - Tukey HSD is best suited for balanced designs (equal replicates per condition). - It assumes equal variance across conditions. - No separate FDR correction is needed — Tukey p-values are already family-wise corrected. - The `p_value_adjusted` field mirrors `p_value` for Tukey results. - The `t_statistic` field is `NaN` for Tukey results (the test uses the studentized range distribution, not a t-distribution). Choose Tukey HSD when all conditions are equally important and sample sizes are similar. Choose BH-corrected t-tests when you have specific pairs of interest or heterogeneous sample sizes. ### The `fdr_alpha` Parameter `fdr_alpha` is configured in the `defaults:` block of `comparison.yaml` (applies to all plugins using the default comparison pipeline) or per plugin in the `plugins:` block for plugins with their own `fdr_alpha` setting. The default is 0.05. When `posthoc_method` is `"ttest_bh"`, `fdr_alpha` controls the BH false-discovery-rate threshold. When `posthoc_method` is `"tukey_hsd"`, `fdr_alpha` is used as the family-wise alpha threshold for determining significance. It is also used as the ANOVA significance threshold regardless of post-hoc method. Lowering the value (e.g., 0.01) makes the correction more conservative; raising it allows more discoveries at the cost of a higher expected false positive rate. ```yaml defaults: fdr_alpha: 0.01 # More conservative than default (applies to all plugins) posthoc_method: "ttest_bh" # or "tukey_hsd" plugins: contacts: fdr_alpha: 0.01 # Per-plugin override (takes precedence for this plugin) ``` ### Raw vs Adjusted p-Values Both raw and BH-adjusted p-values are shown in formatted output. The `significant` field in saved JSON and the significance markers (`*`, `**`, `***`) in CLI tables are based on the **adjusted** value. This ensures that reported significance accounts for the number of comparisons performed. ### Relationship to Within-Trajectory Statistics FDR correction and autocorrelation handling address **different** sources of statistical error: | Problem | Scope | PolyzyMD Solution | |---------|-------|-------------------| | **Autocorrelation** | Within a single trajectory — consecutive frames are not independent | Correct SEM via $N_{\text{eff}}$ or subsample by 2τ (see earlier sections) | | **Multiple comparisons** | Across conditions — many hypothesis tests inflate false positives | BH correction on pairwise p-values (`ttest_bh`) or Tukey HSD FWER control (`tukey_hsd`) | Both corrections are important. Autocorrelation handling ensures that the per-condition uncertainty is not artificially small; FDR correction ensures that the cross-condition significance claims are not inflated by the number of tests performed. For the full field-level reference of post-hoc output (methods, configuration keys, output JSON fields, and edge cases), see {doc}`/reference/posthoc_testing`. --- ## LiveCoMS References This implementation follows the guidelines from: > **Grossfield, A., et al.** (2018). "Best Practices for Quantification of > Uncertainty and Sampling Quality in Molecular Simulations." > *Living Journal of Computational Molecular Science*, 1(1), 5067. > [DOI: 10.33011/livecoms.1.1.5067](https://doi.org/10.33011/livecoms.1.1.5067) Key sections: - **Section 3.1**: Uncertainty in averages (first moments) - **Section 3.2**: Uncertainty in fluctuations (second moments) - **Section 4**: Practical recommendations for MD analysis Additional references: > **Chodera, J. D., et al.** (2007). "Use of the Weighted Histogram Analysis > Method for the Analysis of Simulated and Parallel Tempering Simulations." > *Journal of Chemical Theory and Computation*, 3(1), 26-41. > **Flyvbjerg, H., & Petersen, H. G.** (1989). "Error estimates on averages > of correlated data." *Journal of Chemical Physics*, 91(1), 461-466. --- ## API Reference The statistical functions are available in `polyzymd.analyses.shared.autocorrelation`: | Function | Description | |----------|-------------| | `compute_acf()` | Compute autocorrelation function via FFT | | `estimate_correlation_time()` | Estimate τ from ACF or timeseries | | `get_independent_indices()` | Get frame indices spaced by 2τ | | `statistical_inefficiency()` | Compute g directly from timeseries | | `statistical_inefficiency_multiple()` | Compute g from multiple timeseries | | `n_effective()` | Compute N_eff = N / g | | `check_statistical_reliability()` | Warn if N_eff < 10 | See the {doc}`/api/core` documentation for full API details.