# RMSF Analysis: Statistical Best Practices A comprehensive guide to statistically rigorous RMSF analysis, following LiveCoMS recommendations for uncertainty quantification in molecular dynamics simulations. ```{note} **Just need quick results?** See the [Quick Start Guide](../how_to/analysis_rmsf_quickstart.md) for copy-paste commands and minimal setup. ``` ```{seealso} **For foundational statistical concepts** (autocorrelation, correlation time, the difference between means vs. variances), see the [Statistics Best Practices Guide](analysis_statistics_best_practices.md). This page focuses on **RMSF-specific** guidance: what the warning messages mean, how to interpret RMSF values, and how to compare conditions. ``` ## Introduction RMSF (Root Mean Square Fluctuation) measures how much each residue fluctuates around its average position. Because RMSF is a **variance-based** quantity (a second moment), it requires special statistical treatment—correlated samples don't just reduce precision, they introduce **systematic bias**. This guide explains: - What RMSF measures and how to interpret values - What the warning messages mean and what to do about them - How to properly compare conditions with statistical rigor - Common pitfalls and how to avoid them ```{tip} **Why does RMSF need subsampling?** Unlike mean distances (which can use all frames with corrected SEM), variance estimates from correlated samples are systematically biased—they underestimate true fluctuations. PolyzyMD automatically subsamples to independent frames when computing RMSF. See the [Statistics Best Practices](analysis_statistics_best_practices.md#the-critical-distinction-means-vs-variances) for the full explanation. ``` ## What is RMSF? **Root Mean Square Fluctuation (RMSF)** measures how much each atom or residue fluctuates around its average position during a simulation: $$ \text{RMSF}_i = \sqrt{\frac{1}{T} \sum_{t=1}^{T} \left( \mathbf{r}_i(t) - \langle \mathbf{r}_i \rangle \right)^2} $$ Where: - $\mathbf{r}_i(t)$ is the position of atom/residue $i$ at time $t$ - $\langle \mathbf{r}_i \rangle$ is the time-averaged position - $T$ is the number of frames ### What RMSF Measures | High RMSF | Low RMSF | |-----------|----------| | Flexible regions | Rigid regions | | Loops, termini | Core β-sheets | | Solvent-exposed | Buried residues | | Functionally mobile | Structurally constrained | ### Relationship to B-factors RMSF is related to crystallographic B-factors (temperature factors): $$ B_i = \frac{8\pi^2}{3} \langle \Delta r_i^2 \rangle = \frac{8\pi^2}{3} \text{RMSF}_i^2 $$ This allows comparison between simulation and experimental data, though crystal packing effects mean the correspondence is imperfect. ## How PolyzyMD Handles Correlation for RMSF PolyzyMD automatically performs autocorrelation analysis and **subsamples to independent frames** before computing RMSF. This is essential because RMSF measures variance, which is systematically biased by correlated samples. ```{seealso} For the mathematical details of autocorrelation functions, correlation time estimation, and the LiveCoMS formula, see the [Statistics Best Practices Guide](analysis_statistics_best_practices.md). ``` ### What PolyzyMD Does 1. **Computes RMSD timeseries** after alignment to reference structure 2. **Estimates correlation time (τ)** via ACF integration 3. **Selects independent frames** spaced by ≥2τ 4. **Computes RMSF** using only these independent frames ### Example Output When you run RMSF analysis, PolyzyMD reports: ``` Correlation time: 15394 ps (15.4 ns) Statistical inefficiency: 308.9 Independent samples: 6 (from 2000 frames) ``` This means: - The RMSD decorrelates over ~15 ns timescales - Each independent sample represents ~300 correlated frames - You effectively have only 6 independent measurements ## Understanding the Warnings ### The "Low Statistical Reliability" Warning When N_independent < 10, PolyzyMD issues this warning: ``` WARNING: Low statistical reliability: only 6 independent samples (recommended >= 10). Correlation time τ = 15394 ps is comparable to or longer than the trajectory sampling window. Consider: (1) extending simulation time, (2) using multiple independent trajectories, or (3) interpreting results with caution. See Grossfield et al. (2018) LiveCoMS 1:5067. ``` ### What This Warning Means The warning indicates that your trajectory may not have sampled enough independent conformations to precisely estimate equilibrium properties. It does **not** mean: - Your simulation is broken - Your data is useless - You must start over ### Decision Tree: What To Do ``` Got "Low statistical reliability" warning? │ ├─ Are you using multiple independent replicates? │ ├─ YES → You're probably fine. Replicate-based statistics are robust. │ └─ NO → Consider running 3-5 replicates instead of extending. │ ├─ What kind of conclusion do you need? │ ├─ Qualitative (stable vs unstable) → Warning is informational only. │ └─ Quantitative (precise mean ± SEM) → Need more sampling. │ └─ Are you comparing conditions? ├─ YES → Focus on replicate-to-replicate variation for significance. └─ NO → Report uncertainty based on replicate spread if available. ``` ### When the Warning is Not a Problem 1. **You have multiple replicates**: Inter-replicate variation provides a robust uncertainty estimate independent of within-trajectory correlation. 2. **You're making qualitative conclusions**: "The protein is stable" doesn't require precise RMSF values. 3. **The effect size is large**: A 50% difference in RMSF is meaningful even with uncertain absolute values. ## Replicates vs Longer Simulations ### The LiveCoMS Recommendation > "Multiple independent simulations are preferable to a single long simulation" > — Grossfield et al. (2018) ### Why Replicates Are Better | Multiple Replicates | Single Long Simulation | |--------------------|------------------------| | Truly independent samples | Samples remain correlated | | Tests reproducibility | May be trapped in metastable state | | Detects rare events/outliers | Rare events may dominate or be missed | | Parallelizable | Sequential | | Robust to system issues | Single point of failure | ### The Independence Argument Even with τ = 20 ns, a 200 ns simulation yields only ~10 independent samples. Running that same simulation 3 times gives you 3 **truly independent** measurements of the mean RMSF: ``` Replicate 1: mean RMSF = 0.755 Å Replicate 2: mean RMSF = 0.693 Å Replicate 3: mean RMSF = 0.696 Å Replicate mean: 0.715 Å Replicate std: 0.035 Å Replicate SEM: 0.020 Å (= std/√3) ``` This replicate-based uncertainty is **valid regardless of within-trajectory correlation** because the replicates started from different initial velocities. ### How Many Replicates? | Replicates | Power to Detect | Practical Guidance | |------------|-----------------|-------------------| | 3 | Large effects (d > 2) | Minimum for publication | | 5 | Medium effects (d > 1.3) | Recommended standard | | 10 | Small effects (d > 0.9) | High-precision studies | For most enzyme-polymer studies, **3-5 replicates per condition** is a reasonable balance between statistical power and computational cost. ## Handling Incomplete Data During active research, simulations are often in progress. You may want to analyze available data without waiting for all replicates to complete. PolyzyMD handles this gracefully by skipping missing or failed replicates with informative warnings. ### Missing Replicates When a requested replicate's data cannot be found, PolyzyMD logs a warning and continues with the remaining replicates: ``` WARNING: Skipping replicate 2: trajectory data not found. Working directory not found: /scratch/user/project/run_2 Has replicate 2 been simulated? ``` ### Failed Replicates If a replicate exists but analysis fails (e.g., corrupted trajectory, missing atoms), PolyzyMD logs the error and continues: ``` WARNING: Skipping replicate 3: analysis failed with error: Could not read DCD file: unexpected end of file ``` ### Aggregation Summary When aggregating with incomplete data, PolyzyMD reports which replicates were successfully analyzed: ``` WARNING: Aggregating 2 of 3 requested replicates. Skipped: [2] ``` ### Minimum Requirements **At least 2 successful replicates are required for aggregation.** This is because SEM calculation requires n ≥ 2. If fewer than 2 replicates succeed, PolyzyMD raises an error: ``` ValueError: Aggregation requires at least 2 successful replicates, but only 1 succeeded. Failed replicates: [2, 3] ``` ### Output File Naming Output filenames reflect the actual replicates used, not the requested range: | Requested | Successful | Filename | |-----------|------------|----------| | 1-3 | 1, 2, 3 | `rmsf_reps1-3_eq100ns.json` | | 1-3 | 1, 3 | `rmsf_reps1_3_eq100ns.json` | | 1-5 | 1, 2, 4 | `rmsf_reps1_2_4_eq100ns.json` | Note: Contiguous ranges use a hyphen (`1-3`), non-contiguous use underscores (`1_3`). ### Best Practices for Incomplete Data 1. **Investigate missing data**: If replicates consistently fail, check: - Did the simulation complete? Check SLURM logs for timeouts or errors. - Are trajectory files in the expected location? Verify paths in config.yaml. - Is the trajectory corrupted? Try loading it manually with MDAnalysis. 2. **Document which replicates were used**: When publishing results from incomplete data, clearly state which replicates contributed to aggregated statistics. The JSON output includes a `replicates` field for this purpose. 3. **Re-run with complete data**: Once all simulations finish, re-run analysis with `--recompute` to include all replicates: ```bash polyzymd compare run rmsf -f comparison.yaml --recompute ``` 4. **Consider statistical implications**: Results from 2 replicates have larger uncertainty than 3+. Be appropriately cautious when interpreting results with fewer replicates than planned. ### Example: Analyzing During Active Simulations A common workflow when simulations are still running: ```bash # Request all 5 planned replicates in comparison.yaml, but only 3 have completed polyzymd compare run rmsf -f comparison.yaml --eq-time 100ns # Output shows: # Skipping replicate 4: trajectory data not found... # Skipping replicate 5: trajectory data not found... # Aggregating 3 of 5 requested replicates. Skipped: [4, 5] # # Results saved to: rmsf_reps1-3_eq100ns.json ``` Later, when all simulations complete: ```bash # Re-run to include all replicates polyzymd compare run rmsf -f comparison.yaml --eq-time 100ns --recompute # Now all 5 are included: # Results saved to: rmsf_reps1-5_eq100ns.json ``` ## Comparing Conditions ### Experimental Design To compare two conditions (e.g., polymer vs no-polymer): 1. Run **N replicates** of each condition (N ≥ 3) 2. Compute RMSF for each replicate 3. Use **replicate means** as your data points (not per-frame values) 4. Perform statistical test on replicate means ### The Statistical Test Use a two-sample t-test on the per-replicate mean RMSF values: ```python from scipy import stats import numpy as np # Per-replicate mean RMSF (not per-frame!) condition_A = [0.755, 0.693, 0.696] # No polymer condition_B = [0.558, 0.738, 0.496] # With polymer # Two-sample t-test t_stat, p_value = stats.ttest_ind(condition_A, condition_B) print(f"t-statistic: {t_stat:.3f}") print(f"p-value: {p_value:.4f}") ``` ### Effect Size The p-value tells you if the difference is **statistically significant**, but not if it's **meaningful**. Effect size quantifies the magnitude: ```python # Cohen's d mean_A = np.mean(condition_A) mean_B = np.mean(condition_B) pooled_std = np.sqrt((np.var(condition_A, ddof=1) + np.var(condition_B, ddof=1)) / 2) cohens_d = (mean_A - mean_B) / pooled_std print(f"Cohen's d: {cohens_d:.2f}") ``` | Cohen's d | Interpretation | |-----------|---------------| | < 0.2 | Negligible effect | | 0.2 - 0.5 | Small effect | | 0.5 - 0.8 | Medium effect | | > 0.8 | Large effect | ### Power Analysis With small sample sizes, you can only detect large effects. The **minimum detectable difference** depends on: - Sample size per group (N) - Within-group variability (σ) - Desired significance level (α, typically 0.05) - Desired power (typically 0.80) ```python # Minimum detectable difference (approximate) # For t-test with α=0.05, power=0.80 critical_t = {3: 2.78, 5: 2.31, 10: 2.10} # df = 2*(N-1) se_diff = pooled_std * np.sqrt(2/N) min_detectable = critical_t[N] * se_diff print(f"Minimum detectable difference: {min_detectable:.3f} Å") ``` ### Worked Example From a real analysis comparing polymer vs no-polymer: ``` RESULTS SUMMARY =============== No Polymer With Polymer Replicate 1 0.755 Å 0.558 Å Replicate 2 0.693 Å 0.738 Å Replicate 3 0.696 Å 0.496 Å ----------------------------------------------- Mean 0.715 Å 0.597 Å Std 0.035 Å 0.126 Å STATISTICAL ANALYSIS ==================== Difference: 0.117 Å (16.4% reduction with polymer) t-statistic: 1.56 p-value: 0.194 Cohen's d: 1.27 (large effect) INTERPRETATION ============== - Large effect size (d = 1.27) suggests a real difference - p > 0.05 means we cannot reject null hypothesis - High variability in polymer condition (replicate 2 is outlier) - CONCLUSION: Suggestive but not conclusive; need more replicates ``` ### Non-Significant p-value with Large Effect Size This situation (p > 0.05 but large Cohen's d) is common with small samples. It means: 1. The effect **might be real** but you lack power to detect it 2. More replicates would likely achieve significance 3. You should **not** conclude "no effect" — only "insufficient evidence" The appropriate response is to run additional replicates, not to abandon the hypothesis. ## Interpreting RMSF Values ### Typical Ranges | RMSF (Å) | Protein Region | Interpretation | |----------|---------------|----------------| | 0.3 - 0.5 | Core β-sheets, buried helices | Very rigid | | 0.5 - 1.0 | Surface helices, structured loops | Normal flexibility | | 1.0 - 2.0 | Exposed loops, active site flaps | Flexible | | 2.0 - 5.0 | Termini, disordered regions | Highly flexible | | > 5.0 | Disordered tails, unfolded regions | May indicate instability | ### Factors Affecting RMSF **Temperature**: Higher T → higher RMSF (more kinetic energy) **Solvent**: - Water → normal flexibility - Organic co-solvents → may increase or decrease depending on interactions - Crowding agents → typically reduce RMSF **Simulation parameters**: - Force field → affects flexibility - Timestep → too large can cause instability - Thermostat → Langevin adds friction, affects dynamics **System composition**: - Polymers can restrict or enhance motion - Ligand binding often rigidifies active site - Oligomerization affects interface flexibility ### Active Site Flexibility For enzyme studies, active site RMSF is particularly relevant: - **Lower RMSF** may indicate: - Better substrate positioning - More stable catalytic geometry - Potentially higher activity (if not too rigid) - **Higher RMSF** may indicate: - Conformational sampling for substrate binding - Induced fit dynamics - Potentially reduced activity if catalytic residues misalign The "optimal" flexibility depends on the enzyme mechanism and should be interpreted in context of experimental activity data. ### External Reference for Catalytic Competence When studying enzyme catalysis across multiple conditions (polymer baths, temperatures, mutants), the standard RMSF modes (`centroid`, `average`) use a **condition-specific** reference — each condition's trajectory determines its own reference structure. This makes direct comparison difficult: a low RMSF might mean the enzyme is rigid in a *non-functional* conformation. The `external` reference mode solves this by using a **condition-independent** reference, typically a crystal structure that represents the catalytically competent geometry. RMSF then measures deviation from the functional state: $$ \text{RMSF}_i^{\text{ext}} = \sqrt{\frac{1}{T} \sum_{t=1}^{T} \left( \mathbf{r}_i(t) - \mathbf{r}_i^{\text{crystal}} \right)^2} $$ **Interpretation changes with external reference:** | Metric | Standard RMSF (centroid/average) | External Reference RMSF | |--------|----------------------------------|------------------------| | Low value | Residue is rigid | Residue stays near crystal geometry | | High value | Residue is flexible | Residue deviates from functional state | | Condition comparison | Which condition is more flexible? | Which condition best maintains catalytic geometry? | **Practical guidance:** - Focus on **catalytic residues** (triad, oxyanion hole) rather than overall protein RMSF — these are the residues whose geometry directly affects activity - Use **residue range truncation** (`resid 5:175`) to exclude flexible termini that dominate statistics and obscure active-site signals - Report both overall protein and active-site RMSF separately — polymer effects may stabilize the active site without affecting overall flexibility ```{tip} **Which reference mode for enzymes?** Use `centroid` or `average` to answer "how flexible is the protein?" Use `external` to answer "how well does the protein maintain its catalytically competent geometry?" These are complementary questions — consider running both analyses. ``` ```{seealso} For detailed setup instructions (Python, YAML, CLI), see the [External PDB section](analysis_reference_selection.md#external-pdb) of the Reference Selection Guide. ``` ## Common Pitfalls ### Pitfall 1: Using Unaligned Trajectories **Symptom**: RMSF values > 10 Å, sometimes 50+ Å **Cause**: Without alignment, RMSF includes global translation and rotation of the entire protein through the simulation box. **Solution**: PolyzyMD automatically aligns trajectories. If you see very high values, check that: - Your selection string matches atoms in the system - The reference structure is valid - Trajectory files are complete ### Pitfall 2: Insufficient Equilibration **Symptom**: RMSF values drift over time; different equilibration times give different results **Cause**: Including non-equilibrated frames in the analysis **Solution**: Use `--eq-time` to skip the equilibration period: ```bash # Skip first 10% of a 200 ns simulation polyzymd compare run rmsf -f comparison.yaml --eq-time 20ns ``` Monitor RMSD vs time — equilibration is complete when RMSD plateaus. ### Pitfall 3: Over-Interpreting Small Differences **Symptom**: Claiming significance for 0.05 Å differences **Cause**: Not accounting for uncertainty **Solution**: Always report uncertainty and perform statistical tests: ```python # WRONG: "Condition A (0.715 Å) is more stable than B (0.720 Å)" # RIGHT: "Condition A (0.715 ± 0.020 Å) and B (0.720 ± 0.025 Å) # are not significantly different (p = 0.89)" ``` ### Pitfall 4: Ignoring Replicate Variation **Symptom**: Reporting within-trajectory SEM as the uncertainty **Cause**: Treating correlated frames as independent **Solution**: Use replicate-based statistics: ```python # WRONG: SEM from 2000 correlated frames sem_wrong = np.std(all_frames) / np.sqrt(2000) # Way too small # RIGHT: SEM from 3 independent replicates replicate_means = [0.755, 0.693, 0.696] sem_right = np.std(replicate_means, ddof=1) / np.sqrt(3) # Correct ``` ### Pitfall 5: Cherry-Picking Replicates **Symptom**: Excluding "outlier" replicates without justification **Cause**: Desire for cleaner results **Solution**: Include all replicates unless there's a **technical** reason to exclude (e.g., simulation crashed, obvious system error). Biological variability is real and should be reported. If one replicate differs substantially, investigate **why**: - Did a conformational change occur? - Did the ligand unbind? - Is there a slow process being sampled? ## Advanced: Python API ### Accessing Autocorrelation Results ```python from pathlib import Path import json from polyzymd.analyses.discovery import get_analysis from polyzymd.analyses.orchestrator import run_comparison from polyzymd.config.comparison import ComparisonConfig # Programmatic access goes through the RMSFAnalysis plugin config = ComparisonConfig.from_yaml("comparison.yaml") analysis = get_analysis("rmsf")() pipeline_result = run_comparison(analysis, config, equilibration="100ns") # Load the saved comparison result for detailed inspection comparison_path = Path(pipeline_result["comparison_path"]) with comparison_path.open() as f: comparison_data = json.load(f) # Access condition-level summary for condition in comparison_data["conditions"]: print( f"{condition['label']}: " f"mean RMSF={condition['mean_rmsf']:.3f} Å, " f"SEM={condition['sem_rmsf']:.3f} Å" ) # Access per-residue data for the first condition first_condition = comparison_data["conditions"][0] for resid, rmsf in zip(first_condition["residue_ids"], first_condition["mean_rmsf_per_residue"]): print(f"Residue {resid}: {rmsf:.3f} Å") ``` ### Custom Statistical Analysis ```python import numpy as np from scipy import stats # Load multiple conditions conditions = { 'no_polymer': [0.755, 0.693, 0.696], 'polymer_A': [0.558, 0.738, 0.496], 'polymer_B': [0.612, 0.598, 0.621], } # Pairwise comparisons with Bonferroni correction n_comparisons = 3 alpha = 0.05 / n_comparisons for name_a, values_a in conditions.items(): for name_b, values_b in conditions.items(): if name_a >= name_b: continue t, p = stats.ttest_ind(values_a, values_b) sig = "**" if p < alpha else "" print(f"{name_a} vs {name_b}: p = {p:.4f} {sig}") ``` ### Exporting for External Analysis ```python import pandas as pd import json # Load aggregated result with open("analysis//rmsf/aggregated/rmsf_aggregated.json") as f: data = json.load(f) # Create DataFrame for export df = pd.DataFrame({ 'residue_id': data['residue_ids'], 'residue_name': data['residue_names'], 'mean_rmsf': data['mean_rmsf_per_residue'], 'sem_rmsf': data['sem_rmsf_per_residue'], }) # Export to CSV df.to_csv("rmsf_results.csv", index=False) # Export to Excel with multiple sheets with pd.ExcelWriter("rmsf_analysis.xlsx") as writer: df.to_excel(writer, sheet_name="Per-Residue", index=False) summary = pd.DataFrame({ 'Metric': ['Mean RMSF', 'SEM', 'Min', 'Max', 'N_replicates'], 'Value': [ data['overall_mean_rmsf'], data['overall_sem_rmsf'], data['overall_min_rmsf'], data['overall_max_rmsf'], data['n_replicates'], ] }) summary.to_excel(writer, sheet_name="Summary", index=False) ``` ## References ### Primary Reference **Grossfield A, Patrone PN, Roe DR, Schultz AJ, Siderius DW, Zuckerman DM.** (2018) "Best Practices for Quantification of Uncertainty and Sampling Quality in Molecular Simulations." *Living Journal of Computational Molecular Science* 1(1):5067. https://doi.org/10.33011/livecoms.1.1.5067 This is the definitive guide to uncertainty quantification in MD. The methodology in PolyzyMD follows their recommendations. ### Additional References **Flyvbjerg H, Petersen HG.** (1989) "Error estimates on averages of correlated data." *Journal of Chemical Physics* 91:461-466. https://doi.org/10.1063/1.457480 Classic paper on block averaging for correlated data. **Chodera JD, Swope WC, Pitera JW, Seok C, Dill KA.** (2007) "Use of the Weighted Histogram Analysis Method for the Analysis of Simulated and Parallel Tempering Simulations." *Journal of Chemical Theory and Computation* 3:26-41. https://doi.org/10.1021/ct0502864 Introduces statistical inefficiency for MD analysis. **Knapp B, Frantal S, Greshake B, Schwarz R, et al.** (2018) "Is an Intuitive Convergence Definition of Molecular Dynamics Simulations Solely Based on the Root Mean Square Deviation Possible?" *Journal of Computational Biology* 25:1069-1077. Discussion of RMSD-based convergence assessment. ## See Also - [Quick Start Guide](../how_to/analysis_rmsf_quickstart.md) — Get results fast - [Implementation Verification](analysis_rmsf_verification.md) — 3-way benchmark against MDAnalysis and GROMACS - [Statistics Best Practices](analysis_statistics_best_practices.md) — Foundational statistics for MD - [Reference Structure Selection](analysis_reference_selection.md) — Choose alignment reference - [LiveCoMS Best Practices](https://livecomsjournal.org/index.php/livecoms/article/view/v1i1e5067) — Full methodology paper