# RMSF Implementation Verification PolyzyMD uses a custom NumPy-based RMSF implementation rather than relying directly on MDAnalysis's built-in `RMSF` analysis class. This page summarizes the design rationale and benchmark evidence from representative LipA cases. It records what those checks support, and just as importantly, what they do not prove about every possible topology, atom selection, trajectory format, or reference-mapping workflow. ## Why a Custom Implementation? MDAnalysis provides a built-in [`RMSF`](https://docs.mdanalysis.org/stable/documentation_pages/analysis/rms.html#MDAnalysis.analysis.rms.RMSF) class, but it cannot support two features PolyzyMD requires: 1. **Autocorrelation-based frame subsampling.** PolyzyMD selects approximately decorrelated frames using an autocorrelation-based heuristic (following [Grossfield et al., 2018](https://doi.org/10.33011/livecoms.1.1.5067)), producing arbitrary non-uniform frame indices. MDAnalysis's `RMSF` class only accepts uniform `start/stop/step` slicing. 2. **External reference positions.** In `external` reference mode, RMSF is computed as deviations from a crystal structure's atomic positions rather than the trajectory's own average. MDAnalysis's `RMSF` class always uses the trajectory average internally. The core computation is mathematically identical across all implementations: $$ \text{RMSF}_i = \sqrt{\frac{1}{N}\sum_{t=1}^{N} \left|\mathbf{r}_i(t) - \mathbf{r}_i^{\text{ref}}\right|^2} $$ ## Implementation boundary The supported contributor interface is the public `RMSFAnalysis` plugin and the general PolyzyMD analysis lifecycle documented in the contributor guide. The RMSF plugin package contains private implementation files that are useful provenance for maintainers, but they are not stable extension points and should not be imported by contributor plugins. Conceptually, the RMSF workflow is part of the MDAnalysis job/artifact lifecycle: it constructs trajectory-native jobs, computes an internal RMSD time series for autocorrelation analysis, applies the selected frame policy, and writes canonical replicate and condition artifacts. The numerical RMSF kernel remains a plain NumPy calculation inside that lifecycle. ## Benchmark Methodology The benchmark evidence includes 3-way comparisons for trajectory-average reference modes and an independent reference-kernel check for external-reference mode: | Method | Implementation | Notes | |--------|---------------|-------| | **PolyzyMD** | Custom NumPy (two-pass) | The code under test | | **MDAnalysis** | `MDAnalysis.analysis.rms.RMSF` class | Industry-standard Python MD library | | **GROMACS** | `gmx rmsf` (v2026.0) | Industry-standard C simulation engine | ### Test System - **Protein:** Lipase A (LipA) from *Bacillus subtilis*, 181 residues - **Trajectory:** 2,500 frames from a 1,000 ns NPT production run - **Selection:** 171 C-alpha atoms (residues 5--175, excluding flexible termini) - **Topology:** `solvated_system.pdb` (11,636 atoms) ### Modes Tested Three RMSF reference modes were checked in the benchmarked workflow: | Mode | Alignment Reference | RMSF Reference | Methods | |------|-------------------|---------------|---------| | **Average** | Trajectory average structure | Trajectory average positions | 3-way | | **Centroid** | K-Means centroid frame (k=1) | Trajectory average positions | 3-way | | **External** | External crystal PDB | External PDB positions | independent reference-kernel check | GROMACS is excluded from external mode because `gmx rmsf` always computes RMSF relative to the trajectory average — the `-s` file is used only for fitting, not as the RMSF reference. There is no option to override this behavior. Native MDAnalysis `RMSF` also does not validate PolyzyMD's external-reference mode, because the MDAnalysis class uses trajectory-average positions internally. The external-reference path was therefore checked against an independent NumPy/reference-kernel calculation using the same aligned coordinate arrays and external reference mapping. ### Procedure For average and centroid modes: 1. Load the trajectory and align all frames to the reference using MDAnalysis `AlignTraj` (in-memory). 2. Compute per-residue C-alpha RMSF using each method on the **same aligned coordinates**. 3. For GROMACS: export the aligned trajectory to GRO/XTC, run `gmx rmsf -nofit -res`, convert output from nm to Angstroms. 4. Compare per-residue values: Pearson correlation, mean absolute difference, maximum absolute difference. For external-reference mode, the same aligned coordinate arrays were compared against an independent NumPy/reference-kernel calculation rather than against native MDAnalysis `RMSF` or GROMACS. ## Results ### Average Mode (3-way) | Comparison | Pearson *r* | Mean \|delta\| (Angstrom) | Max \|delta\| (Angstrom) | |---|---|---|---| | PolyzyMD vs MDAnalysis | 1.00000000 | 0.000000 | 0.000000 | | PolyzyMD vs GROMACS | 0.99999998 | 0.000250 | 0.000579 | | MDAnalysis vs GROMACS | 0.99999998 | 0.000250 | 0.000579 | ### Centroid Mode (3-way) | Comparison | Pearson *r* | Mean \|delta\| (Angstrom) | Max \|delta\| (Angstrom) | |---|---|---|---| | PolyzyMD vs MDAnalysis | 1.00000000 | 0.000000 | 0.000000 | | PolyzyMD vs GROMACS | 0.99999998 | 0.000249 | 0.000616 | | MDAnalysis vs GROMACS | 0.99999998 | 0.000249 | 0.000616 | ### External Mode (independent check) | Comparison | Pearson *r* | Mean \|delta\| (Angstrom) | Max \|delta\| (Angstrom) | |---|---|---|---| | PolyzyMD vs independent reference kernel | 1.00000000 | 0.000000 | 0.000000 | ## Interpretation **PolyzyMD and MDAnalysis produce bit-identical results** for the benchmarked average and centroid modes. This is expected — both run in the same Python process on the same NumPy arrays, and the RMSF formula is deterministic once the alignment and frame set are fixed. **GROMACS differs by < 0.0007 Angstrom** in the average and centroid benchmarks. This small discrepancy is fully explained by the GRO coordinate format, which truncates positions to 3 decimal places in nanometers (0.001 nm = 0.01 Angstrom precision). The residual pattern is consistent with rounding noise, not a systematic bias. **Conclusion:** The custom RMSF implementation is validated for the benchmarked LipA C-alpha selection, aligned workflow, and tested reference modes within the tolerances shown above. These results support the design choice to keep a custom kernel while adding features PolyzyMD requires, especially non-uniform frame subsampling and external-reference handling. They are not a universal guarantee for every topology, atom selection, periodic-boundary treatment, reference mapping, or trajectory format. ## Benchmark availability The original benchmark script and input/output assets are not currently tracked in the repository. This page records the validation result and its scope until a maintained, reproducible benchmark is added to the project. ## See Also - [RMSF Quick Start](../how_to/analysis_rmsf_quickstart.md) — Get RMSF results in 5 minutes - [RMSF Best Practices](analysis_rmsf_best_practices.md) — Statistical rigor, interpretation, pitfalls - [Reference Structure Selection](analysis_reference_selection.md) — Average, centroid, and external reference modes