RMSF Implementation Verification

PolyzyMD uses a custom NumPy-based RMSF implementation rather than MDAnalysis’s built-in RMSF analysis class. This page documents the 3-way benchmark that validates the custom implementation against two industry-standard tools.

Why a Custom Implementation?

MDAnalysis provides a built-in RMSF class, but it cannot support two features PolyzyMD requires:

  1. Autocorrelation-based frame subsampling. PolyzyMD identifies statistically independent frames via autocorrelation analysis (following Grossfield et al., 2018), producing arbitrary non-uniform frame indices. MDAnalysis’s RMSF class only accepts uniform start/stop/step slicing.

  2. External reference positions. In external reference mode, RMSF is computed as deviations from a crystal structure’s atomic positions rather than the trajectory’s own average. MDAnalysis’s RMSF class always uses the trajectory average internally.

The core computation is mathematically identical across all implementations:

\[ \text{RMSF}_i = \sqrt{\frac{1}{N}\sum_{t=1}^{N} \left|\mathbf{r}_i(t) - \mathbf{r}_i^{\text{ref}}\right|^2} \]

Source Code

The RMSF and RMSD calculations now live in the RMSF plugin class:

  • RMSF + RMSD workflow: RMSFAnalysis.compute_replicate() in src/polyzymd/analyses/rmsf/__init__.py. The method performs the same two-pass RMSF logic and internal RMSD timeseries calculation for autocorrelation analysis.

The implementation is plain NumPy with no MDAnalysis analysis class dependencies.

Benchmark Methodology

A 3-way benchmark was performed comparing:

Method

Implementation

Notes

PolyzyMD

Custom NumPy (two-pass)

The code under test

MDAnalysis

MDAnalysis.analysis.rms.RMSF class

Industry-standard Python MD library

GROMACS

gmx rmsf (v2026.0)

Industry-standard C simulation engine

Test System

  • Protein: Lipase A (LipA) from Bacillus subtilis, 181 residues

  • Trajectory: 2,500 frames from a 1,000 ns NPT production run

  • Selection: 171 C-alpha atoms (residues 5–175, excluding flexible termini)

  • Topology: solvated_system.pdb (11,636 atoms)

Modes Tested

Three RMSF reference modes were benchmarked:

Mode

Alignment Reference

RMSF Reference

Methods

Average

Trajectory average structure

Trajectory average positions

3-way

Centroid

K-Means centroid frame (k=1)

Trajectory average positions

3-way

External

External crystal PDB

External PDB positions

2-way

GROMACS is excluded from external mode because gmx rmsf always computes RMSF relative to the trajectory average — the -s file is used only for fitting, not as the RMSF reference. There is no option to override this behavior.

Procedure

For each mode:

  1. Load the trajectory and align all frames to the reference using MDAnalysis AlignTraj (in-memory).

  2. Compute per-residue C-alpha RMSF using each method on the same aligned coordinates.

  3. For GROMACS: export the aligned trajectory to GRO/XTC, run gmx rmsf -nofit -res, convert output from nm to Angstroms.

  4. Compare per-residue values: Pearson correlation, mean absolute difference, maximum absolute difference.

Results

Average Mode (3-way)

Comparison

Pearson r

Mean |delta| (Angstrom)

Max |delta| (Angstrom)

PolyzyMD vs MDAnalysis

1.00000000

0.000000

0.000000

PolyzyMD vs GROMACS

0.99999998

0.000250

0.000579

MDAnalysis vs GROMACS

0.99999998

0.000250

0.000579

Centroid Mode (3-way)

Comparison

Pearson r

Mean |delta| (Angstrom)

Max |delta| (Angstrom)

PolyzyMD vs MDAnalysis

1.00000000

0.000000

0.000000

PolyzyMD vs GROMACS

0.99999998

0.000249

0.000616

MDAnalysis vs GROMACS

0.99999998

0.000249

0.000616

External Mode (2-way)

Comparison

Pearson r

Mean |delta| (Angstrom)

Max |delta| (Angstrom)

PolyzyMD vs MDAnalysis

1.00000000

0.000000

0.000000

Interpretation

PolyzyMD and MDAnalysis produce bit-identical results in all three modes. This is expected — both run in the same Python process on the same NumPy arrays, and the RMSF formula is deterministic.

GROMACS differs by < 0.0007 Angstrom (max absolute deviation across all residues and modes). This small discrepancy is fully explained by the GRO coordinate format, which truncates positions to 3 decimal places in nanometers (0.001 nm = 0.01 Angstrom precision). The residual pattern is consistent with rounding noise, not a systematic bias.

Conclusion: PolyzyMD’s custom RMSF implementation is verified correct against both MDAnalysis (bit-identical) and GROMACS (limited only by coordinate format precision). The custom implementation is functionally equivalent to industry-standard tools while supporting the additional features (non-uniform frame subsampling, external reference positions) that PolyzyMD requires.

Reproducing the Benchmark

The benchmark script is located at scripts/benchmark_rmsf_3way.py in the repository root. To reproduce:

pixi run -e build python scripts/benchmark_rmsf_3way.py

Requirements:

  • PolyzyMD build pixi environment (MDAnalysis, NumPy, scikit-learn, matplotlib)

  • GROMACS >= 2020 (gmx rmsf) — configure the GMX variable at the top of the script

  • A simulation directory with topology (PDB) and trajectory (DCD) files

The script outputs per-mode CSV data and a publication-quality benchmark figure to scripts/benchmark_rmsf_output/.

See Also