RMSF Implementation Verification
PolyzyMD uses a custom NumPy-based RMSF implementation rather than
MDAnalysis’s built-in RMSF analysis class. This page documents the
3-way benchmark that validates the custom implementation against two
industry-standard tools.
Why a Custom Implementation?
MDAnalysis provides a built-in
RMSF
class, but it cannot support two features PolyzyMD requires:
Autocorrelation-based frame subsampling. PolyzyMD identifies statistically independent frames via autocorrelation analysis (following Grossfield et al., 2018), producing arbitrary non-uniform frame indices. MDAnalysis’s
RMSFclass only accepts uniformstart/stop/stepslicing.External reference positions. In
externalreference mode, RMSF is computed as deviations from a crystal structure’s atomic positions rather than the trajectory’s own average. MDAnalysis’sRMSFclass always uses the trajectory average internally.
The core computation is mathematically identical across all implementations:
Source Code
The RMSF and RMSD calculations live in a single file:
RMSF:
_compute_rmsf()insrc/polyzymd/analysis/rmsf/calculator.py(lines 633–683). Two-pass algorithm: (1) compute average positions from selected frames (or use external reference), (2) accumulate squared deviations.RMSD:
_compute_rmsd_timeseries()in the same file (lines 613–631). Used internally for autocorrelation analysis, not exposed to users.
Both methods are plain NumPy with no MDAnalysis analysis class
dependencies. The MDAnalysis.analysis.rms.RMSF and
MDAnalysis.analysis.rms.RMSD imports that existed previously were
confirmed dead code and removed in commit 19a247e.
Benchmark Methodology
A 3-way benchmark was performed comparing:
Method |
Implementation |
Notes |
|---|---|---|
PolyzyMD |
Custom NumPy (two-pass) |
The code under test |
MDAnalysis |
|
Industry-standard Python MD library |
GROMACS |
|
Industry-standard C simulation engine |
Test System
Protein: Lipase A (LipA) from Bacillus subtilis, 181 residues
Trajectory: 2,500 frames from a 1,000 ns NPT production run
Selection: 171 C-alpha atoms (residues 5–175, excluding flexible termini)
Topology:
solvated_system.pdb(11,636 atoms)
Modes Tested
Three RMSF reference modes were benchmarked:
Mode |
Alignment Reference |
RMSF Reference |
Methods |
|---|---|---|---|
Average |
Trajectory average structure |
Trajectory average positions |
3-way |
Centroid |
K-Means centroid frame (k=1) |
Trajectory average positions |
3-way |
External |
External crystal PDB |
External PDB positions |
2-way |
GROMACS is excluded from external mode because gmx rmsf always
computes RMSF relative to the trajectory average — the -s file is
used only for fitting, not as the RMSF reference. There is no option
to override this behavior.
Procedure
For each mode:
Load the trajectory and align all frames to the reference using MDAnalysis
AlignTraj(in-memory).Compute per-residue C-alpha RMSF using each method on the same aligned coordinates.
For GROMACS: export the aligned trajectory to GRO/XTC, run
gmx rmsf -nofit -res, convert output from nm to Angstroms.Compare per-residue values: Pearson correlation, mean absolute difference, maximum absolute difference.
Results
Average Mode (3-way)
Comparison |
Pearson r |
Mean |delta| (Angstrom) |
Max |delta| (Angstrom) |
|---|---|---|---|
PolyzyMD vs MDAnalysis |
1.00000000 |
0.000000 |
0.000000 |
PolyzyMD vs GROMACS |
0.99999998 |
0.000250 |
0.000579 |
MDAnalysis vs GROMACS |
0.99999998 |
0.000250 |
0.000579 |
Centroid Mode (3-way)
Comparison |
Pearson r |
Mean |delta| (Angstrom) |
Max |delta| (Angstrom) |
|---|---|---|---|
PolyzyMD vs MDAnalysis |
1.00000000 |
0.000000 |
0.000000 |
PolyzyMD vs GROMACS |
0.99999998 |
0.000249 |
0.000616 |
MDAnalysis vs GROMACS |
0.99999998 |
0.000249 |
0.000616 |
External Mode (2-way)
Comparison |
Pearson r |
Mean |delta| (Angstrom) |
Max |delta| (Angstrom) |
|---|---|---|---|
PolyzyMD vs MDAnalysis |
1.00000000 |
0.000000 |
0.000000 |
Interpretation
PolyzyMD and MDAnalysis produce bit-identical results in all three modes. This is expected — both run in the same Python process on the same NumPy arrays, and the RMSF formula is deterministic.
GROMACS differs by < 0.0007 Angstrom (max absolute deviation across all residues and modes). This small discrepancy is fully explained by the GRO coordinate format, which truncates positions to 3 decimal places in nanometers (0.001 nm = 0.01 Angstrom precision). The residual pattern is consistent with rounding noise, not a systematic bias.
Conclusion: PolyzyMD’s custom RMSF implementation is verified correct against both MDAnalysis (bit-identical) and GROMACS (limited only by coordinate format precision). The custom implementation is functionally equivalent to industry-standard tools while supporting the additional features (non-uniform frame subsampling, external reference positions) that PolyzyMD requires.
Reproducing the Benchmark
The benchmark script is located at scripts/benchmark_rmsf_3way.py in the
repository root. To reproduce:
pixi run -e build python scripts/benchmark_rmsf_3way.py
Requirements:
PolyzyMD
buildpixi environment (MDAnalysis, NumPy, scikit-learn, matplotlib)GROMACS >= 2020 (
gmx rmsf) — configure theGMXvariable at the top of the scriptA simulation directory with topology (PDB) and trajectory (DCD) files
The script outputs per-mode CSV data and a publication-quality benchmark
figure to scripts/benchmark_rmsf_output/.
See Also
RMSF Quick Start — Get RMSF results in 5 minutes
RMSF Best Practices — Statistical rigor, interpretation, pitfalls
Reference Structure Selection — Average, centroid, and external reference modes