Analysis System Concepts

Your simulations are done. You have trajectories on disk and you want to measure something — RMSF, contacts, distances, whatever. This page explains how PolyzyMD’s analysis system is put together so that when you run a command or read an output file, you know what happened and where to look.

The analysis pipeline

Every analysis in PolyzyMD follows the same four-stage pipeline:

replicate stage  →  aggregate  →  compare  →  plot

Here is what each stage does:

Stage

Scope

What it produces

replicate stage

One replicate of one condition

ReplicateArtifact at analysis/<condition_label>/<plugin_name>/run_<N>/result.json

aggregate

All replicates of one condition

ConditionArtifact at analysis/<condition_label>/<plugin_name>/aggregated/result.json

compare

All conditions together

ComparisonArtifact or active custom comparison result at comparison/<plugin_name>/result.json

plot

All conditions together

Figures saved in the configured format, with png as the default and pdf or svg also supported

Each artifact stores a validated payload plus metadata, provenance, warnings, and references to sidecar files when an analysis needs large tables or arrays outside the main JSON document.

Trajectory-native plugins generally create MDAAnalysisJob objects for their per-replicate computation. The corresponding collectors translate completed jobs into ReplicateArtifact objects. PolyzyMD then owns the surrounding workflow: condition aggregation, cross-condition comparison, artifact storage, and plot orchestration.

Plots are deliberately downstream of this artifact layer. They read cached artifacts and sidecars only; they do not reload trajectories or rerun the analysis calculation.

You don’t call these stages yourself. When you run polyzymd compare run, the CLI walks through the pipeline automatically. But knowing the stages helps when you need to debug (“Which stage failed?”) or interpret output (“Is this a per-replicate file or an aggregated file?”).

comparison.yaml — the control file

The comparison.yaml file is the single input that defines an analysis run. It tells PolyzyMD what simulations to analyze, what to measure, and how to compare the results.

Here is a minimal example:

name: "polymer_stability_study"

conditions:
  - label: "No Polymer"
    config: "../no_polymer/config.yaml"
    replicates: [1, 2, 3]
  - label: "100% SBMA"
    config: "../sbma_100/config.yaml"
    replicates: [1, 2, 3]

control: "No Polymer"

defaults:
  equilibration_time: "10ns"

plugins:
  rmsf:
    selection: "protein and name CA"
  contacts: {}

The key sections are:

conditions

Each entry points to a simulation’s config.yaml and lists which replicate numbers to include. The label is a human-readable name that shows up in plots and result files. When labels appear in directory names, PolyzyMD sanitizes them for the filesystem; for example, 100% SBMA may be written as 100_SBMA in paths while remaining 100% SBMA in summaries and plots.

control

Which condition to use as the baseline for statistical comparisons. Set this to the label of your reference condition (typically an unmodified or no-polymer system). If you only have one condition or don’t want relative comparisons, set it to null or leave it out.

defaults.equilibration_time

How much trajectory to discard from the beginning of each run. Early frames are typically not equilibrated, so the pipeline skips them. Specify as a string with units: "10ns", "5000ps", etc. The default is "10ns".

plugins

Which analyses to run and their settings. Each key is a plugin name (like rmsf or contacts), and the value is a settings block for that plugin. An empty block {} means “run with defaults.” Only plugins listed here are executed — if you don’t include sasa, SASA won’t be computed.

For the complete schema with all fields, see comparison.yaml Schema Reference.

Conditions and replicates

These two terms come up everywhere in the analysis output, so it helps to be precise about what they mean in PolyzyMD.

A condition is one simulation setup. Examples: “No Polymer”, “SBMA-100”, “PEG-50”. Each condition has its own config.yaml that defines the system (which protein, which polymer, which force field, etc.).

A replicate is a separate run of the same condition, intended to sample the same setup independently. Replicates are identified by number — 1, 2, 3, and so on. Each replicate uses the same config.yaml but usually starts from a different random seed, initial velocity assignment, or starting configuration. These choices help separate trajectories, but they do not guarantee statistical independence by themselves. Interpretation also depends on equilibration, stationarity, decorrelation, and whether the simulated timescales are long enough for the process being measured.

The pipeline processes data in this order:

  1. Per-replicate: the compute stage runs once for each replicate of each condition and writes a ReplicateArtifact. If you have 2 conditions with 3 replicates each, that is 6 replicate artifacts.

  2. Per-condition: aggregate runs once per condition, combining replicate artifacts into a ConditionArtifact. That’s 2 aggregate calls.

  3. Cross-condition: compare runs once, looking at all conditions together and writing a ComparisonArtifact or an active custom comparison result. plot then reads those cached outputs and any referenced sidecars.

Plugins — the analysis modules

PolyzyMD ships with 9 analysis plugins. Each plugin is a self-contained module that knows how to compute one type of measurement, aggregate it, compare across conditions, and generate plots.

The available plugins are:

Plugin name

What it measures

rmsd

Root-mean-square deviation over time

rg

Radius of gyration over time

rmsf

Root-mean-square fluctuation per residue

contacts

Intermolecular contacts between protein and other components

distances

Distances between specified atom groups

catalytic_triad

Catalytic triad geometry (active-site distances)

secondary_structure

Secondary structure content (helix, sheet, coil fractions)

sasa

Solvent-accessible surface area

hydrogen_bonds

Hydrogen bond occupancy and lifetimes

Each plugin has a Settings model with configurable parameters. Most parameters have sensible defaults, so you often just need plugin_name: {} in your comparison.yaml to get started.

For contributors, the plugin boundary is the supported extension point: a plugin defines its settings, replicate computation, aggregation behavior, comparison behavior, plotting behavior, and formatting behavior without changing the core orchestration code. The conceptual boundary is important because PolyzyMD owns artifact storage and orchestration, while plugins own the domain-specific measurement and interpretation logic. For a contributor-focused walkthrough, see Extend PolyzyMD with MDAnalysis-native analyses.

You configure plugins in the plugins: block. For example, to run RMSF with a custom selection and contacts with defaults:

plugins:
  rmsf:
    selection: "protein and name CA"
  contacts: {}

Statistical comparison

When you have two or more conditions, the compare stage produces statistical output so you can assess whether differences are meaningful. There are two comparison paths:

  • Default scalar/artifact comparison: plugins that expose scalar metrics can use the framework’s default comparison behavior. In that path, PolyzyMD can compute pairwise tests, effect sizes, optional omnibus statistics, and metric rankings from the condition artifacts.

  • Custom comparison: plugins with richer result structures can implement their own comparison behavior. These plugins still write comparison output, but they may not produce the same tests, tables, or rankings as the default scalar path.

For plugins using the default scalar comparison path, PolyzyMD computes:

  • Pairwise t-tests between each pair of conditions, with Benjamini–Hochberg FDR correction to account for multiple comparisons.

  • Effect sizes (Cohen’s d) for each pair, so you can see not just whether a difference is significant but how large it is.

  • ANOVA when there are three or more conditions, as an omnibus test before the pairwise comparisons.

  • Rankings of conditions according to each metric’s directionality. These rankings are screening aids for follow-up interpretation, not biological truth by themselves.

The comparison results are saved as JSON and also printed to the terminal when you run polyzymd compare run. For details on interpreting these outputs, see Comparison and Plotting Reference.

Output structure

After running polyzymd compare run, your project directory will contain:

comparison_project/
├── comparison.yaml
├── analysis/
│   └── <condition_label>/
│       └── <plugin_name>/
│           ├── run_<N>/
│           │   └── result.json # ReplicateArtifact
│           └── aggregated/
│               └── result.json # ConditionArtifact
├── comparison/
│   └── <plugin_name>/
│       └── result.json         # ComparisonArtifact or active custom result
└── figures/
    └── <plugin_name>/
        └── *.<format>          # Plots; png by default, pdf/svg supported

The three output directories map directly to the pipeline stages:

  • analysis/ holds the compute and aggregate output. Each condition gets its own filesystem-sanitized subdirectory, and within that, each plugin gets a directory with ReplicateArtifact files in run_1/, run_2/, … and a ConditionArtifact in aggregated/result.json.

  • comparison/ holds the compare output. One result.json per plugin stores a ComparisonArtifact or an active custom comparison result. Default scalar comparisons include framework-generated tests and rankings; custom comparison outputs may use plugin-specific summaries.

  • figures/ holds the plot output. One subdirectory per plugin with PNG files by default, or another configured format such as PDF or SVG. Plots are generated from cached artifacts and sidecars only.

Results are cached: if you rerun the pipeline without changing settings, the compute stage skips replicates that already have canonical artifacts on disk.

See also