# Analysis System Concepts Your simulations are done. You have trajectories on disk and you want to measure something — RMSF, contacts, distances, whatever. This page explains how PolyzyMD's analysis system is put together so that when you run a command or read an output file, you know what happened and where to look. ## The analysis pipeline Every analysis in PolyzyMD follows the same four-stage pipeline: ```text compute_replicate → aggregate → compare → plot ``` Here is what each stage does: | Stage | Scope | What it produces | |-------|-------|------------------| | **compute_replicate** | One replicate of one condition | Raw per-replicate result (e.g., RMSF values for each residue) | | **aggregate** | All replicates of one condition | Means, standard errors, and distributions across replicates | | **compare** | All conditions together | Statistical tests, effect sizes, and rankings | | **plot** | All conditions together | Figures saved as PNG files | You don't call these stages yourself. When you run `polyzymd compare run`, the CLI walks through the pipeline automatically. But knowing the stages helps when you need to debug ("Which stage failed?") or interpret output ("Is this a per-replicate file or an aggregated file?"). ## `comparison.yaml` — the control file The `comparison.yaml` file is the single input that defines an analysis run. It tells PolyzyMD what simulations to analyze, what to measure, and how to compare the results. Here is a minimal example: ```yaml name: "polymer_stability_study" conditions: - label: "No Polymer" config: "../no_polymer/config.yaml" replicates: [1, 2, 3] - label: "100% SBMA" config: "../sbma_100/config.yaml" replicates: [1, 2, 3] control: "No Polymer" defaults: equilibration_time: "10ns" plugins: rmsf: selection: "protein and name CA" contacts: {} ``` The key sections are: ### `conditions` Each entry points to a simulation's `config.yaml` and lists which replicate numbers to include. The `label` is a human-readable name that shows up in plots and result files. ### `control` Which condition to use as the baseline for statistical comparisons. Set this to the label of your reference condition (typically an unmodified or no-polymer system). If you only have one condition or don't want relative comparisons, set it to `null` or leave it out. ### `defaults.equilibration_time` How much trajectory to discard from the beginning of each run. Early frames are typically not equilibrated, so the pipeline skips them. Specify as a string with units: `"10ns"`, `"5000ps"`, etc. The default is `"10ns"`. ### `plugins` Which analyses to run and their settings. Each key is a plugin name (like `rmsf` or `contacts`), and the value is a settings block for that plugin. An empty block `{}` means "run with defaults." Only plugins listed here are executed — if you don't include `sasa`, SASA won't be computed. For the complete schema with all fields, see {doc}`../reference/comparison_yaml`. ## Conditions and replicates These two terms come up everywhere in the analysis output, so it helps to be precise about what they mean in PolyzyMD. A **condition** is one simulation setup. Examples: "No Polymer", "SBMA-100", "PEG-50". Each condition has its own `config.yaml` that defines the system (which protein, which polymer, which force field, etc.). A **replicate** is an independent run of the same condition. Replicates are identified by number — 1, 2, 3, and so on. Each replicate uses the same `config.yaml` but starts from a different random seed or initial configuration, producing an independent trajectory. The pipeline processes data in this order: 1. **Per-replicate**: `compute_replicate` runs once for each replicate of each condition. If you have 2 conditions with 3 replicates each, that's 6 compute calls. 2. **Per-condition**: `aggregate` runs once per condition, combining the replicate results. That's 2 aggregate calls. 3. **Cross-condition**: `compare` and `plot` each run once, looking at all conditions together. ## Plugins — the analysis modules PolyzyMD ships with 13 analysis plugins. Each plugin is a self-contained module that knows how to compute one type of measurement, aggregate it, compare across conditions, and generate plots. The available plugins are: | Plugin name | What it measures | |-------------|-----------------| | `rmsd` | Root-mean-square deviation over time | | `rg` | Radius of gyration over time | | `rmsf` | Root-mean-square fluctuation per residue | | `contacts` | Intermolecular contacts between protein and other components | | `distances` | Distances between specified atom groups | | `catalytic_triad` | Catalytic triad geometry (active-site distances) | | `secondary_structure` | Secondary structure content (helix, sheet, coil fractions) | | `sasa` | Solvent-accessible surface area | | `hydrogen_bonds` | Hydrogen bond occupancy and lifetimes | | `binding_free_energy` | Per-contact binding free energy estimates (experimental) | | `exposure` | Exposure dynamics of active-site residues (experimental) | | `polymer_affinity` | Polymer–protein interaction scoring (experimental) | | `polymer_bridging` | Polymer bridging topology between protein regions (experimental) | Each plugin has a `Settings` model with configurable parameters. Most parameters have sensible defaults, so you often just need `plugin_name: {}` in your `comparison.yaml` to get started. You configure plugins in the `plugins:` block. For example, to run RMSF with a custom selection and contacts with defaults: ```yaml plugins: rmsf: selection: "protein and name CA" contacts: {} ``` ## Statistical comparison When you have two or more conditions, the compare stage produces statistical output so you can assess whether differences are meaningful. Here is what PolyzyMD computes: - **Pairwise t-tests** between each pair of conditions, with Benjamini–Hochberg FDR correction to account for multiple comparisons. - **Effect sizes** (Cohen's d) for each pair, so you can see not just whether a difference is significant but how large it is. - **ANOVA** when there are three or more conditions, as an omnibus test before the pairwise comparisons. - **Rankings** of conditions from best to worst on each metric (where "best" depends on the metric — lower RMSF is better, higher helix fraction is better). The comparison results are saved as JSON and also printed to the terminal when you run `polyzymd compare run`. For details on interpreting these outputs, see {doc}`../reference/analysis_comparison_reference`. ## Output structure After running `polyzymd compare run`, your project directory will contain: ```text comparison_project/ ├── comparison.yaml ├── analysis/ │ └── / │ └── / │ ├── run_/ # Per-replicate results │ └── aggregated/ # Cross-replicate aggregation ├── comparison/ │ └── / │ └── result.json # Cross-condition comparison └── figures/ └── / └── *.png # Plots ``` The three output directories map directly to the pipeline stages: - **`analysis/`** holds the compute and aggregate output. Each condition gets its own subdirectory, and within that, each plugin gets a directory with per-replicate (`run_1/`, `run_2/`, ...) and `aggregated/` results. - **`comparison/`** holds the compare output. One `result.json` per plugin with the statistical tests and rankings. - **`figures/`** holds the plot output. One subdirectory per plugin with PNG files. Results are cached: if you rerun the pipeline without changing settings, the compute stage skips replicates that already have results on disk. ## See also - {doc}`../tutorials/first_analysis` — Hands-on tutorial for running your first analysis - {doc}`../how_to/analysis_compare_conditions` — Practical guide to setting up a multi-condition comparison - {doc}`../reference/comparison_yaml` — Full `comparison.yaml` schema reference - {doc}`../reference/analysis_comparison_reference` — Plugin listing and statistical terms reference