Comparison and Plotting Reference

Use this page when you need quick lookup information for polyzymd compare, comparison.yaml, output paths, or plotting behavior.

Comparison Project Layout

polyzymd compare init -n my_study creates a workspace like this:

my_study/
├── comparison.yaml
├── comparison/
├── figures/
└── structures/

Core `comparison.yaml` Fields

name: "polymer_stability_study"
description: "Optional human-readable summary"
control: "No Polymer"  # optional

conditions:
  - label: "No Polymer"
    config: "no_polymer/config.yaml"
    replicates: [1, 2, 3]

defaults:
  equilibration_time: "10ns"

plugins:
  rmsf:
    selection: "protein and name CA"

Per-Plugin Statistical Settings

Some plugins support per-plugin statistical settings configured under the plugins: block in comparison.yaml. These control false discovery rate correction, effect-size filtering, and output truncation for cross-condition comparisons.

Canonical YAML Example

plugins:
  contacts:
    cutoff: 4.5
    fdr_alpha: 0.05
    min_effect_size: 0.5
    top_residues: 10

Settings Support Matrix

Setting	contacts	Default
`fdr_alpha`	✓	0.05
`min_effect_size`	✓	0.5
`top_residues`	✓	10

Setting Descriptions

fdr_alpha — Significance threshold for pairwise comparisons. When posthoc_method is "ttest_bh", this controls the Benjamini-Hochberg false discovery rate. When posthoc_method is "tukey_hsd", this is the family-wise alpha threshold. Also used as the ANOVA significance threshold. Lower values are more conservative.
min_effect_size — Minimum Cohen’s d required for practical significance. Pairs that meet or exceed this threshold are highlighted with “†” in formatted output; all pairs are shown regardless.
top_residues — Maximum number of contacted residues shown per condition, ranked by aggregated contact_fraction_mean. Affects both saved JSON and CLI output.

Stable Plugin Keys

Stable analysis plugins:

rmsd
rg
rmsf
contacts
distances
catalytic_triad
secondary_structure
sasa
hydrogen_bonds

Plugin Summary Table

Plugin	Default compare?	Primary metric	Key feature	Statistical method
`rmsd`	No (custom)	`mean_rmsd`	Backbone stability over time	Per-run pairwise t-tests + ANOVA
`rg`	No (custom)	`mean_rg`	Protein compactness	Per-run pairwise t-tests + ANOVA
`rmsf`	Yes	`mean_rmsf`	Per-residue flexibility	FDR-corrected pairwise t-tests + ANOVA
`contacts`	No (custom)	Coverage + contact fraction	Per-residue contact mapping	FDR-corrected pairwise t-tests per residue
`distances`	No (custom)	Multiple distance metrics	Named distance pairs	Per-distance t-tests + ANOVA
`catalytic_triad`	Yes	`simultaneous_contact_fraction`	Active-site geometry	FDR-corrected pairwise t-tests + ANOVA
`secondary_structure`	Yes	`helix_fraction`	Secondary structure content	FDR-corrected pairwise t-tests + ANOVA
`sasa`	No (custom)	Per-run mean SASA	Multi-run target/context model	Per-run pairwise t-tests + ANOVA
`hydrogen_bonds`	Custom loader with default-style scalar statistics	`mean_hbonds_per_frame` per summary	Flexible named groups + summaries + composition analysis	FDR-corrected pairwise t-tests + ANOVA per configured summary

Path Rules

relative paths in config: are resolved relative to comparison.yaml
absolute paths are used as-is
replicates must be an explicit list such as [1, 2, 3]

Replicate Counts

All stable shipped analyses support replicates: [1] for smoke tests and protocol validation. One-replicate runs compute aggregate metrics and plots, but inferential statistics, FDR correction, and uncertainty bands require at least two independent replicates per condition. Singleton pairwise tests and ANOVA are reported as not testable rather than significant.

Commands

Command	Purpose
`polyzymd compare init -n NAME`	Create a comparison workspace
`polyzymd compare validate`	Check `comparison.yaml` before running
`polyzymd compare run TYPE`	Run one analysis plugin
`polyzymd compare run --list`	List available comparison types
`polyzymd compare run-all`	Run every enabled plugin in one pass
`polyzymd compare plot-all`	Generate configured figures
`polyzymd compare plot-all --list-available`	List available plots and experimental labels
`polyzymd compare submit ANALYSIS`	Submit a SLURM DAG for one analysis plugin
`polyzymd compare status ANALYSIS`	Show status of a submitted SLURM DAG
`polyzymd compare finalize ANALYSIS`	Run comparison + plotting from on-disk aggregated results

Common Stable Commands

All commands below assume you are inside the pixi environment (pixi shell -e build) or are prefixed with pixi run -e build.

polyzymd compare run rmsd
polyzymd compare run rg
polyzymd compare run rmsf
polyzymd compare run contacts
polyzymd compare run distances
polyzymd compare run catalytic_triad
polyzymd compare run sasa
polyzymd compare run hydrogen_bonds
polyzymd compare run-all
polyzymd compare plot-all

Output Locations

per-replicate cache files are written under analysis/<condition>/<analysis>/run_<replicate>/
per-condition aggregate files are written under analysis/<condition>/<analysis>/aggregated/
cross-condition comparison JSON files are written to comparison/<analysis>/result.json
figures are written under the configured plot_settings.output_dir, usually figures/<analysis>/
polyzymd compare init scaffolds comparison/, figures/, and structures/ next to comparison.yaml; analysis/ is created and populated during analysis runs

Typical comparison cache paths:

comparison/rmsd/result.json
comparison/rg/result.json
comparison/rmsf/result.json
comparison/contacts/result.json
comparison/distances/result.json
comparison/catalytic_triad/result.json
comparison/sasa/result.json
comparison/hydrogen_bonds/result.json

Plotting Smoke Test

For a final smoke test after comparisons finish:

polyzymd compare plot-all --list-available
polyzymd compare plot-all

Plugin-Specific Metadata Fields

Some plugins include additional metadata in their comparison output beyond the standard ranking and statistical fields. These fields are additive diagnostics — they do not affect rankings, p-values, or effect sizes.

RMSD Convergence Output

The RMSD plugin includes per-run convergence diagnostics generated by the sliding-window convergence heuristic in analyses/shared/convergence.py. These fields appear in the per-condition summaries within comparison/rmsd/result.json:

Field	Type	Description
`convergence_fraction`	`float`	Fraction of replicates that converged (0.0–1.0)
`n_converged_replicates`	`int`	Count of replicates where sustained convergence was detected
`mean_convergence_time_ns`	`float \| null`	Mean convergence time across converged replicates (ns)
`median_convergence_time_ns`	`float \| null`	Median convergence time across converged replicates (ns)

Note

Convergence metadata is purely informational. It does not influence the RMSD ranking, pairwise t-tests, ANOVA, or effect-size calculations. Use it to identify conditions where one or more replicates failed to reach a stable plateau, which may warrant longer production runs or additional replicates.

Statistical Terms

p-value: significance of the observed difference under the null hypothesis
Cohen's d: effect size magnitude
ANOVA: omnibus test across multiple conditions
SEM: standard error of the mean across replicates
Benjamini-Hochberg (BH): step-up procedure for controlling the false discovery rate across multiple hypothesis tests
Adjusted p-value (p_adj): p-value corrected for multiple comparisons via the BH procedure (for ttest_bh) or family-wise Tukey adjustment (for tukey_hsd)
False Discovery Rate (FDR): expected proportion of false positives among rejected hypotheses
Effect size threshold: minimum Cohen’s d required for a pairwise difference to be considered practically significant

For interpretation guidance rather than lookup, see:

Statistical Best Practices for Analysis
How to Compare Simulation Conditions
Post-Hoc Testing Reference — full post-hoc method details, output fields, and edge cases