# `comparison.yaml` Schema Reference The `comparison.yaml` file defines a cross-condition analysis project. It specifies which simulation conditions to compare, which analysis plugins to run, and how to visualize results. Create one with `polyzymd compare init -n ` and place it at the root of your comparison project directory. Source of truth: {func}`polyzymd.config.comparison.ComparisonConfig` in `src/polyzymd/config/comparison.py`. ```{important} Plugin settings path fields are resolved relative to the directory containing `comparison.yaml`. For example, in: `plugins.rmsf.reference_file`, condition `config` paths, and other plugin-declared path fields, a relative path like `structures/enzyme.pdb` is interpreted as: `/structures/enzyme.pdb` ``` For CLI commands that consume this file, see {doc}`analysis_comparison_reference`. For directory layout and data expectations, see {doc}`data_requirements`. Typical local workflow: ```bash pixi run -e build polyzymd compare validate -f comparison.yaml pixi run -e build polyzymd compare run rmsf -f comparison.yaml pixi run -e build polyzymd compare plot-all -f comparison.yaml ``` Typical SLURM workflow: ```bash pixi run -e build polyzymd compare submit sasa -f comparison.yaml --dry-run pixi run -e build polyzymd compare submit sasa -f comparison.yaml --partition pixi run -e build polyzymd compare status sasa -f comparison.yaml pixi run -e build polyzymd compare finalize sasa -f comparison.yaml pixi run -e build polyzymd compare plot-all -f comparison.yaml ``` --- ## Minimal Working Example ```yaml name: "polymer_stability_study" conditions: - label: "No Polymer" config: "../no_polymer/config.yaml" replicates: [1, 2, 3] - label: "100% SBMA" config: "../sbma_100/config.yaml" replicates: [1, 2, 3] defaults: equilibration_time: "10ns" plugins: rmsf: selection: "protein and name CA" ``` --- ## Top-Level Fields | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | `name` | string | **yes** | — | Human-readable project name | | `description` | string | no | `null` | Description of what is being compared | | `control` | string | no | `null` | Label of the control condition. Must match a `label` in `conditions`. Used for relative comparisons (e.g., Δ from control). | | `conditions` | list | **yes** | — | List of condition entries (min 1 required) | | `defaults` | mapping | no | see below | Default analysis parameters | | `plugins` | mapping | no | `{}` | Analysis plugin settings — **what** to compute | | `mda_backend_policy` | mapping | no | `{}` | Optional MDAnalysis internal backend policy for job-backed analyses | | `plot_settings` | mapping | no | see below | Plot customization — **how** to visualize | Unknown top-level keys raise a `ValueError` listing the invalid keys and valid alternatives. Use `plugins:` for analysis plugin settings; unsupported keys such as `analysis_settings:` are rejected. --- ## `conditions[*]` Each entry describes one simulation condition to include in the comparison. | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | `label` | string | **yes** | — | Display name (must be unique across all conditions) | | `config` | path | **yes** | — | Path to the simulation's `config.yaml`. Relative paths resolved from `comparison.yaml` location. | | `replicates` | list of int | **yes** | — | Replicate numbers to include. A single `int` is auto-wrapped to a list. | --- ## `defaults` | Field | Type | Default | Description | |-------|------|---------|-------------| | `equilibration_time` | string | `"10ns"` | Time to discard as equilibration (e.g., `"10ns"`, `"5000ps"`) | | `fdr_alpha` | float (0, 1] | `0.05` | Significance threshold for pairwise comparisons and ANOVA. Used as the Benjamini-Hochberg FDR threshold when `posthoc_method` is `"ttest_bh"`, and as the family-wise alpha threshold when `posthoc_method` is `"tukey_hsd"`. | | `posthoc_method` | `"ttest_bh"` or `"tukey_hsd"` | `"ttest_bh"` | Post-hoc pairwise comparison method. See {doc}`posthoc_testing` for details. | | `ttest_method` | `"student"` or `"welch"` | `"student"` | Two-sample t-test variance assumption. Only used when `posthoc_method` is `"ttest_bh"`. | `equilibration_time` is interpreted as an absolute MDAnalysis trajectory timestamp when the loaded trajectory exposes finite frame times. This handles continuation runs where the first loaded segment may begin after 0 ps. If frame timestamps are unavailable, PolyzyMD treats the first loaded frame as time zero. ## `mda_backend_policy` The default policy is empty and forwards no backend-related keyword arguments to MDAnalysis. This avoids nested oversubscription: PolyzyMD schedules work across conditions/replicates, while each replicate remains serial unless you explicitly opt into an MDAnalysis backend. | Field | Type | Default | Description | |-------|------|---------|-------------| | `backend` | string | `null` | Backend name forwarded to `AnalysisBase.run()`, such as `"multiprocessing"` or `"dask"` | | `n_workers` | positive int | `null` | Worker count forwarded only when `backend` is set | | `n_parts` | positive int | `null` | Optional partition count forwarded only when `backend` is set | Example opt-in for local MDAnalysis internal parallelism: ```yaml mda_backend_policy: backend: "multiprocessing" n_workers: 2 n_parts: 2 ``` Function-adapter jobs generated by the simple scaffold reject non-default backend policies; use an `AnalysisBase`-compatible job for MDAnalysis internal parallelism. --- ## `plugins` Presence of a key **enables** that analysis. The value is a mapping of that plugin's settings. An empty mapping (`rmsf: {}`) enables the plugin with all defaults. ### `plugins.rmsf` | Field | Type | Default | Description | |-------|------|---------|-------------| | `selection` | string | `"protein and name CA"` | MDAnalysis selection string for RMSF computation | | `reference_mode` | string | `"centroid"` | Reference structure: `"centroid"`, `"average"`, `"frame"`, or `"external"` | | `reference_frame` | int | `null` | Required when `reference_mode` is `"frame"` | | `reference_file` | path | `null` | Path to external PDB reference structure. Required when `reference_mode` is `"external"`. Also used for secondary structure annotation on profile plots. | | `alignment_selection` | string | `"protein and name CA"` | MDAnalysis selection used for trajectory alignment before RMSF calculation | | `centroid_selection` | string | `"protein"` | MDAnalysis selection used to compute the centroid reference structure when `reference_mode` is `"centroid"` | ### `plugins.secondary_structure` | Field | Type | Default | Description | |-------|------|---------|-------------| | `chain_id` | string | `"A"` | Chain letter for the protein to analyze via DSSP | ### `plugins.sasa` | Field | Type | Default | Description | |-------|------|---------|-------------| | `runs` | list | **(required)** | List of SASA run definitions (see sub-fields) | | `probe_radius_nm` | float | `0.14` | MDTraj Shrake-Rupley probe radius in nanometers | | `n_sphere_points` | int | `960` | Number of sphere points for MDTraj Shrake-Rupley SASA | | `chunk_size` | int | `100` | Frames per chunk for memory management | Each entry in `runs`: | Field | Type | Default | Description | |-------|------|---------|-------------| | `label` | string | **(required)** | Name for this SASA computation | | `target_selection` | string | **(required)** | MDAnalysis selection for the target surface | | `context_selection` | string | same as `target_selection` | Atoms to include in SASA context (affects shadowing) | | `stride` | int | `1` | Frame stride | ### `plugins.catalytic_triad` | Field | Type | Default | Description | |-------|------|---------|-------------| | `name` | string | `"catalytic_triad"` | Display name for the triad analysis | | `description` | string | `null` | Optional description of the triad (e.g., `"Ser-His-Asp catalytic triad"`) | | `threshold` | float | `3.5` | Distance threshold in Angstroms (H-bond cutoff) | | `pairs` | list | **(required)** | List of atom pair definitions | Each entry in `pairs`: | Field | Type | Default | Description | |-------|------|---------|-------------| | `label` | string | **(required)** | Display label (e.g., `"Asp-His"`) | | `selection_a` | string | **(required)** | MDAnalysis selection for atom/group A. Supports `midpoint(...)` syntax. | | `selection_b` | string | **(required)** | MDAnalysis selection for atom/group B | ### `plugins.distances` | Field | Type | Default | Description | |-------|------|---------|-------------| | `threshold` | float | `3.5` | Global default threshold in Angstroms | | `pairs` | list | **(required)** | List of distance pair definitions | | `use_pbc` | bool | `true` | Apply periodic boundary conditions to distance calculations | | `align_trajectory` | bool | `true` | Align trajectory before computing distances | | `alignment_selection` | string | `"protein and name CA"` | MDAnalysis selection used for trajectory alignment | | `alignment_mode` | string | `"centroid"` | Alignment reference mode: `"centroid"` or `"frame"` | | `alignment_frame` | int | `null` | Frame index to use as reference when `alignment_mode` is `"frame"` | Each entry in `pairs`: | Field | Type | Default | Description | |-------|------|---------|-------------| | `label` | string | **(required)** | Display label (e.g., `"Ser77-Substrate"`) | | `selection_a` | string | **(required)** | MDAnalysis selection for group A. Supports `com(...)` syntax. | | `selection_b` | string | **(required)** | MDAnalysis selection for group B | | `threshold` | float | global `threshold` | Per-pair threshold override | | `below_label` | string | `"Below {threshold}Å"` | Display text for d ≤ threshold | | `above_label` | string | `"Above {threshold}Å"` | Display text for d > threshold | ### `plugins.contacts` | Field | Type | Default | Description | |-------|------|---------|-------------| | `polymer_selection` | string | `"chainid C"` | MDAnalysis selection for polymer atoms | | `protein_selection` | string | `"chainid A"` | MDAnalysis selection for protein atoms | | `cutoff` | float | `4.5` | Contact distance cutoff in Angstroms | | `grouping` | string | `"aa_class"` | Residue grouping: `"aa_class"`, `"secondary_structure"`, or `"none"` | | `compute_residence_times` | bool | `true` | Whether to compute aggregate residence-time summaries and plots. When `false`, per-replicate contact events are still stored and the canonical artifact identity changes. | | `protein_groups` | mapping | `null` | Custom residue groups: `{group_name: [resid, ...]}` | | `protein_partitions` | mapping | `null` | Mutually exclusive partitions for contact-fraction and residence-time plots: `{partition_name: [group_name, ...]}` | | `polymer_types` | list of string | `null` | Explicit polymer type labels. If `null`, types are auto-detected from topology. | | `fdr_alpha` | float | `0.05` | Per-plugin FDR threshold | | `min_effect_size` | float | `0.5` | Minimum Cohen's d for practical significance | | `top_residues` | int | `10` | Max residues shown per condition in formatted output | ### `plugins.rmsd` | Field | Type | Default | Description | |-------|------|---------|-------------| | `runs` | list | **(required)** | List of RMSD run definitions | Each entry in `runs`: | Field | Type | Default | Description | |-------|------|---------|-------------| | `label` | string | **(required)** | Name for this RMSD computation (e.g., `"backbone"`) | | `selection` | string | **(required)** | MDAnalysis selection for RMSD atoms | | `alignment_selection` | string | same as `selection` | MDAnalysis selection for alignment | | `reference_mode` | string | `"centroid"` | Reference structure mode: `"centroid"` or `"frame"` | | `reference_frame` | int | `0` | Frame index to use as reference when `reference_mode` is `"frame"` | | `reference_file` | path | `null` | Path to external PDB reference structure | | `centroid_selection` | string | `null` | MDAnalysis selection for centroid computation. If `null`, uses `alignment_selection`. | | `convergence_window_size_ns` | float | `15.0` | Rolling window size in nanoseconds for convergence detection | | `convergence_step_size_ns` | float | `5.0` | Step size in nanoseconds between convergence windows | | `convergence_slope_threshold` | float | `0.0005` | Maximum slope (Å/ns) for a window to be considered converged | | `convergence_sustained_for_ns` | float | `15.0` | Duration in nanoseconds that convergence must be sustained | ### `plugins.rg` | Field | Type | Default | Description | |-------|------|---------|-------------| | `runs` | list | **(required)** | List of Rg run definitions | Each entry in `runs`: | Field | Type | Default | Description | |-------|------|---------|-------------| | `label` | string | **(required)** | Name for this Rg computation | | `selection` | string | **(required)** | MDAnalysis selection for Rg atoms | | `calculation_mode` | string | `"selection"` | Computation mode: `"selection"` (single Rg for the whole selection) or `"fragments"` (per-fragment Rg) | | `fragment_weighting` | string | `"equal"` | How to weight fragments when `calculation_mode` is `"fragments"`: `"equal"` or `"mass"` | | `save_fragment_distribution` | bool | `true` | Save per-frame fragment Rg distributions | | `histogram_bins` | int | `50` | Number of bins for Rg distribution histograms | ### `plugins.hydrogen_bonds` | Field | Type | Default | Description | |-------|------|---------|-------------| | `groups` | mapping | `{"protein": "chainid A", "polymer": "chainid C"}` | Named atom groups: `{name: "MDAnalysis selection"}` | | `summaries` | list or mapping | one default summary (`protein_polymer` between `protein` and `polymer`) | Named H-bond summaries (see below) | | `distance_cutoff` | float | `3.0` | H-bond distance cutoff in Angstroms | | `angle_cutoff` | float | `150` | H-bond angle cutoff in degrees | | `update_selections` | bool | `true` | Update atom selections every frame | | `top_n_pairs` | int | `15` | Number of top residue pairs to report | | `allow_empty_groups` | bool | `true` | Allow empty group selections: `true` = warn and skip summaries when a group matches no atoms; `false` = raise error | | `allow_overlapping_composition` | bool | `false` | Whether overlapping composition partitions are allowed | | `composition` | mapping | `null` | Composition analysis settings | | `timestep_ps` | float | `null` | Override trajectory timestep in picoseconds for time-axis plots | Time-axis plots assume uniformly saved frames. PolyzyMD converts frame index to time as `frame_index * timestep_ps`; variable-timestep concatenated trajectories are not supported. Each summary entry in `summaries` has: | Field | Type | Required | Description | |-------|------|----------|-------------| | `name` | string | yes | Unique summary name | | `between` | `[group_a, group_b]` | exactly one of `between` / `within` | Inter-group H-bonds | | `within` | `group_name` | exactly one of `between` / `within` | Intra-group H-bonds | For mapping-form input, keys are treated as `name` values. Hydrogen detection uses MDAnalysis `HydrogenBondAnalysis` with hydrogens selected as `() and element H`; topologies need explicit hydrogens and usable element metadata. `composition` sub-fields: | Field | Type | Default | Description | |-------|------|---------|-------------| | `partitions` | mapping | — | Named partitions: `{name: "MDAnalysis selection"}` | --- ## `plot_settings` | Field | Type | Default | Description | |-------|------|---------|-------------| | `output_dir` | path | `"figures/"` | Directory for generated plots (relative to `comparison.yaml`) | | `format` | string | `"png"` | Image format: `"png"`, `"pdf"`, or `"svg"` | | `dpi` | int | `300` | Resolution for raster formats. Range: 50–600. | | `style` | string | `"compact"` | PolyzyMD theme preset: `"compact"`, `"large_elements"`, or `"low_ink"` | | `color_palette` | string | `"tab10"` | Seaborn/matplotlib color palette name | | `semantic_colors` | mapping | disabled | Optional condition-label color and display-order rules for condition-series plots | | `theme` | mapping | from style preset | Visual theme overrides (see below) | `style` selects a PolyzyMD built-in theme preset for standard analysis plots. It is not a matplotlib or seaborn stylesheet, and it does not control `format`, `dpi`, per-analysis figure sizes, or color palettes. `theme` values are merged on top of the selected preset, so you can choose a base style and override only the fields that need project-specific changes. ### `plot_settings.semantic_colors` Semantic colors let a comparison project encode condition meaning directly in figures. The settings are optional and disabled by default; when disabled, plots keep using `color_palette` and each plotter's existing category colors. Semantic ordering is **plot-only**. It changes the display order of conditions in figures, but it does not mutate comparison statistics, rankings, cached artifacts, or JSON result files. Top-level fields: | Field | Type | Default | Description | |-------|------|---------|-------------| | `enabled` | bool | `false` | Opt in to semantic condition colors and plot ordering | | `order` | list of string | `[]` | Explicit plot display order by condition label. Labels not present keep their relative order after condition-level `order` sorting. | | `manual_colors` | mapping | `{}` | Direct color overrides by exact condition label. Highest precedence color rule. | | `conditions` | mapping | `{}` | Per-condition semantic metadata keyed by exact condition label | | `families` | mapping | `{}` | Family-level colormap rules keyed by family name | | `control_color` | color | `"black"` | Color used for the configured `control` condition or a condition with `role: control` | | `missing_color` | color | `"lightgray"` | Fallback color for conditions with incomplete semantic metadata | | `default_color` | color or `null` | `null` | Fallback for labels missing from `conditions`. If `null`, the regular palette color is used. | `conditions.