Analysis Base Classes

API reference for polyzymd.analyses.base, including the plugin base class, context objects, and shared comparison/result models.

Base class and context objects for the PolyzyMD analysis plugin system.

Every analysis in PolyzyMD — RMSF, contacts, distances, etc. — is a single class that inherits from Analysis. The framework discovers these classes automatically (no registry edits) and owns replicate iteration, caching, dependency ordering, and CLI wiring.

How to Add a New Analysis

Create src/polyzymd/analyses/<name>/ as a sub-package.
Define a Settings model (Pydantic v2 BaseModel) as a class attribute.
Subclass Analysis and implement the required methods.
Done — the framework discovers it via pkgutil.

Required methods:

compute_replicate(ctx, replicate) -> dict | BaseModel
aggregate(ctx, results)           -> dict | BaseModel | None

Optional overrides (sensible defaults provided):

filter_conditions(conditions)     -> list[Condition]
compare(ctx)                      -> ComparisonResult | BaseModel | None
plot(ctx)                         -> list[Path]
format(result, output_format)     -> str
extract_metrics(summary)          -> dict[str, MetricValue]

Notes

The orchestrator auto-saves results returned by compute_replicate() and aggregate() to disk. Simple plugins can rely on this fallback. Plugins that need equilibration-aware caching or custom filenames should save explicitly (see rmsf/ for the pattern).

See also

analyses.stats: Shared statistical utility functions.
analyses.discovery: Automatic plugin discovery.
analyses.orchestrator: Framework engine for running analyses.

class polyzymd.analyses.base.BasePlotSettings[source]

Bases: BaseModel

Base class for per-analysis plot settings.

Each analysis plugin that supports plot customization should subclass this in its _plot_settings.py module and set PlotSettingsModel = MyPlotSettings on its Analysis subclass.

The class is intentionally minimal — it exists only so the framework can enforce a common type for all per-analysis plot settings.

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.base.SlurmResourceHint(*, mem=None, time=None, cpus_per_task=None)[source]

Bases: BaseModel

Per-plugin SLURM resource hints for HPC submission.

These values are used as default SLURM resources when users do not pass explicit resource flags on the CLI. Explicit CLI flags always take precedence over plugin hints.

Parameters:

mem (str | None) – Memory request string, for example "16G".
time (str | None) – Walltime string, for example "04:00:00".
cpus_per_task (int | None) – Number of CPUs per task.

mem: str | None

time: str | None

cpus_per_task: int | None

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.base.Condition(label, config_path, replicates, sim_config)[source]

Bases: object

A single simulation condition within a comparison.

Mirrors the essential fields of ConditionConfig but decoupled from the comparison config module so analyses don’t import it.

label

Human-readable condition name (e.g. “100% SBMA”).

Type:: str

config_path

Path to this condition’s config.yaml.

Type:: Path

replicates

1-indexed replicate numbers to process.

Type:: tuple[int, …]

sim_config

Loaded simulation configuration.

Type:: SimulationConfig

label: str

config_path: Path

replicates: tuple[int, ...]

sim_config: SimulationConfig

classmethod from_condition_config(cond)[source]

Create from a ConditionConfig (lazy-loads SimulationConfig).

__init__(label, config_path, replicates, sim_config)

class polyzymd.analyses.base.ReplicateContext(condition, replicate, sim_config, output_dir, equilibration, recompute, settings, result_path=None)[source]

Bases: object

Context passed to Analysis.compute_replicate().

Provides everything needed to analyse a single replicate of a single condition.

condition

The condition being analysed.

Type:: Condition

replicate

1-indexed replicate number.

Type:: int

sim_config

Already-loaded simulation configuration.

Type:: SimulationConfig

output_dir

Where to write per-replicate results (<analysis_root>/<condition_label>/<analysis_name>/run_<rep>).

Type:: Path

equilibration

Equilibration time string (e.g. "10ns").

Type:: str

recompute

If True, ignore cached results and recompute.

Type:: bool

settings

Analysis-specific settings (the analysis’s Settings class).

Type:: BaseModel

result_path

Canonical cache path for the per-replicate result. May be None if the plugin is invoked outside the normal orchestrator pipeline.

Type:: Path | None

condition: Condition

replicate: int

sim_config: SimulationConfig

output_dir: Path

equilibration: str

recompute: bool

settings: BaseModel

result_path: Path | None = None

__init__(condition, replicate, sim_config, output_dir, equilibration, recompute, settings, result_path=None)

class polyzymd.analyses.base.AggregateContext(condition, replicates, output_dir, equilibration, settings, result_path=None)[source]

Bases: object

Context passed to Analysis.aggregate().

condition

The condition being aggregated.

Type:: Condition

replicates

Replicate numbers that were successfully computed.

Type:: tuple[int, …]

output_dir

Where to write the aggregated result (<analysis_root>/<condition_label>/<analysis_name>/aggregated).

Type:: Path

equilibration

Equilibration time string.

Type:: str

settings

Analysis-specific settings.

Type:: BaseModel

result_path

Canonical cache path for the aggregated result. May be None if the plugin is invoked outside the normal orchestrator pipeline.

Type:: Path | None

condition: Condition

replicates: tuple[int, ...]

output_dir: Path

equilibration: str

settings: BaseModel

result_path: Path | None = None

__init__(condition, replicates, output_dir, equilibration, settings, result_path=None)

class polyzymd.analyses.base.ComparisonContext(name, conditions, excluded_conditions, control_label, analysis_dirs, results_dir, equilibration, settings, recompute, fdr_alpha=0.05, ttest_method='student', posthoc_method='ttest_bh', result_path=None, failed_conditions=<factory>, aggregated_results=<factory>)[source]

Bases: object

Context passed to Analysis.compare().

Provides all conditions, their analysis directories, and the comparison-level configuration.

name

Comparison project name (from comparison.yaml).

Type:: str

conditions

Conditions that passed filter_conditions().

Type:: list[Condition]

excluded_conditions

Conditions removed by filter_conditions().

Type:: list[Condition]

failed_conditions

Conditions that were valid but failed during compute/aggregate (e.g., insufficient replicates). Empty by default.

Type:: list[Condition]

control_label

Label of the control condition (None if not specified or if the control was excluded).

Type:: str | None

analysis_dirs

Mapping condition_label -> analysis_dir (contains run_N/ and aggregated/).

Type:: dict[str, Path]

results_dir

Analysis-specific comparison directory.

Type:: Path

equilibration

Equilibration time string.

Type:: str

settings

Analysis-specific settings.

Type:: BaseModel

fdr_alpha

Significance threshold for pairwise tests and ANOVA. Used as the BH false-discovery-rate threshold when posthoc_method is "ttest_bh" and as the family-wise significance threshold when posthoc_method is "tukey_hsd".

Type:: float

ttest_method

Two-sample t-test method for default scalar pairwise tests. "student" uses equal variances and "welch" does not.

Type:: str

posthoc_method

Post-hoc testing method for default scalar pairwise tests. "ttest_bh" applies pairwise t-tests with BH correction and "tukey_hsd" applies Tukey HSD across all groups.

Type:: str

recompute

Whether to force recomputation.

Type:: bool

result_path

Canonical cache path for the comparison result.

Type:: Path | None

aggregated_results

Mapping condition_label -> aggregated result for conditions that succeeded. Plugins can use this instead of re-loading from disk.

Type:: dict[str, Any]

name: str

conditions: list[Condition]

excluded_conditions: list[Condition]

control_label: str | None

analysis_dirs: dict[str, Path]

results_dir: Path

equilibration: str

settings: BaseModel

recompute: bool

fdr_alpha: float = 0.05

ttest_method: str = 'student'

posthoc_method: str = 'ttest_bh'

result_path: Path | None = None

failed_conditions: list[Condition]

aggregated_results: dict[str, Any]

property effective_control: str | None: Return control label if the control was not excluded.

__init__(name, conditions, excluded_conditions, control_label, analysis_dirs, results_dir, equilibration, settings, recompute, fdr_alpha=0.05, ttest_method='student', posthoc_method='ttest_bh', result_path=None, failed_conditions=<factory>, aggregated_results=<factory>)

class polyzymd.analyses.base.PlotContext(conditions, analysis_dirs, results_dir, output_dir, settings, plot_settings=<factory>, comparison_path=None, control_label=None, equilibration='0ns')[source]

Bases: object

Context passed to Analysis.plot().

conditions

All conditions included in the comparison.

Type:: list[Condition]

analysis_dirs

Mapping condition_label -> analysis_dir.

Type:: dict[str, Path]

results_dir

Where comparison result JSONs live.

Type:: Path

output_dir

Where to write figures.

Type:: Path

settings

Analysis-specific settings.

Type:: BaseModel

plot_settings

Global plot settings (theme, DPI, format, etc.). The framework guarantees this is never None — a PlotSettings() default is provided when the comparison config has no plot_settings: section. Plugins can access this directly without None guards.

Type:: PlotSettings

comparison_path

Canonical comparison result path for this analysis.

Type:: Path | None

control_label

Label of the control condition, or None if not specified / excluded. Mirrors ComparisonContext.control_label.

Type:: str | None

equilibration

Equilibration time string used for equilibration-aware cache filenames in plot helpers.

Type:: str

Notes

PlotContext does not carry pre-loaded aggregated results. Use Analysis._build_plot_data() to collect per-condition paths, then Analysis._load_aggregated_result() to load each result:

def plot(self, ctx: PlotContext) -> list[Path]:
    data, labels = self._build_plot_data(ctx)
    for label in labels:
        agg_dir = data[label]["aggregated_dir"]
        summary = self._load_aggregated_result(agg_dir)
        # ... plot data from summary ...

conditions: list[Condition]

analysis_dirs: dict[str, Path]

results_dir: Path

output_dir: Path

settings: BaseModel

plot_settings: PlotSettings

comparison_path: Path | None = None

control_label: str | None = None

equilibration: str = '0ns'

__post_init__()[source]

Ensure plot settings is always materialized for plugins.

__init__(conditions, analysis_dirs, results_dir, output_dir, settings, plot_settings=<factory>, comparison_path=None, control_label=None, equilibration='0ns')

class polyzymd.analyses.base.MetricValue(name, mean, sem, replicate_values, higher_is_better=True, direction_labels=('decreased', 'unchanged', 'increased'))[source]

Bases: object

A single scalar metric extracted from a condition summary.

Used by the default Analysis.compare() implementation. If your analysis overrides compare() entirely, you don’t need this.

name

Metric identifier (e.g. "mean_rmsf", "coverage").

Type:: str

mean

Mean value across replicates.

Type:: float

sem

Standard error of the mean.

Type:: float

replicate_values

Per-replicate values (for t-tests / ANOVA).

Type:: list[float]

higher_is_better

If True, higher values rank first. If False, lower values rank first (e.g. RMSF). If None, no universal quality direction is assumed and conditions are ranked by descending mean value for neutral display.

Type:: bool | None

direction_labels

(negative_label, unchanged_label, positive_label) for interpreting percent-change direction. Defaults to ("decreased", "unchanged", "increased").

Type:: tuple[str, str, str]

name: str

mean: float

sem: float

replicate_values: list[float]

higher_is_better: bool | None = True

direction_labels: tuple[str, str, str] = ('decreased', 'unchanged', 'increased')

__init__(name, mean, sem, replicate_values, higher_is_better=True, direction_labels=('decreased', 'unchanged', 'increased'))

class polyzymd.analyses.base.ConditionSummary(*, label, n_replicates=0, **extra_data)[source]

Bases: BaseModel

Summary statistics for one condition in a scalar comparison.

For simple scalar analyses (RMSF, catalytic_triad, secondary_structure), dynamic <metric>_mean, <metric>_sem, and <metric>_replicate_values fields are added via model_extra.

label

Condition display name.

Type:: str

n_replicates

Number of replicates included.

Type:: int

model_config = {'extra': 'allow'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

label: str

n_replicates: int

class polyzymd.analyses.base.PairwiseResult(*, condition_a, condition_b, metric='default', t_statistic, p_value, p_value_adjusted=None, posthoc_method='ttest_bh', cohens_d, effect_size_interpretation, direction, significant, percent_change)[source]

Bases: BaseModel

Statistical comparison between two conditions for one metric.

condition_a

Label of first condition (typically control/reference).

Type:: str

condition_b

Label of second condition (typically treatment).

Type:: str

metric

Name of the metric being compared.

Type:: str

t_statistic

T-test statistic.

Type:: float

p_value

Two-tailed p-value.

Type:: float

p_value_adjusted

Multiplicity-corrected p-value. For "ttest_bh" this is the Benjamini-Hochberg adjusted value; for "tukey_hsd" this mirrors the Tukey family-wise p-value (already corrected). None only for legacy payloads missing this field.

Type:: float | None

posthoc_method

Post-hoc method used to generate this pairwise p-value.

Type:: str

cohens_d

Effect size (Cohen’s d).

Type:: float

effect_size_interpretation

"negligible", "small", "medium", or "large".

Type:: str

direction

Interpretation of change (e.g. "stabilizing").

Type:: str

significant

Whether the comparison is significant. Uses adjusted p-value when available, otherwise raw p-value.

Type:: bool

percent_change

Percent change from condition_a to condition_b.

Type:: float

model_config = {'ser_json_inf_nan': 'strings'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

condition_a: str

condition_b: str

metric: str

t_statistic: float

p_value: float

p_value_adjusted: float | None

posthoc_method: str

cohens_d: float

effect_size_interpretation: str

direction: str

significant: bool

percent_change: float

class polyzymd.analyses.base.ANOVAResult(*, metric='default', f_statistic, p_value, significant)[source]

Bases: BaseModel

One-way ANOVA result for one metric.

metric

Name of the metric tested.

Type:: str

f_statistic

F-statistic from ANOVA.

Type:: float

p_value

P-value for the test.

Type:: float

significant

Whether p_value is less than or equal to the configured significance threshold.

Type:: bool

metric: str

f_statistic: float

p_value: float

significant: bool

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.base.ComparisonResult(*, analysis_type, name, control_label=None, fdr_alpha=None, ttest_method='student', posthoc_method='ttest_bh', conditions=<factory>, pairwise_comparisons=<factory>, anova=None, ranking=<factory>, rankings_by_metric=None, equilibration_time='0ns', created_at='', polyzymd_version='')[source]

Bases: BaseModel

Serializable result of a cross-condition comparison.

This is the universal comparison output model. The default Analysis.compare() returns an instance of this class. Complex analyses (contacts, distances, exposure, BFE, polymer_affinity) may return their own typed Pydantic models — as long as those models have a .save() method, the framework handles them identically.

The CLI calls result.save(path) and analysis.format(result) for every comparison, so all result objects must support these two operations.

analysis_type

Analysis identifier (e.g. "rmsf").

Type:: str

name

Comparison project name.

Type:: str

control_label

Control condition label.

Type:: str | None

fdr_alpha

Significance threshold for pairwise tests and ANOVA. Used as the BH false-discovery-rate threshold ("ttest_bh") or the Tukey family-wise threshold ("tukey_hsd"). None when unknown (legacy payloads).

Type:: float | None

ttest_method

Two-sample t-test method used for pairwise tests.

Type:: str

posthoc_method

Post-hoc testing method used for pairwise tests.

Type:: str

conditions

Per-condition summary statistics.

Type:: list[ConditionSummary]

pairwise_comparisons

Pairwise statistical tests.

Type:: list[PairwiseResult]

anova

ANOVA results (None if < 3 conditions).

Type:: list[ANOVAResult] | None

ranking

Condition labels ranked by primary metric (best first).

Type:: list[str]

rankings_by_metric

Per-metric rankings for multi-metric analyses.

Type:: dict[str, list[str]] | None

equilibration_time

Equilibration time used.

Type:: str

created_at

ISO 8601 timestamp.

Type:: str

polyzymd_version

PolyzyMD version string.

Type:: str

model_config = {'ser_json_inf_nan': 'strings'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

analysis_type: str

name: str

control_label: str | None

fdr_alpha: float | None

ttest_method: str

posthoc_method: str

conditions: list[ConditionSummary]

pairwise_comparisons: list[PairwiseResult]

anova: list[ANOVAResult] | None

ranking: list[str]

rankings_by_metric: dict[str, list[str]] | None

equilibration_time: str

created_at: str

polyzymd_version: str

save(path)[source]

Save result to JSON file.

Parameters:: path (Path or str) – Output path.
Returns:: Path to saved file.
Return type:: Path

classmethod load(path)[source]

Load result from JSON file.

Parameters:: path (Path or str) – Path to JSON file.
Returns:: Loaded result.
Return type:: Self

class polyzymd.analyses.base.BaseConditionSummary(*, label, config_path, n_replicates, replicate_values)[source]

Bases: BaseModel, ABC

Abstract base class for condition-level custom comparison summaries.

label

Display name for this condition

Type:: str

config_path

Path to the simulation config file

Type:: str

n_replicates

Number of replicates included

Type:: int

replicate_values

Per-replicate values of the primary metric

Type:: list[float]

label: str

config_path: str

n_replicates: int

replicate_values: list[float]

abstract property primary_metric_value: float: Return the primary metric value for ranking and comparison.

abstract property primary_metric_sem: float: Return the SEM of the primary metric.

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.base.BaseComparisonResult(*, metric, name, control_label=None, conditions, pairwise_comparisons, anova=None, ranking, equilibration_time, created_at, polyzymd_version)[source]

Bases: BaseModel, ABC, Generic[TConditionSummary, TPairwiseResult]

Abstract base class for custom plugin comparison results.

metric

The primary metric being compared

Type:: str

name

Name of the comparison project

Type:: str

control_label

Label of the control condition

Type:: str | None

conditions

Condition summaries

Type:: list[TConditionSummary]

pairwise_comparisons

Pairwise statistical comparisons

Type:: list[TPairwiseResult]

anova

ANOVA result(s)

Type:: ANOVAResult | list[ANOVAResult] | None

ranking

Condition labels ranked by primary metric

Type:: list[str]

equilibration_time

Equilibration time used

Type:: str

created_at

Timestamp for result generation

Type:: datetime

polyzymd_version

PolyzyMD version used

Type:: str

model_config = {'ser_json_inf_nan': 'strings'}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

comparison_type: ClassVar[str] = 'base'

metric: str

name: str

control_label: str | None

conditions: list[TConditionSummary]

pairwise_comparisons: list[TPairwiseResult]

anova: ANOVAResult | list[ANOVAResult] | None

ranking: list[str]

equilibration_time: str

created_at: datetime

polyzymd_version: str

save(path)[source]

Save result to JSON file.

Parameters:: path (Path | str) – Output path
Returns:: Path to saved file
Return type:: Path

classmethod load(path)[source]

Load result from JSON file.

Parameters:: path (Path | str) – Path to JSON file
Returns:: Loaded result
Return type:: Self

get_condition(label)[source]

Get a condition by label.

Parameters:: label (str) – Condition label
Returns:: The matching condition summary
Return type:: TConditionSummary
Raises:: KeyError – If condition not found

get_comparison(label)[source]

Get a pairwise comparison by condition pair.

Parameters:

label (str | tuple[str, str]) –

Comparison key.

(condition_a, condition_b) performs an exact pair lookup
condition_b performs legacy lookup by treatment label only

Returns:

The comparison, or None if not found

Return type:

TPairwiseResult | None

Notes

Legacy lookup by condition_b can be ambiguous for all-vs-all comparisons. Prefer tuple lookup for unambiguous retrieval.

class polyzymd.analyses.base.Analysis[source]

Bases: ABC

Base class for all PolyzyMD analyses.

Subclasses represent a complete analysis lifecycle: per-replicate computation, aggregation across replicates, cross-condition comparison, plotting, and CLI formatting.

Class Variables

namestr

Unique identifier used in config files and CLI (e.g. "rmsf").

Settingstype[BaseModel]

Pydantic model for this analysis’s settings.

PlotSettingsModeltype[BasePlotSettings] | None

Optional per-analysis plot settings model. When set, the comparison configuration loader parses plot_settings.<name> using this model and provides default-constructed values on attribute access when omitted in YAML. Defaults to None.

AggregatedResultClasstype[BaseModel] | None

Optional Pydantic model class for aggregated results. When set, the default _deserialize_result() uses this class’s .load(path) method (if available) or .model_validate_json() to load aggregated results from disk. When None (the default), aggregated results are loaded as plain dicts via json.loads().

Setting this class variable replaces the need to override _deserialize_result() in most cases.

Example:

from polyzymd.analyses.rmsf._results import RMSFAggregatedResult

class RMSFAnalysis(Analysis):
    name = "rmsf"
    AggregatedResultClass = RMSFAggregatedResult
    ...

aliasestuple[str, …]

Alternative CLI names (e.g. ("triad",) for catalytic_triad).

dependenciestuple[str, …]

Names of analyses that must run before this one (topological sort).

min_replicatesint

Minimum successful replicates required for aggregation.

has_compute_stagebool

Whether the framework should run compute_replicate().

has_aggregate_stagebool

Whether the framework should run aggregate().

slurm_resource_hintSlurmResourceHint | None

Optional per-plugin SLURM resource defaults for HPC submission.

settings_path_fieldstuple[str, …]

Settings field names that contain filesystem paths to resolve relative to comparison.yaml.

Examples

Simple plugin using the default comparison pipeline (t-tests, ANOVA, ranking). Implement extract_metrics() — the framework deserializes aggregated results automatically via json.loads():

from polyzymd.analyses.base import (
    AggregateContext, Analysis, MetricValue, ReplicateContext,
)
from pydantic import BaseModel

class RgAnalysis(Analysis):
    name = "rg"

    class Settings(BaseModel):
        selection: str = "protein and name CA"

    def compute_replicate(self, ctx, replicate):
        import MDAnalysis as mda
        import numpy as np
        # Use ctx.sim_config, ctx.settings — never load configs yourself
        ...
        return {"mean_rg": float(np.mean(rg_values)), "replicate": replicate}

    def aggregate(self, ctx, results):
        import numpy as np
        values = [r["mean_rg"] for r in results]
        return {"mean_rg": float(np.mean(values)),
                "sem_rg": float(np.std(values, ddof=1) / np.sqrt(len(values))),
                "replicate_values": values}

    def extract_metrics(self, summary):
        return {"mean_rg": MetricValue(
            name="mean_rg", mean=summary["mean_rg"],
            sem=summary["sem_rg"],
            replicate_values=summary["replicate_values"],
            higher_is_better=False,
            direction_labels=("compacting", "unchanged", "expanding"),
        )}

If your aggregated results use a typed Pydantic model, set AggregatedResultClass to have the framework deserialize into that model automatically instead of returning a plain dict:

class MyAnalysis(Analysis):
    name = "my_analysis"
    AggregatedResultClass = MyAggregatedResult  # your Pydantic model
    ...  # framework auto-deserializes via .load() or model_validate_json()

Custom compare plugin — override compare() entirely for multi-metric or entry-table analyses. See analyses/contacts/ or analyses/distances/ for full examples.