Architecture

This page explains how PolyzyMD is organized, why the major subsystems are separated, and which boundaries matter when extending the project. It is a conceptual map, not a step-by-step contributor tutorial.

The high-level shape of the project

PolyzyMD follows the lifecycle of an enzyme-polymer molecular dynamics study:

load and validate configuration
build a molecular system
run simulation workflows locally or through SLURM
analyze trajectories into durable artifacts
compare conditions and create plots or reports

That lifecycle is reflected in the active package layout:

src/polyzymd/
├── analyses/      # artifact-native analysis and comparison plugin system
├── builders/      # molecular system construction
├── cli/           # command-line entry points
├── config/        # simulation and comparison configuration
├── core/          # shared domain types
├── data/          # bundled package data
├── engines/       # engine-specific integration layer
├── exporters/     # output format exporters
├── simulation/    # local simulation execution
├── templates/     # packaged example/scaffold templates
├── utils/         # general utilities
└── workflow/      # orchestration and SLURM support

Older analysis/ and compare/ directories may still exist in the source tree, but they are not the primary architecture for new analysis or comparison work. Current analysis and comparison behavior is concentrated in analyses/, config/comparison.py, and cli/compare.py.

Why the code is split this way

The main boundary is between defining a study, running simulations, and interpreting results. Keeping these responsibilities separate lets users and contributors change one phase without accidentally coupling it to another.

Configuration describes intent

config/ holds schema and loading logic for YAML configuration, including comparison configuration. It validates what a study should do before lower-level builders or analysis plugins act on it.

Builders create simulation-ready systems

builders/ turns input structures into simulation-ready molecular systems by assembling enzyme, substrate, polymer, solvent, and related components. The builder layer stays focused on construction; it does not own long-running job or analysis policy.

Simulation and workflow execute the study

simulation/ runs local minimization, equilibration, checkpointing, continuation, and production segments. workflow/ handles orchestration around those runs, especially SLURM job generation, resubmission, and recovery flows.

engines/ isolates engine-specific integration details such as OpenMM or GROMACS support. This keeps high-level workflows from depending directly on one engine’s file formats or object model.

Analyses interpret completed trajectories

analyses/ is the current analysis and comparison architecture. It is both a plugin system and an artifact lifecycle. Plugins measure trajectories, produce replicate artifacts, aggregate those artifacts per condition, compare conditions, and optionally plot or format results.

This design keeps trajectory processing separate from ensemble interpretation: MDAnalysis handles per-trajectory analysis idioms, while PolyzyMD handles study structure, artifact identity, aggregation, comparison, and CLI integration.

The current `analyses/` boundary

The analysis package is split by public surface and private implementation:

src/polyzymd/analyses/
├── base.py          # stable contributor facade: Analysis, contexts, metrics
├── discovery.py     # plugin auto-discovery
├── orchestrator.py  # comparison workflow orchestration facade
├── stats.py         # default scalar comparison pipeline
├── mda/             # public MDAnalysis extension layer
├── shared/          # reusable utilities shared by plugins
├── _framework/      # private/internal lifecycle, I/O, and contracts
└── <plugin>/        # built-in and contributed analysis plugins

The important public/private boundary is intentional:

polyzymd.analyses.base is the stable contributor facade for Analysis, lifecycle contexts, metrics, and comparison result models.
polyzymd.analyses.mda is the public MDAnalysis extension layer for jobs, frame selection, artifacts, artifact storage, aggregation, and Universe handling.
_framework/ modules are private implementation details behind the public facade.
Plugin helper modules named _*.py are private to their plugin package unless that plugin explicitly documents them as public.

Contributors should not import from _framework/ or rely on another plugin’s private helper modules. The stable import surface is deliberately narrower than the internal package layout so PolyzyMD can evolve lifecycle internals without breaking plugins.

How the MDAnalysis lifecycle is divided

PolyzyMD and MDAnalysis share responsibility during trajectory analysis, but not at the same layer.

PolyzyMD resolves topology and trajectory paths from the study context, applies frame-selection policy, and provides or caches loaded MDAnalysis Universe objects through the MDA lifecycle. It also owns replicate discovery, cache identity, artifact storage, aggregation, cross-condition comparison, and CLI output.

MDAnalysis owns the per-trajectory analysis idioms: selecting atoms, iterating frames, running AnalysisBase-compatible work, and producing MDAnalysis-style Results objects. PolyzyMD collectors then translate completed MDAnalysis jobs into project-level artifacts.

Conceptually, the current analysis flow is:

config -> builders -> simulation/workflow -> analyses/artifacts -> comparison -> plots

Within analyses/, that becomes:

MDA jobs
  -> MDAnalysis work and Results
  -> collectors
  -> ReplicateArtifact objects
  -> condition artifacts
  -> comparison artifacts or documented custom outputs
  -> plots and formatted CLI output

Most trajectory-native plugins implement build_mda_jobs() and usually provide a collector so completed jobs become ReplicateArtifact objects. Those replicate artifacts aggregate into per-condition artifacts, which then feed default comparison artifacts or a plugin’s documented custom comparison output. Plots should read cached artifacts and sidecars rather than reloading trajectories or rerunning compute-stage analysis.

For concrete commands and code examples, see Extend PolyzyMD with MDAnalysis-native analyses.

Comparison infrastructure is distributed

There is no separate active comparison stack that new plugins should target. Comparison behavior is distributed across focused modules:

config/comparison.py describes comparison and plotting settings.
cli/compare.py exposes the polyzymd compare command group.
analyses/stats.py implements the default scalar comparison pipeline, including summaries, rankings, pairwise comparisons, and formatting helpers.
analyses/shared/inferential_statistics.py provides lower-level statistical primitives such as t-tests, ANOVA, and effect sizes.
analyses/mda/ provides the artifact layer that carries replicate, condition, and comparison data through the lifecycle.

This distribution keeps comparison close to the artifacts it consumes while still allowing the CLI and configuration layers to remain stable entry points.

Supporting packages

`core/` and `utils/`

core/ and utils/ provide shared infrastructure such as common types, experimental workflow labeling, and helper functionality that should not be duplicated across the package.

`data/` and `templates/`

data/ stores bundled package resources such as force-field or template data that need to ship with PolyzyMD. It is not a user results directory.

templates/ contains packaged templates and examples used by scaffolding and setup flows. These are starting points for generated files, not the authoritative runtime schema.

`exporters/`

exporters/ contains format-export support for moving PolyzyMD outputs into other molecular simulation ecosystems. Exporters sit at the edge of the package rather than in the core build or simulation lifecycle.

How data moves through the system

At a conceptual level, data moves from declared intent to generated evidence:

config.yaml
  -> validated config objects
  -> system builders
  -> simulation objects and run directories
  -> local or SLURM execution
  -> ReplicateArtifact files and sidecars
  -> condition artifacts
  -> comparison artifacts or documented custom outputs
  -> plots and reports

This separation is intentional:

users can stop after building or running
analysis can be repeated without rebuilding simulations
comparison workflows can reuse cached analysis outputs
plotting can be rerun without recomputing statistics
plugin internals can change while public artifact and facade contracts remain stable

Design patterns you will encounter

Lazy imports for heavy dependencies

Modules that depend on OpenMM, OpenFF, MDAnalysis, or other heavy scientific packages often import those packages inside functions or methods instead of at module import time. This keeps lightweight CLI and documentation operations usable even when optional heavy dependencies are absent.

Plugin-based extension points

Analysis is the primary extensibility axis. Plugins are single files or packages under analyses/ that subclass Analysis. The framework discovers both shapes automatically via pkgutil, so contributors do not need registries, decorators, or core imports to make a plugin available.

The reason for this design is the open-closed principle: new analyses should be added by extension, not by modifying the orchestrator or CLI every time a metric is introduced.

Public facade over private lifecycle internals

The analysis framework uses private modules internally because lifecycle code, artifact I/O, and validation contracts are implementation details. The public facade keeps contributor imports stable while giving maintainers room to improve internals.

This is why documentation points contributors to polyzymd.analyses.base and polyzymd.analyses.mda, not to _framework/.

Where contributors usually need to look

Configuration behavior: src/polyzymd/config/
Build behavior: src/polyzymd/builders/
Run, restart, or cluster behavior: src/polyzymd/simulation/ and src/polyzymd/workflow/
Analysis and comparison plugins: src/polyzymd/analyses/
Comparison configuration and CLI entry points: config/comparison.py and cli/compare.py
CLI commands: src/polyzymd/cli/

For the chain-ID convention used by selections and interpretation, see Understanding Residue Assignment in PolyzyMD.

A practical mental model

If you are new to the codebase, think in layers:

config describes what should happen
builders and simulation make it happen for one system
workflow makes it practical on clusters
engines isolates engine-specific details where possible
analyses plugins measure trajectories and preserve evidence as artifacts
comparison workflows interpret differences across study conditions

That mental model is usually enough to find the right subsystem before diving into module-level details or API reference pages.