Architecture
This page explains how PolyzyMD is organized, why the major subsystems are separated, and which boundaries matter when extending the project. It is a conceptual map, not a step-by-step contributor tutorial.
The high-level shape of the project
PolyzyMD follows the lifecycle of an enzyme-polymer molecular dynamics study:
load and validate configuration
build a molecular system
run simulation workflows locally or through SLURM
analyze trajectories into durable artifacts
compare conditions and create plots or reports
That lifecycle is reflected in the active package layout:
src/polyzymd/
├── analyses/ # artifact-native analysis and comparison plugin system
├── builders/ # molecular system construction
├── cli/ # command-line entry points
├── config/ # simulation and comparison configuration
├── core/ # shared domain types
├── data/ # bundled package data
├── engines/ # engine-specific integration layer
├── exporters/ # output format exporters
├── simulation/ # local simulation execution
├── templates/ # packaged example/scaffold templates
├── utils/ # general utilities
└── workflow/ # orchestration and SLURM support
Older analysis/ and compare/ directories may still exist in the source tree,
but they are not the primary architecture for new analysis or comparison work.
Current analysis and comparison behavior is concentrated in analyses/,
config/comparison.py, and cli/compare.py.
Why the code is split this way
The main boundary is between defining a study, running simulations, and interpreting results. Keeping these responsibilities separate lets users and contributors change one phase without accidentally coupling it to another.
Configuration describes intent
config/ holds schema and loading logic for YAML configuration, including
comparison configuration. It validates what a study should do before lower-level
builders or analysis plugins act on it.
Builders create simulation-ready systems
builders/ turns input structures into simulation-ready molecular systems by
assembling enzyme, substrate, polymer, solvent, and related components. The
builder layer stays focused on construction; it does not own long-running job
or analysis policy.
Simulation and workflow execute the study
simulation/ runs local minimization, equilibration, checkpointing,
continuation, and production segments. workflow/ handles orchestration around
those runs, especially SLURM job generation, resubmission, and recovery flows.
engines/ isolates engine-specific integration details such as OpenMM or
GROMACS support. This keeps high-level workflows from depending directly on one
engine’s file formats or object model.
Analyses interpret completed trajectories
analyses/ is the current analysis and comparison architecture. It is both a
plugin system and an artifact lifecycle. Plugins measure trajectories, produce
replicate artifacts, aggregate those artifacts per condition, compare conditions,
and optionally plot or format results.
This design keeps trajectory processing separate from ensemble interpretation: MDAnalysis handles per-trajectory analysis idioms, while PolyzyMD handles study structure, artifact identity, aggregation, comparison, and CLI integration.
The current analyses/ boundary
The analysis package is split by public surface and private implementation:
src/polyzymd/analyses/
├── base.py # stable contributor facade: Analysis, contexts, metrics
├── discovery.py # plugin auto-discovery
├── orchestrator.py # comparison workflow orchestration facade
├── stats.py # default scalar comparison pipeline
├── mda/ # public MDAnalysis extension layer
├── shared/ # reusable utilities shared by plugins
├── _framework/ # private/internal lifecycle, I/O, and contracts
└── <plugin>/ # built-in and contributed analysis plugins
The important public/private boundary is intentional:
polyzymd.analyses.baseis the stable contributor facade forAnalysis, lifecycle contexts, metrics, and comparison result models.polyzymd.analyses.mdais the public MDAnalysis extension layer for jobs, frame selection, artifacts, artifact storage, aggregation, and Universe handling._framework/modules are private implementation details behind the public facade.Plugin helper modules named
_*.pyare private to their plugin package unless that plugin explicitly documents them as public.
Contributors should not import from _framework/ or rely on another plugin’s
private helper modules. The stable import surface is deliberately narrower than
the internal package layout so PolyzyMD can evolve lifecycle internals without
breaking plugins.
How the MDAnalysis lifecycle is divided
PolyzyMD and MDAnalysis share responsibility during trajectory analysis, but not at the same layer.
PolyzyMD resolves topology and trajectory paths from the study context, applies
frame-selection policy, and provides or caches loaded MDAnalysis Universe
objects through the MDA lifecycle. It also owns replicate discovery, cache
identity, artifact storage, aggregation, cross-condition comparison, and CLI
output.
MDAnalysis owns the per-trajectory analysis idioms: selecting atoms, iterating
frames, running AnalysisBase-compatible work, and producing MDAnalysis-style
Results objects. PolyzyMD collectors then translate completed MDAnalysis jobs
into project-level artifacts.
Conceptually, the current analysis flow is:
config -> builders -> simulation/workflow -> analyses/artifacts -> comparison -> plots
Within analyses/, that becomes:
MDA jobs
-> MDAnalysis work and Results
-> collectors
-> ReplicateArtifact objects
-> condition artifacts
-> comparison artifacts or documented custom outputs
-> plots and formatted CLI output
Most trajectory-native plugins implement build_mda_jobs() and usually provide
a collector so completed jobs become ReplicateArtifact objects. Those
replicate artifacts aggregate into per-condition artifacts, which then feed
default comparison artifacts or a plugin’s documented custom comparison output.
Plots should read cached artifacts and sidecars rather than reloading
trajectories or rerunning compute-stage analysis.
For concrete commands and code examples, see Extend PolyzyMD with MDAnalysis-native analyses.
Comparison infrastructure is distributed
There is no separate active comparison stack that new plugins should target. Comparison behavior is distributed across focused modules:
config/comparison.pydescribes comparison and plotting settings.cli/compare.pyexposes thepolyzymd comparecommand group.analyses/stats.pyimplements the default scalar comparison pipeline, including summaries, rankings, pairwise comparisons, and formatting helpers.analyses/shared/inferential_statistics.pyprovides lower-level statistical primitives such as t-tests, ANOVA, and effect sizes.analyses/mda/provides the artifact layer that carries replicate, condition, and comparison data through the lifecycle.
This distribution keeps comparison close to the artifacts it consumes while still allowing the CLI and configuration layers to remain stable entry points.
Supporting packages
core/ and utils/
core/ and utils/ provide shared infrastructure such as common types,
experimental workflow labeling, and helper functionality that should not be
duplicated across the package.
data/ and templates/
data/ stores bundled package resources such as force-field or template data
that need to ship with PolyzyMD. It is not a user results directory.
templates/ contains packaged templates and examples used by scaffolding and
setup flows. These are starting points for generated files, not the
authoritative runtime schema.
exporters/
exporters/ contains format-export support for moving PolyzyMD outputs into
other molecular simulation ecosystems. Exporters sit at the edge of the package
rather than in the core build or simulation lifecycle.
How data moves through the system
At a conceptual level, data moves from declared intent to generated evidence:
config.yaml
-> validated config objects
-> system builders
-> simulation objects and run directories
-> local or SLURM execution
-> ReplicateArtifact files and sidecars
-> condition artifacts
-> comparison artifacts or documented custom outputs
-> plots and reports
This separation is intentional:
users can stop after building or running
analysis can be repeated without rebuilding simulations
comparison workflows can reuse cached analysis outputs
plotting can be rerun without recomputing statistics
plugin internals can change while public artifact and facade contracts remain stable
Design patterns you will encounter
Lazy imports for heavy dependencies
Modules that depend on OpenMM, OpenFF, MDAnalysis, or other heavy scientific packages often import those packages inside functions or methods instead of at module import time. This keeps lightweight CLI and documentation operations usable even when optional heavy dependencies are absent.
Plugin-based extension points
Analysis is the primary extensibility axis. Plugins are single files or packages
under analyses/ that subclass Analysis. The framework discovers both shapes
automatically via pkgutil, so contributors do not need registries, decorators,
or core imports to make a plugin available.
The reason for this design is the open-closed principle: new analyses should be added by extension, not by modifying the orchestrator or CLI every time a metric is introduced.
Public facade over private lifecycle internals
The analysis framework uses private modules internally because lifecycle code, artifact I/O, and validation contracts are implementation details. The public facade keeps contributor imports stable while giving maintainers room to improve internals.
This is why documentation points contributors to polyzymd.analyses.base and
polyzymd.analyses.mda, not to _framework/.
Where contributors usually need to look
Configuration behavior:
src/polyzymd/config/Build behavior:
src/polyzymd/builders/Run, restart, or cluster behavior:
src/polyzymd/simulation/andsrc/polyzymd/workflow/Analysis and comparison plugins:
src/polyzymd/analyses/Comparison configuration and CLI entry points:
config/comparison.pyandcli/compare.pyCLI commands:
src/polyzymd/cli/
For the chain-ID convention used by selections and interpretation, see Understanding Residue Assignment in PolyzyMD.
A practical mental model
If you are new to the codebase, think in layers:
configdescribes what should happenbuildersandsimulationmake it happen for one systemworkflowmakes it practical on clustersenginesisolates engine-specific details where possibleanalysesplugins measure trajectories and preserve evidence as artifactscomparison workflows interpret differences across study conditions
That mental model is usually enough to find the right subsystem before diving into module-level details or API reference pages.