# Architecture This page explains how PolyzyMD is organized, why the major subsystems are separated, and which boundaries matter when extending the project. It is a conceptual map, not a step-by-step contributor tutorial. ## The high-level shape of the project PolyzyMD follows the lifecycle of an enzyme-polymer molecular dynamics study: 1. load and validate configuration 2. build a molecular system 3. run simulation workflows locally or through SLURM 4. analyze trajectories into durable artifacts 5. compare conditions and create plots or reports That lifecycle is reflected in the active package layout: ```text src/polyzymd/ ├── analyses/ # artifact-native analysis and comparison plugin system ├── builders/ # molecular system construction ├── cli/ # command-line entry points ├── config/ # simulation and comparison configuration ├── core/ # shared domain types ├── data/ # bundled package data ├── engines/ # engine-specific integration layer ├── exporters/ # output format exporters ├── simulation/ # local simulation execution ├── templates/ # packaged example/scaffold templates ├── utils/ # general utilities └── workflow/ # orchestration and SLURM support ``` Older `analysis/` and `compare/` directories may still exist in the source tree, but they are not the primary architecture for new analysis or comparison work. Current analysis and comparison behavior is concentrated in `analyses/`, `config/comparison.py`, and `cli/compare.py`. ## Why the code is split this way The main boundary is between defining a study, running simulations, and interpreting results. Keeping these responsibilities separate lets users and contributors change one phase without accidentally coupling it to another. ### Configuration describes intent `config/` holds schema and loading logic for YAML configuration, including comparison configuration. It validates what a study should do before lower-level builders or analysis plugins act on it. ### Builders create simulation-ready systems `builders/` turns input structures into simulation-ready molecular systems by assembling enzyme, substrate, polymer, solvent, and related components. The builder layer stays focused on construction; it does not own long-running job or analysis policy. ### Simulation and workflow execute the study `simulation/` runs local minimization, equilibration, checkpointing, continuation, and production segments. `workflow/` handles orchestration around those runs, especially SLURM job generation, resubmission, and recovery flows. `engines/` isolates engine-specific integration details such as OpenMM or GROMACS support. This keeps high-level workflows from depending directly on one engine's file formats or object model. ### Analyses interpret completed trajectories `analyses/` is the current analysis and comparison architecture. It is both a plugin system and an artifact lifecycle. Plugins measure trajectories, produce replicate artifacts, aggregate those artifacts per condition, compare conditions, and optionally plot or format results. This design keeps trajectory processing separate from ensemble interpretation: MDAnalysis handles per-trajectory analysis idioms, while PolyzyMD handles study structure, artifact identity, aggregation, comparison, and CLI integration. ## The current `analyses/` boundary The analysis package is split by public surface and private implementation: ```text src/polyzymd/analyses/ ├── base.py # stable contributor facade: Analysis, contexts, metrics ├── discovery.py # plugin auto-discovery ├── orchestrator.py # comparison workflow orchestration facade ├── stats.py # default scalar comparison pipeline ├── mda/ # public MDAnalysis extension layer ├── shared/ # reusable utilities shared by plugins ├── _framework/ # private/internal lifecycle, I/O, and contracts └── / # built-in and contributed analysis plugins ``` The important public/private boundary is intentional: - `polyzymd.analyses.base` is the stable contributor facade for `Analysis`, lifecycle contexts, metrics, and comparison result models. - `polyzymd.analyses.mda` is the public MDAnalysis extension layer for jobs, frame selection, artifacts, artifact storage, aggregation, and Universe handling. - `_framework/` modules are private implementation details behind the public facade. - Plugin helper modules named `_*.py` are private to their plugin package unless that plugin explicitly documents them as public. Contributors should not import from `_framework/` or rely on another plugin's private helper modules. The stable import surface is deliberately narrower than the internal package layout so PolyzyMD can evolve lifecycle internals without breaking plugins. ## How the MDAnalysis lifecycle is divided PolyzyMD and MDAnalysis share responsibility during trajectory analysis, but not at the same layer. PolyzyMD resolves topology and trajectory paths from the study context, applies frame-selection policy, and provides or caches loaded MDAnalysis `Universe` objects through the MDA lifecycle. It also owns replicate discovery, cache identity, artifact storage, aggregation, cross-condition comparison, and CLI output. MDAnalysis owns the per-trajectory analysis idioms: selecting atoms, iterating frames, running `AnalysisBase`-compatible work, and producing MDAnalysis-style `Results` objects. PolyzyMD collectors then translate completed MDAnalysis jobs into project-level artifacts. Conceptually, the current analysis flow is: ```text config -> builders -> simulation/workflow -> analyses/artifacts -> comparison -> plots ``` Within `analyses/`, that becomes: ```text MDA jobs -> MDAnalysis work and Results -> collectors -> ReplicateArtifact objects -> condition artifacts -> comparison artifacts or documented custom outputs -> plots and formatted CLI output ``` Most trajectory-native plugins implement `build_mda_jobs()` and usually provide a collector so completed jobs become `ReplicateArtifact` objects. Those replicate artifacts aggregate into per-condition artifacts, which then feed default comparison artifacts or a plugin's documented custom comparison output. Plots should read cached artifacts and sidecars rather than reloading trajectories or rerunning compute-stage analysis. For concrete commands and code examples, see {doc}`../contributor_guide/extending_analyses`. ## Comparison infrastructure is distributed There is no separate active comparison stack that new plugins should target. Comparison behavior is distributed across focused modules: - `config/comparison.py` describes comparison and plotting settings. - `cli/compare.py` exposes the `polyzymd compare` command group. - `analyses/stats.py` implements the default scalar comparison pipeline, including summaries, rankings, pairwise comparisons, and formatting helpers. - `analyses/shared/inferential_statistics.py` provides lower-level statistical primitives such as t-tests, ANOVA, and effect sizes. - `analyses/mda/` provides the artifact layer that carries replicate, condition, and comparison data through the lifecycle. This distribution keeps comparison close to the artifacts it consumes while still allowing the CLI and configuration layers to remain stable entry points. ## Supporting packages ### `core/` and `utils/` `core/` and `utils/` provide shared infrastructure such as common types, experimental workflow labeling, and helper functionality that should not be duplicated across the package. ### `data/` and `templates/` `data/` stores bundled package resources such as force-field or template data that need to ship with PolyzyMD. It is not a user results directory. `templates/` contains packaged templates and examples used by scaffolding and setup flows. These are starting points for generated files, not the authoritative runtime schema. ### `exporters/` `exporters/` contains format-export support for moving PolyzyMD outputs into other molecular simulation ecosystems. Exporters sit at the edge of the package rather than in the core build or simulation lifecycle. ## How data moves through the system At a conceptual level, data moves from declared intent to generated evidence: ```text config.yaml -> validated config objects -> system builders -> simulation objects and run directories -> local or SLURM execution -> ReplicateArtifact files and sidecars -> condition artifacts -> comparison artifacts or documented custom outputs -> plots and reports ``` This separation is intentional: - users can stop after building or running - analysis can be repeated without rebuilding simulations - comparison workflows can reuse cached analysis outputs - plotting can be rerun without recomputing statistics - plugin internals can change while public artifact and facade contracts remain stable ## Design patterns you will encounter ### Lazy imports for heavy dependencies Modules that depend on OpenMM, OpenFF, MDAnalysis, or other heavy scientific packages often import those packages inside functions or methods instead of at module import time. This keeps lightweight CLI and documentation operations usable even when optional heavy dependencies are absent. ### Plugin-based extension points Analysis is the primary extensibility axis. Plugins are single files or packages under `analyses/` that subclass `Analysis`. The framework discovers both shapes automatically via `pkgutil`, so contributors do not need registries, decorators, or core imports to make a plugin available. The reason for this design is the open-closed principle: new analyses should be added by extension, not by modifying the orchestrator or CLI every time a metric is introduced. ### Public facade over private lifecycle internals The analysis framework uses private modules internally because lifecycle code, artifact I/O, and validation contracts are implementation details. The public facade keeps contributor imports stable while giving maintainers room to improve internals. This is why documentation points contributors to `polyzymd.analyses.base` and `polyzymd.analyses.mda`, not to `_framework/`. ## Where contributors usually need to look - **Configuration behavior:** `src/polyzymd/config/` - **Build behavior:** `src/polyzymd/builders/` - **Run, restart, or cluster behavior:** `src/polyzymd/simulation/` and `src/polyzymd/workflow/` - **Analysis and comparison plugins:** `src/polyzymd/analyses/` - **Comparison configuration and CLI entry points:** `config/comparison.py` and `cli/compare.py` - **CLI commands:** `src/polyzymd/cli/` For the chain-ID convention used by selections and interpretation, see {doc}`residue_assignment`. ## A practical mental model If you are new to the codebase, think in layers: - `config` describes what should happen - `builders` and `simulation` make it happen for one system - `workflow` makes it practical on clusters - `engines` isolates engine-specific details where possible - `analyses` plugins measure trajectories and preserve evidence as artifacts - comparison workflows interpret differences across study conditions That mental model is usually enough to find the right subsystem before diving into module-level details or API reference pages. ## Related pages - contributor workflows: {doc}`../contributor_guide/contributing` - extending analyses: {doc}`../contributor_guide/extending_analyses` - chain conventions: {doc}`residue_assignment` - SLURM usage: {doc}`../how_to/hpc_slurm` - API reference: {doc}`../api/index`