Convert a plugin into an advanced package

This how-to is for contributors whose analysis plugin has outgrown a single module. Use it when the plugin already works and the next task is to make its structure easier to maintain, test, and review.

An advanced package is still one plugin. The public entry point remains one Analysis subclass discovered by PolyzyMD. The package split only moves helper code into focused private modules inside that plugin package.

Decide whether to split

Do not split a plugin just because it might grow later. Keep a plugin as a single module when:

it has one small Settings model and one readable Analysis subclass;
build_mda_jobs() and the collector fit on the same screen;
plotting is absent or limited to one short helper;
results are plain artifacts with compact payloads and no custom output model;
formatting uses the default formatter or a short format() method; and
tests can find the behavior without navigating several files.

Split into a package when the current module is becoming hard to review. Good signals include:

MDAnalysis job setup or collectors need several helper classes or functions;
the plugin uses array or table sidecars and needs loading helpers;
plotting has several artifact-only figures or repeated data preparation;
aggregation, comparison, or CLI output uses structured Pydantic models;
formatting has enough logic to distract from the lifecycle methods; or
maintainers are repeatedly asking for clearer separation of responsibilities.

The goal is not abstraction for its own sake. The goal is a smaller public facade in __init__.py plus private modules that each have one reason to change.

Use built-in packages as examples, not import targets

Built-in plugins such as rmsf, distances, contacts, sasa, rg, rmsd, hydrogen_bonds, secondary_structure, and catalytic_triad show real package shapes. Their private helper modules are useful examples of organization:

_mda.py for trajectory-native job and collector helpers;
_plotters.py for artifact-only plotting helpers;
_models.py for domain schemas that validate artifact payload entries;
_formatters.py for substantial CLI formatting; and
plugin-specific modules such as _comparison.py or _filters.py only when a plugin has a genuine need.

Do not import private modules from another plugin. For contributor plugins, use the public facades only:

from polyzymd.analyses.base import Analysis, PlotContext
from polyzymd.analyses.mda import MDAAnalysisJob, MDAReplicateJobContext

Documented utilities from polyzymd.analyses.shared are also acceptable when an existing shared helper fits the task. Do not import from polyzymd.analyses._framework; it is a private implementation detail behind the public facades.

Target package layout

For a plugin named solvent_shell, convert the single module into a directory like this:

src/polyzymd/analyses/solvent_shell/
├── __init__.py
├── _mda.py
├── _plotters.py
├── _models.py
└── _formatters.py

Create only the files that have a current job. If your plugin has no custom formatting yet, skip _formatters.py. If plain artifact payload dictionaries are enough, skip _models.py.

File	Responsibility	Keep out
`__init__.py`	Public plugin class, settings model, class variables, lifecycle wiring, and small orchestration methods.	Long MDAnalysis loops, large plot functions, bulky formatters.
`_mda.py`	MDAnalysis job functions, `AnalysisBase`-compatible workers, collectors, sidecar writing, and helpers that translate runtime results into `ReplicateArtifact` objects.	Public plugin registration, CLI formatting, plot rendering.
`_plotters.py`	Artifact-only plot helpers that load `ConditionArtifact`, `ComparisonArtifact`, and registered sidecars, then write figures.	Trajectory loading, MDAnalysis job execution, recomputation.
`_models.py`	Pydantic domain schemas for validating nested artifact payload entries or custom comparison outputs.	Alternate aggregate cache loaders, raw MDAnalysis `Results` objects, or large frame-by-frame arrays that belong in sidecars.
`_formatters.py`	Substantial text/table formatting used by `format()`.	Scientific computation, artifact writes, plotting side effects.

Keep `init.py` as the public facade

After the split, __init__.py should still be the place where discovery finds the plugin class. Keep the public class readable and import private helpers from the same package with relative imports.

from typing import ClassVar

from pydantic import BaseModel, Field

from polyzymd.analyses.base import Analysis, PlotContext
from polyzymd.analyses.mda import MDACollectorContext, MDAReplicateJobContext

from ._formatters import format_solvent_shell
from ._mda import SolventShellCollector, build_solvent_shell_jobs
from ._plotters import plot_solvent_shell_summary


class SolventShellSettings(BaseModel):
    """Settings for solvent-shell analysis."""

    selection: str = Field(default="protein and chainid A")


class SolventShellAnalysis(Analysis):
    """Analyze solvent-shell behavior around the protein."""

    name: ClassVar[str] = "solvent_shell"
    Settings: ClassVar[type[BaseModel]] = SolventShellSettings

    def build_mda_jobs(self, ctx: MDAReplicateJobContext):
        return build_solvent_shell_jobs(ctx)

    def build_mda_collector(self, ctx: MDACollectorContext):
        del ctx
        return SolventShellCollector()

    def plot(self, ctx: PlotContext):
        return plot_solvent_shell_summary(ctx)

    def format(self, result, output_format: str = "text") -> str:
        return format_solvent_shell(result, output_format=output_format)

This file wires together the lifecycle but does not hide heavy computation or plotting details inside the class body.

Move trajectory work to `_mda.py`

Put trajectory-native details in _mda.py: job construction, function-adapter jobs, AnalysisBase-compatible workers, collectors, and sidecar writes. Import heavy dependencies lazily inside functions or methods that need them.

from polyzymd.analyses.mda import (
    MDAAnalysisJob,
    MDACollectorContext,
    MDAJobResult,
    MDAReplicateJobContext,
    ReplicateArtifact,
    frame_selection_payload,
)


def calculate_shell_counts(
    universe,
    *,
    selection: str,
    start=None,
    stop=None,
    step=None,
    frames=None,
    **_frame_kwargs,
):
    """Calculate compact per-replicate values for one trajectory."""

    import numpy as np

    atoms = universe.select_atoms(selection)
    values: list[float] = []
    iterator = (
        universe.trajectory[frames]
        if frames is not None
        else universe.trajectory[start:stop:step]
    )

    for _ts in iterator:
        values.append(float(np.asarray(atoms.positions).shape[0]))

    return {"mean_count": float(np.mean(values)), "n_frames": len(values)}


def build_solvent_shell_jobs(ctx: MDAReplicateJobContext) -> list[MDAAnalysisJob]:
    settings = ctx.settings
    return [
        MDAAnalysisJob.from_function(
            name="solvent_shell_counts",
            function=calculate_shell_counts,
            universe=ctx.universe,
            frame_selection=ctx.frame_selection,
            universe_policy=ctx.universe_policy,
            function_kwargs={"selection": settings.selection},
        )
    ]


class SolventShellCollector:
    """Collect one completed job into a replicate artifact."""

    def __call__(
        self,
        ctx: MDACollectorContext,
        completed_jobs: list[MDAJobResult],
    ) -> ReplicateArtifact:
        job = completed_jobs[0]
        mean_count = float(job.results["mean_count"])
        metadata = {"result_kind": "solvent_shell_replicate"}
        if ctx.settings_fingerprint is not None:
            metadata["settings_fingerprint"] = ctx.settings_fingerprint

        return ReplicateArtifact(
            analysis_name=ctx.analysis_name,
            condition_label=ctx.condition_label,
            replicate=ctx.replicate,
            payload={"metrics": {"mean_shell_count": mean_count}},
            provenance={"frame_selection": frame_selection_payload(ctx.frame_selection)},
            metadata=metadata,
            warnings=list(ctx.warnings),
        )

The collector returns a durable artifact. It should not serialize raw MDAnalysis Results objects. If the job produces arrays or event tables, write registered sidecars as shown in Store large analysis outputs with artifact sidecars.

Function-adapter workers receive frame-selection keyword arguments from PolyzyMD’s FrameSelection. They must respect both explicit frames selectors and start/stop/step selectors so equilibration cuts, analysis windows, and strides are honored consistently.

Move artifact-only plotting to `_plotters.py`

Plot helpers should read cached artifacts and registered sidecars only. They should not load trajectories, create MDAAnalysisJob objects, or rerun compute work.

from pathlib import Path

from polyzymd.analyses.base import PlotContext
from polyzymd.analyses.mda import ArtifactStore


def plot_solvent_shell_summary(ctx: PlotContext) -> list[Path]:
    """Plot from cached condition artifacts and sidecars."""

    import matplotlib.pyplot as plt

    output_path = ctx.output_dir / "solvent_shell_summary.png"
    labels: list[str] = []
    values: list[float] = []

    for condition in ctx.conditions:
        analysis_dir = ctx.analysis_dirs.get(condition.label)
        if analysis_dir is None:
            continue
        artifact = ArtifactStore(analysis_dir).read_condition_result()
        metric = artifact.payload.get("metrics", {}).get("mean_shell_count")
        if metric is None:
            continue
        labels.append(condition.label)
        values.append(float(metric["mean"]))

    if not values:
        return []

    fig, ax = plt.subplots()
    ax.bar(labels, values)
    ax.set_ylabel("Mean shell count")
    fig.tight_layout()
    fig.savefig(output_path)
    plt.close(fig)
    return [output_path]

Keep all heavy plotting dependencies inside the plotting function. This preserves fast imports for users who only need configuration, discovery, or API docs.

Add `_models.py` only for domain schemas

Use _models.py when the plugin has a custom Pydantic output model or typed helpers that make aggregation and comparison clearer. Do not add a result model only to wrap one scalar already stored under payload["metrics"].

Good uses for _models.py include:

custom comparison outputs with several tables or ranked entries;
aggregate summaries that need schema validation beyond artifact payloads;
small typed models that point to sidecars by reference; and
typed custom comparison models when the default comparison artifact is not enough.

Large arrays, per-frame matrices, and event streams should remain in registered sidecars, with _models.py storing only validated summaries or sidecar references.

Add `_formatters.py` only for substantial formatting

Use _formatters.py when format() has enough branches or table-building logic to obscure the Analysis subclass. Keep format() in __init__.py as a small delegation method and put the formatting implementation in _formatters.py.

def format_solvent_shell(result, *, output_format: str = "text") -> str:
    """Format solvent-shell comparison output for the CLI."""

    if output_format == "json":
        return result.model_dump_json(indent=2)
    return "Solvent-shell comparison complete."

Do not compute new scientific results in a formatter. It should only render data that aggregation or comparison already produced.

Migrate in small steps

Start from a passing single-file plugin. Run the focused plugin tests before the split so you know whether a later failure came from the refactor.
Create the package directory. Replace solvent_shell.py with solvent_shell/__init__.py and keep the public Analysis subclass name, name, and Settings unchanged.
Move trajectory helpers first. Put job functions, workers, collectors, and sidecar writes in _mda.py. Keep public imports from polyzymd.analyses.base, polyzymd.analyses.mda, and documented polyzymd.analyses.shared utilities.
Move plot helpers next. Put plot data loading and rendering in _plotters.py. Verify plots read artifacts and sidecars only.
Move structured models only if needed. Add _models.py when the plugin has a custom result contract. Otherwise, keep relying on canonical artifacts.
Move formatting only if needed. Add _formatters.py when CLI rendering is substantial enough to deserve its own file.
Run focused tests and docs checks. Use the plugin test file and the docs build before opening a pull request.

Example validation commands:

pixi run -e build pytest tests/analyses/plugins/test_solvent_shell.py -v
pixi run -e build make -C docs clean html

Migration checklist

Before you consider the package split complete, check that:

__init__.py still exposes exactly one public plugin class for discovery.
The plugin’s name and Settings are unchanged unless you intentionally made a user-facing configuration change.
All imports from PolyzyMD use public facades or documented shared utilities.
No contributor code imports from another plugin’s private helper modules.
Heavy dependencies such as MDAnalysis, NumPy-heavy routines, or matplotlib are imported lazily inside functions that need them when practical.
Collectors return ReplicateArtifact objects and write large arrays or tables as registered sidecars.
Aggregation and comparison consume artifacts or validated structured outputs, not raw runtime containers.
plot() delegates to helpers that load cached artifacts and sidecars only.
_models.py and _formatters.py exist only if they have clear current responsibilities.
Focused plugin tests pass after the file move.

Success state

You have converted a plugin into an advanced package when the public plugin class is easier to read, private helpers have focused responsibilities, imports stay on public PolyzyMD facades, heavy dependencies remain lazy, and plotting is still artifact-only.