# Convert a plugin into an advanced package

This how-to is for contributors whose analysis plugin has outgrown a single
module. Use it when the plugin already works and the next task is to make its
structure easier to maintain, test, and review.

An advanced package is still one plugin. The public entry point remains one
`Analysis` subclass discovered by PolyzyMD. The package split only moves helper
code into focused private modules inside that plugin package.

## Decide whether to split

Do **not** split a plugin just because it might grow later. Keep a plugin as a
single module when:

- it has one small `Settings` model and one readable `Analysis` subclass;
- `build_mda_jobs()` and the collector fit on the same screen;
- plotting is absent or limited to one short helper;
- results are plain artifacts with compact payloads and no custom output model;
- formatting uses the default formatter or a short `format()` method; and
- tests can find the behavior without navigating several files.

Split into a package when the current module is becoming hard to review. Good
signals include:

- MDAnalysis job setup or collectors need several helper classes or functions;
- the plugin uses array or table sidecars and needs loading helpers;
- plotting has several artifact-only figures or repeated data preparation;
- aggregation, comparison, or CLI output uses structured Pydantic models;
- formatting has enough logic to distract from the lifecycle methods; or
- maintainers are repeatedly asking for clearer separation of responsibilities.

The goal is not abstraction for its own sake. The goal is a smaller public
facade in `__init__.py` plus private modules that each have one reason to
change.

## Use built-in packages as examples, not import targets

Built-in plugins such as `rmsf`, `distances`, `contacts`, `sasa`, `rg`, `rmsd`,
`hydrogen_bonds`, `secondary_structure`, and `catalytic_triad` show real package
shapes. Their private helper modules are useful examples of organization:

- `_mda.py` for trajectory-native job and collector helpers;
- `_plotters.py` for artifact-only plotting helpers;
- `_models.py` for domain schemas that validate artifact payload entries;
- `_formatters.py` for substantial CLI formatting; and
- plugin-specific modules such as `_comparison.py` or `_filters.py` only when a
  plugin has a genuine need.

Do not import private modules from another plugin. For contributor plugins, use
the public facades only:

```python
from polyzymd.analyses.base import Analysis, PlotContext
from polyzymd.analyses.mda import MDAAnalysisJob, MDAReplicateJobContext
```

Documented utilities from `polyzymd.analyses.shared` are also acceptable when an
existing shared helper fits the task. Do not import from
`polyzymd.analyses._framework`; it is a private implementation detail behind the
public facades.

## Target package layout

For a plugin named `solvent_shell`, convert the single module into a directory
like this:

```text
src/polyzymd/analyses/solvent_shell/
├── __init__.py
├── _mda.py
├── _plotters.py
├── _models.py
└── _formatters.py
```

Create only the files that have a current job. If your plugin has no custom
formatting yet, skip `_formatters.py`. If plain artifact payload dictionaries
are enough, skip `_models.py`.

| File | Responsibility | Keep out |
| --- | --- | --- |
| `__init__.py` | Public plugin class, settings model, class variables, lifecycle wiring, and small orchestration methods. | Long MDAnalysis loops, large plot functions, bulky formatters. |
| `_mda.py` | MDAnalysis job functions, `AnalysisBase`-compatible workers, collectors, sidecar writing, and helpers that translate runtime results into `ReplicateArtifact` objects. | Public plugin registration, CLI formatting, plot rendering. |
| `_plotters.py` | Artifact-only plot helpers that load `ConditionArtifact`, `ComparisonArtifact`, and registered sidecars, then write figures. | Trajectory loading, MDAnalysis job execution, recomputation. |
| `_models.py` | Pydantic domain schemas for validating nested artifact payload entries or custom comparison outputs. | Alternate aggregate cache loaders, raw MDAnalysis `Results` objects, or large frame-by-frame arrays that belong in sidecars. |
| `_formatters.py` | Substantial text/table formatting used by `format()`. | Scientific computation, artifact writes, plotting side effects. |

## Keep `__init__.py` as the public facade

After the split, `__init__.py` should still be the place where discovery finds
the plugin class. Keep the public class readable and import private helpers from
the same package with relative imports.

```python
from typing import ClassVar

from pydantic import BaseModel, Field

from polyzymd.analyses.base import Analysis, PlotContext
from polyzymd.analyses.mda import MDACollectorContext, MDAReplicateJobContext

from ._formatters import format_solvent_shell
from ._mda import SolventShellCollector, build_solvent_shell_jobs
from ._plotters import plot_solvent_shell_summary


class SolventShellSettings(BaseModel):
    """Settings for solvent-shell analysis."""

    selection: str = Field(default="protein and chainid A")


class SolventShellAnalysis(Analysis):
    """Analyze solvent-shell behavior around the protein."""

    name: ClassVar[str] = "solvent_shell"
    Settings: ClassVar[type[BaseModel]] = SolventShellSettings

    def build_mda_jobs(self, ctx: MDAReplicateJobContext):
        return build_solvent_shell_jobs(ctx)

    def build_mda_collector(self, ctx: MDACollectorContext):
        del ctx
        return SolventShellCollector()

    def plot(self, ctx: PlotContext):
        return plot_solvent_shell_summary(ctx)

    def format(self, result, output_format: str = "text") -> str:
        return format_solvent_shell(result, output_format=output_format)
```

This file wires together the lifecycle but does not hide heavy computation or
plotting details inside the class body.

## Move trajectory work to `_mda.py`

Put trajectory-native details in `_mda.py`: job construction, function-adapter
jobs, `AnalysisBase`-compatible workers, collectors, and sidecar writes. Import
heavy dependencies lazily inside functions or methods that need them.

```python
from polyzymd.analyses.mda import (
    MDAAnalysisJob,
    MDACollectorContext,
    MDAJobResult,
    MDAReplicateJobContext,
    ReplicateArtifact,
    frame_selection_payload,
)


def calculate_shell_counts(
    universe,
    *,
    selection: str,
    start=None,
    stop=None,
    step=None,
    frames=None,
    **_frame_kwargs,
):
    """Calculate compact per-replicate values for one trajectory."""

    import numpy as np

    atoms = universe.select_atoms(selection)
    values: list[float] = []
    iterator = (
        universe.trajectory[frames]
        if frames is not None
        else universe.trajectory[start:stop:step]
    )

    for _ts in iterator:
        values.append(float(np.asarray(atoms.positions).shape[0]))

    return {"mean_count": float(np.mean(values)), "n_frames": len(values)}


def build_solvent_shell_jobs(ctx: MDAReplicateJobContext) -> list[MDAAnalysisJob]:
    settings = ctx.settings
    return [
        MDAAnalysisJob.from_function(
            name="solvent_shell_counts",
            function=calculate_shell_counts,
            universe=ctx.universe,
            frame_selection=ctx.frame_selection,
            universe_policy=ctx.universe_policy,
            function_kwargs={"selection": settings.selection},
        )
    ]


class SolventShellCollector:
    """Collect one completed job into a replicate artifact."""

    def __call__(
        self,
        ctx: MDACollectorContext,
        completed_jobs: list[MDAJobResult],
    ) -> ReplicateArtifact:
        job = completed_jobs[0]
        mean_count = float(job.results["mean_count"])
        metadata = {"result_kind": "solvent_shell_replicate"}
        if ctx.settings_fingerprint is not None:
            metadata["settings_fingerprint"] = ctx.settings_fingerprint

        return ReplicateArtifact(
            analysis_name=ctx.analysis_name,
            condition_label=ctx.condition_label,
            replicate=ctx.replicate,
            payload={"metrics": {"mean_shell_count": mean_count}},
            provenance={"frame_selection": frame_selection_payload(ctx.frame_selection)},
            metadata=metadata,
            warnings=list(ctx.warnings),
        )
```

The collector returns a durable artifact. It should not serialize raw MDAnalysis
`Results` objects. If the job produces arrays or event tables, write registered
sidecars as shown in {doc}`sidecars`.

Function-adapter workers receive frame-selection keyword arguments from
PolyzyMD's `FrameSelection`. They must respect both explicit `frames` selectors
and `start`/`stop`/`step` selectors so equilibration cuts, analysis windows, and
strides are honored consistently.

## Move artifact-only plotting to `_plotters.py`

Plot helpers should read cached artifacts and registered sidecars only. They
should not load trajectories, create `MDAAnalysisJob` objects, or rerun compute
work.

```python
from pathlib import Path

from polyzymd.analyses.base import PlotContext
from polyzymd.analyses.mda import ArtifactStore


def plot_solvent_shell_summary(ctx: PlotContext) -> list[Path]:
    """Plot from cached condition artifacts and sidecars."""

    import matplotlib.pyplot as plt

    output_path = ctx.output_dir / "solvent_shell_summary.png"
    labels: list[str] = []
    values: list[float] = []

    for condition in ctx.conditions:
        analysis_dir = ctx.analysis_dirs.get(condition.label)
        if analysis_dir is None:
            continue
        artifact = ArtifactStore(analysis_dir).read_condition_result()
        metric = artifact.payload.get("metrics", {}).get("mean_shell_count")
        if metric is None:
            continue
        labels.append(condition.label)
        values.append(float(metric["mean"]))

    if not values:
        return []

    fig, ax = plt.subplots()
    ax.bar(labels, values)
    ax.set_ylabel("Mean shell count")
    fig.tight_layout()
    fig.savefig(output_path)
    plt.close(fig)
    return [output_path]
```

Keep all heavy plotting dependencies inside the plotting function. This preserves
fast imports for users who only need configuration, discovery, or API docs.

## Add `_models.py` only for domain schemas

Use `_models.py` when the plugin has a custom Pydantic output model or typed
helpers that make aggregation and comparison clearer. Do not add a result model
only to wrap one scalar already stored under `payload["metrics"]`.

Good uses for `_models.py` include:

- custom comparison outputs with several tables or ranked entries;
- aggregate summaries that need schema validation beyond artifact payloads;
- small typed models that point to sidecars by reference; and
- typed custom comparison models when the default comparison artifact is not enough.

Large arrays, per-frame matrices, and event streams should remain in registered
sidecars, with `_models.py` storing only validated summaries or sidecar
references.

## Add `_formatters.py` only for substantial formatting

Use `_formatters.py` when `format()` has enough branches or table-building logic
to obscure the `Analysis` subclass. Keep `format()` in `__init__.py` as a small
delegation method and put the formatting implementation in `_formatters.py`.

```python
def format_solvent_shell(result, *, output_format: str = "text") -> str:
    """Format solvent-shell comparison output for the CLI."""

    if output_format == "json":
        return result.model_dump_json(indent=2)
    return "Solvent-shell comparison complete."
```

Do not compute new scientific results in a formatter. It should only render data
that aggregation or comparison already produced.

## Migrate in small steps

1. **Start from a passing single-file plugin.** Run the focused plugin tests
   before the split so you know whether a later failure came from the refactor.
2. **Create the package directory.** Replace `solvent_shell.py` with
   `solvent_shell/__init__.py` and keep the public `Analysis` subclass name,
   `name`, and `Settings` unchanged.
3. **Move trajectory helpers first.** Put job functions, workers, collectors, and
   sidecar writes in `_mda.py`. Keep public imports from
   `polyzymd.analyses.base`, `polyzymd.analyses.mda`, and documented
   `polyzymd.analyses.shared` utilities.
4. **Move plot helpers next.** Put plot data loading and rendering in
   `_plotters.py`. Verify plots read artifacts and sidecars only.
5. **Move structured models only if needed.** Add `_models.py` when the plugin
   has a custom result contract. Otherwise, keep relying on canonical artifacts.
6. **Move formatting only if needed.** Add `_formatters.py` when CLI rendering is
   substantial enough to deserve its own file.
7. **Run focused tests and docs checks.** Use the plugin test file and the docs
   build before opening a pull request.

Example validation commands:

```bash
pixi run -e build pytest tests/analyses/plugins/test_solvent_shell.py -v
pixi run -e build make -C docs clean html
```

## Migration checklist

Before you consider the package split complete, check that:

- `__init__.py` still exposes exactly one public plugin class for discovery.
- The plugin's `name` and `Settings` are unchanged unless you intentionally made
  a user-facing configuration change.
- All imports from PolyzyMD use public facades or documented shared utilities.
- No contributor code imports from another plugin's private helper modules.
- Heavy dependencies such as MDAnalysis, NumPy-heavy routines, or matplotlib are
  imported lazily inside functions that need them when practical.
- Collectors return `ReplicateArtifact` objects and write large arrays or tables
  as registered sidecars.
- Aggregation and comparison consume artifacts or validated structured outputs,
  not raw runtime containers.
- `plot()` delegates to helpers that load cached artifacts and sidecars only.
- `_models.py` and `_formatters.py` exist only if they have clear current
  responsibilities.
- Focused plugin tests pass after the file move.

## Success state

You have converted a plugin into an advanced package when the public plugin class
is easier to read, private helpers have focused responsibilities, imports stay on
public PolyzyMD facades, heavy dependencies remain lazy, and plotting is still
artifact-only.