Analysis Shared Utilities

This reference page documents contributor-facing utilities in polyzymd.analyses.shared. These modules provide reusable building blocks for analysis plugins; framework internals and plugin-private helpers are documented with their owning packages.

The package root re-exports common helpers for convenience. Import specialized selectors, grouping classes, and module-specific helpers from their submodules.

Trajectory loading and windows

Use these modules to locate trajectories, parse time values, and resolve the trajectory window passed into MDAnalysis job lifecycles.

Trajectory loading utilities for PolyzyMD analysis.

This module provides config-aware trajectory loading that understands PolyzyMD’s directory structure and daisy-chain continuation patterns. File discovery is delegated to the active simulation engine so that both OpenMM and GROMACS directory layouts are handled transparently.

Key Features

Config-based path resolution (config.yaml is single source of truth)
Engine-aware file discovery (OpenMM daisy-chain, GROMACS flat layout)
Automatic detection of daisy-chain trajectory segments
Support for both scratch and projects directories
Lazy loading and memory-efficient iteration

class polyzymd.analyses.shared.loader.TrajectoryInfo(topology_file, trajectory_files=<factory>, n_segments=0, working_directory=<factory>, replicate=1, topology_format=None, trajectory_format=None, warnings=<factory>)[source]

Bases: object

Information about discovered trajectory files.

topology_file

Path to topology file (PDB)

Type:: Path

trajectory_files

List of trajectory files (DCD) in order

Type:: list[Path]

n_segments

Number of daisy-chain segments

Type:: int

working_directory

Base working directory for this replicate

Type:: Path

replicate

Replicate number

Type:: int

topology_format

Engine-reported topology format, when available.

Type:: str or None, optional

trajectory_format

Engine-reported trajectory format, when available.

Type:: str or None, optional

warnings

Discovery warnings that should be preserved in downstream provenance.

Type:: list[str]

topology_file: Path

trajectory_files: list[Path]

n_segments: int = 0

working_directory: Path

replicate: int = 1

topology_format: str | None = None

trajectory_format: str | None = None

warnings: list[str]

property n_trajectory_files: int: Number of trajectory files found.

validate()[source]

Validate that all files exist.

__init__(topology_file, trajectory_files=<factory>, n_segments=0, working_directory=<factory>, replicate=1, topology_format=None, trajectory_format=None, warnings=<factory>)

class polyzymd.analyses.shared.loader.TrajectoryLoader(config, engine_override=None)[source]

Bases: object

Config-aware trajectory loader for PolyzyMD simulations.

This class handles the complexity of finding and loading trajectories from PolyzyMD’s output structure, including:

Daisy-chain continuation segments (OpenMM)
Flat production directories (GROMACS)
Scratch vs projects directory resolution
Multiple replicates

File discovery is delegated to the simulation engine resolved from the config’s engine field. The engine is created lazily on the first call that needs it, so construction remains cheap. Engine resolution errors propagate unless an explicit engine_override is supplied.

Parameters:

config (SimulationConfig) – PolyzyMD simulation configuration.
engine_override (str or None, optional) – Force a specific engine name ("openmm" or "gromacs") instead of reading config.engine.

Examples

>>> from polyzymd.config import load_config
>>> config = load_config("config.yaml")
>>> loader = TrajectoryLoader(config)
>>>
>>> # Load single replicate
>>> u = loader.load_universe(replicate=1)
>>> print(f"Loaded {len(u.trajectory)} frames")
>>>
>>> # Get trajectory info without loading
>>> info = loader.get_trajectory_info(replicate=1)
>>> print(f"Found {info.n_segments} segments")
>>>
>>> # Load multiple replicates
>>> for rep in range(1, 6):
...     u = loader.load_universe(replicate=rep)
...     # ... analyze
>>>
>>> # Explicit engine override for GROMACS directories
>>> loader = TrajectoryLoader(config, engine_override="gromacs")

Notes

Frame indices in MDAnalysis are 0-indexed. For user-facing output, add 1 to follow PyMOL convention (1-indexed frames).

__init__(config, engine_override=None)[source]

get_trajectory_info(replicate)[source]

Get trajectory file information for a replicate.

Parameters:: replicate (int) – Replicate number (1-indexed)
Returns:: Information about discovered trajectory files
Return type:: TrajectoryInfo
Raises:: FileNotFoundError – If working directory or required files don’t exist

load_universe(replicate, cache=True)[source]

Load MDAnalysis Universe for a replicate.

Parameters:

replicate (int) – Replicate number (1-indexed)
cache (bool, optional) – If True (default), cache the Universe for reuse

Returns:

MDAnalysis Universe with trajectory loaded

Return type:

Universe

Notes

For daisy-chain trajectories, all segments are loaded as a continuous trajectory using MDAnalysis’s ChainReader.

iter_replicates(replicates)[source]

Iterate over multiple replicates.

Parameters:: replicates (sequence of int) – Replicate numbers to load
Yields:: tuple of (int, Universe) – Replicate number and loaded Universe

Examples

>>> for rep, u in loader.iter_replicates([1, 2, 3, 4, 5]):
...     rmsf = compute_rmsf(u)
...     results[rep] = rmsf

get_frame_times(replicate, unit='ns')[source]

Get time values for each frame.

Parameters:

replicate (int) – Replicate number
unit (str, optional) – Time unit for output. Options: “ps”, “ns”. Default is “ns”.

Returns:

Array of time values for each frame

Return type:

NDArray[np.float64]

get_timestep(replicate, unit='ps')[source]

Get the trajectory timestep (time between frames).

Parameters:

replicate (int) – Replicate number
unit (str, optional) – Time unit. Options: “ps”, “ns”. Default is “ps”.

Returns:

Time between consecutive frames

Return type:

float

get_first_frame_time(replicate, unit='ps')[source]

Return the first loaded frame timestamp when available.

MDAnalysis reports trajectory times in picoseconds. This method probes cached Universe metadata without changing the caller-visible current frame when the reader exposes a restorable frame index.

Parameters:

replicate (int) – Replicate number.
unit (str, optional) – Time unit for output. Options are "ps" and "ns", by default "ps".

Returns:

Finite first-frame timestamp in the requested unit, or None when the trajectory does not expose a usable timestamp.

Return type:

float | None

Raises:

ValueError – Raised when unit is not "ps" or "ns".

clear_cache()[source]

Clear the Universe cache to free memory.

find_topology(working_dir)[source]

Find topology file in working directory.

Delegates file discovery to the simulation engine. The engine applies its own search order (e.g. PDB preference for GROMACS, solvated_system.pdb preference for OpenMM).

This method is used by several plugins that pass an explicit working_dir unrelated to the current replicate. The replicate index is inferred from the directory name when possible (run_<N>), falling back to 1.

Parameters:: working_dir (Path) – Directory to search for topology files.
Returns:: Path to the topology file.
Return type:: Path
Raises:: FileNotFoundError – If no topology file is found.

polyzymd.analyses.shared.loader.parse_time_string(time_str)[source]

Parse a time string with units into value and unit.

Parameters:: time_str (str) – Time string like “100ns”, “5000ps”, “100 ns”, etc.
Returns:: Numeric value and unit string
Return type:: tuple of (float, str)

Examples

>>> parse_time_string("100ns")
(100.0, "ns")
>>> parse_time_string("5000 ps")
(5000.0, "ps")
>>> parse_time_string("100")  # Default to ns
(100.0, "ns")

polyzymd.analyses.shared.loader.convert_time(value, from_unit, to_unit)[source]

Convert time between units.

Parameters:

value (float) – Time value
from_unit (str) – Source unit (“fs”, “ps”, “ns”)
to_unit (str) – Target unit (“fs”, “ps”, “ns”)

Returns:

Converted time value

Return type:

float

polyzymd.analyses.shared.loader.time_to_frame(time, time_unit, timestep, timestep_unit='ps')[source]

Convert time to frame index.

Parameters:

time (float) – Time value
time_unit (str) – Unit of time value
timestep (float) – Time between frames
timestep_unit (str) – Unit of timestep (default: “ps”)

Returns:

Frame index (0-indexed)

Return type:

int

Trajectory window helpers for trajectory-backed analyses.

This module centralizes the frame-window logic shared by analysis plugins that need to combine equilibration skipping with MDAnalysis run() slice arguments. The helpers return a validated window that can be passed directly to trajectory-native runners without PolyzyMD re-owning the frame loop.

class polyzymd.analyses.shared.window.TrajectoryWindow(start, stop, step, equilibration_start, n_frames_total, n_frames_selected, timestep_ps, equilibration_ps, equilibration=None, first_frame_time_ps=None, selected_start_time_ps=None, equilibration_time_reference='loaded_frame_zero', warning_message=None)[source]

Bases: object

Validated frame window for a trajectory-backed analysis.

Parameters:

start (int) – Inclusive start frame for the analysis run.
stop (int) – Exclusive stop frame for the analysis run.
step (int) – Frame stride passed to runner.run(step=...).
equilibration_start (int) – Start frame implied by the equilibration time alone.
n_frames_total (int) – Total number of frames in the trajectory.
n_frames_selected (int) – Number of frames selected by start, stop, and step.
timestep_ps (float) – Trajectory timestep in picoseconds.
equilibration_ps (float) – Equilibration time converted to picoseconds.
equilibration (str | None, optional) – Original equilibration time string used to resolve the window.
first_frame_time_ps (float | None, optional) – Absolute MDAnalysis timestamp of the first loaded frame in picoseconds, when available.
selected_start_time_ps (float | None, optional) – Timestamp of the selected start frame in the active time reference.
equilibration_time_reference (str, optional) – Time reference used to interpret equilibration. "trajectory_timestamp" means absolute MDAnalysis timestamps were available; "loaded_frame_zero" means the stale loaded-frame-relative origin was used.
warning_message (str | None) – Non-fatal equilibration warning generated during validation.

start: int

stop: int

step: int

equilibration_start: int

n_frames_total: int

n_frames_selected: int

timestep_ps: float

equilibration_ps: float

equilibration: str | None = None

first_frame_time_ps: float | None = None

selected_start_time_ps: float | None = None

equilibration_time_reference: str = 'loaded_frame_zero'

warning_message: str | None = None

run_kwargs()[source]

Return keyword arguments for MDAnalysis runner run().

Returns:: start, stop, and step values for run().
Return type:: dict[str, int]

__init__(start, stop, step, equilibration_start, n_frames_total, n_frames_selected, timestep_ps, equilibration_ps, equilibration=None, first_frame_time_ps=None, selected_start_time_ps=None, equilibration_time_reference='loaded_frame_zero', warning_message=None)

polyzymd.analyses.shared.window.resolve_replicate_trajectory_window(*, loader, replicate, equilibration, n_frames_total, start=None, stop=None, step=1, min_frames=1, timestep_ps=None)[source]

Resolve a validated window using loader trajectory timing metadata.

Parameters:

loader (TrajectoryLoader) – Loader for the replicate being analyzed.
replicate (int) – Replicate number.
equilibration (str) – Equilibration time string such as "10ns".
n_frames_total (int) – Total number of frames in the trajectory.
start (int | None, optional) – Absolute start frame for the analysis window. When None, the equilibration-resolved start frame is used.
stop (int | None, optional) – Absolute exclusive stop frame. When None, the full remaining trajectory is used.
step (int, optional) – Frame stride, by default 1.
min_frames (int, optional) – Minimum required number of selected frames, by default 1.
timestep_ps (float | None, optional) – Explicit timestep override in picoseconds. When None, the loader timestep is used.

Returns:

Validated trajectory window with materialized run() arguments.

Return type:

TrajectoryWindow

polyzymd.analyses.shared.window.resolve_trajectory_window(*, equilibration, n_frames_total, timestep_ps, start=None, stop=None, step=1, min_frames=1, first_frame_time_ps=None)[source]

Resolve and validate a trajectory frame window.

When the first loaded frame has a finite MDAnalysis timestamp, equilibration is interpreted as an absolute trajectory time. The start frame is the first loaded frame whose timestamp is greater than or equal to the equilibration time. When timestamp metadata is unavailable, the noncanonical loaded-frame-relative origin is used.

Parameters:

equilibration (str) – Equilibration time string such as "10ns".
n_frames_total (int) – Total number of frames in the trajectory.
timestep_ps (float) – Time between consecutive frames in picoseconds.
start (int | None, optional) – Absolute start frame for the analysis window. When None, the equilibration-resolved start frame is used.
stop (int | None, optional) – Absolute exclusive stop frame. When None, the trajectory end is used.
step (int, optional) – Frame stride, by default 1.
min_frames (int, optional) – Minimum required number of selected frames, by default 1.
first_frame_time_ps (float | None, optional) – Absolute MDAnalysis timestamp of loaded frame 0 in picoseconds. Non-finite values are ignored and use the stale loaded-frame-relative behavior.

Returns:

Validated frame window.

Return type:

TrajectoryWindow

Raises:

ValueError – Raised when the timestep, equilibration, or window arguments are inconsistent with the trajectory.

Alignment and representative frames

Alignment helpers in polyzymd.analyses.shared.alignment standardize reference-mode handling. Centroid helpers support plugins that need representative frames or structures.

Representative frame finding utilities.

This module provides functions to find representative frames from MD trajectories using different methods. The representative frame is commonly used as a reference for trajectory alignment before RMSF calculations.

polyzymd.analyses.shared.centroid.centroid(aligned-mean representative frame): Finds the frame closest to the aligned mean structure. Uses all protein atoms by default to capture side chain conformations. Best for: Finding a representative equilibrium conformation while removing rigid-body translation and rotation effects.

polyzymd.analyses.shared.centroid.average(): Aligns to an average structure computed from all frames. Note: The average structure is synthetic and may have unphysical geometry (e.g., distorted bond lengths/angles). Best for: Pure mathematical measure of thermal fluctuations around the mean.

polyzymd.analyses.shared.centroid.frame(): Uses a specific frame as the reference (user-specified). Best for: Analyzing fluctuations relative to a known functional state, such as a catalytically competent conformation.

Time-series statistics and convergence

These modules provide statistical summaries, autocorrelation-aware estimates, inferential tests, and convergence diagnostics used by built-in and contributor plugins.

Autocorrelation analysis for independent sampling.

MD trajectories are highly correlated in time - consecutive frames are not independent samples. This module provides tools to:

Compute the autocorrelation function (ACF) of an observable
Estimate the correlation time (τ) from the ACF
Compute statistical inefficiency (g) for proper uncertainty quantification
Select independent frames based on τ for proper statistics

Key Concepts

Autocorrelation function (ACF): Measures how correlated a signal is with itself at different time lags. ACF(0) = 1, and ACF decays toward 0.
Correlation time (τ): Characteristic time for decorrelation. Frames separated by > 2τ are approximately independent.
Statistical inefficiency (g): Factor by which variance is inflated due to correlation. g = 1 + 2*Σ C(t)*(1-t/N). N_eff = N/g.
Independent samples: For proper SEM calculation, we need N_eff independent samples, not N_frames correlated observations.

Methods for τ estimation

First zero crossing: τ is lag where ACF first crosses zero
Exponential fit: Fit ACF = exp(-t/τ) and extract τ
Integration: τ = ∫ACF(t)dt from 0 to first zero (or cutoff)

Statistical Validity

The number of effective independent samples (N_eff) is computed as:: N_eff = N / g = N / (1 + 2*Σ C(t)*(1-t/N))

This matches the algorithm from Chodera et al. (2007) with the finite-size correction factor (1-t/N). When N_eff < 10, statistical estimates (mean, SEM) may be unreliable, and users should be warned per LiveCoMS best practices (Grossfield et al., 2018).

For multiple timeseries of different lengths (e.g., replicates), use statistical_inefficiency_multiple() which correctly handles the averaging.

References

Flyvbjerg & Petersen (1989) J. Chem. Phys. 91:461 (block averaging)
Chodera et al. (2007) J. Chem. Theory Comput. 3:26 (statistical inefficiency)
Grossfield et al. (2018) LiveCoMS 1:5067 (uncertainty quantification)

class polyzymd.analyses.shared.autocorrelation.CorrelationTimeMethod(value)[source]

Bases: str, Enum

Method for estimating correlation time from ACF.

FIRST_ZERO = 'first_zero'

EXPONENTIAL_FIT = 'exponential_fit'

INTEGRATION = 'integration'

class polyzymd.analyses.shared.autocorrelation.ACFResult(lags, acf, timestep, timestep_unit, n_samples)[source]

Bases: object

Result of autocorrelation function computation.

lags

Time lags in the same units as timestep

Type:: NDArray[np.float64]

acf

Autocorrelation values (normalized, ACF[0] = 1)

Type:: NDArray[np.float64]

timestep

Time between frames

Type:: float

timestep_unit

Unit of timestep (e.g., “ps”, “ns”)

Type:: str

n_samples

Number of samples in the original timeseries

Type:: int

lags: numpy.typing.NDArray.numpy.float64

acf: numpy.typing.NDArray.numpy.float64

timestep: float

timestep_unit: str

n_samples: int

to_dict()[source]

Convert to dictionary for serialization.

__init__(lags, acf, timestep, timestep_unit, n_samples)

class polyzymd.analyses.shared.autocorrelation.CorrelationTimeResult(tau, tau_unit, method, n_independent, statistical_inefficiency, warning=None)[source]

Bases: object

Result of correlation time estimation.

tau

Estimated correlation time

Type:: float

tau_unit

Unit of tau (same as timestep unit)

Type:: str

method

Method used for estimation

Type:: str

n_independent

Estimated number of independent samples in trajectory

Type:: int

statistical_inefficiency

g = 1 + 2*tau/dt, factor by which variance is inflated

Type:: float

warning

Warning message if statistics may be unreliable (e.g., N_ind < 10)

Type:: str | None

tau: float

tau_unit: str

method: str

n_independent: int

statistical_inefficiency: float

warning: str | None = None

property is_reliable: bool: Return True if statistics are likely reliable (N_ind >= 10).

to_dict()[source]

Convert to dictionary for serialization.

__init__(tau, tau_unit, method, n_independent, statistical_inefficiency, warning=None)

polyzymd.analyses.shared.autocorrelation.compute_acf(timeseries, max_lag=None, timestep=1.0, timestep_unit='frames')[source]

Compute autocorrelation function of a 1D timeseries.

Uses FFT-based computation for efficiency.

Parameters:

timeseries (array_like) – 1D array of values (e.g., RMSD over time, distance over time)
max_lag (int, optional) – Maximum lag to compute (in frames). Default is N//4 where N is the length of the timeseries.
timestep (float, optional) – Time between frames. Default is 1.0.
timestep_unit (str, optional) – Unit of timestep. Default is “frames”.

Returns:

Container with lags, acf values, and metadata

Return type:

ACFResult

Examples

>>> # Compute ACF of RMSD timeseries
>>> rmsd = np.array([1.2, 1.3, 1.25, 1.4, ...])  # from MDAnalysis
>>> acf_result = compute_acf(rmsd, timestep=10.0, timestep_unit="ps")
>>> print(f"ACF at lag 100ps: {acf_result.acf[10]:.3f}")

Notes

The ACF is normalized so that ACF[0] = 1.

For a stationary process: ACF(τ) = <(x(t) - μ)(x(t+τ) - μ)> / σ²

For constant or near-constant timeseries (variance below a small epsilon), this function returns a defined degenerate ACF with ACF[0] = 1 and all positive lags set to 0.

polyzymd.analyses.shared.autocorrelation.estimate_correlation_time(acf_or_timeseries, timestep=1.0, timestep_unit='frames', method='integration', n_frames=None)[source]

Estimate correlation time from ACF or raw timeseries.

Parameters:

acf_or_timeseries (ACFResult or array_like) – Either an ACFResult from compute_acf(), or a raw timeseries
timestep (float, optional) – Time between frames (only used if passing raw timeseries)
timestep_unit (str, optional) – Unit of timestep (only used if passing raw timeseries)
method ({"first_zero", "exponential_fit", "integration"}) – Method for estimating τ: - “first_zero”: Lag where ACF first crosses zero - “exponential_fit”: Fit ACF = exp(-t/τ) - “integration”: τ = ∫ACF(t)dt (recommended, most robust)
n_frames (int, optional) – Total number of frames (for computing n_independent). Only needed if passing ACFResult.

Returns:

Contains tau, method used, n_independent, statistical_inefficiency

Return type:

CorrelationTimeResult

Examples

>>> acf_result = compute_acf(rmsd, timestep=10.0, timestep_unit="ps")
>>> tau_result = estimate_correlation_time(acf_result, method="integration")
>>> print(f"Correlation time: {tau_result.tau:.1f} {tau_result.tau_unit}")
>>> print(f"Independent samples: {tau_result.n_independent}")

Notes

The “integration” method is most robust for noisy ACFs. It computes:: τ = ∫₀^∞ ACF(t) dt ≈ Σ ACF[i] * dt

Integration stops at first zero crossing to avoid noise contribution.

polyzymd.analyses.shared.autocorrelation.get_independent_indices(n_frames, correlation_time, timestep=1.0, start_frame=0)[source]

Get frame indices for independent samples.

Selects frames separated by at least 2*τ (correlation time) to ensure approximate independence for statistical analysis.

Parameters:

n_frames (int) – Total number of frames in trajectory
correlation_time (float) – Correlation time τ (in same units as timestep)
timestep (float, optional) – Time between frames. Default is 1.0.
start_frame (int, optional) – First frame to consider (after equilibration). Default is 0. Note: Frame indices are 0-indexed internally, but user-facing documentation uses 1-indexed (PyMOL convention).

Returns:

Array of frame indices (0-indexed) that are approximately independent

Return type:

NDArray[np.int64]

Examples

>>> # Get independent frames for RMSF calculation
>>> tau_result = estimate_correlation_time(rmsd, timestep=10.0)
>>> indices = get_independent_indices(
...     n_frames=10000,
...     correlation_time=tau_result.tau,
...     timestep=10.0,
...     start_frame=1000,  # Skip first 1000 frames for equilibration
... )
>>> print(f"Using {len(indices)} independent frames")

Notes

Frame indices returned are 0-indexed (for direct use with MDAnalysis). When displaying to users, add 1 for PyMOL convention.

The spacing is set to 2*τ/timestep, which gives frames with negligible correlation (ACF < 0.05 for exponential decay).

polyzymd.analyses.shared.autocorrelation.statistical_inefficiency(timeseries, mintime=3, fft=True)[source]

Compute statistical inefficiency g directly from a timeseries.

The statistical inefficiency g is the factor by which the variance of the sample mean is increased due to correlation:

Var(mean) = Var(x) * g / N

This is computed as: g = 1 + 2 * Σ C(t) * (1 - t/N)

where C(t) is the normalized autocorrelation function and the sum includes the finite-size correction factor (1 - t/N) per Chodera et al. (2007).

Parameters:

timeseries (array_like) – 1D array of values (e.g., contact binary array, RMSD over time)
mintime (int) – Minimum number of lags to compute before checking for zero crossing. Prevents early termination from noise. Default is 3.
fft (bool) – If True, use FFT-based ACF computation (faster). Default is True.

Returns:

Statistical inefficiency g (>= 1.0). The number of effective independent samples is N_eff = N / g.

Return type:

float

Examples

>>> # Binary contact timeseries
>>> contacts = np.array([0, 1, 1, 1, 0, 0, 1, 1, ...])
>>> g = statistical_inefficiency(contacts)
>>> n_eff = len(contacts) / g
>>> print(f"Effective samples: {n_eff:.1f}")

>>> # Continuous observable
>>> rmsd = np.array([1.2, 1.3, 1.25, 1.4, ...])
>>> g = statistical_inefficiency(rmsd)

Notes

This implementation follows the algorithm from Chodera et al. (2007) J. Chem. Theory Comput. 3:26, with the finite-size correction.

For binary (0/1) data, the algorithm works correctly as the variance of a Bernoulli random variable is p(1-p).

References

Chodera et al. (2007) J. Chem. Theory Comput. 3:26

polyzymd.analyses.shared.autocorrelation.statistical_inefficiency_multiple(timeseries_list, mintime=3)[source]

Compute statistical inefficiency from multiple timeseries of different lengths.

This is critical for aggregating replicates with different frame counts. The algorithm computes a global mean μ across all timeseries, then averages the ACF numerator and denominator separately before computing g.

Parameters:

timeseries_list (list[ArrayLike]) – List of 1D timeseries arrays (can have different lengths)
mintime (int) – Minimum number of lags before checking for zero crossing. Default is 3.

Returns:

Statistical inefficiency g (>= 1.0)

Return type:

float

Examples

>>> # Three replicates with different lengths
>>> ts1 = np.array([0, 1, 1, 0, 0, 1])  # 6 frames
>>> ts2 = np.array([1, 1, 0, 0, 0])      # 5 frames
>>> ts3 = np.array([0, 0, 1, 1, 1, 0, 1])  # 7 frames
>>> g = statistical_inefficiency_multiple([ts1, ts2, ts3])

Notes

This implementation follows the algorithm from PyMBAR’s statistical_inefficiency_multiple(), adapted without the PyMBAR dependency.

The algorithm:

Compute global mean μ across all timeseries
For each lag t: - Compute sum of (x - μ) products across all timeseries where t < N_k - Compute sum of sample counts across all timeseries where t < N_k - Average to get C(t)
Sum with finite-size correction

References

Chodera et al. (2007) J. Chem. Theory Comput. 3:26

polyzymd.analyses.shared.autocorrelation.n_effective(n_samples, g)[source]

Compute number of effective independent samples.

Parameters:

n_samples (int) – Total number of samples
g (float) – Statistical inefficiency

Returns:

Effective number of independent samples (N_eff = N / g)

Return type:

float

polyzymd.analyses.shared.autocorrelation.check_statistical_reliability(n_eff, threshold=10)[source]

Check if statistics are reliable based on effective sample count.

Parameters:

n_eff (float) – Number of effective independent samples
threshold (int) – Minimum recommended independent samples. Default is 10.

Returns:

is_reliable (bool) – True if n_eff >= threshold
warning (str | None) – Warning message if not reliable, None otherwise

Return type:

tuple[bool, str | None]

Examples

>>> g = statistical_inefficiency(contacts)
>>> n_eff = n_effective(len(contacts), g)
>>> is_ok, warning = check_statistical_reliability(n_eff)
>>> if not is_ok:
...     print(warning)

Statistical functions for replicate aggregation.

This module provides statistical utilities for combining results across multiple simulation replicates with proper error propagation.

Key design decisions: - All uncertainties are reported as Standard Error of the Mean (SEM) - SEM = std / sqrt(N) where N is the number of independent samples - Hierarchical aggregation preserves proper statistics at each level

Functions

compute_sem: Standard error of the mean for a 1D array
aggregate_per_residue_stats: Combine per-residue values across replicates
aggregate_region_stats: Combine region-averaged values across replicates
weighted_mean_with_sem: Weighted average with proper error propagation

class polyzymd.analyses.shared.statistics.StatResult(mean, sem, n_samples)[source]

Bases: object

Container for mean +/- SEM results.

mean

The mean value

Type:: float

sem

Standard error of the mean

Type:: float

n_samples

Number of samples used in computation

Type:: int

mean: float

sem: float

n_samples: int

to_dict()[source]

Convert to dictionary for JSON serialization.

__init__(mean, sem, n_samples)

class polyzymd.analyses.shared.statistics.PerResidueStats(residue_ids, means, sems, n_replicates)[source]

Bases: object

Container for per-residue statistics across replicates.

residue_ids

Residue identifiers (1-indexed, following PyMOL convention)

Type:: NDArray[np.int64]

means

Mean value for each residue across replicates

Type:: NDArray[np.float64]

sems

SEM for each residue across replicates

Type:: NDArray[np.float64]

n_replicates

Number of replicates aggregated

Type:: int

residue_ids: numpy.typing.NDArray.numpy.int64

means: numpy.typing.NDArray.numpy.float64

sems: numpy.typing.NDArray.numpy.float64

n_replicates: int

to_dict()[source]

Convert to dictionary for JSON serialization.

__init__(residue_ids, means, sems, n_replicates)

polyzymd.analyses.shared.statistics.compute_sem(values, ddof=1)[source]

Compute mean and standard error of the mean.

SEM = std / sqrt(N) where N is the number of samples.

Parameters:

values (array_like) – 1D array of values (e.g., one value per replicate)
ddof (int, optional) – Delta degrees of freedom for std calculation. Default is 1 (Bessel’s correction for sample std).

Returns:

Container with mean, sem, and n_samples

Return type:

StatResult

Examples

>>> values = [2.5, 2.7, 2.6, 2.4, 2.8]  # RMSF from 5 replicates
>>> result = compute_sem(values)
>>> print(f"RMSF = {result.mean:.2f} +/- {result.sem:.2f} A")
RMSF = 2.60 +/- 0.07 A

Notes

For a single value, SEM is undefined (returns 0.0).

polyzymd.analyses.shared.statistics.aggregate_per_residue_stats(per_replicate_values, residue_ids=None)[source]

Aggregate per-residue values across replicates.

For each residue, computes mean +/- SEM across all replicates. This is the correct way to aggregate per-residue RMSF values.

Parameters:

per_replicate_values (sequence of arrays) – List/tuple of 1D arrays, each containing per-residue values from one replicate. All arrays must have the same length.
residue_ids (array, optional) – 1-indexed residue identifiers. If None, uses 1, 2, 3, … Following PyMOL convention (1-indexed).

Returns:

Container with residue_ids, means, sems, n_replicates

Return type:

PerResidueStats

Raises:

ValueError – If arrays have inconsistent lengths or no replicates provided

polyzymd.analyses.shared.statistics.aggregate_region_stats(per_replicate_values, residue_mask=None)[source]

Aggregate region-averaged values across replicates.

For whole-protein or region-specific metrics, this computes the mean of per-replicate averages, with SEM across replicates.

This implements the correct hierarchical aggregation: 1. First average within each replicate (over selected residues) 2. Then compute mean +/- SEM across replicate means

Parameters:

per_replicate_values (sequence of arrays) – List/tuple of 1D arrays, each containing per-residue values from one replicate.
residue_mask (bool array, optional) – Boolean mask for residue selection. If None, uses all residues.

Returns:

Mean +/- SEM of region-averaged values across replicates

Return type:

StatResult

polyzymd.analyses.shared.statistics.weighted_mean_with_sem(means, sems, weights=None)[source]

Compute weighted mean with proper error propagation.

Useful for combining results from different conditions or analyses with different uncertainties.

Parameters:

means (array_like) – Mean values from each source
sems (array_like) – SEM values from each source
weights (array_like, optional) – Weights for each source. If None, uses inverse-variance weighting (1/sem^2), which is optimal for independent measurements.

Returns:

Weighted mean with propagated uncertainty

Return type:

StatResult

Notes

For inverse-variance weighting, the combined SEM is:: SEM_combined = 1 / sqrt(sum(1/SEM_i^2))
For arbitrary weights, uses standard error propagation:: SEM_combined = sqrt(sum((w_i * SEM_i)^2)) / sum(w_i)

Inferential statistical tests shared across analysis comparisons.

This module provides statistical functions for comparing analysis results across multiple conditions, including t-tests, ANOVA, and effect sizes.

It is the canonical home for inferential statistics used by analysis plugins and comparison utilities.

All functions use SciPy for statistical calculations.

class polyzymd.analyses.shared.inferential_statistics.TTestResult(t_statistic, p_value)[source]

Bases: object

Result of a two-sample t-test.

t_statistic

The t-statistic

Type:: float

p_value

Two-tailed p-value

Type:: float

t_statistic: float

p_value: float

property significant: bool: Whether the result is significant at p < 0.05.

Note

This uses a hardcoded alpha=0.05 threshold. The comparison pipeline overrides significance with configurable thresholds (BH-adjusted or Tukey). Use this property only as a convenience default.

to_dict()[source]

Convert to dictionary.

__init__(t_statistic, p_value)

class polyzymd.analyses.shared.inferential_statistics.EffectSize(cohens_d, interpretation, direction)[source]

Bases: object

Cohen’s d effect size with interpretation.

cohens_d

The effect size (positive = group1 > group2)

Type:: float

interpretation

Categorical interpretation: “negligible”, “small”, “medium”, “large”

Type:: str

direction

“higher” (d > 0), “lower” (d < 0), or “unchanged” (d == 0).

Type:: str

interpretation: str

direction: str

to_dict()[source]

Convert to dictionary.

__init__(cohens_d, interpretation, direction)

class polyzymd.analyses.shared.inferential_statistics.ANOVAResult(f_statistic, p_value)[source]

Bases: object

Result of one-way ANOVA.

f_statistic

The F-statistic

Type:: float

p_value

P-value for the test

Type:: float

f_statistic: float

p_value: float

property significant: bool: Whether the result is significant at p < 0.05.

Note

This uses a hardcoded alpha=0.05 threshold. The comparison pipeline overrides significance with configurable thresholds. Use this property only as a convenience default.

to_dict()[source]

Convert to dictionary.

__init__(f_statistic, p_value)

class polyzymd.analyses.shared.inferential_statistics.BHResult(raw_p_value, adjusted_p_value, significant, rank)[source]

Bases: object

Result of Benjamini-Hochberg correction for one hypothesis.

raw_p_value

Original uncorrected p-value.

Type:: float | None

adjusted_p_value

BH-adjusted p-value (q-value). None if raw was None.

Type:: float | None

significant

Whether adjusted_p_value <= alpha.

Type:: bool

rank

1-based rank among non-None p-values (smallest=1). None if raw was None.

Type:: int | None

raw_p_value: float | None

adjusted_p_value: float | None

significant: bool

rank: int | None

__init__(raw_p_value, adjusted_p_value, significant, rank)

polyzymd.analyses.shared.inferential_statistics.benjamini_hochberg(p_values, alpha=0.05)[source]

Apply Benjamini-Hochberg FDR correction to a family of p-values.

Implements the Benjamini-Hochberg (1995) step-up procedure to control the false discovery rate. The correction adjusts p-values such that declaring significance at adjusted_p <= alpha controls the expected proportion of false discoveries at level alpha.

None and NaN entries in p_values (e.g. cross-temperature pairs where statistics are suppressed, or degenerate tests with undefined p-values) are passed through — the corresponding BHResult has adjusted_p_value=None and significant=False.

Parameters:

p_values (Sequence[float | None]) – Raw two-tailed p-values. None entries are preserved.
alpha (float, optional) – FDR significance threshold, by default 0.05.

Returns:

One entry per input p-value, in the same order.

Return type:

list[BHResult]

References

Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS B, 57(1), 289-300.

polyzymd.analyses.shared.inferential_statistics.independent_ttest(group1, group2, method='student')[source]

Perform a two-sample independent t-test.

Tests the null hypothesis that two independent samples have identical expected values.

The method parameter controls the variance assumption:

"student" uses Student’s t-test (equal_var=True), which assumes equal population variances
"welch" uses Welch’s t-test (equal_var=False), which does not assume equal variances

Use "student" when homoscedasticity is a reasonable assumption. Use "welch" when variances may differ across conditions.

Parameters:

group1 (array_like) – First group of values (e.g., control replicate means)
group2 (array_like) – Second group of values (e.g., treatment replicate means)
method (str, optional) – T-test method to use: "student" or "welch", by default "student".

Returns:

Result containing t-statistic and p-value

Return type:

TTestResult

Examples

>>> control = [0.715, 0.693, 0.696]  # No polymer RMSF
>>> treatment = [0.517, 0.586]        # 100% SBMA RMSF
>>> result = independent_ttest(control, treatment)
>>> print(f"t = {result.t_statistic:.3f}, p = {result.p_value:.4f}")

Raises:: ValueError – If method is not "student" or "welch"

polyzymd.analyses.shared.inferential_statistics.one_way_anova(*groups)[source]

Perform classical one-way ANOVA across multiple groups.

Tests the null hypothesis that all groups have the same mean using scipy.stats.f_oneway (equal variance assumption).

Parameters:: *groups (array_like) – Variable number of groups to compare. Each group must have at least 2 observations; groups with fewer observations cause the function to return NaN statistics.
Returns:: Result containing F-statistic and p-value. Both are NaN if any group has fewer than 2 observations.
Return type:: ANOVAResult

Examples

>>> no_poly = [0.715, 0.693, 0.696]
>>> sbma = [0.517, 0.586]
>>> egma = [0.558, 0.738, 0.496]
>>> result = one_way_anova(no_poly, sbma, egma)
>>> print(f"F = {result.f_statistic:.3f}, p = {result.p_value:.4f}")

class polyzymd.analyses.shared.inferential_statistics.TukeyHSDResult(group_i, group_j, statistic, p_value)[source]

Bases: object

Result of Tukey’s HSD test for one pair of groups.

group_i

Index of the first group.

Type:: int

group_j

Index of the second group.

Type:: int

statistic

Mean difference (group_j - group_i).

Type:: float

p_value

Tukey-adjusted p-value for this pair.

Type:: float

group_i: int

group_j: int

statistic: float

p_value: float

__init__(group_i, group_j, statistic, p_value)

polyzymd.analyses.shared.inferential_statistics.tukey_hsd(*groups)[source]

Run Tukey’s Honestly Significant Difference test.

Computes family-wise-adjusted p-values for all pairwise group comparisons using scipy.stats.tukey_hsd.

Parameters:: *groups (array_like) – Variable number of groups to compare. Each group must have at least 2 observations.
Returns:: One result per unique pair (i < j), ordered by (i, j). Returns an empty list if fewer than 2 groups are provided or any group has fewer than 2 observations.
Return type:: list[TukeyHSDResult]

Examples

>>> results = tukey_hsd([1, 2, 3], [4, 5, 6], [7, 8, 9])
>>> for r in results:
...     print(f"({r.group_i}, {r.group_j}): p={r.p_value:.4f}")

polyzymd.analyses.shared.inferential_statistics.percent_change(control_mean, treatment_mean)[source]

Calculate percent change from control.

Parameters:

control_mean (float) – Mean value of control condition
treatment_mean (float) – Mean value of treatment condition

Returns:

Percent change: (treatment - control) / control * 100 Negative = reduction, Positive = increase.

Special handling for zero control values:

0 -> 0 returns 0.0
0 -> positive returns math.inf
0 -> negative returns -math.inf

If either input is non-finite (NaN or +/-inf), returns math.nan.

Return type:

float

Convergence diagnostics for sliding-window timeseries analysis.

This module implements a sliding-window convergence heuristic adapted from a collaborator notebook used for RMSD equilibration checks.

class polyzymd.analyses.shared.convergence.ConvergenceResult(converged, assessable, convergence_time_ns, window_start_times_ns, window_mean_values, slope_times_ns, slopes, window_size_ns, step_size_ns, slope_threshold, sustained_for_ns, message)[source]

Bases: object

Container for convergence diagnostics.

converged

Whether sustained convergence was detected.

Type:: bool

assessable

Whether convergence could be assessed from available data.

Type:: bool

convergence_time_ns

Start time of the first sustained converged period.

Type:: float | None

window_start_times_ns

Start times for each sliding window.

Type:: list[float]

window_mean_values

Mean signal value in each sliding window.

Type:: list[float]

slope_times_ns

Time points associated with slope estimates.

Type:: list[float]

slopes

Slopes between successive window means.

Type:: list[float]

window_size_ns

Sliding window width in ns.

Type:: float

step_size_ns

Sliding window stride in ns.

Type:: float

slope_threshold

Absolute slope cutoff used for convergence.

Type:: float

sustained_for_ns

Required sustained duration below slope threshold.

Type:: float

message

Human-readable status message.

Type:: str

converged: bool

assessable: bool

convergence_time_ns: float | None

window_start_times_ns: list[float]

window_mean_values: list[float]

slope_times_ns: list[float]

slopes: list[float]

window_size_ns: float

step_size_ns: float

slope_threshold: float

sustained_for_ns: float

message: str

__init__(converged, assessable, convergence_time_ns, window_start_times_ns, window_mean_values, slope_times_ns, slopes, window_size_ns, step_size_ns, slope_threshold, sustained_for_ns, message)

polyzymd.analyses.shared.convergence.find_convergence_time(time_ns, values, window_size_ns=15.0, step_size_ns=5.0, slope_threshold=0.0005, sustained_for_ns=15.0)[source]

Find sustained convergence time using a sliding-window slope heuristic.

Parameters:

time_ns (array_like) – Monotonically increasing time values in ns.
values (array_like) – Signal values sampled at time_ns.
window_size_ns (float, optional) – Width of each averaging window in ns.
step_size_ns (float, optional) – Sliding step between successive windows in ns.
slope_threshold (float, optional) – Absolute slope threshold for classifying a window-to-window change as converged.
sustained_for_ns (float, optional) – Required cumulative duration below threshold before declaring convergence.

Returns:

Full convergence diagnostics, including intermediate window means and slope traces.

Return type:

ConvergenceResult

Raises:

ValueError – Raised when inputs are invalid.

Plotting

Plotting helpers centralize figure themes, output paths, axis styling, legends, grouped bars, and matrix annotations.

Shared plotting utilities for analysis plugins.

This module provides reusable plotting helper functions extracted from the plotter infrastructure. Analysis plugins import these functions to apply consistent styling, save figures with watermarks, and render common chart elements (grouped bars, heatmap annotations, etc.) without inheriting from a base class.

All functions accept a PlotSettings object (from polyzymd.config.comparison) so that user-configured themes, palettes, and DPI settings are respected automatically.

Examples

>>> from polyzymd.analyses.shared.plotting import (
...     apply_axis_style, apply_legend, get_palette_colors, save_figure,
... )
>>>
>>> fig, ax = plt.subplots()
>>> colors = get_palette_colors(3, plot_settings)
>>> ax.bar(x, y, color=colors[0])
>>> apply_axis_style(ax, plot_settings, title="My Plot", ylabel="Value (Å)")
>>> apply_legend(ax, plot_settings)
>>> save_figure(fig, output_dir / "my_plot.png", plot_settings)

class polyzymd.analyses.shared.plotting.ArtifactPlotData(analysis_dir, condition_artifact, replicate_artifacts, aggregated_dir, run_dirs)[source]

Bases: object

Canonical artifacts loaded for plot-time data access.

analysis_dir: Path

condition_artifact: Any | None

replicate_artifacts: dict[int, Any]

aggregated_dir: Path

run_dirs: dict[int, Path]

__init__(analysis_dir, condition_artifact, replicate_artifacts, aggregated_dir, run_dirs)

polyzymd.analyses.shared.plotting.load_canonical_plot_artifacts(analysis_dir, replicates, *, require_condition=False, require_replicates=True)[source]

Load plot inputs from canonical MDAnalysis artifacts only.

The loader reads aggregated/result.json and the configured run_N/result.json files through ArtifactStore. It never scans directories, opens non-canonical JSON files, or imports trajectory packages.

Parameters:

analysis_dir (Path) – Condition-level analysis directory containing aggregated and run_N subdirectories.
replicates (sequence of int) – Configured replicate IDs to load. Extra run directories are ignored.
require_condition (bool, optional) – Raise when aggregated/result.json is absent, by default False.
require_replicates (bool, optional) – Raise when any configured run_N/result.json is absent, by default True.

Returns:

Loaded canonical condition and replicate artifacts.

Return type:

ArtifactPlotData

polyzymd.analyses.shared.plotting.get_theme(plot_settings)[source]

Return the resolved PlotTheme from plot_settings.

Parameters:: plot_settings (PlotSettings) – Global plot settings (carries a .theme property).
Return type:: PlotTheme

polyzymd.analyses.shared.plotting.apply_axis_style(ax, plot_settings, *, title=None, xlabel=None, ylabel=None)[source]

Apply standard axis chrome from the theme.

Hides spines according to theme settings, sizes tick labels, and optionally sets title / xlabel / ylabel with themed font sizes.

Parameters:

ax (matplotlib Axes) – Target axes to style.
plot_settings (PlotSettings) – Global plot settings.
title (str, optional) – Axes title.
xlabel (str, optional) – X-axis label.
ylabel (str, optional) – Y-axis label.

polyzymd.analyses.shared.plotting.apply_legend(ax, plot_settings, *, loc=None, bbox_to_anchor=<object object>, fontsize=None, **kwargs)[source]

Apply legend with themed defaults.

Uses theme.legend_loc and theme.legend_bbox unless overridden by the caller. Extra kwargs are forwarded to ax.legend().

Parameters:

ax (matplotlib Axes) – Target axes.
plot_settings (PlotSettings) – Global plot settings.
loc (str, optional) – Override theme.legend_loc.
bbox_to_anchor (tuple of float or None, optional) – Override theme.legend_bbox. Pass None explicitly to suppress the bbox (e.g. for inside-axes placement).
fontsize (int, optional) – Override theme.legend_fontsize.
**kwargs (Any) – Forwarded to ax.legend().

polyzymd.analyses.shared.plotting.get_palette_colors(n, plot_settings)[source]

Get n distinct colors from the configured palette.

Tries seaborn first (richer palette support), falls back to a matplotlib colormap sampled at evenly-spaced intervals.

Parameters:

n (int) – Number of colors needed.
plot_settings (PlotSettings) – Global plot settings (carries color_palette).

Returns:

List of color values (RGB tuples or matplotlib color specs).

Return type:

list

polyzymd.analyses.shared.plotting.order_condition_labels(labels, plot_settings)[source]

Return condition labels in semantic plot order when enabled.

Ordering only affects plot display order. It does not alter comparison statistics, rankings, or condition result files.

Parameters:

labels (sequence of str) – Condition labels in their original order.
plot_settings (PlotSettings) – Global plot settings carrying optional semantic color settings.

Returns:

Ordered labels for plotting.

Return type:

list of str

polyzymd.analyses.shared.plotting.get_condition_colors(labels, plot_settings, *, control_label=None)[source]

Return colors for condition labels using semantic settings if enabled.

Parameters:

labels (sequence of str) – Condition labels in plot order.
plot_settings (PlotSettings) – Global plot settings carrying optional semantic color settings.
control_label (str, optional) – Label that should use the configured semantic control color.

Returns:

Color values aligned to labels.

Return type:

list

polyzymd.analyses.shared.plotting.get_condition_color_map(labels, plot_settings, *, control_label=None)[source]

Return a label-to-color map using semantic condition color rules.

Resolution precedence is manual color, condition color, control color, family/value color, missing metadata fallback, then existing palette fallback. Invalid color or colormap values warn and continue to a safe fallback.

Parameters:

labels (sequence of str) – Condition labels in their original palette-alignment order.
plot_settings (PlotSettings) – Global plot settings carrying optional semantic color settings.
control_label (str, optional) – Label that should use the configured semantic control color.

Returns:

Mapping from each label to its resolved matplotlib-compatible color.

Return type:

dict of str to Any

polyzymd.analyses.shared.plotting.get_output_path(output_dir, name, plot_settings)[source]

Generate output file path with correct format extension.

Parameters:

output_dir (Path) – Output directory.
name (str) – Base filename (without extension).
plot_settings (PlotSettings) – Global plot settings (carries format).

Returns:

Full output path with extension.

Return type:

Path

polyzymd.analyses.shared.plotting.save_figure(fig, output_path, plot_settings, *, experimental_features=None, close=True)[source]

Save figure with DPI, watermark, and optional experimental stamp.

Parameters:

fig (matplotlib Figure) – Figure to save.
output_path (Path) – Output file path.
plot_settings (PlotSettings) – Global plot settings (carries dpi and theme).
experimental_features (sequence of str or None, optional) – Experimental feature ids to stamp onto the figure.
close (bool, optional) – If True, close the figure after saving. Set False when the caller needs to keep using the figure object.

Returns:

Path to saved figure.

Return type:

Path

polyzymd.analyses.shared.plotting.finite_numeric_values(values)[source]

Return finite numeric values as a one-dimensional float array.

Parameters:: values (Any) – Candidate scalar or sequence of replicate values. None and non-numeric inputs are treated as missing data.
Returns:: One-dimensional array containing only finite floats. The array is empty when no finite numeric values are available.
Return type:: numpy.ndarray

polyzymd.analyses.shared.plotting.replicate_jitter_offsets(n_values, bar_width)[source]

Return deterministic offsets for replicate dot overlays.

Offsets are centred on the corresponding bar position so overlays are reproducible across runs and independent of random-number state.

Parameters:

n_values (int) – Number of replicate values to display for one bar.
bar_width (float) – Width or height of the corresponding bar, depending on orientation.

Returns:

Jitter offsets centred around zero.

Return type:

numpy.ndarray

polyzymd.analyses.shared.plotting.has_replicate_uncertainty(replicate_values=None, *, n_replicates=None)[source]

Return whether replicate-level uncertainty can be displayed.

Parameters:

replicate_values (Any, optional) – Per-condition or per-bar replicate values. Finite numeric entries are counted after coercion.
n_replicates (int or None, optional) – Explicit replicate count when the raw replicate values are not available.

Returns:

True when at least two finite independent replicate values are present.

Return type:

bool

polyzymd.analyses.shared.plotting.suppress_singleton_errors(errors, replicate_values)[source]

Return errors with singleton replicate uncertainties suppressed.

Parameters:

errors (sequence of float) – SEM or uncertainty values aligned to replicate_values.
replicate_values (sequence or None) – Per-bar replicate values used to decide whether an error bar is statistically displayable.

Returns:

Sanitized error values. Returns None when no bar has replicate uncertainty, allowing callers to omit error bars entirely.

Return type:

list of float or None

polyzymd.analyses.shared.plotting.scatter_replicate_values(ax, bar_positions, replicate_values, plot_settings, *, orientation='vertical', bar_width=0.8, dot_color=None, dot_size=None, dot_alpha=None, zorder=5)[source]

Overlay jittered per-replicate values on bars.

For vertical bars, bar positions are x-coordinates, jitter is applied in x, and replicate values are plotted on y. For horizontal bars, bar positions are y-coordinates, jitter is applied in y, and replicate values are plotted on x.

Parameters:

ax (matplotlib.axes.Axes) – Axes containing the bar chart.
bar_positions (sequence of float or numpy.ndarray) – Bar centre positions aligned to replicate_values.
replicate_values (sequence of Any) – Per-bar replicate values. Each item may be a scalar or sequence; only finite numeric values are plotted.
plot_settings (PlotSettings) – Global plot settings whose theme provides default dot styling.
orientation ({"vertical", "horizontal"}, optional) – Bar orientation, by default "vertical".
bar_width (float, optional) – Width or height of the bars, by default 0.8.
dot_color (Any, optional) – Override for theme dot colour.
dot_size (float, optional) – Override for theme dot size. Dots are skipped when non-positive.
dot_alpha (float, optional) – Override for theme dot alpha. Dots are skipped when non-positive.
zorder (float, optional) – Matplotlib z-order for dot overlays, by default 5.

Returns:

Number of scatter calls emitted.

Return type:

int

Raises:

ValueError – If orientation is not "vertical" or "horizontal", or if bar_positions and replicate_values are not the same length.

polyzymd.analyses.shared.plotting.scatter_stacked_segment_replicates(ax, x_position, bottom_value, replicate_values, plot_settings, *, replicate_base_values=None, positive_base_values=None, negative_base_values=None, bar_width=0.8, dot_color=None, dot_size=None, dot_alpha=None, placement='center', zorder=5)[source]

Overlay replicate dots on stacked segments.

The per-component replicate value is a segment height, not an absolute stacked coordinate. Plotting at base + replicate / 2 places each dot at the center of the component-specific replicate segment. Callers should pass replicate-specific bases when earlier stacked components vary by replicate. Signed stacks may pass separate positive and negative bases so each dot is placed on the same sign stack as its own replicate value.

Parameters:

ax (matplotlib.axes.Axes) – Axes containing the stacked bar chart.
x_position (float) – Center x-coordinate of the condition bar.
bottom_value (float) – Aggregate stack baseline for the current segment.
replicate_values (sequence of Any) – Component-specific per-replicate segment heights.
plot_settings (PlotSettings) – Plot configuration used for dot styling.
replicate_base_values (sequence of Any, optional) – Per-replicate cumulative stack bases for unsigned stacks. When omitted, bottom_value is used for every replicate for backward compatibility.
positive_base_values (sequence of Any, optional) – Per-replicate cumulative positive stack bases for signed stacks.
negative_base_values (sequence of Any, optional) – Per-replicate cumulative negative stack bases for signed stacks.
bar_width (float, optional) – Width used for deterministic jitter, by default 0.8.
dot_color (Any, optional) – Override for theme dot colour.
dot_size (float, optional) – Override for theme dot size.
dot_alpha (float, optional) – Override for theme dot alpha.
placement ({"center", "end"}, optional) – Dot placement within each replicate segment. "center" uses base + replicate / 2 and "end" uses base + replicate.
zorder (float, optional) – Matplotlib z-order for dot overlays, by default 5.

Returns:

Number of scatter calls emitted.

Return type:

int

Raises:

ValueError – If replicate base arrays do not align with replicate_values.

polyzymd.analyses.shared.plotting.grouped_bars(ax, x, series, colors, plot_settings, *, bar_width=None, show_error=True, reference_line=0.0, reference_label='Neutral (0)', replicate_values=None, **style_overrides)[source]

Render grouped bars with optional error bars and reference line.

Style values (alpha, capsize, edgecolor, linewidth, dot_size, etc.) are read from plot_settings.theme. Callers can override any of them via **style_overrides using the theme field names as keys.

Parameters:

ax (matplotlib Axes) – Target axes.
x (np.ndarray) – 1-D array of group centre positions (e.g. np.arange(n_groups)).
series (sequence of (label, means, errors)) – One tuple per condition. means and errors must have the same length as x.
colors (sequence) – One colour per condition (same length as series).
plot_settings (PlotSettings) – Global plot settings.
bar_width (float | None, optional) – Width of each individual bar. When None (default) the width is computed as 0.8 / len(series).
show_error (bool, optional) – If False, error bars are suppressed, by default True.
reference_line (float | None, optional) – Y-value for a horizontal reference line. Set to None to skip, by default 0.0.
reference_label (str, optional) – Legend label for the reference line, by default "Neutral (0)".
replicate_values (sequence or None, optional) – Per-replicate values for jittered dot overlay. Indexed as replicate_values[condition_idx][group_idx] -> sequence of floats (one per replicate). When None (default), no dots are drawn.
**style_overrides – Override any theme field for this call only. Accepted keys: bar_alpha, bar_capsize, bar_edgecolor, bar_linewidth, dot_size, dot_alpha, dot_color, reference_line_color, reference_line_style, reference_line_width.

polyzymd.analyses.shared.plotting.annotate_cells(ax, matrix, plot_settings, *, fmt='.2f', fontsize=None, threshold=0.3, sem_matrix=None, show_sign=True, linespacing=None)[source]

Annotate heatmap cells with formatted values.

Iterates over every element of matrix and places a text label at the corresponding (col, row) position on ax. NaN cells are skipped. Text colour flips between black and white depending on the background intensity (controlled by threshold).

Parameters:

ax (matplotlib Axes) – The axes containing the heatmap image.
matrix (np.ndarray) – 2-D array of values (rows x cols) matching the heatmap.
plot_settings (PlotSettings) – Global plot settings.
fmt (str, optional) – Format spec for the value, by default ".2f".
fontsize (int | None, optional) – Annotation font size. When None (default), uses plot_settings.theme.annotation_fontsize.
threshold (float, optional) – Absolute-value threshold above which text turns white.
sem_matrix (np.ndarray | None, optional) – If provided, a second line ±{sem} is appended when the SEM value is finite.
show_sign (bool, optional) – Prefix positive values with "+" , by default True.
linespacing (float | None, optional) – Passed to ax.text(linespacing=...) when SEM is shown.

polyzymd.analyses.shared.plotting.symmetric_clim(values, pad=0.1)[source]

Compute symmetric colour limits centred on zero.

Parameters:

values (sequence of float or ndarray) – Finite data values to derive limits from.
pad (float, optional) – Extra padding added to both sides, by default 0.1.

Returns:

(vmin, vmax) with vmin == -vmax (before padding).

Return type:

tuple[float, float]

Selections, selectors, and grouping

Selection helpers extend MDAnalysis selections. Selector and grouping packages provide reusable abstractions for selecting molecules or classifying residues in plugin settings and analysis code.

Special selection syntax for distance calculations.

This module provides parsing of extended selection syntax for defining atom positions in distance calculations. It supports:

Standard MDAnalysis selections: “resid 133 and name OD1”
Midpoint of multiple atoms: “midpoint(resid 133 and name OD1 OD2)”
Center of mass of a group: “com(resid 50-75)”

Examples

>>> from polyzymd.analyses.shared.selections import parse_selection_string, get_position
>>>
>>> # Standard selection - single atom
>>> ag = parse_selection(universe, "resid 77 and name OG")
>>> pos = get_position(ag)  # Returns position of single atom
>>>
>>> # Midpoint of Asp carboxyl oxygens
>>> ag = parse_selection(universe, "midpoint(resid 133 and name OD1 OD2)")
>>> pos = get_position(ag)  # Returns midpoint of OD1 and OD2
>>>
>>> # Center of mass of lid domain
>>> ag = parse_selection(universe, "com(resid 50-75)")
>>> pos = get_position(ag)  # Returns COM of residues 50-75

Notes

The midpoint() syntax is particularly useful for catalytic residues where the functional position is between two atoms (e.g., Asp carboxyl oxygens, Glu carboxyl oxygens).

The com() syntax is useful for domain motions where you want to track the center of mass of a group of residues (e.g., lid opening in lipases).

polyzymd.analyses.shared.selections.translate_selection(selection)[source]

Translate PolyzyMD selection keywords to MDAnalysis equivalents.

This allows users to use the same selection syntax in analysis as they use in config.yaml for restraints and other atom selections.

Translations

pdbindex N → id N (PDB ATOM serial number)

The pdbindex keyword refers to the 1-indexed atom serial number from the PDB ATOM record (column 7-11), which is what PyMOL displays as “id”. In MDAnalysis, this is accessed via the id selection keyword.

Note: MDAnalysis also has bynum which is 1-indexed positional (i.e., bynum 1 = first atom, bynum 2 = second atom), but this does NOT correspond to PDB serial numbers when there are gaps in numbering. We use id because it matches actual PDB serial numbers.

param selection:: Selection string with possible PolyzyMD-specific keywords
type selection:: str
returns:: Selection string with MDAnalysis-compatible keywords
rtype:: str

Examples

>>> translate_selection("pdbindex 100 and name CA")
"id 100 and name CA"

>>> translate_selection("midpoint(pdbindex 100 and name OD1 OD2)")
"midpoint(id 100 and name OD1 OD2)"

class polyzymd.analyses.shared.selections.SelectionMode(value)[source]

Bases: str, Enum

Mode for position calculation from atom selection.

SINGLE = 'single'

CENTROID = 'centroid'

MIDPOINT = 'midpoint'

COM = 'com'

class polyzymd.analyses.shared.selections.ParsedSelection(selection, mode, original)[source]

Bases: object

Result of parsing a selection string.

selection

The MDAnalysis selection string (without wrapper function)

Type:: str

mode

How to compute the position

Type:: SelectionMode

original

The original input string

Type:: str

selection: str

mode: SelectionMode

original: str

__init__(selection, mode, original)

polyzymd.analyses.shared.selections.parse_selection_string(selection)[source]

Parse a selection string to extract mode and MDAnalysis selection.

Also translates PolyzyMD-specific keywords (like pdbindex) to their MDAnalysis equivalents (like id).

Parameters:: selection (str) – Selection string, possibly with special syntax: - “resid 77 and name OG” - standard MDAnalysis - “midpoint(resid 133 and name OD1 OD2)” - midpoint mode - “com(resid 50-75)” - center of mass mode - “pdbindex 100 and name CA” - PolyzyMD pdbindex (translated to id)
Returns:: Parsed selection with mode and clean selection string
Return type:: ParsedSelection

Examples

>>> parsed = parse_selection_string("midpoint(resid 133 and name OD1 OD2)")
>>> parsed.mode
<SelectionMode.MIDPOINT: 'midpoint'>
>>> parsed.selection
"resid 133 and name OD1 OD2"

>>> parsed = parse_selection_string("pdbindex 100 and name CA")
>>> parsed.selection
"id 100 and name CA"

polyzymd.analyses.shared.selections.select_atoms(universe, selection)[source]

Select atoms from universe using potentially special syntax.

Parameters:

universe (Universe) – MDAnalysis Universe
selection (str) – Selection string (standard or special syntax)

Returns:

Selected atoms

Return type:

AtomGroup

Raises:

ValueError – If selection matches no atoms, with diagnostic info

polyzymd.analyses.shared.selections.get_position(atoms, mode=SelectionMode.SINGLE)[source]

Get position from atom group based on mode.

Parameters:

atoms (AtomGroup) – MDAnalysis AtomGroup
mode (SelectionMode) – How to compute position: - SINGLE: Position of single atom (error if multiple) - CENTROID/MIDPOINT: Center of geometry - COM: Center of mass

Returns:

3D position vector [x, y, z]

Return type:

NDArray[np.float64]

Raises:

ValueError – If mode is SINGLE but multiple atoms selected

polyzymd.analyses.shared.selections.get_position_from_selection(universe, selection)[source]

Get position from selection string in one step.

This is a convenience function that combines parsing, selection, and position calculation.

Parameters:

universe (Universe) – MDAnalysis Universe
selection (str) – Selection string (standard or special syntax)

Returns:

3D position vector [x, y, z]

Return type:

NDArray[np.float64]

Examples

>>> # Single atom
>>> pos = get_position_from_selection(u, "resid 77 and name OG")
>>>
>>> # Midpoint of Asp carboxyl
>>> pos = get_position_from_selection(u, "midpoint(resid 133 and name OD1 OD2)")

polyzymd.analyses.shared.selections.validate_selection(universe, selection)[source]

Validate a selection string and return diagnostic info.

Parameters:

universe (Universe) – MDAnalysis Universe
selection (str) – Selection string to validate

Returns:

Diagnostic information: - valid: bool - n_atoms: int - mode: str - atoms: list of atom info dicts - error: str (if invalid) - diagnostics: str (detailed diagnostics if invalid)

Return type:

dict

polyzymd.analyses.shared.selections.format_selection_for_label(selection)[source]

Convert selection string to a short label for filenames/display.

Parameters:: selection (str) – Selection string (standard or special syntax)
Returns:: Short label (e.g., “Asp133_mid” or “Ser77_OG”)
Return type:: str

Examples

>>> format_selection_for_label("midpoint(resid 133 and name OD1 OD2)")
"res133_mid"
>>> format_selection_for_label("resid 77 and name OG")
"res77_OG"

Molecular selector abstractions for analysis plugins.

This module provides a unified interface for selecting atoms, residues, or molecular groups from an MDAnalysis Universe. Selectors enable:

Consistent selection logic across different analysis types
Configurable protein/polymer/solvent selection
Support for arbitrary user-defined selections
Proximity-based selections (e.g., “residues near active site”)

The Strategy pattern is used so users can define custom selectors by subclassing MolecularSelector.

Examples

>>> from polyzymd.analyses.shared.selectors import ProteinResidues, PolymerChains
>>>
>>> # Select all protein residues
>>> protein_selector = ProteinResidues()
>>> protein_residues = protein_selector.select(universe)
>>>
>>> # Select polymer chains by type
>>> polymer_selector = PolymerResiduesByType(residue_names=["SBM", "EGP"])
>>> polymer_residues = polymer_selector.select(universe)
>>>
>>> # Select protein residues near catalytic triad
>>> triad_selector = ProteinResiduesNearReference(
...     reference_selection="resid 77 133 156",
...     cutoff=5.0
... )
>>> nearby_residues = triad_selector.select(universe)

class polyzymd.analyses.shared.selectors.MolecularSelector[source]

Bases: ABC

Abstract base class for molecular selections.

Subclasses must implement the select() method to define how atoms or residues are selected from a Universe.

This follows the Strategy pattern - different selectors can be swapped in to change selection behavior without modifying the analysis code.

Examples

>>> class ActiveSiteSelector(MolecularSelector):
...     def __init__(self, active_site_resids: list[int]):
...         self.resids = active_site_resids
...
...     def select(self, universe: Universe) -> SelectionResult:
...         resid_str = " ".join(str(r) for r in self.resids)
...         atoms = universe.select_atoms(f"resid {resid_str}")
...         return SelectionResult(
...             atoms=atoms,
...             residues=atoms.residues,
...             label="active_site",
...             metadata={"resids": self.resids}
...         )
...
...     @property
...     def label(self) -> str:
...         return "active_site"

abstractmethod select(universe)[source]

Select atoms/residues from a Universe.

Parameters:: universe (Universe) – MDAnalysis Universe to select from
Returns:: Container with selected atoms, residues, and metadata
Return type:: SelectionResult

abstract property label: str: Short label identifying this selector (for filenames/logging).

validate(universe)[source]

Validate the selector against a Universe.

Returns diagnostic information about whether the selection would succeed and what it would select.

Parameters:: universe (Universe) – MDAnalysis Universe to validate against
Returns:: Validation results with keys: - valid: bool - n_atoms: int - n_residues: int - error: str (if invalid) - warnings: list[str]
Return type:: dict

class polyzymd.analyses.shared.selectors.MDAnalysisSelector(selection, label=None)[source]

Bases: MolecularSelector

Simple selector using an MDAnalysis selection string.

This is the most flexible selector - it allows arbitrary MDAnalysis selection syntax. Use this when you need direct control over the selection or when the specialized selectors don’t fit your needs.

Parameters:

selection (str) – MDAnalysis selection string (e.g., “protein”, “resname SBM EGM”, “resid 1-50 and name CA”)
label (str, optional) – Human-readable label. If not provided, uses a sanitized version of the selection string.

Examples

>>> # Select polymer residues by name
>>> selector = MDAnalysisSelector("resname SBM EGM")
>>> result = selector.select(universe)
>>>
>>> # Select protein backbone near ligand
>>> selector = MDAnalysisSelector(
...     "protein and backbone and around 5.0 resname LIG",
...     label="protein_near_ligand"
... )

__init__(selection, label=None)[source]

select(universe)[source]

Select atoms using the MDAnalysis selection string.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.SelectionResult(atoms, residues, label, metadata=<factory>)[source]

Bases: object

Container for selection results with metadata.

atoms

The selected atoms

Type:: AtomGroup

residues

The residues containing the selected atoms

Type:: ResidueGroup

label

Human-readable label for this selection

Type:: str

metadata

Additional metadata about the selection (e.g., selection string used, cutoff values, etc.)

Type:: dict

atoms: AtomGroup

residues: ResidueGroup

label: str

metadata: dict

property n_atoms: int: Number of selected atoms.

property n_residues: int: Number of selected residues.

property residue_ids: numpy.typing.NDArray.numpy.int64: 1-indexed residue IDs (PyMOL convention).

property residue_names: list[str]: Residue names for each residue.

__init__(atoms, residues, label, metadata=<factory>)

class polyzymd.analyses.shared.selectors.CompositeSelector(selectors, mode='union', label=None)[source]

Bases: MolecularSelector

Combines multiple selectors with AND/OR logic.

Useful for complex selections like “protein residues that are both aromatic AND within 5A of the active site”.

Parameters:

selectors (list[MolecularSelector]) – List of selectors to combine
mode ({"union", "intersection"}) – How to combine selections: - “union”: Include atoms selected by ANY selector (OR) - “intersection”: Include only atoms selected by ALL selectors (AND)
label (str, optional) – Custom label. If not provided, generates from component labels.

__init__(selectors, mode='union', label=None)[source]

select(universe)[source]

Select atoms using combined selectors.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.ProteinResidues(selection_modifier=None)[source]

Bases: MolecularSelector

Select all protein residues.

Uses MDAnalysis “protein” selection keyword which matches standard amino acid residues.

Parameters:: selection_modifier (str, optional) – Additional selection criteria to AND with “protein”. E.g., “and not name H*” to exclude hydrogens.

__init__(selection_modifier=None)[source]

select(universe)[source]

Select all protein atoms/residues.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.ProteinResiduesByGroup(grouping, groups, exclude=False)[source]

Bases: MolecularSelector

Select protein residues by amino acid group classification.

Uses a ResidueGrouping to classify amino acids (e.g., aromatic, charged, polar, nonpolar) and selects only residues in the specified groups.

Parameters:

grouping (ResidueGrouping) – Classification scheme for amino acids
groups (list[str]) – Names of groups to include (e.g., [“aromatic”, “charged_positive”])
exclude (bool, optional) – If True, select residues NOT in the specified groups. Default False.

Examples

>>> from polyzymd.analyses.shared.groupings import ProteinAAClassification
>>>
>>> # Select aromatic residues
>>> grouping = ProteinAAClassification()
>>> selector = ProteinResiduesByGroup(grouping, groups=["aromatic"])
>>>
>>> # Select all charged residues
>>> selector = ProteinResiduesByGroup(
...     grouping,
...     groups=["charged_positive", "charged_negative"]
... )

__init__(grouping, groups, exclude=False)[source]

select(universe)[source]

Select protein residues matching the specified groups.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.ProteinResiduesNearReference(reference_selection, cutoff, include_reference=True, frame=0)[source]

Bases: MolecularSelector

Select protein residues within a cutoff distance of reference atoms.

Useful for selecting residues near active sites, binding pockets, or other regions of interest.

Parameters:

reference_selection (str) – MDAnalysis selection string for reference atoms (e.g., “resid 77 133 156”)
cutoff (float) – Distance cutoff in Angstroms. Residues with any atom within this distance of any reference atom are selected.
include_reference (bool, optional) – Whether to include the reference residues themselves. Default True.
frame (int, optional) – Frame to use for distance calculation. Default is current frame (0).

Examples

>>> # Select residues within 5A of catalytic triad
>>> selector = ProteinResiduesNearReference(
...     reference_selection="resid 77 133 156",
...     cutoff=5.0,
... )
>>>
>>> # Select residues near substrate binding site (not including the site itself)
>>> selector = ProteinResiduesNearReference(
...     reference_selection="resname LIG",
...     cutoff=4.0,
...     include_reference=False,
... )

__init__(reference_selection, cutoff, include_reference=True, frame=0)[source]

select(universe)[source]

Select protein residues near the reference atoms.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.PolymerChains(chain_id='C', residue_names=None, chain_indices=None, segids=None)[source]

Bases: MolecularSelector

Select polymer chains from the system.

For PolyzyMD-built systems, polymers are assigned to Chain C by convention. This selector uses chain ID selection by default, which is more reliable than residue name matching.

Parameters:

chain_id (str, optional) – Chain ID for polymer selection. Default “C” (PolyzyMD convention). Set to None to use residue_names instead.
residue_names (list[str], optional) – Residue names that identify polymer residues. Only used when chain_id is None, or as a filter within the chain. Default uses common PolyzyMD polymer names.
chain_indices (list[int], optional) – If provided, select only these polymer chain indices (0-indexed) from within the selected atoms. Useful when analyzing specific polymer chains in multi-chain systems.
segids (list[str], optional) – If provided, select only polymers with these segment IDs.

Notes

The PolyzyMD chain convention is: - Chain A: Protein/Enzyme - Chain B: Substrate/Ligand - Chain C: Polymers - Chain D+: Solvent (water, ions, co-solvents)

For systems not built with PolyzyMD, set chain_id=None and provide residue_names explicitly.

Examples

>>> # PolyzyMD system (recommended)
>>> selector = PolymerChains()  # Uses chain C
>>>
>>> # Non-PolyzyMD system
>>> selector = PolymerChains(chain_id=None, residue_names=["SBM", "EGM"])
>>>
>>> # PolyzyMD system with specific polymer types
>>> selector = PolymerChains(residue_names=["SBM"])  # SBM in chain C only

__init__(chain_id='C', residue_names=None, chain_indices=None, segids=None)[source]

select(universe)[source]

Select polymer atoms/residues.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.PolymerResiduesByType(residue_names, exclude=False)[source]

Bases: MolecularSelector

Select polymer residues by monomer type (residue name).

This selector groups polymer residues by their residue names, allowing analysis of specific monomer types within copolymers.

Parameters:

residue_names (list[str]) – Residue names to select (e.g., [“SBM”, “EGP”] for SBMA-EGMA copolymer)
exclude (bool, optional) – If True, select polymer residues NOT matching these names. Default False.

Examples

>>> # Select SBMA monomers only
>>> selector = PolymerResiduesByType(residue_names=["SBM", "SBMA"])
>>>
>>> # Select non-SBMA monomers
>>> selector = PolymerResiduesByType(residue_names=["SBM", "SBMA"], exclude=True)

__init__(residue_names, exclude=False)[source]

select(universe)[source]

Select polymer residues by type.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.PolymerSegments(residue_names=None, chain_index=None, segment_indices=None)[source]

Bases: MolecularSelector

Select individual segments (residues) within polymer chains.

This selector provides fine-grained access to polymer segments, useful for per-segment contact analysis.

Parameters:

residue_names (list[str], optional) – Residue names that identify polymer residues.
chain_index (int, optional) – Specific chain to select segments from (0-indexed). If None, selects from all chains.
segment_indices (list[int], optional) – Specific segment indices within chains to select. Uses 0-indexed positions within each chain.

Notes

A “segment” in this context refers to a single residue/monomer unit within a polymer chain, not MDAnalysis segments.

__init__(residue_names=None, chain_index=None, segment_indices=None)[source]

select(universe)[source]

Select polymer segments.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.SolventMolecules(residue_names=None, exclude_near=None, exclude_cutoff=3.0)[source]

Bases: MolecularSelector

Select solvent (water) molecules.

Parameters:

residue_names (list[str], optional) – Residue names for water. Default uses common water names.
exclude_near (str, optional) – Exclude waters within a cutoff of this selection. E.g., “protein” to exclude waters in first hydration shell.
exclude_cutoff (float, optional) – Cutoff in Angstroms for exclude_near. Default 3.0.

Examples

>>> # Select all water
>>> selector = SolventMolecules()
>>>
>>> # Select bulk water (exclude first shell around protein)
>>> selector = SolventMolecules(
...     exclude_near="protein",
...     exclude_cutoff=5.0
... )

__init__(residue_names=None, exclude_near=None, exclude_cutoff=3.0)[source]

select(universe)[source]

Select water molecules.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.CosolventMolecules(residue_names=None)[source]

Bases: MolecularSelector

Select cosolvent molecules (e.g., DMSO, acetonitrile).

Parameters:: residue_names (list[str], optional) – Residue names for cosolvent. Default uses common names. You should typically specify this for your system.

Examples

>>> # Select DMSO molecules
>>> selector = CosolventMolecules(residue_names=["DMSO", "DMS"])

__init__(residue_names=None)[source]

select(universe)[source]

Select cosolvent molecules.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.SubstrateMolecule(residue_name, n_molecules=None)[source]

Bases: MolecularSelector

Select substrate or ligand molecules.

Parameters:

residue_name (str) – Residue name of the substrate.
n_molecules (int, optional) – Expected number of substrate molecules. If provided, validates that exactly this many are found. Default None (no validation).

Examples

>>> # Select resorufin butyrate substrate
>>> selector = SubstrateMolecule(residue_name="RBU")
>>>
>>> # Select single substrate, validate count
>>> selector = SubstrateMolecule(residue_name="RBU", n_molecules=1)

__init__(residue_name, n_molecules=None)[source]

select(universe)[source]

Select substrate molecules.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.IonSelector(residue_names=None, ion_type='all')[source]

Bases: MolecularSelector

Select ion molecules (Na+, Cl-, etc.).

Parameters:

residue_names (list[str], optional) – Residue names for ions. Default includes common ions.
ion_type ({"all", "cations", "anions"}, optional) – Filter to specific ion types. Default “all”.

Examples

>>> # Select all ions
>>> selector = IonSelector()
>>>
>>> # Select only sodium ions
>>> selector = IonSelector(residue_names=["NA", "Na+", "SOD"])

DEFAULT_CATIONS = ['NA', 'Na+', 'SOD', 'K', 'K+', 'POT', 'MG', 'Mg2+', 'CA', 'Ca2+']

DEFAULT_ANIONS = ['CL', 'Cl-', 'CLA', 'BR', 'Br-']

__init__(residue_names=None, ion_type='all')[source]

select(universe)[source]

Select ion molecules.

property label: str: Short label identifying this selector (for filenames/logging).

Base class for molecular selectors.

This module defines the abstract base class for all molecular selectors, providing a consistent interface for selecting atoms, residues, or groups from an MDAnalysis Universe.

The Strategy pattern allows users to define custom selection logic by subclassing MolecularSelector and implementing the select() method.

class polyzymd.analyses.shared.selectors.base.SelectionResult(atoms, residues, label, metadata=<factory>)[source]

Bases: object

Container for selection results with metadata.

atoms

The selected atoms

Type:: AtomGroup

residues

The residues containing the selected atoms

Type:: ResidueGroup

label

Human-readable label for this selection

Type:: str

metadata

Additional metadata about the selection (e.g., selection string used, cutoff values, etc.)

Type:: dict

atoms: AtomGroup

residues: ResidueGroup

label: str

metadata: dict

property n_atoms: int: Number of selected atoms.

property n_residues: int: Number of selected residues.

property residue_ids: numpy.typing.NDArray.numpy.int64: 1-indexed residue IDs (PyMOL convention).

property residue_names: list[str]: Residue names for each residue.

__init__(atoms, residues, label, metadata=<factory>)

class polyzymd.analyses.shared.selectors.base.MolecularSelector[source]

Bases: ABC

Abstract base class for molecular selections.

Subclasses must implement the select() method to define how atoms or residues are selected from a Universe.

This follows the Strategy pattern - different selectors can be swapped in to change selection behavior without modifying the analysis code.

Examples

>>> class ActiveSiteSelector(MolecularSelector):
...     def __init__(self, active_site_resids: list[int]):
...         self.resids = active_site_resids
...
...     def select(self, universe: Universe) -> SelectionResult:
...         resid_str = " ".join(str(r) for r in self.resids)
...         atoms = universe.select_atoms(f"resid {resid_str}")
...         return SelectionResult(
...             atoms=atoms,
...             residues=atoms.residues,
...             label="active_site",
...             metadata={"resids": self.resids}
...         )
...
...     @property
...     def label(self) -> str:
...         return "active_site"

abstractmethod select(universe)[source]

Select atoms/residues from a Universe.

Parameters:: universe (Universe) – MDAnalysis Universe to select from
Returns:: Container with selected atoms, residues, and metadata
Return type:: SelectionResult

abstract property label: str: Short label identifying this selector (for filenames/logging).

validate(universe)[source]

Validate the selector against a Universe.

Returns diagnostic information about whether the selection would succeed and what it would select.

Parameters:: universe (Universe) – MDAnalysis Universe to validate against
Returns:: Validation results with keys: - valid: bool - n_atoms: int - n_residues: int - error: str (if invalid) - warnings: list[str]
Return type:: dict

class polyzymd.analyses.shared.selectors.base.MDAnalysisSelector(selection, label=None)[source]

Bases: MolecularSelector

Simple selector using an MDAnalysis selection string.

This is the most flexible selector - it allows arbitrary MDAnalysis selection syntax. Use this when you need direct control over the selection or when the specialized selectors don’t fit your needs.

Parameters:

selection (str) – MDAnalysis selection string (e.g., “protein”, “resname SBM EGM”, “resid 1-50 and name CA”)
label (str, optional) – Human-readable label. If not provided, uses a sanitized version of the selection string.

Examples

>>> # Select polymer residues by name
>>> selector = MDAnalysisSelector("resname SBM EGM")
>>> result = selector.select(universe)
>>>
>>> # Select protein backbone near ligand
>>> selector = MDAnalysisSelector(
...     "protein and backbone and around 5.0 resname LIG",
...     label="protein_near_ligand"
... )

__init__(selection, label=None)[source]

select(universe)[source]

Select atoms using the MDAnalysis selection string.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.base.CompositeSelector(selectors, mode='union', label=None)[source]

Bases: MolecularSelector

Combines multiple selectors with AND/OR logic.

Useful for complex selections like “protein residues that are both aromatic AND within 5A of the active site”.

Parameters:

selectors (list[MolecularSelector]) – List of selectors to combine
mode ({"union", "intersection"}) – How to combine selections: - “union”: Include atoms selected by ANY selector (OR) - “intersection”: Include only atoms selected by ALL selectors (AND)
label (str, optional) – Custom label. If not provided, generates from component labels.

__init__(selectors, mode='union', label=None)[source]

select(universe)[source]

Select atoms using combined selectors.

property label: str: Short label identifying this selector (for filenames/logging).

Protein residue selectors.

This module provides selectors for protein residues:

ProteinResidues: Select all protein residues
ProteinResiduesByGroup: Select protein residues by amino acid classification
ProteinResiduesNearReference: Select residues within cutoff of reference atoms

Examples

>>> # Select all protein residues
>>> selector = ProteinResidues()
>>> result = selector.select(universe)
>>>
>>> # Select aromatic residues only
>>> selector = ProteinResiduesByGroup(
...     grouping=ProteinAAClassification(),
...     groups=["aromatic"]
... )
>>>
>>> # Select residues near catalytic triad
>>> selector = ProteinResiduesNearReference(
...     reference_selection="resid 77 133 156",
...     cutoff=5.0
... )

class polyzymd.analyses.shared.selectors.protein.ProteinResidues(selection_modifier=None)[source]

Bases: MolecularSelector

Select all protein residues.

Uses MDAnalysis “protein” selection keyword which matches standard amino acid residues.

Parameters:: selection_modifier (str, optional) – Additional selection criteria to AND with “protein”. E.g., “and not name H*” to exclude hydrogens.

__init__(selection_modifier=None)[source]

select(universe)[source]

Select all protein atoms/residues.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.protein.ProteinResiduesByGroup(grouping, groups, exclude=False)[source]

Bases: MolecularSelector

Select protein residues by amino acid group classification.

Uses a ResidueGrouping to classify amino acids (e.g., aromatic, charged, polar, nonpolar) and selects only residues in the specified groups.

Parameters:

grouping (ResidueGrouping) – Classification scheme for amino acids
groups (list[str]) – Names of groups to include (e.g., [“aromatic”, “charged_positive”])
exclude (bool, optional) – If True, select residues NOT in the specified groups. Default False.

Examples

>>> from polyzymd.analyses.shared.groupings import ProteinAAClassification
>>>
>>> # Select aromatic residues
>>> grouping = ProteinAAClassification()
>>> selector = ProteinResiduesByGroup(grouping, groups=["aromatic"])
>>>
>>> # Select all charged residues
>>> selector = ProteinResiduesByGroup(
...     grouping,
...     groups=["charged_positive", "charged_negative"]
... )

__init__(grouping, groups, exclude=False)[source]

select(universe)[source]

Select protein residues matching the specified groups.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.protein.ProteinResiduesNearReference(reference_selection, cutoff, include_reference=True, frame=0)[source]

Bases: MolecularSelector

Select protein residues within a cutoff distance of reference atoms.

Useful for selecting residues near active sites, binding pockets, or other regions of interest.

Parameters:

reference_selection (str) – MDAnalysis selection string for reference atoms (e.g., “resid 77 133 156”)
cutoff (float) – Distance cutoff in Angstroms. Residues with any atom within this distance of any reference atom are selected.
include_reference (bool, optional) – Whether to include the reference residues themselves. Default True.
frame (int, optional) – Frame to use for distance calculation. Default is current frame (0).

Examples

>>> # Select residues within 5A of catalytic triad
>>> selector = ProteinResiduesNearReference(
...     reference_selection="resid 77 133 156",
...     cutoff=5.0,
... )
>>>
>>> # Select residues near substrate binding site (not including the site itself)
>>> selector = ProteinResiduesNearReference(
...     reference_selection="resname LIG",
...     cutoff=4.0,
...     include_reference=False,
... )

__init__(reference_selection, cutoff, include_reference=True, frame=0)[source]

select(universe)[source]

Select protein residues near the reference atoms.

property label: str: Short label identifying this selector (for filenames/logging).

Polymer chain and residue selectors.

This module provides selectors for polymer chains and residues:

PolymerChains: Select all polymer chains
PolymerResiduesByType: Select polymer residues by residue name (monomer type)

For systems built with PolyzyMD, use chain_id=”C” (the default) to select polymers based on the PolyzyMD chain convention: - Chain A: Protein/Enzyme - Chain B: Substrate/Ligand - Chain C: Polymers - Chain D+: Solvent (water, ions, co-solvents)

Examples

>>> # Select polymer chain C (PolyzyMD default)
>>> selector = PolymerChains()
>>> result = selector.select(universe)
>>>
>>> # Select by residue names (for non-PolyzyMD systems)
>>> selector = PolymerChains(chain_id=None, residue_names=["SBM", "EGP"])
>>>
>>> # Select specific polymer types within chain C
>>> selector = PolymerResiduesByType(residue_names=["SBM"])

class polyzymd.analyses.shared.selectors.polymer.PolymerChains(chain_id='C', residue_names=None, chain_indices=None, segids=None)[source]

Bases: MolecularSelector

Select polymer chains from the system.

For PolyzyMD-built systems, polymers are assigned to Chain C by convention. This selector uses chain ID selection by default, which is more reliable than residue name matching.

Parameters:

chain_id (str, optional) – Chain ID for polymer selection. Default “C” (PolyzyMD convention). Set to None to use residue_names instead.
residue_names (list[str], optional) – Residue names that identify polymer residues. Only used when chain_id is None, or as a filter within the chain. Default uses common PolyzyMD polymer names.
chain_indices (list[int], optional) – If provided, select only these polymer chain indices (0-indexed) from within the selected atoms. Useful when analyzing specific polymer chains in multi-chain systems.
segids (list[str], optional) – If provided, select only polymers with these segment IDs.

Notes

The PolyzyMD chain convention is: - Chain A: Protein/Enzyme - Chain B: Substrate/Ligand - Chain C: Polymers - Chain D+: Solvent (water, ions, co-solvents)

For systems not built with PolyzyMD, set chain_id=None and provide residue_names explicitly.

Examples

>>> # PolyzyMD system (recommended)
>>> selector = PolymerChains()  # Uses chain C
>>>
>>> # Non-PolyzyMD system
>>> selector = PolymerChains(chain_id=None, residue_names=["SBM", "EGM"])
>>>
>>> # PolyzyMD system with specific polymer types
>>> selector = PolymerChains(residue_names=["SBM"])  # SBM in chain C only

__init__(chain_id='C', residue_names=None, chain_indices=None, segids=None)[source]

select(universe)[source]

Select polymer atoms/residues.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.polymer.PolymerResiduesByType(residue_names, exclude=False)[source]

Bases: MolecularSelector

Select polymer residues by monomer type (residue name).

This selector groups polymer residues by their residue names, allowing analysis of specific monomer types within copolymers.

Parameters:

residue_names (list[str]) – Residue names to select (e.g., [“SBM”, “EGP”] for SBMA-EGMA copolymer)
exclude (bool, optional) – If True, select polymer residues NOT matching these names. Default False.

Examples

>>> # Select SBMA monomers only
>>> selector = PolymerResiduesByType(residue_names=["SBM", "SBMA"])
>>>
>>> # Select non-SBMA monomers
>>> selector = PolymerResiduesByType(residue_names=["SBM", "SBMA"], exclude=True)

__init__(residue_names, exclude=False)[source]

select(universe)[source]

Select polymer residues by type.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.polymer.PolymerSegments(residue_names=None, chain_index=None, segment_indices=None)[source]

Bases: MolecularSelector

Select individual segments (residues) within polymer chains.

This selector provides fine-grained access to polymer segments, useful for per-segment contact analysis.

Parameters:

residue_names (list[str], optional) – Residue names that identify polymer residues.
chain_index (int, optional) – Specific chain to select segments from (0-indexed). If None, selects from all chains.
segment_indices (list[int], optional) – Specific segment indices within chains to select. Uses 0-indexed positions within each chain.

Notes

A “segment” in this context refers to a single residue/monomer unit within a polymer chain, not MDAnalysis segments.

__init__(residue_names=None, chain_index=None, segment_indices=None)[source]

select(universe)[source]

Select polymer segments.

property label: str: Short label identifying this selector (for filenames/logging).

Solvent, cosolvent, and substrate selectors.

This module provides selectors for non-protein, non-polymer molecules:

SolventMolecules: Select water molecules
CosolventMolecules: Select cosolvent (e.g., DMSO)
SubstrateMolecule: Select substrate/ligand molecules

Examples

>>> # Select water molecules
>>> selector = SolventMolecules()
>>>
>>> # Select DMSO cosolvent
>>> selector = CosolventMolecules(residue_names=["DMSO", "DMS"])
>>>
>>> # Select substrate by residue name
>>> selector = SubstrateMolecule(residue_name="RBU")  # Resorufin butyrate

class polyzymd.analyses.shared.selectors.solvent.SolventMolecules(residue_names=None, exclude_near=None, exclude_cutoff=3.0)[source]

Bases: MolecularSelector

Select solvent (water) molecules.

Parameters:

residue_names (list[str], optional) – Residue names for water. Default uses common water names.
exclude_near (str, optional) – Exclude waters within a cutoff of this selection. E.g., “protein” to exclude waters in first hydration shell.
exclude_cutoff (float, optional) – Cutoff in Angstroms for exclude_near. Default 3.0.

Examples

>>> # Select all water
>>> selector = SolventMolecules()
>>>
>>> # Select bulk water (exclude first shell around protein)
>>> selector = SolventMolecules(
...     exclude_near="protein",
...     exclude_cutoff=5.0
... )

__init__(residue_names=None, exclude_near=None, exclude_cutoff=3.0)[source]

select(universe)[source]

Select water molecules.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.solvent.CosolventMolecules(residue_names=None)[source]

Bases: MolecularSelector

Select cosolvent molecules (e.g., DMSO, acetonitrile).

Parameters:: residue_names (list[str], optional) – Residue names for cosolvent. Default uses common names. You should typically specify this for your system.

Examples

>>> # Select DMSO molecules
>>> selector = CosolventMolecules(residue_names=["DMSO", "DMS"])

__init__(residue_names=None)[source]

select(universe)[source]

Select cosolvent molecules.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.solvent.SubstrateMolecule(residue_name, n_molecules=None)[source]

Bases: MolecularSelector

Select substrate or ligand molecules.

Parameters:

residue_name (str) – Residue name of the substrate.
n_molecules (int, optional) – Expected number of substrate molecules. If provided, validates that exactly this many are found. Default None (no validation).

Examples

>>> # Select resorufin butyrate substrate
>>> selector = SubstrateMolecule(residue_name="RBU")
>>>
>>> # Select single substrate, validate count
>>> selector = SubstrateMolecule(residue_name="RBU", n_molecules=1)

__init__(residue_name, n_molecules=None)[source]

select(universe)[source]

Select substrate molecules.

property label: str: Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.solvent.IonSelector(residue_names=None, ion_type='all')[source]

Bases: MolecularSelector

Select ion molecules (Na+, Cl-, etc.).

Parameters:

residue_names (list[str], optional) – Residue names for ions. Default includes common ions.
ion_type ({"all", "cations", "anions"}, optional) – Filter to specific ion types. Default “all”.

Examples

>>> # Select all ions
>>> selector = IonSelector()
>>>
>>> # Select only sodium ions
>>> selector = IonSelector(residue_names=["NA", "Na+", "SOD"])

DEFAULT_CATIONS = ['NA', 'Na+', 'SOD', 'K', 'K+', 'POT', 'MG', 'Mg2+', 'CA', 'Ca2+']

DEFAULT_ANIONS = ['CL', 'Cl-', 'CLA', 'BR', 'Br-']

__init__(residue_names=None, ion_type='all')[source]

select(universe)[source]

Select ion molecules.

property label: str: Short label identifying this selector (for filenames/logging).

Residue grouping abstractions for analysis plugins.

This module provides classification systems for residues:

ResidueGrouping: Abstract base class for residue classification
ProteinAAClassification: Standard amino acid classification
CustomGrouping: User-defined classification scheme

Examples

>>> from polyzymd.analyses.shared.groupings import ProteinAAClassification
>>>
>>> # Classify amino acids
>>> grouping = ProteinAAClassification()
>>> print(grouping.classify("PHE"))  # "aromatic"
>>> print(grouping.classify("LYS"))  # "charged_positive"
>>>
>>> # Get all residues in a group
>>> aromatics = grouping.get_residues_in_group("aromatic")
>>> # Returns: ["PHE", "TRP", "TYR", "HIS"]

class polyzymd.analyses.shared.groupings.ResidueGrouping[source]

Bases: ABC

Abstract base class for residue classification schemes.

Subclasses must implement classify() to map residue names to group labels.

Examples

>>> class MyPolymerGrouping(ResidueGrouping):
...     def classify(self, resname: str) -> str:
...         if resname in ["SBM", "SBMA"]:
...             return "zwitterionic"
...         elif resname in ["EGP", "EGMA"]:
...             return "hydrophilic"
...         return "unknown"
...
...     @property
...     def available_groups(self) -> list[str]:
...         return ["zwitterionic", "hydrophilic", "unknown"]

abstractmethod classify(resname)[source]

Classify a residue name into a group.

Parameters:: resname (str) – Residue name (3-letter code for amino acids)
Returns:: Group label for this residue type
Return type:: str

abstract property available_groups: list[str]: List of all group labels in this classification scheme.

get_residues_in_group(group)[source]

Get all residue names that belong to a group.

Parameters:: group (str) – Group label
Returns:: Residue names in this group
Return type:: list[str]
Raises:: ValueError – If group is not in available_groups

to_dict()[source]

Serialize grouping scheme to dictionary.

class polyzymd.analyses.shared.groupings.ProteinAAClassification(include_his_aromatic=True)[source]

Bases: ResidueGrouping

Standard amino acid classification.

Groups amino acids into: - aromatic: PHE, TRP, TYR, HIS - charged_positive: ARG, LYS - charged_negative: ASP, GLU - polar: ASN, CYS, GLN, SER, THR - nonpolar: ALA, GLY, ILE, LEU, MET, PRO, VAL

This classification matches the scaffold notebooks and common biochemistry conventions.

Parameters:: include_his_aromatic (bool, optional) – Whether to classify HIS as aromatic (default True). Some classifications put HIS with charged_positive.

Examples

>>> grouping = ProteinAAClassification()
>>> grouping.classify("PHE")
'aromatic'
>>> grouping.classify("LYS")
'charged_positive'
>>> grouping.get_residues_in_group("aromatic")
['PHE', 'TRP', 'TYR', 'HIS']

__init__(include_his_aromatic=True)[source]

classify(resname)[source]

Classify amino acid by residue name.

property available_groups: list[str]: List of all group labels in this classification scheme.

get_charged_groups()[source]

Convenience: get both charged group names.

get_hydrophobic_groups()[source]

Convenience: groups typically considered hydrophobic.

get_hydrophilic_groups()[source]

Convenience: groups typically considered hydrophilic.

class polyzymd.analyses.shared.groupings.CustomGrouping(classification, default_group='other')[source]

Bases: ResidueGrouping

User-defined residue classification.

Allows arbitrary mapping from residue names to group labels.

Parameters:

classification (dict[str, str]) – Mapping from residue name to group label.
default_group (str, optional) – Group label for unclassified residues. Default “other”.

Examples

>>> # Custom polymer classification
>>> grouping = CustomGrouping({
...     "SBM": "zwitterionic",
...     "SBMA": "zwitterionic",
...     "EGP": "peg_like",
...     "EGMA": "peg_like",
... }, default_group="unknown")
>>> grouping.classify("SBM")
'zwitterionic'

__init__(classification, default_group='other')[source]

classify(resname)[source]

Classify residue by name using custom mapping.

property available_groups: list[str]: List of all group labels in this classification scheme.

classmethod from_groups(groups, default_group='other')[source]

Create grouping from group -> residue list mapping.

Parameters:

groups (dict[str, list[str]]) – Mapping from group name to list of residue names
default_group (str) – Group for unlisted residues

Return type:

CustomGrouping

Examples

>>> grouping = CustomGrouping.from_groups({
...     "zwitterionic": ["SBM", "SBMA"],
...     "peg_like": ["EGP", "EGMA", "OEGMA"],
... })

Base classes for residue grouping/classification.

This module provides the abstract base class for residue classification schemes and concrete implementations for protein amino acids.

The Strategy pattern allows users to define custom classification schemes for polymers, modified residues, or other systems.

class polyzymd.analyses.shared.groupings.base.ResidueGrouping[source]

Bases: ABC

Abstract base class for residue classification schemes.

Subclasses must implement classify() to map residue names to group labels.

Examples

>>> class MyPolymerGrouping(ResidueGrouping):
...     def classify(self, resname: str) -> str:
...         if resname in ["SBM", "SBMA"]:
...             return "zwitterionic"
...         elif resname in ["EGP", "EGMA"]:
...             return "hydrophilic"
...         return "unknown"
...
...     @property
...     def available_groups(self) -> list[str]:
...         return ["zwitterionic", "hydrophilic", "unknown"]

abstractmethod classify(resname)[source]

Classify a residue name into a group.

Parameters:: resname (str) – Residue name (3-letter code for amino acids)
Returns:: Group label for this residue type
Return type:: str

abstract property available_groups: list[str]: List of all group labels in this classification scheme.

get_residues_in_group(group)[source]

Get all residue names that belong to a group.

Parameters:: group (str) – Group label
Returns:: Residue names in this group
Return type:: list[str]
Raises:: ValueError – If group is not in available_groups

to_dict()[source]

Serialize grouping scheme to dictionary.

class polyzymd.analyses.shared.groupings.base.ProteinAAClassification(include_his_aromatic=True)[source]

Bases: ResidueGrouping

Standard amino acid classification.

Groups amino acids into: - aromatic: PHE, TRP, TYR, HIS - charged_positive: ARG, LYS - charged_negative: ASP, GLU - polar: ASN, CYS, GLN, SER, THR - nonpolar: ALA, GLY, ILE, LEU, MET, PRO, VAL

This classification matches the scaffold notebooks and common biochemistry conventions.

Parameters:: include_his_aromatic (bool, optional) – Whether to classify HIS as aromatic (default True). Some classifications put HIS with charged_positive.

Examples

>>> grouping = ProteinAAClassification()
>>> grouping.classify("PHE")
'aromatic'
>>> grouping.classify("LYS")
'charged_positive'
>>> grouping.get_residues_in_group("aromatic")
['PHE', 'TRP', 'TYR', 'HIS']

__init__(include_his_aromatic=True)[source]

classify(resname)[source]

Classify amino acid by residue name.

property available_groups: list[str]: List of all group labels in this classification scheme.

get_charged_groups()[source]

Convenience: get both charged group names.

get_hydrophobic_groups()[source]

Convenience: groups typically considered hydrophobic.

get_hydrophilic_groups()[source]

Convenience: groups typically considered hydrophilic.

class polyzymd.analyses.shared.groupings.base.CustomGrouping(classification, default_group='other')[source]

Bases: ResidueGrouping

User-defined residue classification.

Allows arbitrary mapping from residue names to group labels.

Parameters:

classification (dict[str, str]) – Mapping from residue name to group label.
default_group (str, optional) – Group label for unclassified residues. Default “other”.

Examples

>>> # Custom polymer classification
>>> grouping = CustomGrouping({
...     "SBM": "zwitterionic",
...     "SBMA": "zwitterionic",
...     "EGP": "peg_like",
...     "EGMA": "peg_like",
... }, default_group="unknown")
>>> grouping.classify("SBM")
'zwitterionic'

__init__(classification, default_group='other')[source]

classify(resname)[source]

Classify residue by name using custom mapping.

property available_groups: list[str]: List of all group labels in this classification scheme.

classmethod from_groups(groups, default_group='other')[source]

Create grouping from group -> residue list mapping.

Parameters:

groups (dict[str, list[str]]) – Mapping from group name to list of residue names
default_group (str) – Group for unlisted residues

Return type:

CustomGrouping

Examples

>>> grouping = CustomGrouping.from_groups({
...     "zwitterionic": ["SBM", "SBMA"],
...     "peg_like": ["EGP", "EGMA", "OEGMA"],
... })

Amino acid classification and SASA reference data.

This module provides centralized reference data for amino acid properties:

Maximum accessible surface area (maxASA) from Tien et al. 2013
Standard amino acid classification by physicochemical properties
Default MDAnalysis selection strings for each AA class

These constants are used by:

Protein grouping in contact analysis
Template generation for analysis configs

References

Tien MZ, Meyer AG, Sydykova DK, Spielman SJ, Wilke CO. Maximum allowed solvent accessibilities of residues in proteins. PLoS One. 2013 Nov 21;8(11):e80635. doi: 10.1371/journal.pone.0080635. PMID: 24278298; PMCID: PMC3836772.

class polyzymd.analyses.shared.aa_classification.AAClass(value)[source]

Bases: str, Enum

Standard amino acid classifications.

AROMATIC = 'aromatic'

POLAR = 'polar'

NONPOLAR = 'nonpolar'

CHARGED_POSITIVE = 'charged_positive'

CHARGED_NEGATIVE = 'charged_negative'

UNKNOWN = 'unknown'

polyzymd.analyses.shared.aa_classification.get_aa_class(resname)[source]

Get amino acid classification for a residue name.

Parameters:: resname (str) – 3-letter amino acid code (case-insensitive)
Returns:: Classification: ‘aromatic’, ‘polar’, ‘nonpolar’, ‘charged_positive’, ‘charged_negative’, or ‘unknown’
Return type:: str

Examples

>>> get_aa_class("PHE")
'aromatic'
>>> get_aa_class("lys")
'charged_positive'
>>> get_aa_class("UNK")
'unknown'

polyzymd.analyses.shared.aa_classification.get_max_asa(resname)[source]

Get maximum accessible surface area for a residue name.

Parameters:: resname (str) – 3-letter amino acid code (case-insensitive)
Returns:: Maximum ASA in Angstrom^2, or None if residue not in table
Return type:: float or None

Examples

>>> get_max_asa("ALA")
121.0
>>> get_max_asa("TRP")
264.0
>>> get_max_asa("UNK")  # Returns None for unknown residues

polyzymd.analyses.shared.aa_classification.get_residues_for_class(aa_class)[source]

Get all residue names belonging to an amino acid class.

Parameters:: aa_class (str) – One of: ‘aromatic’, ‘polar’, ‘nonpolar’, ‘charged_positive’, ‘charged_negative’
Returns:: List of 3-letter amino acid codes in this class
Return type:: list[str]
Raises:: ValueError – If aa_class is not a valid classification

Examples

>>> get_residues_for_class("aromatic")
['PHE', 'TRP', 'TYR', 'HIS']

polyzymd.analyses.shared.aa_classification.get_selection_for_class(aa_class)[source]

Get MDAnalysis selection string for an amino acid class.

Parameters:: aa_class (str) – One of: ‘aromatic’, ‘polar’, ‘nonpolar’, ‘charged_positive’, ‘charged_negative’
Returns:: MDAnalysis selection string
Return type:: str
Raises:: ValueError – If aa_class is not a valid classification

Examples

>>> get_selection_for_class("aromatic")
'protein and resname PHE TRP TYR HIS'

Diagnostics and path helpers

Diagnostics helpers validate selections and analysis inputs. The module is polyzymd.analyses.shared.diagnostics. Path helpers standardize artifact-oriented file locations used by analysis plugins.

Path utilities for the analysis plugin system.

polyzymd.analyses.shared.paths.sanitize_label(label)[source]

Convert a condition label to a filesystem-safe directory name.

Replaces % with pct, spaces with underscores, strips remaining non-alphanumeric chars (except hyphens, underscores, dots), and collapses consecutive underscores.

Parameters:: label (str) – Original condition label.
Returns:: Filesystem-safe label.
Return type:: str

polyzymd.analyses.shared.paths.format_replicate_cache_token(replicates)[source]

Format replicate IDs for cache filenames without range collisions.

Contiguous replicate IDs are compacted as a range, while non-contiguous IDs are listed explicitly so (1, 3) cannot collide with (1, 2, 3).

Parameters:: replicates (Sequence[int]) – Iterable of replicate IDs.
Returns:: Cache-safe token such as "reps1-3" or "reps1_3".
Return type:: str

Multi-run comparison and formatting

Multi-run helpers support plugins that compare several named runs or entities per condition, such as RMSD, radius of gyration, and SASA analyses.

Shared helpers for multi-run comparison orchestration.

These helpers keep run-wise comparison logic concise across plugins that compare multiple named runs (RMSD, Rg, SASA).

polyzymd.analyses.shared.multi_run_comparison.filter_summaries_with_run(summaries, run_label, get_run_fn, logger=None)[source]

Filter condition summaries to those containing a specific run.

Parameters:

summaries (dict[str, Any]) – Mapping from condition label to condition summary.
run_label (str) – Run label to keep.
get_run_fn (Callable[[Any, str], Any]) – Callback that returns run summary for (summary, run_label) and raises KeyError when the run is missing.
logger (logging.Logger | None, optional) – Optional logger for missing-run warnings.

Returns:

Subset of summaries with run data available.

Return type:

dict[str, Any]

polyzymd.analyses.shared.multi_run_comparison.build_condition_pairs(condition_labels, control_label, on_control_missing='all_pairs', logger=None)[source]

Build pairwise condition pairs for comparison.

Parameters:

condition_labels (list[str]) – Ordered condition labels to compare.
control_label (str | None) – Preferred control label for control-vs-treatment comparisons.
on_control_missing (str, optional) –
Behavior when control_label is requested but unavailable.

Supported values:
- "all_pairs": fall back to all-vs-all
- "skip": return no pairs
logger (logging.Logger | None, optional) – Optional logger for fallback/skip messages.

Returns:

Pair list as (condition_a, condition_b) tuples.

Return type:

list[tuple[str, str]]

Raises:

ValueError – Raised when on_control_missing is not "all_pairs" or "skip".

polyzymd.analyses.shared.multi_run_comparison.apply_fdr_correction(pairwise_results, anova_by_run=None, fdr_alpha=0.05, get_p_value=None, set_corrected=None)[source]

Apply Benjamini-Hochberg FDR correction across statistical result families.

Parameters:

pairwise_results (list[Any]) – Pairwise comparison result objects.
anova_by_run (dict[Any, Any] | list[Any] | None, optional) – ANOVA result objects, as either list-like or dict-like container.
fdr_alpha (float, optional) – FDR threshold.
get_p_value (Callable[[Any], float | None] | None, optional) – Callback extracting raw p-value from a result object. Defaults to reading .p_value.
set_corrected (Callable[[Any, Any], None] | None, optional) – Callback applying BH output to each result object. Defaults to setting .p_value_adjusted (when available) and .significant.

Shared formatting helpers for multi-run analysis outputs.

polyzymd.analyses.shared.multi_run_formatting.is_sem_estimable(n_replicates)[source]

Return whether SEM can be estimated from replicate-level values.

Parameters:: n_replicates (int) – Number of replicate values contributing to the summary.
Returns:: True when at least two replicates are available.
Return type:: bool

polyzymd.analyses.shared.multi_run_formatting.format_sem_value(sem, n_replicates, *, precision=2, unit='')[source]

Format SEM without implying singleton uncertainty is estimable.

Parameters:

sem (float | None) – SEM value to display when enough replicates are available.
n_replicates (int) – Number of replicates contributing to the summary.
precision (int, optional) – Decimal places for numeric SEM values, by default 2.
unit (str, optional) – Unit suffix appended to numeric SEM values, by default "".

Returns:

"n/a" for singleton summaries, otherwise a formatted SEM value.

Return type:

str

polyzymd.analyses.shared.multi_run_formatting.format_sem_phrase(sem, n_replicates, *, precision=2, unit='')[source]

Format a compact SEM: ... phrase for summaries.

Parameters:

sem (float | None) – SEM value to display when enough replicates are available.
n_replicates (int) – Number of replicates contributing to the summary.
precision (int, optional) – Decimal places for numeric SEM values, by default 2.
unit (str, optional) – Unit suffix appended to numeric SEM values, by default "".

Returns:

"SEM: n/a (single replicate)" for singleton summaries, otherwise a numeric SEM phrase.

Return type:

str

polyzymd.analyses.shared.multi_run_formatting.make_section_title(title, width)[source]

Build a section title and separator lines.

polyzymd.analyses.shared.multi_run_formatting.make_ranked_table_header(*, mean_label)[source]

Build standard ranked-table headers for text output.

polyzymd.analyses.shared.multi_run_formatting.make_ranked_markdown_header(*, mean_label)[source]

Build standard ranked-table headers for markdown output.

polyzymd.analyses.shared.multi_run_formatting.format_pairwise_line(*, condition_a, condition_b, direction, p_value, effect_size, effect_label, percent_change, significant, prefix='Pairwise')[source]

Format one standard pairwise comparison line.

polyzymd.analyses.shared.multi_run_formatting.format_anova_line(*, f_statistic, p_value, significant)[source]

Format one standard ANOVA line.

polyzymd.analyses.shared.multi_run_formatting.format_markdown_bullet(prefix, line)[source]

Format a markdown bullet line with consistent prefixing.

polyzymd.analyses.shared.multi_run_formatting.make_ranked_rows(ranking, get_values)[source]

Build ranked rows as (label, mean, sem, rank) tuples.