Analysis Shared Utilities

API reference for polyzymd.analyses.shared and its commonly used submodules.

polyzymd.analyses.shared

Shared infrastructure for analysis plugins.

This package provides a curated set of commonly used shared utilities. For advanced or less common helpers, import directly from the relevant shared submodules.

In particular, selectors, custom selections, and result I/O helpers are not re-exported from this package root and should be imported from:

  • polyzymd.analyses.shared.selectors

  • polyzymd.analyses.shared.selections

  • polyzymd.analyses.shared.result_io

Sub-modules

loader

Trajectory loading, time parsing, frame conversion.

alignment

Trajectory alignment (centroid/average/frame/external).

centroid

K-Means centroid frame finding, reference mode dispatch.

statistics

SEM, per-residue/region aggregation, weighted mean.

aggregation

Replicate collection, distance pair aggregation.

autocorrelation

ACF, correlation time, statistical inefficiency.

pbc

Minimum image distance, PBC-aware distance matrix.

selections

Extended selection syntax (midpoint, COM), position retrieval.

diagnostics

Selection diagnostics, equilibration validation.

config_hash

Config hashing for cache validation.

constants

Shared default values (cutoffs, thresholds).

defaults

AnalysisDefaults model (equilibration_time).

logging_utils

Colored terminal logging.

plotting

Shared plotting helpers (axis styling, figure saving, grouped bars, etc.).

class polyzymd.analyses.shared.TrajectoryInfo(topology_file, trajectory_files=<factory>, n_segments=0, working_directory=<factory>, replicate=1)[source]

Bases: object

Information about discovered trajectory files.

topology_file

Path to topology file (PDB)

Type:

Path

trajectory_files

List of trajectory files (DCD) in order

Type:

list[Path]

n_segments

Number of daisy-chain segments

Type:

int

working_directory

Base working directory for this replicate

Type:

Path

replicate

Replicate number

Type:

int

topology_file: Path
trajectory_files: list[Path]
n_segments: int = 0
working_directory: Path
replicate: int = 1
property n_trajectory_files: int

Number of trajectory files found.

validate()[source]

Validate that all files exist.

__init__(topology_file, trajectory_files=<factory>, n_segments=0, working_directory=<factory>, replicate=1)
class polyzymd.analyses.shared.TrajectoryLoader(config, engine_override=None)[source]

Bases: object

Config-aware trajectory loader for PolyzyMD simulations.

This class handles the complexity of finding and loading trajectories from PolyzyMD’s output structure, including:

  • Daisy-chain continuation segments (OpenMM)

  • Flat production directories (GROMACS)

  • Scratch vs projects directory resolution

  • Multiple replicates

File discovery is delegated to the simulation engine resolved from the config’s engine field. The engine is created lazily on the first call that needs it, so construction remains cheap.

Parameters:
  • config (SimulationConfig) – PolyzyMD simulation configuration.

  • engine_override (str or None, optional) – Force a specific engine name ("openmm" or "gromacs") instead of reading config.engine.

Examples

>>> from polyzymd.config import load_config
>>> config = load_config("config.yaml")
>>> loader = TrajectoryLoader(config)
>>>
>>> # Load single replicate
>>> u = loader.load_universe(replicate=1)
>>> print(f"Loaded {len(u.trajectory)} frames")
>>>
>>> # Get trajectory info without loading
>>> info = loader.get_trajectory_info(replicate=1)
>>> print(f"Found {info.n_segments} segments")
>>>
>>> # Load multiple replicates
>>> for rep in range(1, 6):
...     u = loader.load_universe(replicate=rep)
...     # ... analyze
>>>
>>> # Explicit engine override for GROMACS directories
>>> loader = TrajectoryLoader(config, engine_override="gromacs")

Notes

Frame indices in MDAnalysis are 0-indexed. For user-facing output, add 1 to follow PyMOL convention (1-indexed frames).

__init__(config, engine_override=None)[source]
get_trajectory_info(replicate)[source]

Get trajectory file information for a replicate.

Parameters:

replicate (int) – Replicate number (1-indexed)

Returns:

Information about discovered trajectory files

Return type:

TrajectoryInfo

Raises:

FileNotFoundError – If working directory or required files don’t exist

load_universe(replicate, cache=True)[source]

Load MDAnalysis Universe for a replicate.

Parameters:
  • replicate (int) – Replicate number (1-indexed)

  • cache (bool, optional) – If True (default), cache the Universe for reuse

Returns:

MDAnalysis Universe with trajectory loaded

Return type:

Universe

Notes

For daisy-chain trajectories, all segments are loaded as a continuous trajectory using MDAnalysis’s ChainReader.

iter_replicates(replicates)[source]

Iterate over multiple replicates.

Parameters:

replicates (sequence of int) – Replicate numbers to load

Yields:

tuple of (int, Universe) – Replicate number and loaded Universe

Examples

>>> for rep, u in loader.iter_replicates([1, 2, 3, 4, 5]):
...     rmsf = compute_rmsf(u)
...     results[rep] = rmsf
get_frame_times(replicate, unit='ns')[source]

Get time values for each frame.

Parameters:
  • replicate (int) – Replicate number

  • unit (str, optional) – Time unit for output. Options: “ps”, “ns”. Default is “ns”.

Returns:

Array of time values for each frame

Return type:

NDArray[np.float64]

get_timestep(replicate, unit='ps')[source]

Get the trajectory timestep (time between frames).

Parameters:
  • replicate (int) – Replicate number

  • unit (str, optional) – Time unit. Options: “ps”, “ns”. Default is “ps”.

Returns:

Time between consecutive frames

Return type:

float

clear_cache()[source]

Clear the Universe cache to free memory.

find_topology(working_dir)[source]

Find topology file in working directory.

Delegates file discovery to the simulation engine. The engine applies its own search order (e.g. PDB preference for GROMACS, solvated_system.pdb preference for OpenMM).

This method is used by several plugins that pass an explicit working_dir unrelated to the current replicate. The replicate index is inferred from the directory name when possible (run_<N>), falling back to 1.

Parameters:

working_dir (Path) – Directory to search for topology files.

Returns:

Path to the topology file.

Return type:

Path

Raises:

FileNotFoundError – If no topology file is found.

polyzymd.analyses.shared.parse_time_string(time_str)[source]

Parse a time string with units into value and unit.

Parameters:

time_str (str) – Time string like “100ns”, “5000ps”, “100 ns”, etc.

Returns:

Numeric value and unit string

Return type:

tuple of (float, str)

Examples

>>> parse_time_string("100ns")
(100.0, "ns")
>>> parse_time_string("5000 ps")
(5000.0, "ps")
>>> parse_time_string("100")  # Default to ns
(100.0, "ns")
polyzymd.analyses.shared.convert_time(value, from_unit, to_unit)[source]

Convert time between units.

Parameters:
  • value (float) – Time value

  • from_unit (str) – Source unit (“fs”, “ps”, “ns”)

  • to_unit (str) – Target unit (“fs”, “ps”, “ns”)

Returns:

Converted time value

Return type:

float

polyzymd.analyses.shared.time_to_frame(time, time_unit, timestep, timestep_unit='ps')[source]

Convert time to frame index.

Parameters:
  • time (float) – Time value

  • time_unit (str) – Unit of time value

  • timestep (float) – Time between frames

  • timestep_unit (str) – Unit of timestep (default: “ps”)

Returns:

Frame index (0-indexed)

Return type:

int

class polyzymd.analyses.shared.AlignmentConfig(*, enabled=True, reference_mode='centroid', reference_frame=None, selection='protein and name CA', centroid_selection='protein', reference_file=None)[source]

Bases: BaseModel

Configuration for trajectory alignment.

This model defines how a trajectory should be aligned before analysis. By default, alignment is ENABLED to remove rotational drift and COM motion that can add noise to distance measurements.

enabled

Whether to align the trajectory before analysis. Default is True. When enabled, the user is notified via INFO-level logging.

Type:

bool

reference_mode

How to select the reference structure:

  • “centroid”: Representative aligned frame (closest to aligned mean)

  • “average”: Mathematical average structure

  • “frame”: Specific frame number

Type:

ReferenceMode

reference_frame

Frame number (1-indexed) when reference_mode=”frame”. Required when mode is “frame”, ignored otherwise.

Type:

int | None

selection

MDAnalysis selection string for superposition. Atoms matching this selection are used to compute the optimal rotation/translation. Default: “protein and name CA” (backbone alpha-carbons).

Type:

str

centroid_selection

Selection for representative-frame detection when mode=”centroid”. Default: “protein” (all protein atoms for better discrimination).

Type:

str

Examples

>>> # Default: align to centroid using CA atoms
>>> config = AlignmentConfig()
>>> # Align to a specific frame
>>> config = AlignmentConfig(
...     reference_mode="frame",
...     reference_frame=500,
... )
>>> # Disable alignment (not recommended for most analyses)
>>> config = AlignmentConfig(enabled=False)
enabled: bool
reference_mode: ReferenceMode
reference_frame: int | None
selection: str
centroid_selection: str
reference_file: Path | None
validate_reference_params()[source]

Validate reference_frame and reference_file for their modes.

to_dict()[source]

Convert to dictionary for cache key hashing.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

polyzymd.analyses.shared.align_trajectory(universe, config, start_frame=0, stop_frame=None)[source]

Align trajectory in-memory to a reference structure.

This function performs trajectory alignment using MDAnalysis, removing rotational drift and center-of-mass motion. The alignment is performed in-place, modifying the Universe’s coordinates.

Parameters:
  • universe (Universe) – MDAnalysis Universe to align. Will be modified in-place.

  • config (AlignmentConfig) – Alignment configuration specifying reference mode and selections.

  • start_frame (int, optional) – First frame to consider (0-indexed). Default is 0. Frames before this are typically equilibration and should be excluded.

  • stop_frame (int, optional) – Last frame (exclusive). If None, uses all frames after start_frame.

Returns:

Reference frame index (0-indexed), or None if using average mode. This can be stored in results for reproducibility.

Return type:

int | None

Raises:
  • ImportError – If MDAnalysis is not available.

  • ValueError – If reference_mode is “frame” but the frame number is out of range.

Notes

When alignment is performed, an INFO-level log message is emitted to notify the user. This ensures users are aware that their trajectory coordinates have been modified.

Examples

>>> from polyzymd.analyses.shared.alignment import AlignmentConfig, align_trajectory
>>>
>>> config = AlignmentConfig(reference_mode="centroid")
>>> ref_frame = align_trajectory(universe, config, start_frame=100)
>>> print(f"Aligned to frame {ref_frame}")
polyzymd.analyses.shared.get_alignment_description(config)[source]

Generate a human-readable description of the alignment configuration.

Parameters:

config (AlignmentConfig) – Alignment configuration.

Returns:

Description suitable for documentation or result files.

Return type:

str

class polyzymd.analyses.shared.StatResult(mean, sem, n_samples)[source]

Bases: object

Container for mean +/- SEM results.

mean

The mean value

Type:

float

sem

Standard error of the mean

Type:

float

n_samples

Number of samples used in computation

Type:

int

mean: float
sem: float
n_samples: int
to_dict()[source]

Convert to dictionary for JSON serialization.

__init__(mean, sem, n_samples)
class polyzymd.analyses.shared.PerResidueStats(residue_ids, means, sems, n_replicates)[source]

Bases: object

Container for per-residue statistics across replicates.

residue_ids

Residue identifiers (1-indexed, following PyMOL convention)

Type:

NDArray[np.int64]

means

Mean value for each residue across replicates

Type:

NDArray[np.float64]

sems

SEM for each residue across replicates

Type:

NDArray[np.float64]

n_replicates

Number of replicates aggregated

Type:

int

residue_ids: numpy.typing.NDArray.numpy.int64
means: numpy.typing.NDArray.numpy.float64
sems: numpy.typing.NDArray.numpy.float64
n_replicates: int
to_dict()[source]

Convert to dictionary for JSON serialization.

__init__(residue_ids, means, sems, n_replicates)
polyzymd.analyses.shared.compute_sem(values, ddof=1)[source]

Compute mean and standard error of the mean.

SEM = std / sqrt(N) where N is the number of samples.

Parameters:
  • values (array_like) – 1D array of values (e.g., one value per replicate)

  • ddof (int, optional) – Delta degrees of freedom for std calculation. Default is 1 (Bessel’s correction for sample std).

Returns:

Container with mean, sem, and n_samples

Return type:

StatResult

Examples

>>> values = [2.5, 2.7, 2.6, 2.4, 2.8]  # RMSF from 5 replicates
>>> result = compute_sem(values)
>>> print(f"RMSF = {result.mean:.2f} +/- {result.sem:.2f} A")
RMSF = 2.60 +/- 0.07 A

Notes

For a single value, SEM is undefined (returns 0.0).

polyzymd.analyses.shared.aggregate_per_residue_stats(per_replicate_values, residue_ids=None)[source]

Aggregate per-residue values across replicates.

For each residue, computes mean +/- SEM across all replicates. This is the correct way to aggregate per-residue RMSF values.

Parameters:
  • per_replicate_values (sequence of arrays) – List/tuple of 1D arrays, each containing per-residue values from one replicate. All arrays must have the same length.

  • residue_ids (array, optional) – 1-indexed residue identifiers. If None, uses 1, 2, 3, … Following PyMOL convention (1-indexed).

Returns:

Container with residue_ids, means, sems, n_replicates

Return type:

PerResidueStats

Raises:

ValueError – If arrays have inconsistent lengths or no replicates provided

polyzymd.analyses.shared.aggregate_region_stats(per_replicate_values, residue_mask=None)[source]

Aggregate region-averaged values across replicates.

For whole-protein or region-specific metrics, this computes the mean of per-replicate averages, with SEM across replicates.

This implements the correct hierarchical aggregation: 1. First average within each replicate (over selected residues) 2. Then compute mean +/- SEM across replicate means

Parameters:
  • per_replicate_values (sequence of arrays) – List/tuple of 1D arrays, each containing per-residue values from one replicate.

  • residue_mask (bool array, optional) – Boolean mask for residue selection. If None, uses all residues.

Returns:

Mean +/- SEM of region-averaged values across replicates

Return type:

StatResult

polyzymd.analyses.shared.weighted_mean_with_sem(means, sems, weights=None)[source]

Compute weighted mean with proper error propagation.

Useful for combining results from different conditions or analyses with different uncertainties.

Parameters:
  • means (array_like) – Mean values from each source

  • sems (array_like) – SEM values from each source

  • weights (array_like, optional) – Weights for each source. If None, uses inverse-variance weighting (1/sem^2), which is optimal for independent measurements.

Returns:

Weighted mean with propagated uncertainty

Return type:

StatResult

Notes

For inverse-variance weighting, the combined SEM is:

SEM_combined = 1 / sqrt(sum(1/SEM_i^2))

For arbitrary weights, uses standard error propagation:

SEM_combined = sqrt(sum((w_i * SEM_i)^2)) / sum(w_i)

class polyzymd.analyses.shared.ACFResult(lags, acf, timestep, timestep_unit, n_samples)[source]

Bases: object

Result of autocorrelation function computation.

lags

Time lags in the same units as timestep

Type:

NDArray[np.float64]

acf

Autocorrelation values (normalized, ACF[0] = 1)

Type:

NDArray[np.float64]

timestep

Time between frames

Type:

float

timestep_unit

Unit of timestep (e.g., “ps”, “ns”)

Type:

str

n_samples

Number of samples in the original timeseries

Type:

int

lags: numpy.typing.NDArray.numpy.float64
acf: numpy.typing.NDArray.numpy.float64
timestep: float
timestep_unit: str
n_samples: int
to_dict()[source]

Convert to dictionary for serialization.

__init__(lags, acf, timestep, timestep_unit, n_samples)
class polyzymd.analyses.shared.CorrelationTimeResult(tau, tau_unit, method, n_independent, statistical_inefficiency, warning=None)[source]

Bases: object

Result of correlation time estimation.

tau

Estimated correlation time

Type:

float

tau_unit

Unit of tau (same as timestep unit)

Type:

str

method

Method used for estimation

Type:

str

n_independent

Estimated number of independent samples in trajectory

Type:

int

statistical_inefficiency

g = 1 + 2*tau/dt, factor by which variance is inflated

Type:

float

warning

Warning message if statistics may be unreliable (e.g., N_ind < 10)

Type:

str | None

tau: float
tau_unit: str
method: str
n_independent: int
statistical_inefficiency: float
warning: str | None = None
property is_reliable: bool

Return True if statistics are likely reliable (N_ind >= 10).

to_dict()[source]

Convert to dictionary for serialization.

__init__(tau, tau_unit, method, n_independent, statistical_inefficiency, warning=None)
polyzymd.analyses.shared.compute_acf(timeseries, max_lag=None, timestep=1.0, timestep_unit='frames')[source]

Compute autocorrelation function of a 1D timeseries.

Uses FFT-based computation for efficiency.

Parameters:
  • timeseries (array_like) – 1D array of values (e.g., RMSD over time, distance over time)

  • max_lag (int, optional) – Maximum lag to compute (in frames). Default is N//4 where N is the length of the timeseries.

  • timestep (float, optional) – Time between frames. Default is 1.0.

  • timestep_unit (str, optional) – Unit of timestep. Default is “frames”.

Returns:

Container with lags, acf values, and metadata

Return type:

ACFResult

Examples

>>> # Compute ACF of RMSD timeseries
>>> rmsd = np.array([1.2, 1.3, 1.25, 1.4, ...])  # from MDAnalysis
>>> acf_result = compute_acf(rmsd, timestep=10.0, timestep_unit="ps")
>>> print(f"ACF at lag 100ps: {acf_result.acf[10]:.3f}")

Notes

The ACF is normalized so that ACF[0] = 1.

For a stationary process: ACF(τ) = <(x(t) - μ)(x(t+τ) - μ)> / σ²

For constant or near-constant timeseries (variance below a small epsilon), this function returns a defined degenerate ACF with ACF[0] = 1 and all positive lags set to 0.

polyzymd.analyses.shared.estimate_correlation_time(acf_or_timeseries, timestep=1.0, timestep_unit='frames', method='integration', n_frames=None)[source]

Estimate correlation time from ACF or raw timeseries.

Parameters:
  • acf_or_timeseries (ACFResult or array_like) – Either an ACFResult from compute_acf(), or a raw timeseries

  • timestep (float, optional) – Time between frames (only used if passing raw timeseries)

  • timestep_unit (str, optional) – Unit of timestep (only used if passing raw timeseries)

  • method ({"first_zero", "exponential_fit", "integration"}) – Method for estimating τ: - “first_zero”: Lag where ACF first crosses zero - “exponential_fit”: Fit ACF = exp(-t/τ) - “integration”: τ = ∫ACF(t)dt (recommended, most robust)

  • n_frames (int, optional) – Total number of frames (for computing n_independent). Only needed if passing ACFResult.

Returns:

Contains tau, method used, n_independent, statistical_inefficiency

Return type:

CorrelationTimeResult

Examples

>>> acf_result = compute_acf(rmsd, timestep=10.0, timestep_unit="ps")
>>> tau_result = estimate_correlation_time(acf_result, method="integration")
>>> print(f"Correlation time: {tau_result.tau:.1f} {tau_result.tau_unit}")
>>> print(f"Independent samples: {tau_result.n_independent}")

Notes

The “integration” method is most robust for noisy ACFs. It computes:

τ = ∫₀^∞ ACF(t) dt ≈ Σ ACF[i] * dt

Integration stops at first zero crossing to avoid noise contribution.

polyzymd.analyses.shared.get_independent_indices(n_frames, correlation_time, timestep=1.0, start_frame=0)[source]

Get frame indices for independent samples.

Selects frames separated by at least 2*τ (correlation time) to ensure approximate independence for statistical analysis.

Parameters:
  • n_frames (int) – Total number of frames in trajectory

  • correlation_time (float) – Correlation time τ (in same units as timestep)

  • timestep (float, optional) – Time between frames. Default is 1.0.

  • start_frame (int, optional) – First frame to consider (after equilibration). Default is 0. Note: Frame indices are 0-indexed internally, but user-facing documentation uses 1-indexed (PyMOL convention).

Returns:

Array of frame indices (0-indexed) that are approximately independent

Return type:

NDArray[np.int64]

Examples

>>> # Get independent frames for RMSF calculation
>>> tau_result = estimate_correlation_time(rmsd, timestep=10.0)
>>> indices = get_independent_indices(
...     n_frames=10000,
...     correlation_time=tau_result.tau,
...     timestep=10.0,
...     start_frame=1000,  # Skip first 1000 frames for equilibration
... )
>>> print(f"Using {len(indices)} independent frames")

Notes

Frame indices returned are 0-indexed (for direct use with MDAnalysis). When displaying to users, add 1 for PyMOL convention.

The spacing is set to 2*τ/timestep, which gives frames with negligible correlation (ACF < 0.05 for exponential decay).

polyzymd.analyses.shared.statistical_inefficiency(timeseries, mintime=3, fft=True)[source]

Compute statistical inefficiency g directly from a timeseries.

The statistical inefficiency g is the factor by which the variance of the sample mean is increased due to correlation:

Var(mean) = Var(x) * g / N

This is computed as: g = 1 + 2 * Σ C(t) * (1 - t/N)

where C(t) is the normalized autocorrelation function and the sum includes the finite-size correction factor (1 - t/N) per Chodera et al. (2007).

Parameters:
  • timeseries (array_like) – 1D array of values (e.g., contact binary array, RMSD over time)

  • mintime (int) – Minimum number of lags to compute before checking for zero crossing. Prevents early termination from noise. Default is 3.

  • fft (bool) – If True, use FFT-based ACF computation (faster). Default is True.

Returns:

Statistical inefficiency g (>= 1.0). The number of effective independent samples is N_eff = N / g.

Return type:

float

Examples

>>> # Binary contact timeseries
>>> contacts = np.array([0, 1, 1, 1, 0, 0, 1, 1, ...])
>>> g = statistical_inefficiency(contacts)
>>> n_eff = len(contacts) / g
>>> print(f"Effective samples: {n_eff:.1f}")
>>> # Continuous observable
>>> rmsd = np.array([1.2, 1.3, 1.25, 1.4, ...])
>>> g = statistical_inefficiency(rmsd)

Notes

This implementation follows the algorithm from Chodera et al. (2007) J. Chem. Theory Comput. 3:26, with the finite-size correction.

For binary (0/1) data, the algorithm works correctly as the variance of a Bernoulli random variable is p(1-p).

References

Chodera et al. (2007) J. Chem. Theory Comput. 3:26

polyzymd.analyses.shared.statistical_inefficiency_multiple(timeseries_list, mintime=3)[source]

Compute statistical inefficiency from multiple timeseries of different lengths.

This is critical for aggregating replicates with different frame counts. The algorithm computes a global mean μ across all timeseries, then averages the ACF numerator and denominator separately before computing g.

Parameters:
  • timeseries_list (list[ArrayLike]) – List of 1D timeseries arrays (can have different lengths)

  • mintime (int) – Minimum number of lags before checking for zero crossing. Default is 3.

Returns:

Statistical inefficiency g (>= 1.0)

Return type:

float

Examples

>>> # Three replicates with different lengths
>>> ts1 = np.array([0, 1, 1, 0, 0, 1])  # 6 frames
>>> ts2 = np.array([1, 1, 0, 0, 0])      # 5 frames
>>> ts3 = np.array([0, 0, 1, 1, 1, 0, 1])  # 7 frames
>>> g = statistical_inefficiency_multiple([ts1, ts2, ts3])

Notes

This implementation follows the algorithm from PyMBAR’s statistical_inefficiency_multiple(), adapted without the PyMBAR dependency.

The algorithm:

  1. Compute global mean μ across all timeseries

  2. For each lag t: - Compute sum of (x - μ) products across all timeseries where t < N_k - Compute sum of sample counts across all timeseries where t < N_k - Average to get C(t)

  3. Sum with finite-size correction

References

Chodera et al. (2007) J. Chem. Theory Comput. 3:26

polyzymd.analyses.shared.n_effective(n_samples, g)[source]

Compute number of effective independent samples.

Parameters:
  • n_samples (int) – Total number of samples

  • g (float) – Statistical inefficiency

Returns:

Effective number of independent samples (N_eff = N / g)

Return type:

float

polyzymd.analyses.shared.check_statistical_reliability(n_eff, threshold=10)[source]

Check if statistics are reliable based on effective sample count.

Parameters:
  • n_eff (float) – Number of effective independent samples

  • threshold (int) – Minimum recommended independent samples. Default is 10.

Returns:

  • is_reliable (bool) – True if n_eff >= threshold

  • warning (str | None) – Warning message if not reliable, None otherwise

Return type:

tuple[bool, str | None]

Examples

>>> g = statistical_inefficiency(contacts)
>>> n_eff = n_effective(len(contacts), g)
>>> is_ok, warning = check_statistical_reliability(n_eff)
>>> if not is_ok:
...     print(warning)
polyzymd.analyses.shared.minimum_image_distance(pos1, pos2, box=None)[source]

Calculate minimum image distance between two positions.

Uses the minimum image convention for periodic boundary conditions. Both orthorhombic and triclinic boxes are fully supported.

Parameters:
  • pos1 (NDArray) – Position vectors of shape (3,) in Angstroms.

  • pos2 (NDArray) – Position vectors of shape (3,) in Angstroms.

  • box (NDArray, optional) – Box dimensions in MDAnalysis format: [Lx, Ly, Lz, alpha, beta, gamma] or just [Lx, Ly, Lz] (assumes orthorhombic). If None, no PBC correction is applied.

Returns:

Minimum image distance in Angstroms.

Return type:

float

polyzymd.analyses.shared.pairwise_distances_pbc(positions1, positions2, box=None)[source]

Calculate pairwise distances between two sets of positions.

Uses the minimum image convention for periodic boundary conditions. Both orthorhombic and triclinic boxes are fully supported.

Parameters:
  • positions1 (NDArray) – First set of positions, shape (N, 3) in Angstroms.

  • positions2 (NDArray) – Second set of positions, shape (M, 3) in Angstroms.

  • box (NDArray, optional) – Box dimensions in MDAnalysis format: [Lx, Ly, Lz, alpha, beta, gamma] or just [Lx, Ly, Lz] (assumes orthorhombic). If None, no PBC correction is applied.

Returns:

Distance matrix of shape (N, M) in Angstroms.

Return type:

NDArray[np.float64]

class polyzymd.analyses.shared.AnalysisDefaults(*, equilibration_time='10ns', fdr_alpha=0.05, ttest_method='student', posthoc_method='ttest_bh')[source]

Bases: BaseModel

Default parameters applied to all analyses.

equilibration_time

Time to skip for equilibration (e.g., “10ns”, “5000ps”)

Type:

str

fdr_alpha

False discovery rate threshold for Benjamini-Hochberg correction in default scalar pairwise comparisons. Only used when posthoc_method is "ttest_bh".

Type:

float

ttest_method

Two-sample t-test method used by the default scalar comparison. "student" assumes equal variances and "welch" relaxes that assumption. Only used when posthoc_method is "ttest_bh".

Type:

str

posthoc_method

Post-hoc testing method for pairwise comparisons. "ttest_bh" (default) runs pairwise t-tests with Benjamini-Hochberg FDR correction. "tukey_hsd" runs Tukey’s Honestly Significant Difference test across all groups.

Type:

str

equilibration_time: str
fdr_alpha: float
ttest_method: str
posthoc_method: str
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

polyzymd.analyses.shared.compute_config_hash(config)[source]

Compute hash of config parameters relevant to analysis.

Includes parameters that affect trajectory interpretation: - enzyme configuration - substrate configuration - polymer configuration - thermodynamics (temperature, pressure) - output paths (for trajectory location)

Excludes parameters that don’t affect analysis of completed trajectories: - simulation_phases (equilibration_stages/production settings) - force_field (already baked into trajectory)

Parameters:

config (SimulationConfig) – PolyzyMD simulation configuration

Returns:

Hex digest of SHA-256 hash (first 16 characters for brevity)

Return type:

str

Examples

>>> from polyzymd.config import load_config
>>> config = load_config("config.yaml")
>>> hash_val = compute_config_hash(config)
>>> print(f"Config hash: {hash_val}")
Config hash: a3b2c1d4e5f67890
polyzymd.analyses.shared.validate_config_hash(stored_hash, current_config, warn=True)[source]

Check if stored hash matches current config.

If the hashes don’t match, this indicates the config has changed since the analysis was performed. This could mean: 1. The user modified the config (bad practice - should create new project) 2. The analysis was performed on a different config file 3. A bug in the hashing algorithm

Parameters:
  • stored_hash (str) – Hash stored in cached analysis results

  • current_config (SimulationConfig) – Current configuration being used

  • warn (bool, optional) – If True (default), print a loud warning when hashes don’t match

Returns:

True if hashes match, False otherwise

Return type:

bool

Examples

>>> stored_hash = loaded_result.get("config_hash", "")
>>> config = load_config("config.yaml")
>>> if not validate_config_hash(stored_hash, config):
...     print("Warning: Results may be stale!")
class polyzymd.analyses.shared.ConvergenceResult(converged, assessable, convergence_time_ns, window_start_times_ns, window_mean_values, slope_times_ns, slopes, window_size_ns, step_size_ns, slope_threshold, sustained_for_ns, message)[source]

Bases: object

Container for convergence diagnostics.

converged

Whether sustained convergence was detected.

Type:

bool

assessable

Whether convergence could be assessed from available data.

Type:

bool

convergence_time_ns

Start time of the first sustained converged period.

Type:

float | None

window_start_times_ns

Start times for each sliding window.

Type:

list[float]

window_mean_values

Mean signal value in each sliding window.

Type:

list[float]

slope_times_ns

Time points associated with slope estimates.

Type:

list[float]

slopes

Slopes between successive window means.

Type:

list[float]

window_size_ns

Sliding window width in ns.

Type:

float

step_size_ns

Sliding window stride in ns.

Type:

float

slope_threshold

Absolute slope cutoff used for convergence.

Type:

float

sustained_for_ns

Required sustained duration below slope threshold.

Type:

float

message

Human-readable status message.

Type:

str

converged: bool
assessable: bool
convergence_time_ns: float | None
window_start_times_ns: list[float]
window_mean_values: list[float]
slope_times_ns: list[float]
slopes: list[float]
window_size_ns: float
step_size_ns: float
slope_threshold: float
sustained_for_ns: float
message: str
__init__(converged, assessable, convergence_time_ns, window_start_times_ns, window_mean_values, slope_times_ns, slopes, window_size_ns, step_size_ns, slope_threshold, sustained_for_ns, message)
polyzymd.analyses.shared.find_convergence_time(time_ns, values, window_size_ns=15.0, step_size_ns=5.0, slope_threshold=0.0005, sustained_for_ns=15.0)[source]

Find sustained convergence time using a sliding-window slope heuristic.

Parameters:
  • time_ns (array_like) – Monotonically increasing time values in ns.

  • values (array_like) – Signal values sampled at time_ns.

  • window_size_ns (float, optional) – Width of each averaging window in ns.

  • step_size_ns (float, optional) – Sliding step between successive windows in ns.

  • slope_threshold (float, optional) – Absolute slope threshold for classifying a window-to-window change as converged.

  • sustained_for_ns (float, optional) – Required cumulative duration below threshold before declaring convergence.

Returns:

Full convergence diagnostics, including intermediate window means and slope traces.

Return type:

ConvergenceResult

Raises:

ValueError – Raised when inputs are invalid.

polyzymd.analyses.shared.get_theme(plot_settings)[source]

Return the resolved PlotTheme from plot_settings.

Parameters:

plot_settings (PlotSettings) – Global plot settings (carries a .theme property).

Return type:

PlotTheme

polyzymd.analyses.shared.apply_axis_style(ax, plot_settings, *, title=None, xlabel=None, ylabel=None)[source]

Apply standard axis chrome from the theme.

Hides spines according to theme settings, sizes tick labels, and optionally sets title / xlabel / ylabel with themed font sizes.

Parameters:
  • ax (matplotlib Axes) – Target axes to style.

  • plot_settings (PlotSettings) – Global plot settings.

  • title (str, optional) – Axes title.

  • xlabel (str, optional) – X-axis label.

  • ylabel (str, optional) – Y-axis label.

polyzymd.analyses.shared.apply_legend(ax, plot_settings, *, loc=None, bbox_to_anchor=<object object>, fontsize=None, **kwargs)[source]

Apply legend with themed defaults.

Uses theme.legend_loc and theme.legend_bbox unless overridden by the caller. Extra kwargs are forwarded to ax.legend().

Parameters:
  • ax (matplotlib Axes) – Target axes.

  • plot_settings (PlotSettings) – Global plot settings.

  • loc (str, optional) – Override theme.legend_loc.

  • bbox_to_anchor (tuple of float or None, optional) – Override theme.legend_bbox. Pass None explicitly to suppress the bbox (e.g. for inside-axes placement).

  • fontsize (int, optional) – Override theme.legend_fontsize.

  • **kwargs (Any) – Forwarded to ax.legend().

polyzymd.analyses.shared.get_colors(n, plot_settings)[source]

Get n distinct colors from the configured palette.

Tries seaborn first (richer palette support), falls back to a matplotlib colormap sampled at evenly-spaced intervals.

Parameters:
  • n (int) – Number of colors needed.

  • plot_settings (PlotSettings) – Global plot settings (carries color_palette).

Returns:

List of color values (RGB tuples or matplotlib color specs).

Return type:

list

polyzymd.analyses.shared.get_output_path(output_dir, name, plot_settings)[source]

Generate output file path with correct format extension.

Parameters:
  • output_dir (Path) – Output directory.

  • name (str) – Base filename (without extension).

  • plot_settings (PlotSettings) – Global plot settings (carries format).

Returns:

Full output path with extension.

Return type:

Path

polyzymd.analyses.shared.save_figure(fig, output_path, plot_settings, *, experimental_features=None, close=True)[source]

Save figure with DPI, watermark, and optional experimental stamp.

Parameters:
  • fig (matplotlib Figure) – Figure to save.

  • output_path (Path) – Output file path.

  • plot_settings (PlotSettings) – Global plot settings (carries dpi and theme).

  • experimental_features (sequence of str or None, optional) – Experimental feature ids to stamp onto the figure.

  • close (bool, optional) – If True, close the figure after saving. Set False when the caller needs to keep using the figure object.

Returns:

Path to saved figure.

Return type:

Path

polyzymd.analyses.shared.grouped_bars(ax, x, series, colors, plot_settings, *, bar_width=None, show_error=True, reference_line=0.0, reference_label='Neutral (0)', replicate_values=None, **style_overrides)[source]

Render grouped bars with optional error bars and reference line.

Style values (alpha, capsize, edgecolor, linewidth, dot_size, etc.) are read from plot_settings.theme. Callers can override any of them via **style_overrides using the theme field names as keys.

Parameters:
  • ax (matplotlib Axes) – Target axes.

  • x (np.ndarray) – 1-D array of group centre positions (e.g. np.arange(n_groups)).

  • series (sequence of (label, means, errors)) – One tuple per condition. means and errors must have the same length as x.

  • colors (sequence) – One colour per condition (same length as series).

  • plot_settings (PlotSettings) – Global plot settings.

  • bar_width (float | None, optional) – Width of each individual bar. When None (default) the width is computed as 0.8 / len(series).

  • show_error (bool, optional) – If False, error bars are suppressed, by default True.

  • reference_line (float | None, optional) – Y-value for a horizontal reference line. Set to None to skip, by default 0.0.

  • reference_label (str, optional) – Legend label for the reference line, by default "Neutral (0)".

  • replicate_values (sequence or None, optional) – Per-replicate values for jittered dot overlay. Indexed as replicate_values[condition_idx][group_idx] -> sequence of floats (one per replicate). When None (default), no dots are drawn.

  • **style_overrides – Override any theme field for this call only. Accepted keys: bar_alpha, bar_capsize, bar_edgecolor, bar_linewidth, dot_size, dot_alpha, dot_color, reference_line_color, reference_line_style, reference_line_width.

polyzymd.analyses.shared.find_json(directory, preferred, glob_pattern='*.json')[source]

Locate a JSON result file inside directory.

Tries the preferred filename first; if it does not exist, falls back to the first file matching glob_pattern (sorted lexicographically so results are deterministic).

Parameters:
  • directory (Path) – Directory to search.

  • preferred (str) – Exact filename to try first (e.g. "rmsf_aggregated.json").

  • glob_pattern (str, optional) – Glob to use as fallback, by default "*.json".

Returns:

Path to the located file, or None if nothing was found.

Return type:

Path | None

polyzymd.analyses.shared.annotate_cells(ax, matrix, plot_settings, *, fmt='.2f', fontsize=None, threshold=0.3, sem_matrix=None, show_sign=True, linespacing=None)[source]

Annotate heatmap cells with formatted values.

Iterates over every element of matrix and places a text label at the corresponding (col, row) position on ax. NaN cells are skipped. Text colour flips between black and white depending on the background intensity (controlled by threshold).

Parameters:
  • ax (matplotlib Axes) – The axes containing the heatmap image.

  • matrix (np.ndarray) – 2-D array of values (rows x cols) matching the heatmap.

  • plot_settings (PlotSettings) – Global plot settings.

  • fmt (str, optional) – Format spec for the value, by default ".2f".

  • fontsize (int | None, optional) – Annotation font size. When None (default), uses plot_settings.theme.annotation_fontsize.

  • threshold (float, optional) – Absolute-value threshold above which text turns white.

  • sem_matrix (np.ndarray | None, optional) – If provided, a second line ±{sem} is appended when the SEM value is finite.

  • show_sign (bool, optional) – Prefix positive values with "+" , by default True.

  • linespacing (float | None, optional) – Passed to ax.text(linespacing=...) when SEM is shown.

polyzymd.analyses.shared.symmetric_clim(values, pad=0.1)[source]

Compute symmetric colour limits centred on zero.

Parameters:
  • values (sequence of float or ndarray) – Finite data values to derive limits from.

  • pad (float, optional) – Extra padding added to both sides, by default 0.1.

Returns:

(vmin, vmax) with vmin == -vmax (before padding).

Return type:

tuple[float, float]

polyzymd.analyses.shared.loader

Trajectory loading utilities for PolyzyMD analysis.

This module provides config-aware trajectory loading that understands PolyzyMD’s directory structure and daisy-chain continuation patterns. File discovery is delegated to the active simulation engine so that both OpenMM and GROMACS directory layouts are handled transparently.

Key Features

  • Config-based path resolution (config.yaml is single source of truth)

  • Engine-aware file discovery (OpenMM daisy-chain, GROMACS flat layout)

  • Automatic detection of daisy-chain trajectory segments

  • Support for both scratch and projects directories

  • Lazy loading and memory-efficient iteration

class polyzymd.analyses.shared.loader.TrajectoryInfo(topology_file, trajectory_files=<factory>, n_segments=0, working_directory=<factory>, replicate=1)[source]

Bases: object

Information about discovered trajectory files.

topology_file

Path to topology file (PDB)

Type:

Path

trajectory_files

List of trajectory files (DCD) in order

Type:

list[Path]

n_segments

Number of daisy-chain segments

Type:

int

working_directory

Base working directory for this replicate

Type:

Path

replicate

Replicate number

Type:

int

topology_file: Path
trajectory_files: list[Path]
n_segments: int = 0
working_directory: Path
replicate: int = 1
property n_trajectory_files: int

Number of trajectory files found.

validate()[source]

Validate that all files exist.

__init__(topology_file, trajectory_files=<factory>, n_segments=0, working_directory=<factory>, replicate=1)
class polyzymd.analyses.shared.loader.TrajectoryLoader(config, engine_override=None)[source]

Bases: object

Config-aware trajectory loader for PolyzyMD simulations.

This class handles the complexity of finding and loading trajectories from PolyzyMD’s output structure, including:

  • Daisy-chain continuation segments (OpenMM)

  • Flat production directories (GROMACS)

  • Scratch vs projects directory resolution

  • Multiple replicates

File discovery is delegated to the simulation engine resolved from the config’s engine field. The engine is created lazily on the first call that needs it, so construction remains cheap.

Parameters:
  • config (SimulationConfig) – PolyzyMD simulation configuration.

  • engine_override (str or None, optional) – Force a specific engine name ("openmm" or "gromacs") instead of reading config.engine.

Examples

>>> from polyzymd.config import load_config
>>> config = load_config("config.yaml")
>>> loader = TrajectoryLoader(config)
>>>
>>> # Load single replicate
>>> u = loader.load_universe(replicate=1)
>>> print(f"Loaded {len(u.trajectory)} frames")
>>>
>>> # Get trajectory info without loading
>>> info = loader.get_trajectory_info(replicate=1)
>>> print(f"Found {info.n_segments} segments")
>>>
>>> # Load multiple replicates
>>> for rep in range(1, 6):
...     u = loader.load_universe(replicate=rep)
...     # ... analyze
>>>
>>> # Explicit engine override for GROMACS directories
>>> loader = TrajectoryLoader(config, engine_override="gromacs")

Notes

Frame indices in MDAnalysis are 0-indexed. For user-facing output, add 1 to follow PyMOL convention (1-indexed frames).

__init__(config, engine_override=None)[source]
get_trajectory_info(replicate)[source]

Get trajectory file information for a replicate.

Parameters:

replicate (int) – Replicate number (1-indexed)

Returns:

Information about discovered trajectory files

Return type:

TrajectoryInfo

Raises:

FileNotFoundError – If working directory or required files don’t exist

load_universe(replicate, cache=True)[source]

Load MDAnalysis Universe for a replicate.

Parameters:
  • replicate (int) – Replicate number (1-indexed)

  • cache (bool, optional) – If True (default), cache the Universe for reuse

Returns:

MDAnalysis Universe with trajectory loaded

Return type:

Universe

Notes

For daisy-chain trajectories, all segments are loaded as a continuous trajectory using MDAnalysis’s ChainReader.

iter_replicates(replicates)[source]

Iterate over multiple replicates.

Parameters:

replicates (sequence of int) – Replicate numbers to load

Yields:

tuple of (int, Universe) – Replicate number and loaded Universe

Examples

>>> for rep, u in loader.iter_replicates([1, 2, 3, 4, 5]):
...     rmsf = compute_rmsf(u)
...     results[rep] = rmsf
get_frame_times(replicate, unit='ns')[source]

Get time values for each frame.

Parameters:
  • replicate (int) – Replicate number

  • unit (str, optional) – Time unit for output. Options: “ps”, “ns”. Default is “ns”.

Returns:

Array of time values for each frame

Return type:

NDArray[np.float64]

get_timestep(replicate, unit='ps')[source]

Get the trajectory timestep (time between frames).

Parameters:
  • replicate (int) – Replicate number

  • unit (str, optional) – Time unit. Options: “ps”, “ns”. Default is “ps”.

Returns:

Time between consecutive frames

Return type:

float

clear_cache()[source]

Clear the Universe cache to free memory.

find_topology(working_dir)[source]

Find topology file in working directory.

Delegates file discovery to the simulation engine. The engine applies its own search order (e.g. PDB preference for GROMACS, solvated_system.pdb preference for OpenMM).

This method is used by several plugins that pass an explicit working_dir unrelated to the current replicate. The replicate index is inferred from the directory name when possible (run_<N>), falling back to 1.

Parameters:

working_dir (Path) – Directory to search for topology files.

Returns:

Path to the topology file.

Return type:

Path

Raises:

FileNotFoundError – If no topology file is found.

polyzymd.analyses.shared.loader.parse_time_string(time_str)[source]

Parse a time string with units into value and unit.

Parameters:

time_str (str) – Time string like “100ns”, “5000ps”, “100 ns”, etc.

Returns:

Numeric value and unit string

Return type:

tuple of (float, str)

Examples

>>> parse_time_string("100ns")
(100.0, "ns")
>>> parse_time_string("5000 ps")
(5000.0, "ps")
>>> parse_time_string("100")  # Default to ns
(100.0, "ns")
polyzymd.analyses.shared.loader.convert_time(value, from_unit, to_unit)[source]

Convert time between units.

Parameters:
  • value (float) – Time value

  • from_unit (str) – Source unit (“fs”, “ps”, “ns”)

  • to_unit (str) – Target unit (“fs”, “ps”, “ns”)

Returns:

Converted time value

Return type:

float

polyzymd.analyses.shared.loader.time_to_frame(time, time_unit, timestep, timestep_unit='ps')[source]

Convert time to frame index.

Parameters:
  • time (float) – Time value

  • time_unit (str) – Unit of time value

  • timestep (float) – Time between frames

  • timestep_unit (str) – Unit of timestep (default: “ps”)

Returns:

Frame index (0-indexed)

Return type:

int

polyzymd.analyses.shared.alignment

Trajectory alignment utilities.

This module provides reusable trajectory alignment infrastructure for all analysis modules. The alignment is performed in-memory using MDAnalysis and removes rotational drift and center-of-mass motion that can add noise to distance and flexibility measurements.

Reference Modes

  • centroid: Align to a representative aligned frame (closest to aligned mean structure). Best for measuring fluctuations around a representative equilibrium state.

  • average: Align to mathematical average structure. Best for pure thermal fluctuation analysis (note: average may have unphysical geometry).

  • frame: Align to a specific user-specified frame number. Best for comparing to a known functional conformation.

  • external: Align to an external PDB structure (e.g., crystal structure). Best for measuring deviations from a catalytically competent reference geometry that is independent of the simulation conditions.

Usage

>>> from polyzymd.analyses.shared.alignment import AlignmentConfig, align_trajectory
>>>
>>> # Create configuration
>>> config = AlignmentConfig(
...     enabled=True,
...     reference_mode="centroid",
...     selection="protein and name CA",
... )
>>>
>>> # Align trajectory in-place
>>> ref_frame = align_trajectory(universe, config, start_frame=100)
>>> # Universe is now aligned; ref_frame is the reference frame index

Notes

Alignment modifies the Universe in-place. If you need the original coordinates, load a fresh Universe after alignment.

The alignment selection should typically be stable atoms that define the reference frame. Common choices: - “protein and name CA”: Backbone alpha-carbons (default) - “protein and backbone”: Full backbone (N, CA, C) - “protein”: All protein atoms (includes flexible side chains)

References

class polyzymd.analyses.shared.alignment.AlignmentConfig(*, enabled=True, reference_mode='centroid', reference_frame=None, selection='protein and name CA', centroid_selection='protein', reference_file=None)[source]

Bases: BaseModel

Configuration for trajectory alignment.

This model defines how a trajectory should be aligned before analysis. By default, alignment is ENABLED to remove rotational drift and COM motion that can add noise to distance measurements.

enabled

Whether to align the trajectory before analysis. Default is True. When enabled, the user is notified via INFO-level logging.

Type:

bool

reference_mode

How to select the reference structure:

  • “centroid”: Representative aligned frame (closest to aligned mean)

  • “average”: Mathematical average structure

  • “frame”: Specific frame number

Type:

ReferenceMode

reference_frame

Frame number (1-indexed) when reference_mode=”frame”. Required when mode is “frame”, ignored otherwise.

Type:

int | None

selection

MDAnalysis selection string for superposition. Atoms matching this selection are used to compute the optimal rotation/translation. Default: “protein and name CA” (backbone alpha-carbons).

Type:

str

centroid_selection

Selection for representative-frame detection when mode=”centroid”. Default: “protein” (all protein atoms for better discrimination).

Type:

str

Examples

>>> # Default: align to centroid using CA atoms
>>> config = AlignmentConfig()
>>> # Align to a specific frame
>>> config = AlignmentConfig(
...     reference_mode="frame",
...     reference_frame=500,
... )
>>> # Disable alignment (not recommended for most analyses)
>>> config = AlignmentConfig(enabled=False)
enabled: bool
reference_mode: ReferenceMode
reference_frame: int | None
selection: str
centroid_selection: str
reference_file: Path | None
validate_reference_params()[source]

Validate reference_frame and reference_file for their modes.

to_dict()[source]

Convert to dictionary for cache key hashing.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

polyzymd.analyses.shared.alignment.align_trajectory(universe, config, start_frame=0, stop_frame=None)[source]

Align trajectory in-memory to a reference structure.

This function performs trajectory alignment using MDAnalysis, removing rotational drift and center-of-mass motion. The alignment is performed in-place, modifying the Universe’s coordinates.

Parameters:
  • universe (Universe) – MDAnalysis Universe to align. Will be modified in-place.

  • config (AlignmentConfig) – Alignment configuration specifying reference mode and selections.

  • start_frame (int, optional) – First frame to consider (0-indexed). Default is 0. Frames before this are typically equilibration and should be excluded.

  • stop_frame (int, optional) – Last frame (exclusive). If None, uses all frames after start_frame.

Returns:

Reference frame index (0-indexed), or None if using average mode. This can be stored in results for reproducibility.

Return type:

int | None

Raises:
  • ImportError – If MDAnalysis is not available.

  • ValueError – If reference_mode is “frame” but the frame number is out of range.

Notes

When alignment is performed, an INFO-level log message is emitted to notify the user. This ensures users are aware that their trajectory coordinates have been modified.

Examples

>>> from polyzymd.analyses.shared.alignment import AlignmentConfig, align_trajectory
>>>
>>> config = AlignmentConfig(reference_mode="centroid")
>>> ref_frame = align_trajectory(universe, config, start_frame=100)
>>> print(f"Aligned to frame {ref_frame}")
polyzymd.analyses.shared.alignment.get_alignment_description(config)[source]

Generate a human-readable description of the alignment configuration.

Parameters:

config (AlignmentConfig) – Alignment configuration.

Returns:

Description suitable for documentation or result files.

Return type:

str

polyzymd.analyses.shared.autocorrelation

Autocorrelation analysis for independent sampling.

MD trajectories are highly correlated in time - consecutive frames are not independent samples. This module provides tools to:

  1. Compute the autocorrelation function (ACF) of an observable

  2. Estimate the correlation time (τ) from the ACF

  3. Compute statistical inefficiency (g) for proper uncertainty quantification

  4. Select independent frames based on τ for proper statistics

Key Concepts

  • Autocorrelation function (ACF): Measures how correlated a signal is with itself at different time lags. ACF(0) = 1, and ACF decays toward 0.

  • Correlation time (τ): Characteristic time for decorrelation. Frames separated by > 2τ are approximately independent.

  • Statistical inefficiency (g): Factor by which variance is inflated due to correlation. g = 1 + 2*Σ C(t)*(1-t/N). N_eff = N/g.

  • Independent samples: For proper SEM calculation, we need N_eff independent samples, not N_frames correlated observations.

Methods for τ estimation

  • First zero crossing: τ is lag where ACF first crosses zero

  • Exponential fit: Fit ACF = exp(-t/τ) and extract τ

  • Integration: τ = ∫ACF(t)dt from 0 to first zero (or cutoff)

Statistical Validity

The number of effective independent samples (N_eff) is computed as:

N_eff = N / g = N / (1 + 2*Σ C(t)*(1-t/N))

This matches the algorithm from Chodera et al. (2007) with the finite-size correction factor (1-t/N). When N_eff < 10, statistical estimates (mean, SEM) may be unreliable, and users should be warned per LiveCoMS best practices (Grossfield et al., 2018).

For multiple timeseries of different lengths (e.g., replicates), use statistical_inefficiency_multiple() which correctly handles the averaging.

References

  • Flyvbjerg & Petersen (1989) J. Chem. Phys. 91:461 (block averaging)

  • Chodera et al. (2007) J. Chem. Theory Comput. 3:26 (statistical inefficiency)

  • Grossfield et al. (2018) LiveCoMS 1:5067 (uncertainty quantification)

class polyzymd.analyses.shared.autocorrelation.CorrelationTimeMethod(value)[source]

Bases: str, Enum

Method for estimating correlation time from ACF.

FIRST_ZERO = 'first_zero'
EXPONENTIAL_FIT = 'exponential_fit'
INTEGRATION = 'integration'
class polyzymd.analyses.shared.autocorrelation.ACFResult(lags, acf, timestep, timestep_unit, n_samples)[source]

Bases: object

Result of autocorrelation function computation.

lags

Time lags in the same units as timestep

Type:

NDArray[np.float64]

acf

Autocorrelation values (normalized, ACF[0] = 1)

Type:

NDArray[np.float64]

timestep

Time between frames

Type:

float

timestep_unit

Unit of timestep (e.g., “ps”, “ns”)

Type:

str

n_samples

Number of samples in the original timeseries

Type:

int

lags: numpy.typing.NDArray.numpy.float64
acf: numpy.typing.NDArray.numpy.float64
timestep: float
timestep_unit: str
n_samples: int
to_dict()[source]

Convert to dictionary for serialization.

__init__(lags, acf, timestep, timestep_unit, n_samples)
class polyzymd.analyses.shared.autocorrelation.CorrelationTimeResult(tau, tau_unit, method, n_independent, statistical_inefficiency, warning=None)[source]

Bases: object

Result of correlation time estimation.

tau

Estimated correlation time

Type:

float

tau_unit

Unit of tau (same as timestep unit)

Type:

str

method

Method used for estimation

Type:

str

n_independent

Estimated number of independent samples in trajectory

Type:

int

statistical_inefficiency

g = 1 + 2*tau/dt, factor by which variance is inflated

Type:

float

warning

Warning message if statistics may be unreliable (e.g., N_ind < 10)

Type:

str | None

tau: float
tau_unit: str
method: str
n_independent: int
statistical_inefficiency: float
warning: str | None = None
property is_reliable: bool

Return True if statistics are likely reliable (N_ind >= 10).

to_dict()[source]

Convert to dictionary for serialization.

__init__(tau, tau_unit, method, n_independent, statistical_inefficiency, warning=None)
polyzymd.analyses.shared.autocorrelation.compute_acf(timeseries, max_lag=None, timestep=1.0, timestep_unit='frames')[source]

Compute autocorrelation function of a 1D timeseries.

Uses FFT-based computation for efficiency.

Parameters:
  • timeseries (array_like) – 1D array of values (e.g., RMSD over time, distance over time)

  • max_lag (int, optional) – Maximum lag to compute (in frames). Default is N//4 where N is the length of the timeseries.

  • timestep (float, optional) – Time between frames. Default is 1.0.

  • timestep_unit (str, optional) – Unit of timestep. Default is “frames”.

Returns:

Container with lags, acf values, and metadata

Return type:

ACFResult

Examples

>>> # Compute ACF of RMSD timeseries
>>> rmsd = np.array([1.2, 1.3, 1.25, 1.4, ...])  # from MDAnalysis
>>> acf_result = compute_acf(rmsd, timestep=10.0, timestep_unit="ps")
>>> print(f"ACF at lag 100ps: {acf_result.acf[10]:.3f}")

Notes

The ACF is normalized so that ACF[0] = 1.

For a stationary process: ACF(τ) = <(x(t) - μ)(x(t+τ) - μ)> / σ²

For constant or near-constant timeseries (variance below a small epsilon), this function returns a defined degenerate ACF with ACF[0] = 1 and all positive lags set to 0.

polyzymd.analyses.shared.autocorrelation.estimate_correlation_time(acf_or_timeseries, timestep=1.0, timestep_unit='frames', method='integration', n_frames=None)[source]

Estimate correlation time from ACF or raw timeseries.

Parameters:
  • acf_or_timeseries (ACFResult or array_like) – Either an ACFResult from compute_acf(), or a raw timeseries

  • timestep (float, optional) – Time between frames (only used if passing raw timeseries)

  • timestep_unit (str, optional) – Unit of timestep (only used if passing raw timeseries)

  • method ({"first_zero", "exponential_fit", "integration"}) – Method for estimating τ: - “first_zero”: Lag where ACF first crosses zero - “exponential_fit”: Fit ACF = exp(-t/τ) - “integration”: τ = ∫ACF(t)dt (recommended, most robust)

  • n_frames (int, optional) – Total number of frames (for computing n_independent). Only needed if passing ACFResult.

Returns:

Contains tau, method used, n_independent, statistical_inefficiency

Return type:

CorrelationTimeResult

Examples

>>> acf_result = compute_acf(rmsd, timestep=10.0, timestep_unit="ps")
>>> tau_result = estimate_correlation_time(acf_result, method="integration")
>>> print(f"Correlation time: {tau_result.tau:.1f} {tau_result.tau_unit}")
>>> print(f"Independent samples: {tau_result.n_independent}")

Notes

The “integration” method is most robust for noisy ACFs. It computes:

τ = ∫₀^∞ ACF(t) dt ≈ Σ ACF[i] * dt

Integration stops at first zero crossing to avoid noise contribution.

polyzymd.analyses.shared.autocorrelation.get_independent_indices(n_frames, correlation_time, timestep=1.0, start_frame=0)[source]

Get frame indices for independent samples.

Selects frames separated by at least 2*τ (correlation time) to ensure approximate independence for statistical analysis.

Parameters:
  • n_frames (int) – Total number of frames in trajectory

  • correlation_time (float) – Correlation time τ (in same units as timestep)

  • timestep (float, optional) – Time between frames. Default is 1.0.

  • start_frame (int, optional) – First frame to consider (after equilibration). Default is 0. Note: Frame indices are 0-indexed internally, but user-facing documentation uses 1-indexed (PyMOL convention).

Returns:

Array of frame indices (0-indexed) that are approximately independent

Return type:

NDArray[np.int64]

Examples

>>> # Get independent frames for RMSF calculation
>>> tau_result = estimate_correlation_time(rmsd, timestep=10.0)
>>> indices = get_independent_indices(
...     n_frames=10000,
...     correlation_time=tau_result.tau,
...     timestep=10.0,
...     start_frame=1000,  # Skip first 1000 frames for equilibration
... )
>>> print(f"Using {len(indices)} independent frames")

Notes

Frame indices returned are 0-indexed (for direct use with MDAnalysis). When displaying to users, add 1 for PyMOL convention.

The spacing is set to 2*τ/timestep, which gives frames with negligible correlation (ACF < 0.05 for exponential decay).

polyzymd.analyses.shared.autocorrelation.statistical_inefficiency(timeseries, mintime=3, fft=True)[source]

Compute statistical inefficiency g directly from a timeseries.

The statistical inefficiency g is the factor by which the variance of the sample mean is increased due to correlation:

Var(mean) = Var(x) * g / N

This is computed as: g = 1 + 2 * Σ C(t) * (1 - t/N)

where C(t) is the normalized autocorrelation function and the sum includes the finite-size correction factor (1 - t/N) per Chodera et al. (2007).

Parameters:
  • timeseries (array_like) – 1D array of values (e.g., contact binary array, RMSD over time)

  • mintime (int) – Minimum number of lags to compute before checking for zero crossing. Prevents early termination from noise. Default is 3.

  • fft (bool) – If True, use FFT-based ACF computation (faster). Default is True.

Returns:

Statistical inefficiency g (>= 1.0). The number of effective independent samples is N_eff = N / g.

Return type:

float

Examples

>>> # Binary contact timeseries
>>> contacts = np.array([0, 1, 1, 1, 0, 0, 1, 1, ...])
>>> g = statistical_inefficiency(contacts)
>>> n_eff = len(contacts) / g
>>> print(f"Effective samples: {n_eff:.1f}")
>>> # Continuous observable
>>> rmsd = np.array([1.2, 1.3, 1.25, 1.4, ...])
>>> g = statistical_inefficiency(rmsd)

Notes

This implementation follows the algorithm from Chodera et al. (2007) J. Chem. Theory Comput. 3:26, with the finite-size correction.

For binary (0/1) data, the algorithm works correctly as the variance of a Bernoulli random variable is p(1-p).

References

Chodera et al. (2007) J. Chem. Theory Comput. 3:26

polyzymd.analyses.shared.autocorrelation.statistical_inefficiency_multiple(timeseries_list, mintime=3)[source]

Compute statistical inefficiency from multiple timeseries of different lengths.

This is critical for aggregating replicates with different frame counts. The algorithm computes a global mean μ across all timeseries, then averages the ACF numerator and denominator separately before computing g.

Parameters:
  • timeseries_list (list[ArrayLike]) – List of 1D timeseries arrays (can have different lengths)

  • mintime (int) – Minimum number of lags before checking for zero crossing. Default is 3.

Returns:

Statistical inefficiency g (>= 1.0)

Return type:

float

Examples

>>> # Three replicates with different lengths
>>> ts1 = np.array([0, 1, 1, 0, 0, 1])  # 6 frames
>>> ts2 = np.array([1, 1, 0, 0, 0])      # 5 frames
>>> ts3 = np.array([0, 0, 1, 1, 1, 0, 1])  # 7 frames
>>> g = statistical_inefficiency_multiple([ts1, ts2, ts3])

Notes

This implementation follows the algorithm from PyMBAR’s statistical_inefficiency_multiple(), adapted without the PyMBAR dependency.

The algorithm:

  1. Compute global mean μ across all timeseries

  2. For each lag t: - Compute sum of (x - μ) products across all timeseries where t < N_k - Compute sum of sample counts across all timeseries where t < N_k - Average to get C(t)

  3. Sum with finite-size correction

References

Chodera et al. (2007) J. Chem. Theory Comput. 3:26

polyzymd.analyses.shared.autocorrelation.n_effective(n_samples, g)[source]

Compute number of effective independent samples.

Parameters:
  • n_samples (int) – Total number of samples

  • g (float) – Statistical inefficiency

Returns:

Effective number of independent samples (N_eff = N / g)

Return type:

float

polyzymd.analyses.shared.autocorrelation.check_statistical_reliability(n_eff, threshold=10)[source]

Check if statistics are reliable based on effective sample count.

Parameters:
  • n_eff (float) – Number of effective independent samples

  • threshold (int) – Minimum recommended independent samples. Default is 10.

Returns:

  • is_reliable (bool) – True if n_eff >= threshold

  • warning (str | None) – Warning message if not reliable, None otherwise

Return type:

tuple[bool, str | None]

Examples

>>> g = statistical_inefficiency(contacts)
>>> n_eff = n_effective(len(contacts), g)
>>> is_ok, warning = check_statistical_reliability(n_eff)
>>> if not is_ok:
...     print(warning)

polyzymd.analyses.shared.statistics

Statistical functions for replicate aggregation.

This module provides statistical utilities for combining results across multiple simulation replicates with proper error propagation.

Key design decisions: - All uncertainties are reported as Standard Error of the Mean (SEM) - SEM = std / sqrt(N) where N is the number of independent samples - Hierarchical aggregation preserves proper statistics at each level

Functions

compute_sem

Standard error of the mean for a 1D array

aggregate_per_residue_stats

Combine per-residue values across replicates

aggregate_region_stats

Combine region-averaged values across replicates

weighted_mean_with_sem

Weighted average with proper error propagation

class polyzymd.analyses.shared.statistics.StatResult(mean, sem, n_samples)[source]

Bases: object

Container for mean +/- SEM results.

mean

The mean value

Type:

float

sem

Standard error of the mean

Type:

float

n_samples

Number of samples used in computation

Type:

int

mean: float
sem: float
n_samples: int
to_dict()[source]

Convert to dictionary for JSON serialization.

__init__(mean, sem, n_samples)
class polyzymd.analyses.shared.statistics.PerResidueStats(residue_ids, means, sems, n_replicates)[source]

Bases: object

Container for per-residue statistics across replicates.

residue_ids

Residue identifiers (1-indexed, following PyMOL convention)

Type:

NDArray[np.int64]

means

Mean value for each residue across replicates

Type:

NDArray[np.float64]

sems

SEM for each residue across replicates

Type:

NDArray[np.float64]

n_replicates

Number of replicates aggregated

Type:

int

residue_ids: numpy.typing.NDArray.numpy.int64
means: numpy.typing.NDArray.numpy.float64
sems: numpy.typing.NDArray.numpy.float64
n_replicates: int
to_dict()[source]

Convert to dictionary for JSON serialization.

__init__(residue_ids, means, sems, n_replicates)
polyzymd.analyses.shared.statistics.compute_sem(values, ddof=1)[source]

Compute mean and standard error of the mean.

SEM = std / sqrt(N) where N is the number of samples.

Parameters:
  • values (array_like) – 1D array of values (e.g., one value per replicate)

  • ddof (int, optional) – Delta degrees of freedom for std calculation. Default is 1 (Bessel’s correction for sample std).

Returns:

Container with mean, sem, and n_samples

Return type:

StatResult

Examples

>>> values = [2.5, 2.7, 2.6, 2.4, 2.8]  # RMSF from 5 replicates
>>> result = compute_sem(values)
>>> print(f"RMSF = {result.mean:.2f} +/- {result.sem:.2f} A")
RMSF = 2.60 +/- 0.07 A

Notes

For a single value, SEM is undefined (returns 0.0).

polyzymd.analyses.shared.statistics.aggregate_per_residue_stats(per_replicate_values, residue_ids=None)[source]

Aggregate per-residue values across replicates.

For each residue, computes mean +/- SEM across all replicates. This is the correct way to aggregate per-residue RMSF values.

Parameters:
  • per_replicate_values (sequence of arrays) – List/tuple of 1D arrays, each containing per-residue values from one replicate. All arrays must have the same length.

  • residue_ids (array, optional) – 1-indexed residue identifiers. If None, uses 1, 2, 3, … Following PyMOL convention (1-indexed).

Returns:

Container with residue_ids, means, sems, n_replicates

Return type:

PerResidueStats

Raises:

ValueError – If arrays have inconsistent lengths or no replicates provided

polyzymd.analyses.shared.statistics.aggregate_region_stats(per_replicate_values, residue_mask=None)[source]

Aggregate region-averaged values across replicates.

For whole-protein or region-specific metrics, this computes the mean of per-replicate averages, with SEM across replicates.

This implements the correct hierarchical aggregation: 1. First average within each replicate (over selected residues) 2. Then compute mean +/- SEM across replicate means

Parameters:
  • per_replicate_values (sequence of arrays) – List/tuple of 1D arrays, each containing per-residue values from one replicate.

  • residue_mask (bool array, optional) – Boolean mask for residue selection. If None, uses all residues.

Returns:

Mean +/- SEM of region-averaged values across replicates

Return type:

StatResult

polyzymd.analyses.shared.statistics.weighted_mean_with_sem(means, sems, weights=None)[source]

Compute weighted mean with proper error propagation.

Useful for combining results from different conditions or analyses with different uncertainties.

Parameters:
  • means (array_like) – Mean values from each source

  • sems (array_like) – SEM values from each source

  • weights (array_like, optional) – Weights for each source. If None, uses inverse-variance weighting (1/sem^2), which is optimal for independent measurements.

Returns:

Weighted mean with propagated uncertainty

Return type:

StatResult

Notes

For inverse-variance weighting, the combined SEM is:

SEM_combined = 1 / sqrt(sum(1/SEM_i^2))

For arbitrary weights, uses standard error propagation:

SEM_combined = sqrt(sum((w_i * SEM_i)^2)) / sum(w_i)

polyzymd.analyses.shared.plotting

Shared plotting utilities for analysis plugins.

This module provides reusable plotting helper functions extracted from the plotter infrastructure. Analysis plugins import these functions to apply consistent styling, save figures with watermarks, and render common chart elements (grouped bars, heatmap annotations, etc.) without inheriting from a base class.

All functions accept a PlotSettings object (from polyzymd.config.comparison) so that user-configured themes, palettes, and DPI settings are respected automatically.

Examples

>>> from polyzymd.analyses.shared.plotting import (
...     apply_axis_style, apply_legend, get_colors, save_figure,
... )
>>>
>>> fig, ax = plt.subplots()
>>> colors = get_colors(3, plot_settings)
>>> ax.bar(x, y, color=colors[0])
>>> apply_axis_style(ax, plot_settings, title="My Plot", ylabel="Value (Å)")
>>> apply_legend(ax, plot_settings)
>>> save_figure(fig, output_dir / "my_plot.png", plot_settings)
polyzymd.analyses.shared.plotting.get_theme(plot_settings)[source]

Return the resolved PlotTheme from plot_settings.

Parameters:

plot_settings (PlotSettings) – Global plot settings (carries a .theme property).

Return type:

PlotTheme

polyzymd.analyses.shared.plotting.apply_axis_style(ax, plot_settings, *, title=None, xlabel=None, ylabel=None)[source]

Apply standard axis chrome from the theme.

Hides spines according to theme settings, sizes tick labels, and optionally sets title / xlabel / ylabel with themed font sizes.

Parameters:
  • ax (matplotlib Axes) – Target axes to style.

  • plot_settings (PlotSettings) – Global plot settings.

  • title (str, optional) – Axes title.

  • xlabel (str, optional) – X-axis label.

  • ylabel (str, optional) – Y-axis label.

polyzymd.analyses.shared.plotting.apply_legend(ax, plot_settings, *, loc=None, bbox_to_anchor=<object object>, fontsize=None, **kwargs)[source]

Apply legend with themed defaults.

Uses theme.legend_loc and theme.legend_bbox unless overridden by the caller. Extra kwargs are forwarded to ax.legend().

Parameters:
  • ax (matplotlib Axes) – Target axes.

  • plot_settings (PlotSettings) – Global plot settings.

  • loc (str, optional) – Override theme.legend_loc.

  • bbox_to_anchor (tuple of float or None, optional) – Override theme.legend_bbox. Pass None explicitly to suppress the bbox (e.g. for inside-axes placement).

  • fontsize (int, optional) – Override theme.legend_fontsize.

  • **kwargs (Any) – Forwarded to ax.legend().

polyzymd.analyses.shared.plotting.get_colors(n, plot_settings)[source]

Get n distinct colors from the configured palette.

Tries seaborn first (richer palette support), falls back to a matplotlib colormap sampled at evenly-spaced intervals.

Parameters:
  • n (int) – Number of colors needed.

  • plot_settings (PlotSettings) – Global plot settings (carries color_palette).

Returns:

List of color values (RGB tuples or matplotlib color specs).

Return type:

list

polyzymd.analyses.shared.plotting.get_output_path(output_dir, name, plot_settings)[source]

Generate output file path with correct format extension.

Parameters:
  • output_dir (Path) – Output directory.

  • name (str) – Base filename (without extension).

  • plot_settings (PlotSettings) – Global plot settings (carries format).

Returns:

Full output path with extension.

Return type:

Path

polyzymd.analyses.shared.plotting.save_figure(fig, output_path, plot_settings, *, experimental_features=None, close=True)[source]

Save figure with DPI, watermark, and optional experimental stamp.

Parameters:
  • fig (matplotlib Figure) – Figure to save.

  • output_path (Path) – Output file path.

  • plot_settings (PlotSettings) – Global plot settings (carries dpi and theme).

  • experimental_features (sequence of str or None, optional) – Experimental feature ids to stamp onto the figure.

  • close (bool, optional) – If True, close the figure after saving. Set False when the caller needs to keep using the figure object.

Returns:

Path to saved figure.

Return type:

Path

polyzymd.analyses.shared.plotting.find_json(directory, preferred, glob_pattern='*.json')[source]

Locate a JSON result file inside directory.

Tries the preferred filename first; if it does not exist, falls back to the first file matching glob_pattern (sorted lexicographically so results are deterministic).

Parameters:
  • directory (Path) – Directory to search.

  • preferred (str) – Exact filename to try first (e.g. "rmsf_aggregated.json").

  • glob_pattern (str, optional) – Glob to use as fallback, by default "*.json".

Returns:

Path to the located file, or None if nothing was found.

Return type:

Path | None

polyzymd.analyses.shared.plotting.grouped_bars(ax, x, series, colors, plot_settings, *, bar_width=None, show_error=True, reference_line=0.0, reference_label='Neutral (0)', replicate_values=None, **style_overrides)[source]

Render grouped bars with optional error bars and reference line.

Style values (alpha, capsize, edgecolor, linewidth, dot_size, etc.) are read from plot_settings.theme. Callers can override any of them via **style_overrides using the theme field names as keys.

Parameters:
  • ax (matplotlib Axes) – Target axes.

  • x (np.ndarray) – 1-D array of group centre positions (e.g. np.arange(n_groups)).

  • series (sequence of (label, means, errors)) – One tuple per condition. means and errors must have the same length as x.

  • colors (sequence) – One colour per condition (same length as series).

  • plot_settings (PlotSettings) – Global plot settings.

  • bar_width (float | None, optional) – Width of each individual bar. When None (default) the width is computed as 0.8 / len(series).

  • show_error (bool, optional) – If False, error bars are suppressed, by default True.

  • reference_line (float | None, optional) – Y-value for a horizontal reference line. Set to None to skip, by default 0.0.

  • reference_label (str, optional) – Legend label for the reference line, by default "Neutral (0)".

  • replicate_values (sequence or None, optional) – Per-replicate values for jittered dot overlay. Indexed as replicate_values[condition_idx][group_idx] -> sequence of floats (one per replicate). When None (default), no dots are drawn.

  • **style_overrides – Override any theme field for this call only. Accepted keys: bar_alpha, bar_capsize, bar_edgecolor, bar_linewidth, dot_size, dot_alpha, dot_color, reference_line_color, reference_line_style, reference_line_width.

polyzymd.analyses.shared.plotting.annotate_cells(ax, matrix, plot_settings, *, fmt='.2f', fontsize=None, threshold=0.3, sem_matrix=None, show_sign=True, linespacing=None)[source]

Annotate heatmap cells with formatted values.

Iterates over every element of matrix and places a text label at the corresponding (col, row) position on ax. NaN cells are skipped. Text colour flips between black and white depending on the background intensity (controlled by threshold).

Parameters:
  • ax (matplotlib Axes) – The axes containing the heatmap image.

  • matrix (np.ndarray) – 2-D array of values (rows x cols) matching the heatmap.

  • plot_settings (PlotSettings) – Global plot settings.

  • fmt (str, optional) – Format spec for the value, by default ".2f".

  • fontsize (int | None, optional) – Annotation font size. When None (default), uses plot_settings.theme.annotation_fontsize.

  • threshold (float, optional) – Absolute-value threshold above which text turns white.

  • sem_matrix (np.ndarray | None, optional) – If provided, a second line ±{sem} is appended when the SEM value is finite.

  • show_sign (bool, optional) – Prefix positive values with "+" , by default True.

  • linespacing (float | None, optional) – Passed to ax.text(linespacing=...) when SEM is shown.

polyzymd.analyses.shared.plotting.symmetric_clim(values, pad=0.1)[source]

Compute symmetric colour limits centred on zero.

Parameters:
  • values (sequence of float or ndarray) – Finite data values to derive limits from.

  • pad (float, optional) – Extra padding added to both sides, by default 0.1.

Returns:

(vmin, vmax) with vmin == -vmax (before padding).

Return type:

tuple[float, float]

polyzymd.analyses.shared.selections

Special selection syntax for distance calculations.

This module provides parsing of extended selection syntax for defining atom positions in distance calculations. It supports:

  1. Standard MDAnalysis selections: “resid 133 and name OD1”

  2. Midpoint of multiple atoms: “midpoint(resid 133 and name OD1 OD2)”

  3. Center of mass of a group: “com(resid 50-75)”

Examples

>>> from polyzymd.analyses.shared.selections import parse_selection_string, get_position
>>>
>>> # Standard selection - single atom
>>> ag = parse_selection(universe, "resid 77 and name OG")
>>> pos = get_position(ag)  # Returns position of single atom
>>>
>>> # Midpoint of Asp carboxyl oxygens
>>> ag = parse_selection(universe, "midpoint(resid 133 and name OD1 OD2)")
>>> pos = get_position(ag)  # Returns midpoint of OD1 and OD2
>>>
>>> # Center of mass of lid domain
>>> ag = parse_selection(universe, "com(resid 50-75)")
>>> pos = get_position(ag)  # Returns COM of residues 50-75

Notes

The midpoint() syntax is particularly useful for catalytic residues where the functional position is between two atoms (e.g., Asp carboxyl oxygens, Glu carboxyl oxygens).

The com() syntax is useful for domain motions where you want to track the center of mass of a group of residues (e.g., lid opening in lipases).

polyzymd.analyses.shared.selections.translate_selection(selection)[source]

Translate PolyzyMD selection keywords to MDAnalysis equivalents.

This allows users to use the same selection syntax in analysis as they use in config.yaml for restraints and other atom selections.

Translations

  • pdbindex Nid N (PDB ATOM serial number)

The pdbindex keyword refers to the 1-indexed atom serial number from the PDB ATOM record (column 7-11), which is what PyMOL displays as “id”. In MDAnalysis, this is accessed via the id selection keyword.

Note: MDAnalysis also has bynum which is 1-indexed positional (i.e., bynum 1 = first atom, bynum 2 = second atom), but this does NOT correspond to PDB serial numbers when there are gaps in numbering. We use id because it matches actual PDB serial numbers.

param selection:

Selection string with possible PolyzyMD-specific keywords

type selection:

str

returns:

Selection string with MDAnalysis-compatible keywords

rtype:

str

Examples

>>> translate_selection("pdbindex 100 and name CA")
"id 100 and name CA"
>>> translate_selection("midpoint(pdbindex 100 and name OD1 OD2)")
"midpoint(id 100 and name OD1 OD2)"
class polyzymd.analyses.shared.selections.SelectionMode(value)[source]

Bases: str, Enum

Mode for position calculation from atom selection.

SINGLE = 'single'
CENTROID = 'centroid'
MIDPOINT = 'midpoint'
COM = 'com'
class polyzymd.analyses.shared.selections.ParsedSelection(selection, mode, original)[source]

Bases: object

Result of parsing a selection string.

selection

The MDAnalysis selection string (without wrapper function)

Type:

str

mode

How to compute the position

Type:

SelectionMode

original

The original input string

Type:

str

selection: str
mode: SelectionMode
original: str
__init__(selection, mode, original)
polyzymd.analyses.shared.selections.parse_selection_string(selection)[source]

Parse a selection string to extract mode and MDAnalysis selection.

Also translates PolyzyMD-specific keywords (like pdbindex) to their MDAnalysis equivalents (like id).

Parameters:

selection (str) – Selection string, possibly with special syntax: - “resid 77 and name OG” - standard MDAnalysis - “midpoint(resid 133 and name OD1 OD2)” - midpoint mode - “com(resid 50-75)” - center of mass mode - “pdbindex 100 and name CA” - PolyzyMD pdbindex (translated to id)

Returns:

Parsed selection with mode and clean selection string

Return type:

ParsedSelection

Examples

>>> parsed = parse_selection_string("midpoint(resid 133 and name OD1 OD2)")
>>> parsed.mode
<SelectionMode.MIDPOINT: 'midpoint'>
>>> parsed.selection
"resid 133 and name OD1 OD2"
>>> parsed = parse_selection_string("pdbindex 100 and name CA")
>>> parsed.selection
"id 100 and name CA"
polyzymd.analyses.shared.selections.select_atoms(universe, selection)[source]

Select atoms from universe using potentially special syntax.

Parameters:
  • universe (Universe) – MDAnalysis Universe

  • selection (str) – Selection string (standard or special syntax)

Returns:

Selected atoms

Return type:

AtomGroup

Raises:

ValueError – If selection matches no atoms, with diagnostic info

polyzymd.analyses.shared.selections.get_position(atoms, mode=SelectionMode.SINGLE)[source]

Get position from atom group based on mode.

Parameters:
  • atoms (AtomGroup) – MDAnalysis AtomGroup

  • mode (SelectionMode) – How to compute position: - SINGLE: Position of single atom (error if multiple) - CENTROID/MIDPOINT: Center of geometry - COM: Center of mass

Returns:

3D position vector [x, y, z]

Return type:

NDArray[np.float64]

Raises:

ValueError – If mode is SINGLE but multiple atoms selected

polyzymd.analyses.shared.selections.get_position_from_selection(universe, selection)[source]

Get position from selection string in one step.

This is a convenience function that combines parsing, selection, and position calculation.

Parameters:
  • universe (Universe) – MDAnalysis Universe

  • selection (str) – Selection string (standard or special syntax)

Returns:

3D position vector [x, y, z]

Return type:

NDArray[np.float64]

Examples

>>> # Single atom
>>> pos = get_position_from_selection(u, "resid 77 and name OG")
>>>
>>> # Midpoint of Asp carboxyl
>>> pos = get_position_from_selection(u, "midpoint(resid 133 and name OD1 OD2)")
polyzymd.analyses.shared.selections.validate_selection(universe, selection)[source]

Validate a selection string and return diagnostic info.

Parameters:
  • universe (Universe) – MDAnalysis Universe

  • selection (str) – Selection string to validate

Returns:

Diagnostic information: - valid: bool - n_atoms: int - mode: str - atoms: list of atom info dicts - error: str (if invalid) - diagnostics: str (detailed diagnostics if invalid)

Return type:

dict

polyzymd.analyses.shared.selections.format_selection_for_label(selection)[source]

Convert selection string to a short label for filenames/display.

Parameters:

selection (str) – Selection string (standard or special syntax)

Returns:

Short label (e.g., “Asp133_mid” or “Ser77_OG”)

Return type:

str

Examples

>>> format_selection_for_label("midpoint(resid 133 and name OD1 OD2)")
"res133_mid"
>>> format_selection_for_label("resid 77 and name OG")
"res77_OG"

polyzymd.analyses.shared.config_hash

Config hashing for analysis cache validation.

When analysis results are cached, we store a hash of the relevant config parameters. If the config changes, we warn the user that cached results may be invalid.

This module provides: - compute_config_hash: Generate a hash of analysis-relevant config parameters - validate_config_hash: Check if stored hash matches current config

Design Decision:

Config immutability is expected. If a user modifies config parameters (e.g., temperature), they should create a new project directory. The hash validation is a safety check, not an enforcement mechanism.

polyzymd.analyses.shared.config_hash.compute_config_hash(config)[source]

Compute hash of config parameters relevant to analysis.

Includes parameters that affect trajectory interpretation: - enzyme configuration - substrate configuration - polymer configuration - thermodynamics (temperature, pressure) - output paths (for trajectory location)

Excludes parameters that don’t affect analysis of completed trajectories: - simulation_phases (equilibration_stages/production settings) - force_field (already baked into trajectory)

Parameters:

config (SimulationConfig) – PolyzyMD simulation configuration

Returns:

Hex digest of SHA-256 hash (first 16 characters for brevity)

Return type:

str

Examples

>>> from polyzymd.config import load_config
>>> config = load_config("config.yaml")
>>> hash_val = compute_config_hash(config)
>>> print(f"Config hash: {hash_val}")
Config hash: a3b2c1d4e5f67890
polyzymd.analyses.shared.config_hash.settings_fingerprint(settings)[source]

Compute a short deterministic fingerprint for analysis settings.

The fingerprint is derived from canonical JSON produced with json.dumps(settings.model_dump(mode="json"), sort_keys=True), then hashed with SHA-256. It is intended for cache identity, so changing settings (for example contacts cutoff) naturally changes cache filenames.

Parameters:

settings (BaseModel) – Analysis plugin settings model.

Returns:

First 8 hexadecimal characters of the SHA-256 digest.

Return type:

str

polyzymd.analyses.shared.config_hash.compute_cache_identity(*, config_hash, settings=None, settings_fp=None, cache_params=None, length=12)[source]

Compute a deterministic cache identity across config and settings.

This helper unifies cache identity generation for analysis caches. The identity combines:

  • Simulation config hash

  • Analysis settings fingerprint

  • Extra cache parameters when needed

Parameters:
  • config_hash (str) – Hash of simulation config returned by compute_config_hash().

  • settings (BaseModel or None, optional) – Analysis settings model. Used only when settings_fp is not provided.

  • settings_fp (str or None, optional) – Precomputed settings fingerprint. If provided, this takes precedence over computing from settings.

  • cache_params (Mapping[str, object] or None, optional) – Additional cache identity inputs such as equilibration or selection.

  • length (int, optional) – Number of hex characters to return, by default 12.

Returns:

Short hex identity safe for filenames.

Return type:

str

Raises:

ValueError – If neither settings nor settings_fp is provided.

polyzymd.analyses.shared.config_hash.extract_settings_fingerprint_from_path(cache_path)[source]

Extract settings fingerprint from a cache filename when present.

Parameters:

cache_path (str or Path) – Path to a cached result file.

Returns:

Parsed 8-character fingerprint, or None when filename does not encode a settings fingerprint.

Return type:

str | None

polyzymd.analyses.shared.config_hash.validate_settings_fingerprint(stored_fingerprint, current_settings, *, warn=True, source=None)[source]

Validate cached settings fingerprint against current analysis settings.

Parameters:
  • stored_fingerprint (str or None) – Fingerprint read from cached result metadata or filename.

  • current_settings (BaseModel) – Current plugin settings used for this analysis invocation.

  • warn (bool, optional) – Emit warnings on mismatch or missing fingerprint, by default True.

  • source (str or Path or None, optional) – Optional cache source path for diagnostics.

Returns:

True when cache settings are compatible with current settings, otherwise False.

Return type:

bool

Notes

Legacy cache files may not encode settings fingerprints. These files are treated as compatible for backward compatibility, with a warning to encourage recomputation.

polyzymd.analyses.shared.config_hash.validate_config_hash(stored_hash, current_config, warn=True)[source]

Check if stored hash matches current config.

If the hashes don’t match, this indicates the config has changed since the analysis was performed. This could mean: 1. The user modified the config (bad practice - should create new project) 2. The analysis was performed on a different config file 3. A bug in the hashing algorithm

Parameters:
  • stored_hash (str) – Hash stored in cached analysis results

  • current_config (SimulationConfig) – Current configuration being used

  • warn (bool, optional) – If True (default), print a loud warning when hashes don’t match

Returns:

True if hashes match, False otherwise

Return type:

bool

Examples

>>> stored_hash = loaded_result.get("config_hash", "")
>>> config = load_config("config.yaml")
>>> if not validate_config_hash(stored_hash, config):
...     print("Warning: Results may be stale!")

polyzymd.analyses.shared.sasa

Shared SASA computation utilities.

This module provides reusable helpers for SASA analyses that need selection validation, atom-level and residue-level SASA traces, and raw artifact persistence.

class polyzymd.analyses.shared.sasa.SASAComputationResult(atom_sasa_a2, residue_sasa_a2, total_sasa_a2, frames, time_ns, target_atom_indices, context_atom_indices, residue_keys, residue_chainids, residue_resids, residue_resnames)[source]

Bases: object

Container for raw SASA computation outputs.

Parameters:
  • atom_sasa_a2 (NDArray[np.float64]) – Per-atom SASA trace in Ų, shape (n_frames, n_target_atoms).

  • residue_sasa_a2 (NDArray[np.float64]) – Per-residue SASA trace in Ų, shape (n_frames, n_target_residues).

  • total_sasa_a2 (NDArray[np.float64]) – Total target SASA in Ų, shape (n_frames,).

  • frames (NDArray[np.int64]) – 0-indexed frame indices used for analysis.

  • time_ns (NDArray[np.float64]) – Time axis in ns corresponding to frames.

  • target_atom_indices (NDArray[np.int64]) – Universe-global atom indices for target atoms.

  • context_atom_indices (NDArray[np.int64]) – Universe-global atom indices for context atoms.

  • residue_keys (list[str]) – Residue identity keys in chainID:resid:resname format.

  • residue_chainids (list[str]) – Chain IDs for each residue key.

  • residue_resids (list[int]) – Residue IDs for each residue key.

  • residue_resnames (list[str]) – Residue names for each residue key.

atom_sasa_a2: numpy.typing.NDArray.numpy.float64
residue_sasa_a2: numpy.typing.NDArray.numpy.float64
total_sasa_a2: numpy.typing.NDArray.numpy.float64
frames: numpy.typing.NDArray.numpy.int64
time_ns: numpy.typing.NDArray.numpy.float64
target_atom_indices: numpy.typing.NDArray.numpy.int64
context_atom_indices: numpy.typing.NDArray.numpy.int64
residue_keys: list[str]
residue_chainids: list[str]
residue_resids: list[int]
residue_resnames: list[str]
__init__(atom_sasa_a2, residue_sasa_a2, total_sasa_a2, frames, time_ns, target_atom_indices, context_atom_indices, residue_keys, residue_chainids, residue_resids, residue_resnames)
polyzymd.analyses.shared.sasa.resolve_selection_indices(universe, selection, *, role, run_label)[source]

Resolve an MDAnalysis selection to atom indices.

Parameters:
  • universe (Any) – MDAnalysis universe.

  • selection (str) – MDAnalysis selection string.

  • role (str) – Selection role used in warning context.

  • run_label (str) – Human-readable run label.

Returns:

Selected atom group and its universe-global atom indices.

Return type:

tuple[Any, NDArray[np.int64]]

polyzymd.analyses.shared.sasa.validate_target_subset(target_indices, context_indices, *, run_label, target_selection, context_selection)[source]

Validate target-selection atoms are a subset of context-selection atoms.

Raises:

ValueError – If any target atom is absent from the context selection.

polyzymd.analyses.shared.sasa.compute_sasa(universe, *, run_label, target_selection, context_selection, probe_radius_nm, n_sphere_points, start_frame, stop_frame, timestep_ps, chunk_size=100, stride=1)[source]

Compute target SASA over a selected context.

Parameters:
  • universe (Any) – MDAnalysis universe.

  • run_label (str) – Human-readable run label.

  • target_selection (str) – Selection of atoms whose SASA is reported.

  • context_selection (str) – Selection of atoms considered during SASA computation.

  • probe_radius_nm (float) – Probe radius in nm.

  • n_sphere_points (int) – Number of sphere points for Shrake-Rupley.

  • start_frame (int) – First frame index (inclusive).

  • stop_frame (int) – Last frame index (exclusive).

  • timestep_ps (float) – Timestep in ps.

  • chunk_size (int, optional) – Number of analyzed frames processed per Shrake-Rupley chunk, by default 100.

  • stride (int, optional) – Frame stride applied before chunking (1 = analyze every frame), by default 1.

Returns:

Raw atom-level, residue-level, and total SASA traces in Ų.

Return type:

SASAComputationResult

Raises:

ValueError – If frame bounds or stride/chunk parameters are invalid.

polyzymd.analyses.shared.sasa.save_sasa_artifacts(npz_path, metadata_path, result, *, run_label, target_selection, context_selection, probe_radius_nm, n_sphere_points, equilibration)[source]

Save raw SASA arrays plus JSON sidecar metadata.

Parameters:
  • npz_path (Path) – Output path for compressed NumPy archive.

  • metadata_path (Path) – Output path for JSON metadata sidecar.

  • result (SASAComputationResult) – Raw SASA arrays to persist.

  • run_label (str) – Run label.

  • target_selection (str) – Target selection string.

  • context_selection (str) – Context selection string.

  • probe_radius_nm (float) – Probe radius in nm.

  • n_sphere_points (int) – Number of sphere points.

  • equilibration (str) – Equilibration string.

polyzymd.analyses.shared.sasa.load_sasa_artifacts(npz_path, metadata_path)[source]

Load SASA raw arrays and metadata from disk.

Parameters:
  • npz_path (Path) – Path to compressed NumPy archive.

  • metadata_path (Path) – Path to JSON metadata sidecar.

Returns:

Reconstructed raw result and parsed metadata dictionary.

Return type:

tuple[SASAComputationResult, dict[str, Any]]

class polyzymd.analyses.shared.sasa.SASAArtifactContract(schema_name, schema_version, compatibility_version, producer, engine, mode, units, run_label, target_selection, context_selection, probe_radius_nm, n_sphere_points, equilibration, compatibility_hash)[source]

Bases: object

Canonical contract metadata for a persisted SASA artifact.

Parameters:
  • schema_name (str) – Artifact schema identifier.

  • schema_version (int) – Artifact schema version.

  • compatibility_version (int) – Compatibility hash payload version.

  • producer (str) – Producer module name.

  • engine (str) – SASA engine identifier.

  • mode (str) – SASA mode. Canonical value is "atom".

  • units (str) – SASA units. Canonical value is "A^2".

  • run_label (str) – Human-readable run label.

  • target_selection (str) – Target atom selection used for SASA reporting.

  • context_selection (str) – Context atom selection used during SASA computation.

  • probe_radius_nm (float) – Probe radius in nm.

  • n_sphere_points (int) – Number of Shrake-Rupley sphere points.

  • equilibration (str) – Equilibration descriptor.

  • compatibility_hash (str) – Deterministic compatibility hash.

schema_name: str
schema_version: int
compatibility_version: int
producer: str
engine: str
mode: str
units: str
run_label: str
target_selection: str
context_selection: str
probe_radius_nm: float
n_sphere_points: int
equilibration: str
compatibility_hash: str
__init__(schema_name, schema_version, compatibility_version, producer, engine, mode, units, run_label, target_selection, context_selection, probe_radius_nm, n_sphere_points, equilibration, compatibility_hash)
class polyzymd.analyses.shared.sasa.SASAArtifactCompatibilityQuery(probe_radius_nm, n_sphere_points, equilibration, selection=None, context_selection=None)[source]

Bases: object

Compatibility query metadata for sibling artifact lookup.

Parameters:
  • probe_radius_nm (float) – Probe radius expected by the consumer.

  • n_sphere_points (int) – Sphere point count expected by the consumer.

  • equilibration (str) – Equilibration label expected by the consumer.

  • selection (str | None, optional) – Target selection string for advisory hash comparison.

  • context_selection (str | None, optional) – Context selection string for advisory hash comparison.

probe_radius_nm: float
n_sphere_points: int
equilibration: str
selection: str | None = None
context_selection: str | None = None
__init__(probe_radius_nm, n_sphere_points, equilibration, selection=None, context_selection=None)
class polyzymd.analyses.shared.sasa.SASAArtifactCompatibility(is_compatible, is_legacy, schema_version, selection_hash_matches, mismatched_fields)[source]

Bases: object

Compatibility decision for a single SASA artifact metadata payload.

Parameters:
  • is_compatible (bool) – True when metadata-level required fields are compatible.

  • is_legacy (bool) – True when schema version fields are absent.

  • schema_version (int | None) – Parsed schema version if present.

  • selection_hash_matches (bool | None) – Advisory selection hash result when query selections are supplied.

  • mismatched_fields (tuple[str, ...]) – Names of fields that made the artifact incompatible.

is_compatible: bool
is_legacy: bool
schema_version: int | None
selection_hash_matches: bool | None
mismatched_fields: tuple[str, ...]
__init__(is_compatible, is_legacy, schema_version, selection_hash_matches, mismatched_fields)
class polyzymd.analyses.shared.sasa.SASASiblingArtifactMatch(sibling_analysis_dir, npz_path, metadata_path, metadata, compatibility)[source]

Bases: object

A sibling SASA artifact candidate and its compatibility outcome.

Parameters:
  • sibling_analysis_dir (Path) – Directory containing sibling SASA artifacts for the replicate.

  • npz_path (Path) – Path to the SASA NPZ payload.

  • metadata_path (Path) – Path to the JSON metadata sidecar.

  • metadata (dict[str, Any]) – Parsed metadata dictionary.

  • compatibility (SASAArtifactCompatibility) – Compatibility decision for this artifact.

sibling_analysis_dir: Path
npz_path: Path
metadata_path: Path
metadata: dict[str, Any]
compatibility: SASAArtifactCompatibility
__init__(sibling_analysis_dir, npz_path, metadata_path, metadata, compatibility)
polyzymd.analyses.shared.sasa.compute_sasa_artifact_compatibility_hash(*, probe_radius_nm, n_sphere_points, selection, equilibration, context_selection=None)[source]

Compute deterministic compatibility hash for SASA artifacts.

Parameters:
  • probe_radius_nm (float) – Probe radius in nm.

  • n_sphere_points (int) – Number of Shrake-Rupley sphere points.

  • selection (str) – Target selection string.

  • equilibration (str) – Equilibration label.

  • context_selection (str | None, optional) – Context selection string. Defaults to selection when omitted.

Returns:

First 16 characters of the SHA-256 compatibility digest.

Return type:

str

polyzymd.analyses.shared.sasa.build_sasa_artifact_contract(*, run_label, target_selection, context_selection, probe_radius_nm, n_sphere_points, equilibration)[source]

Build canonical SASA artifact contract metadata.

Parameters:
  • run_label (str) – Human-readable run label.

  • target_selection (str) – Target atom selection used for SASA reporting.

  • context_selection (str) – Context atom selection used during SASA computation.

  • probe_radius_nm (float) – Probe radius in nm.

  • n_sphere_points (int) – Number of Shrake-Rupley sphere points.

  • equilibration (str) – Equilibration descriptor.

Returns:

Fully populated canonical metadata contract.

Return type:

SASAArtifactContract

polyzymd.analyses.shared.sasa.build_sasa_artifact_metadata(result, *, run_label, target_selection, context_selection, probe_radius_nm, n_sphere_points, equilibration)[source]

Build flat SASA artifact metadata with schema versioning fields.

Parameters:
  • result (SASAComputationResult) – Raw SASA arrays and residue metadata.

  • run_label (str) – Human-readable run label.

  • target_selection (str) – Target atom selection.

  • context_selection (str) – Context atom selection.

  • probe_radius_nm (float) – Probe radius in nm.

  • n_sphere_points (int) – Number of Shrake-Rupley sphere points.

  • equilibration (str) – Equilibration descriptor.

Returns:

Flat metadata dictionary for JSON sidecar persistence.

Return type:

dict[str, Any]

polyzymd.analyses.shared.sasa.check_sasa_artifact_compatibility(metadata, query)[source]

Evaluate metadata-level compatibility for reusable SASA artifacts.

Parameters:
  • metadata (dict[str, Any]) – Parsed SASA metadata sidecar.

  • query (SASAArtifactCompatibilityQuery) – Consumer compatibility query.

Returns:

Compatibility decision containing definitive and advisory signals.

Return type:

SASAArtifactCompatibility

polyzymd.analyses.shared.sasa.adapt_canonical_sasa_to_exposure(result, *, exposure_threshold)[source]

Adapt canonical shared SASA result into exposure SASA trajectory format.

Parameters:
  • result (SASAComputationResult) – Canonical shared SASA computation result in Ų.

  • exposure_threshold (float) – Relative SASA threshold used by exposure consumers.

Returns:

SASATrajectoryResult instance for exposure analysis.

Return type:

Any

polyzymd.analyses.shared.sasa.find_sibling_sasa_artifacts(replicate_analysis_dir, query)[source]

Find compatible sibling SASA artifacts for a replicate analysis directory.

Parameters:
  • replicate_analysis_dir (Path) – Replicate analysis directory, for example analysis/<cond>/exposure/run_1.

  • query (SASAArtifactCompatibilityQuery) – Compatibility query used to pre-filter candidates.

Returns:

Compatible sibling artifacts sorted by preference.

Return type:

list[SASASiblingArtifactMatch]

polyzymd.analyses.shared.surface_exposure

Surface exposure filtering using SASA calculations.

Shared utility for determining which protein residues are surface-exposed based on solvent-accessible surface area (SASA) calculations. Surface exposure data is consumed by multiple analysis plugins (contacts, binding_free_energy, polymer_affinity) for binding preference analysis, since buried residues cannot participate in polymer-protein contacts.

Uses rust_sasa_python for fast SASA computation on initial PDB structures.

The surface exposure threshold determines which residues are considered “exposed” based on their relative SASA (actual SASA / max theoretical SASA). A threshold of 0.2 means residues with at least 20% of their theoretical maximum surface area exposed are considered surface-accessible.

Examples

>>> from polyzymd.analyses.shared.surface_exposure import SurfaceExposureFilter
>>> filter = SurfaceExposureFilter(threshold=0.2)
>>> result = filter.calculate("enzyme.pdb")
>>> print(f"Found {result.exposed_count} surface-exposed residues")
>>> print(result.exposed_by_aa_class())
{'aromatic': 12, 'polar': 25, 'nonpolar': 18, ...}
class polyzymd.analyses.shared.surface_exposure.ResidueExposure(resid, resname, chain_id, sasa, max_sasa, relative_sasa, is_exposed, aa_class)[source]

Bases: object

SASA data for a single protein residue.

resid

Residue ID (1-indexed, as in PDB)

Type:

int

resname

3-letter residue name (e.g., “ALA”, “PHE”)

Type:

str

chain_id

Chain identifier from PDB

Type:

str

sasa

Calculated SASA in Angstrom^2

Type:

float

max_sasa

Maximum theoretical SASA for this residue type (Tien et al. 2013)

Type:

float

relative_sasa

Ratio of actual to maximum SASA (0.0 to ~1.0+)

Type:

float

is_exposed

True if relative_sasa >= threshold

Type:

bool

aa_class

Amino acid classification (aromatic, polar, etc.)

Type:

str

resid: int
resname: str
chain_id: str
sasa: float
max_sasa: float
relative_sasa: float
is_exposed: bool
aa_class: str
__init__(resid, resname, chain_id, sasa, max_sasa, relative_sasa, is_exposed, aa_class)
class polyzymd.analyses.shared.surface_exposure.SurfaceExposureResult(residue_exposures=<factory>, threshold=0.2, pdb_path='', probe_radius=1.4, n_points=1000)[source]

Bases: object

Surface exposure analysis result for a protein structure.

Contains SASA data for all residues and provides convenience methods for filtering and grouping by exposure status and amino acid class.

residue_exposures

SASA data for each residue

Type:

list[ResidueExposure]

threshold

Relative SASA threshold used (e.g., 0.2 for 20%)

Type:

float

pdb_path

Path to the PDB file analyzed

Type:

str

probe_radius

Probe radius used for SASA calculation

Type:

float

n_points

Number of points used for SASA calculation

Type:

int

residue_exposures: list[ResidueExposure]
threshold: float = 0.2
pdb_path: str = ''
probe_radius: float = 1.4
n_points: int = 1000
property exposed_resids: set[int]

Set of residue IDs that are surface-exposed.

property buried_resids: set[int]

Set of residue IDs that are buried (not surface-exposed).

property all_resids: set[int]

Set of all residue IDs.

property exposed_count: int

Number of surface-exposed residues.

property total_count: int

Total number of residues analyzed.

exposed_by_aa_class()[source]

Count of exposed residues per amino acid class.

Returns:

Mapping from AA class to count of exposed residues. E.g., {‘aromatic’: 12, ‘polar’: 25, …}

Return type:

dict[str, int]

total_by_aa_class()[source]

Count of all residues per amino acid class (exposed + buried).

Returns:

Mapping from AA class to total count.

Return type:

dict[str, int]

exposed_in_selection(resids)[source]

Get exposed residues within a specific selection.

Parameters:

resids (set[int]) – Set of residue IDs to filter

Returns:

Intersection of resids with exposed_resids

Return type:

set[int]

get_exposure(resid)[source]

Get exposure data for a specific residue.

Parameters:

resid (int) – Residue ID to look up

Returns:

Exposure data, or None if resid not found

Return type:

ResidueExposure or None

to_dict()[source]

Serialize to dictionary for JSON storage.

classmethod from_dict(data)[source]

Deserialize from dictionary.

__init__(residue_exposures=<factory>, threshold=0.2, pdb_path='', probe_radius=1.4, n_points=1000)
class polyzymd.analyses.shared.surface_exposure.SurfaceExposureFilter(threshold=0.2, probe_radius=1.4, n_points=1000)[source]

Bases: object

Compute surface exposure from initial PDB structure.

Uses rust_sasa_python for fast SASA calculation. Residues are classified as “exposed” if their relative SASA (actual/max theoretical) meets or exceeds the threshold.

Parameters:
  • threshold (float) – Relative SASA threshold (0-1). Residues with SASA/maxSASA >= threshold are considered exposed. Default 0.2 (20% of max theoretical).

  • probe_radius (float) – Probe radius for SASA calculation in Angstroms. Default 1.4 (water molecule radius).

  • n_points (int) – Number of points for SASA calculation. Higher = more accurate but slower. Default 1000.

Examples

>>> filter = SurfaceExposureFilter(threshold=0.2)
>>> result = filter.calculate("enzyme.pdb")
>>> exposed = result.exposed_resids
>>> print(f"Found {len(exposed)} exposed residues")
>>> # Get exposed residues by AA class
>>> by_class = result.exposed_by_aa_class()
>>> print(f"Exposed aromatics: {by_class.get('aromatic', 0)}")
__init__(threshold=0.2, probe_radius=1.4, n_points=1000)[source]
calculate(pdb_path)[source]

Calculate surface exposure for all protein residues.

Parameters:

pdb_path (Path or str) – Path to PDB file (typically the enzyme PDB from config). Should contain only the protein (not polymer/solvent).

Returns:

Surface exposure data for all residues

Return type:

SurfaceExposureResult

Raises:

polyzymd.analyses.shared.binding_preference

Binding preference metrics for polymer-protein contacts.

Shared compute layer consumed by contacts, binding_free_energy, and polymer_affinity plugins.

class polyzymd.analyses.shared.binding_preference.PolymerComposition(*, residue_counts=<factory>, heavy_atom_counts=<factory>)[source]

Bases: BaseModel

Polymer composition data extracted from trajectory.

Contains counts of residues and heavy atoms for each polymer type, enabling normalization of enrichment ratios by polymer availability.

This data is used for dual normalization of binding preference:

  • Residue-based: Normalizes by number of polymer residues per type. Matches the experimental viewpoint where concentrations are specified in terms of monomer units.

  • Atom-based: Normalizes by number of heavy atoms (non-hydrogen) per type. Accounts for differences in monomer size, since larger monomers have more surface area and thus more opportunity for contacts.

residue_counts

Number of residues per polymer type (e.g., {“SBM”: 50, “EGM”: 50})

Type:

dict[str, int]

heavy_atom_counts

Number of heavy atoms (non-hydrogen) per polymer type. Heavy atoms are defined as all atoms with element != ‘H’.

Type:

dict[str, int]

Examples

>>> composition = PolymerComposition(
...     residue_counts={"SBM": 50, "EGM": 50},
...     heavy_atom_counts={"SBM": 750, "EGM": 400},
... )
>>> composition.total_residues
100
>>> composition.residue_fraction("SBM")
0.5
>>> composition.heavy_atom_fraction("SBM")
0.652  # SBM has larger monomers
residue_counts: dict[str, int]
heavy_atom_counts: dict[str, int]
property total_residues: int

Total polymer residues across all types.

property total_heavy_atoms: int

Total heavy atoms across all polymer types.

residue_fraction(polymer_type)[source]

Fraction of residues that are this polymer type.

Parameters:

polymer_type (str) – Polymer type name (e.g., “SBM”)

Returns:

Fraction in range [0, 1], or 0.0 if type not found

Return type:

float

heavy_atom_fraction(polymer_type)[source]

Fraction of heavy atoms that are this polymer type.

Parameters:

polymer_type (str) – Polymer type name (e.g., “SBM”)

Returns:

Fraction in range [0, 1], or 0.0 if type not found

Return type:

float

polymer_types()[source]

Get sorted list of polymer types in this composition.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.shared.binding_preference.BindingPreferenceEntry(*, polymer_type, protein_group, total_contact_frames, mean_contact_fraction, n_residues_in_group, n_exposed_in_group, n_residues_contacted=0, contact_share=0.0, expected_share=0.0, enrichment=None, polymer_residue_count=0, total_polymer_residues=0, polymer_heavy_atom_count=0, total_polymer_heavy_atoms=0)[source]

Bases: BaseModel

Single entry in the binding preference matrix.

Represents the binding preference metrics for one (polymer_type, protein_group) combination.

Enrichment Interpretation (centered at zero)

The enrichment ratio measures whether a polymer type contacts a protein group more or less than expected based on surface availability.

  • enrichment > 0: Preferential binding (more contacts than expected)
    • +0.5 means “50% more contacts than expected”

    • +1.0 means “2× as many contacts as expected”

  • enrichment = 0: Neutral (contact frequency matches surface availability)

  • enrichment < 0: Avoidance (fewer contacts than expected)
    • -0.3 means “30% fewer contacts than expected”

  • enrichment = -1: Complete avoidance (no contacts at all)

The expected share is based on protein surface availability:

expected_share = n_exposed_in_group / total_exposed_residues

This normalization answers: “Given how much of the protein surface is aromatic/charged/etc., does this polymer type contact that surface proportionally, more than proportionally, or less?”

polymer_type

Polymer residue type (e.g., “SBM”, “EGM”)

Type:

str

protein_group

Protein group label (e.g., “aromatic”, “charged_positive”)

Type:

str

total_contact_frames

Sum of contact frames across all exposed residues in this group.

Type:

int

mean_contact_fraction

Average per-residue contact fraction within this group.

Type:

float

n_residues_in_group

Total residues in this protein group (exposed + buried)

Type:

int

n_exposed_in_group

Surface-exposed residues in this group (used for enrichment)

Type:

int

n_residues_contacted

Number of exposed residues that had at least one contact

Type:

int

contact_share

Fraction of this polymer’s total contacts that went to this group.

Type:

float

expected_share

Expected contact share based on surface availability (n_exposed_in_group / total_exposed_residues)

Type:

float

enrichment

Zero-centered enrichment: (contact_share / expected_share) - 1

Type:

float | None

polymer_residue_count

Number of residues of this polymer type (metadata)

Type:

int

total_polymer_residues

Total polymer residues across all types (metadata)

Type:

int

polymer_heavy_atom_count

Number of heavy atoms for this polymer type (metadata)

Type:

int

total_polymer_heavy_atoms

Total polymer heavy atoms across all types (metadata)

Type:

int

polymer_type: str
protein_group: str
total_contact_frames: int
mean_contact_fraction: float
n_residues_in_group: int
n_exposed_in_group: int
n_residues_contacted: int
contact_share: float
expected_share: float
enrichment: float | None
polymer_residue_count: int
total_polymer_residues: int
polymer_heavy_atom_count: int
total_polymer_heavy_atoms: int
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.shared.binding_preference.BindingPreferenceResult(**data)[source]

Bases: BaseModel

Complete binding preference analysis result.

Provides enrichment-normalized metrics for polymer-protein binding preferences, answering questions like:

  • “Does SBMA preferentially bind aromatic residues?”

  • “How does EGMA’s preference for charged residues compare to SBMA?”

  • “Which amino acid class does this polymer type prefer?”

Enrichment values are centered at zero: - enrichment > 0: Preferential binding - enrichment = 0: Neutral (random chance) - enrichment < 0: Avoidance

entries

All (polymer_type × protein_group) combinations

Type:

list[BindingPreferenceEntry]

n_frames

Total frames analyzed

Type:

int

total_exposed_residues

Number of surface-exposed residues considered

Type:

int

surface_exposure_threshold

SASA threshold used for surface filtering

Type:

float

protein_groups_used

Mapping of group name to MDAnalysis selection string

Type:

dict[str, str]

polymer_types_used

Mapping of polymer type name to MDAnalysis selection string

Type:

dict[str, str]

polymer_composition

Polymer composition data (residue/atom counts per type, metadata only)

Type:

PolymerComposition

system_coverage

System-level coverage metrics collapsed across polymer types. Answers: “What does this polymer mixture collectively cover?”

Type:

SystemCoverageResult, optional

schema_version

Version for forward compatibility. Version 4 adds system_coverage.

Type:

int

entries: list[BindingPreferenceEntry]
n_frames: int
total_exposed_residues: int
surface_exposure_threshold: float | None
protein_groups_used: dict[str, str]
polymer_types_used: dict[str, str]
polymer_composition: PolymerComposition | None
system_coverage: SystemCoverageResult | None
binding_preference: PolymerBindingPreferenceResult | None
metadata: dict[str, Any]
schema_version: int
to_dataframe()[source]

Convert to pandas DataFrame for analysis/plotting.

Returns:

Columns: polymer_type, protein_group, total_contact_frames, mean_contact_fraction, n_residues_in_group, n_exposed_in_group, n_residues_contacted, contact_share, expected_share, enrichment

Return type:

pd.DataFrame

enrichment_matrix()[source]

Get enrichment as nested dict: {polymer_type: {protein_group: value}}.

Enrichment values are centered at zero and normalized by protein surface availability: - > 0: Preferential binding (more contacts than expected) - = 0: Neutral (matches surface availability) - < 0: Avoidance (fewer contacts than expected)

Returns:

Nested mapping of enrichment values. Missing/invalid values are returned as 0.0.

Return type:

dict[str, dict[str, float]]

Examples

>>> matrix = result.enrichment_matrix()
>>> print(matrix["SBM"]["aromatic"])
0.45  # 45% more contacts than expected based on surface availability
contact_fraction_matrix()[source]

Get mean contact fractions as nested dict.

Returns:

Nested mapping: {polymer_type: {protein_group: mean_frac}}

Return type:

dict[str, dict[str, float]]

contact_share_matrix()[source]

Get contact shares as nested dict.

Returns:

Nested mapping: {polymer_type: {protein_group: contact_share}}

Return type:

dict[str, dict[str, float]]

get_enrichment(polymer_type, protein_group)[source]

Get enrichment for a specific (polymer_type, protein_group) pair.

Parameters:
  • polymer_type (str) – Polymer type name

  • protein_group (str) – Protein group name

Returns:

Enrichment value (centered at zero), or None if pair not found. Enrichment is based on protein surface availability: (contact_share / expected_share) - 1

Return type:

float or None

get_entry(polymer_type, protein_group)[source]

Get the full entry for a (polymer_type, protein_group) pair.

Parameters:
  • polymer_type (str) – Polymer type name

  • protein_group (str) – Protein group name

Returns:

Full entry, or None if not found

Return type:

BindingPreferenceEntry or None

polymer_types()[source]

Get list of polymer types in this result.

protein_groups()[source]

Get list of protein groups in this result.

save(path)[source]

Save to JSON file.

classmethod load(path)[source]

Load from JSON file.

Parameters:

path (str or Path) – Path to JSON file

Returns:

Loaded result

Return type:

BindingPreferenceResult

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

polyzymd.analyses.shared.binding_preference.compute_binding_preference(contact_result, surface_exposure, protein_groups, polymer_composition, polymer_types=None, protein_group_selections=None, polymer_type_selections=None, protein_partitions=None)[source]

Compute binding preference from contact results.

This function computes enrichment ratios for each (polymer_type, protein_group) combination, answering: “Does this polymer type preferentially bind this protein group compared to random chance?”

The enrichment calculation accounts for: 1. Surface exposure (only exposed residues are considered) 2. Contact duration (contact frames are summed, not binary counts) 3. Polymer composition (normalization by residue count or heavy atom count)

Parameters:
  • contact_result (Any) – Raw contact analysis results from trajectory

  • surface_exposure (SurfaceExposureResult) – Surface exposure data for filtering buried residues

  • protein_groups (dict[str, set[int]]) – Mapping of group name to residue IDs in that group. Only surface-exposed residues in each group are counted. Example: {“aromatic”: {12, 45, 67}, “charged”: {23, 34}}

  • polymer_composition (PolymerComposition) – Polymer composition data (residue and heavy atom counts per type). Used for dual normalization of enrichment ratios.

  • polymer_types (list[str], optional) – If provided, only compute for these polymer residue types. If None, all polymer types found in contacts are used.

  • protein_group_selections (dict[str, str], optional) – Original MDAnalysis selections (for metadata/reproducibility)

  • polymer_type_selections (dict[str, str], optional) – Original MDAnalysis selections (for metadata/reproducibility)

  • protein_partitions (dict[str, list[str]], optional) – User-defined partitions for system coverage plots. Each partition maps a name to a list of group names from protein_groups. Groups within a partition must be mutually exclusive. Example: {“lid_helices”: [“lid_helix_5”, “lid_helix_10”]}

Returns:

Binding preference metrics with dual enrichment normalization

Return type:

BindingPreferenceResult

Notes

Enrichment calculation is centered at zero

For each (polymer_type, protein_group) pair:

contact_share = polymer_contacts_to_group / polymer_total_contacts

Two normalization methods are computed:

  1. Residue-based (matches experimental concentration ratios):

    expected_by_residue = polymer_residue_count / total_polymer_residues enrichment_by_residue = (contact_share / expected_by_residue) - 1

  2. Atom-based (accounts for monomer size via heavy atoms):

    expected_by_atoms = polymer_heavy_atoms / total_polymer_heavy_atoms enrichment_by_atoms = (contact_share / expected_by_atoms) - 1

Interpretation (both methods): - enrichment > 0: Preferential binding (more contacts than expected) - enrichment = 0: Neutral (matches random chance) - enrichment < 0: Avoidance (fewer contacts than expected) - enrichment = -1: Complete avoidance (no contacts at all)

When to use which metric: - enrichment_by_residue: Direct comparison to experimental concentration ratios - enrichment_by_atoms: Reveals true chemical affinity vs. geometric/steric effects

polyzymd.analyses.shared.binding_preference.aggregate_binding_preference(results)[source]

Aggregate binding preference across replicates.

Computes mean ± SEM for both residue-based and atom-based enrichment ratios across multiple replicates.

Parameters:

results (list[BindingPreferenceResult]) – Binding preference results from multiple replicates

Returns:

Aggregated results with mean and SEM for both normalization methods

Return type:

AggregatedBindingPreferenceResult

polyzymd.analyses.shared.binding_preference.aggregate_binding_preference_results(results)

Aggregate binding preference across replicates.

Computes mean ± SEM for both residue-based and atom-based enrichment ratios across multiple replicates.

Parameters:

results (list[BindingPreferenceResult]) – Binding preference results from multiple replicates

Returns:

Aggregated results with mean and SEM for both normalization methods

Return type:

AggregatedBindingPreferenceResult

class polyzymd.analyses.shared.binding_preference.AggregatedBindingPreferenceEntry(*, polymer_type, protein_group, mean_enrichment=None, sem_enrichment=None, per_replicate_enrichments=<factory>, mean_contact_fraction=0.0, sem_contact_fraction=0.0, mean_contact_share=0.0, expected_share=0.0, n_exposed_in_group=0, n_residues_in_group=0, n_replicates=0)[source]

Bases: BaseModel

Aggregated binding preference for one (polymer_type, protein_group) pair.

Contains mean ± SEM across replicates for enrichment based on protein surface availability.

Enrichment values are centered at zero: - > 0: Preferential binding (more contacts than expected by surface area) - = 0: Neutral (contact frequency matches surface availability) - < 0: Avoidance (fewer contacts than expected by surface area)

The expected share is based on protein surface availability:

expected_share = n_exposed_in_group / total_exposed_residues

This normalization answers: “Given how much of the protein surface is aromatic/charged/etc., does this polymer type contact that surface proportionally, more than proportionally, or less?”

polymer_type: str
protein_group: str
mean_enrichment: float | None
sem_enrichment: float | None
per_replicate_enrichments: list[float]
mean_contact_fraction: float
sem_contact_fraction: float
mean_contact_share: float
expected_share: float
n_exposed_in_group: int
n_residues_in_group: int
n_replicates: int
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.shared.binding_preference.AggregatedBindingPreferenceResult(**data)[source]

Bases: BaseModel

Binding preference aggregated across replicates.

Contains mean ± SEM for all metrics across multiple replicates. Enrichment is normalized by protein surface availability.

Enrichment values are centered at zero: - > 0: Preferential binding (more contacts than expected by surface area) - = 0: Neutral (contact frequency matches surface availability) - < 0: Avoidance (fewer contacts than expected by surface area)

entries: list[AggregatedBindingPreferenceEntry]
n_replicates: int
total_exposed_residues: int
surface_exposure_threshold: float | None
protein_groups_used: dict[str, str]
polymer_types_used: dict[str, str]
polymer_composition: PolymerComposition | None
system_coverage: AggregatedSystemCoverageResult | None
binding_preference: AggregatedPolymerBindingPreferenceResult | None
schema_version: int
to_dataframe()[source]

Convert to pandas DataFrame.

enrichment_matrix()[source]

Get mean enrichment as nested dict.

Enrichment values are centered at zero and normalized by protein surface availability: - > 0: Preferential binding (more contacts than expected) - = 0: Neutral (matches surface availability) - < 0: Avoidance (fewer contacts than expected)

Returns:

Nested mapping: {polymer_type: {protein_group: mean_enrichment}}. Missing/invalid values are returned as 0.0.

Return type:

dict[str, dict[str, float]]

get_entry(polymer_type, protein_group)[source]

Get entry for a (polymer_type, protein_group) pair.

polymer_types()[source]

Get list of polymer types.

protein_groups()[source]

Get list of protein groups.

save(path)[source]

Save to JSON file.

classmethod load(path)[source]

Load from JSON file.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.shared.binding_preference.SystemCoverageEntry(*, protein_group, total_contact_frames=0, coverage_share=0.0, expected_share=0.0, coverage_enrichment=None, n_exposed_in_group=0, n_residues_in_group=0, polymer_contributions=<factory>)[source]

Bases: BaseModel

System-level coverage entry for one protein group.

While BindingPreferenceEntry answers “What does SBMA prefer?”, this entry answers “What fraction of ALL polymer contacts in this system go to this protein group?” — collapsing across polymer types for condition-level analysis.

Use Case

Compare copolymer compositions (conditions) with each other. For example: “Does a 70:30 SBMA:EGMA mixture cover aromatic residues differently than a 30:70 mixture?”

Coverage Enrichment Calculation

For each protein group:

coverage_share = Σ(all polymer contacts to group) / Σ(all polymer contacts) expected_share = n_exposed_in_group / total_exposed_residues coverage_enrichment = (coverage_share / expected_share) - 1

Interpretation (centered at zero):

  • coverage_enrichment > 0: Preferential coverage (more than surface predicts)
    • +0.5 means “50% more coverage than expected”

  • coverage_enrichment = 0: Neutral (coverage matches surface availability)

  • coverage_enrichment < 0: Under-coverage (less than surface predicts)
    • -0.3 means “30% less coverage than expected”

  • coverage_enrichment = -1: Complete avoidance (no coverage at all)

protein_group

Protein group label (e.g., “aromatic”, “charged_positive”)

Type:

str

total_contact_frames

Sum of contact frames from ALL polymer types to this group.

Type:

int

coverage_share

Fraction of all polymer contacts that went to this group.

Type:

float

expected_share

Expected coverage based on protein surface availability (n_exposed_in_group / total_exposed_residues)

Type:

float

coverage_enrichment

Zero-centered enrichment: (coverage_share / expected_share) - 1

Type:

float | None

n_exposed_in_group

Surface-exposed residues in this group

Type:

int

n_residues_in_group

Total residues in this group (exposed + buried)

Type:

int

polymer_contributions

Breakdown of coverage by polymer type: {“SBMA”: 0.35, “EGMA”: 0.65} Values sum to 1.0 (fraction of contacts to this group from each polymer)

Type:

dict[str, float]

protein_group: str
total_contact_frames: int
coverage_share: float
expected_share: float
coverage_enrichment: float | None
n_exposed_in_group: int
n_residues_in_group: int
polymer_contributions: dict[str, float]
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.shared.binding_preference.PartitionCoverageEntry(*, partition_element, total_contact_frames=0, coverage_share=0.0, expected_share=0.0, coverage_enrichment=None, n_exposed_in_element=0, n_residues_in_element=0, polymer_contributions=<factory>)[source]

Bases: BaseModel

Coverage metrics for one element in a partition.

A partition element is a mutually exclusive subset of protein residues. Within a partition, all elements together cover the entire protein surface exactly once (no residue is counted in multiple elements).

This ensures that: - coverage_share sums to 1.0 across all elements in the partition - expected_share sums to 1.0 across all elements in the partition - enrichment is mathematically valid (no inflated denominators)

partition_element

Name of this partition element (e.g., “aromatic”, “lid_helix_5”, “rest_of_protein”)

Type:

str

total_contact_frames

Sum of contact frames from ALL polymer types to residues in this element

Type:

int

coverage_share

Fraction of all polymer contacts that went to this element. Sums to 1.0 across all elements in the partition.

Type:

float

expected_share

Expected coverage based on surface availability (n_exposed / total_exposed). Sums to 1.0 across all elements in the partition.

Type:

float

coverage_enrichment

Zero-centered enrichment: (coverage_share / expected_share) - 1

Type:

float | None

n_exposed_in_element

Number of surface-exposed residues in this element

Type:

int

n_residues_in_element

Total residues in this element (exposed + buried)

Type:

int

polymer_contributions

Breakdown of coverage by polymer type (sums to 1.0 for this element)

Type:

dict[str, float]

partition_element: str
total_contact_frames: int
coverage_share: float
expected_share: float
coverage_enrichment: float | None
n_exposed_in_element: int
n_residues_in_element: int
polymer_contributions: dict[str, float]
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.shared.binding_preference.PartitionCoverageResult(*, partition_name, partition_type, entries=<factory>, total_coverage_share=1.0, total_expected_share=1.0)[source]

Bases: BaseModel

Coverage analysis for a complete partition of the protein surface.

A partition divides the protein surface into mutually exclusive regions that together cover the entire surface. This ensures valid enrichment calculations where both coverage_share and expected_share sum to 1.0.

Partition Types

  • aa_class: 5-way partition by amino acid class (aromatic, polar, nonpolar, charged_positive, charged_negative)

  • binary_custom: 2-way partition for a custom group vs rest_of_protein (e.g., “lid_helix_5” vs “rest_of_protein”)

  • combined_custom: N+1 way partition with all non-overlapping custom groups plus “rest_of_protein”

partition_name

Descriptive name (e.g., “aa_class”, “lid_helix_5_vs_rest”)

Type:

str

partition_type

One of: “aa_class”, “binary_custom”, “combined_custom”

Type:

str

entries

Coverage metrics for each element in the partition

Type:

list[PartitionCoverageEntry]

total_coverage_share

Validation check: should be ~1.0

Type:

float

total_expected_share

Validation check: should be ~1.0

Type:

float

partition_name: str
partition_type: Literal['aa_class', 'binary_custom', 'combined_custom', 'user_defined']
entries: list[PartitionCoverageEntry]
total_coverage_share: float
total_expected_share: float
to_dataframe()[source]

Convert to pandas DataFrame for analysis/plotting.

coverage_enrichment_dict()[source]

Get coverage enrichment as dict: {element: enrichment}.

coverage_share_dict()[source]

Get coverage shares as dict: {element: share}.

expected_share_dict()[source]

Get expected shares as dict: {element: share}.

get_entry(element_name)[source]

Get the entry for a specific partition element.

element_names()[source]

Get list of partition element names.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.shared.binding_preference.SystemCoverageResult(*, aa_class_coverage, custom_group_coverages=<factory>, combined_custom_coverage=None, user_defined_partitions=<factory>, n_frames=0, total_contact_frames=0, total_exposed_residues=0, surface_exposure_threshold=None, custom_group_selections=<factory>, polymer_types_included=<factory>, has_overlapping_custom_groups=False, overlapping_group_pairs=<factory>, schema_version=2)[source]

Bases: BaseModel

System-level coverage analysis with proper partition structure.

This result uses partitions to ensure mathematically valid enrichment calculations. A partition divides the protein surface into mutually exclusive regions, avoiding the overlap bug where custom groups and AA class groups can inflate the expected_share denominator.

Partition Strategy

  1. AA Class Partition (always computed): 5-way partition by amino acid class. Every surface residue belongs to exactly one class.

  2. Binary Custom Partitions (per custom group): Each custom group is compared to “rest_of_protein”. This answers: “Does my lid_helix_5 have enriched polymer contacts vs non-lid regions?”

  3. Combined Custom Partition (optional): If custom groups don’t overlap, all custom groups + rest_of_protein form a single partition. If groups overlap, this is not computed and an error is raised if explicitly requested.

  4. User-Defined Partitions (from protein_partitions config): Custom partitions specified by the user in the YAML config. Each partition references groups from protein_groups and must be mutually exclusive. ‘rest_of_protein’ is auto-added if the groups don’t cover all exposed protein residues. One plot per partition is generated.

aa_class_coverage

5-way partition by amino acid class. Always computed.

Type:

PartitionCoverageResult

custom_group_coverages

Binary partitions for each custom group vs rest_of_protein. Keys are custom group names.

Type:

dict[str, PartitionCoverageResult]

combined_custom_coverage

All custom groups + rest_of_protein as a single partition. Only computed if custom groups don’t overlap.

Type:

PartitionCoverageResult | None

user_defined_partitions

User-defined partitions from protein_partitions config. Keys are partition names, values are the computed coverage partitions. ‘rest_of_protein’ is auto-added if groups don’t fully cover the protein.

Type:

dict[str, PartitionCoverageResult]

n_frames

Total frames analyzed

Type:

int

total_contact_frames

Sum of all polymer contacts across all groups

Type:

int

total_exposed_residues

Number of surface-exposed protein residues

Type:

int

surface_exposure_threshold

SASA threshold used for surface filtering

Type:

float | None

custom_group_selections

Custom group name to MDAnalysis selection (for metadata)

Type:

dict[str, str]

polymer_types_included

Polymer types that contributed to coverage

Type:

list[str]

has_overlapping_custom_groups

True if custom groups share residues (combined partition not computed)

Type:

bool

overlapping_group_pairs

Pairs of custom groups that overlap (for diagnostics)

Type:

list[tuple[str, str]]

schema_version

Schema version (2 = partition-based)

Type:

int

aa_class_coverage: PartitionCoverageResult
custom_group_coverages: dict[str, PartitionCoverageResult]
combined_custom_coverage: PartitionCoverageResult | None
user_defined_partitions: dict[str, PartitionCoverageResult]
n_frames: int
total_contact_frames: int
total_exposed_residues: int
surface_exposure_threshold: float | None
custom_group_selections: dict[str, str]
polymer_types_included: list[str]
has_overlapping_custom_groups: bool
overlapping_group_pairs: list[tuple[str, str]]
schema_version: int
get_aa_class_enrichment(aa_class)[source]

Get coverage enrichment for an AA class.

Parameters:

aa_class (str) – One of: aromatic, polar, nonpolar, charged_positive, charged_negative

Returns:

Coverage enrichment, or None if not found

Return type:

float | None

get_custom_group_enrichment(group_name)[source]

Get coverage enrichment for a custom group (vs rest_of_protein).

Parameters:

group_name (str) – Custom group name (e.g., “lid_helix_5”)

Returns:

Coverage enrichment for the custom group, or None if not found

Return type:

float | None

aa_class_enrichment_dict()[source]

Get AA class enrichments as dict: {aa_class: enrichment}.

custom_group_enrichment_dict()[source]

Get custom group enrichments as dict: {group_name: enrichment}.

Each custom group’s enrichment is relative to rest_of_protein.

aa_class_names()[source]

Get list of AA class names in canonical order.

custom_group_names()[source]

Get list of custom group names.

user_partition_names()[source]

Get list of user-defined partition names.

get_user_partition(partition_name)[source]

Get a user-defined partition by name.

Parameters:

partition_name (str) – Name of the partition (e.g., “lid_helices”)

Returns:

The partition result, or None if not found

Return type:

PartitionCoverageResult | None

save(path)[source]

Save to JSON file.

Parameters:

path (str or Path) – Output path for JSON file

classmethod load(path)[source]

Load from JSON file.

Parameters:

path (str or Path) – Path to JSON file

Returns:

Loaded result

Return type:

SystemCoverageResult

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.shared.binding_preference.PartitionBindingEntry(*, partition_element, polymer_type, total_contact_frames=0, contact_share=0.0, expected_share=0.0, enrichment=None, n_exposed_in_element=0, n_residues_in_element=0, n_residues_contacted=0)[source]

Bases: BaseModel

Binding metrics for one partition element for a specific polymer type.

A partition element is a mutually exclusive subset of protein residues. Within a partition, all elements together cover the entire protein surface exactly once (no residue is counted in multiple elements).

This entry is for a SINGLE polymer type, answering: “What fraction of SBMA’s contacts go to aromatic residues?”

This ensures that: - contact_share sums to 1.0 across all elements in the partition (for this polymer) - expected_share sums to 1.0 across all elements in the partition - enrichment is mathematically valid (no inflated denominators)

partition_element

Name of this partition element (e.g., “aromatic”, “lid_helix_5”, “rest_of_protein”)

Type:

str

polymer_type

Polymer type this entry is for (e.g., “SBM”, “EGM”)

Type:

str

total_contact_frames

Sum of contact frames from THIS polymer type to residues in this element

Type:

int

contact_share

Fraction of this polymer’s contacts that went to this element. Sums to 1.0 across all elements in the partition (for this polymer).

Type:

float

expected_share

Expected share based on surface availability (n_exposed / total_exposed). Sums to 1.0 across all elements in the partition.

Type:

float

enrichment

Zero-centered enrichment: (contact_share / expected_share) - 1

Type:

float | None

n_exposed_in_element

Number of surface-exposed residues in this element

Type:

int

n_residues_in_element

Total residues in this element (exposed + buried)

Type:

int

n_residues_contacted

Number of exposed residues that had at least one contact from this polymer

Type:

int

partition_element: str
polymer_type: str
total_contact_frames: int
contact_share: float
expected_share: float
enrichment: float | None
n_exposed_in_element: int
n_residues_in_element: int
n_residues_contacted: int
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.shared.binding_preference.PartitionBindingResult(*, partition_name, partition_type, polymer_type, entries=<factory>, total_contact_share=1.0, total_expected_share=1.0, total_contact_frames=0)[source]

Bases: BaseModel

Binding preference for a complete partition for ONE polymer type.

A partition divides the protein surface into mutually exclusive regions that together cover the entire surface. This class stores the binding preference of a single polymer type across all partition elements.

This ensures valid enrichment calculations where both contact_share and expected_share sum to 1.0.

Partition Types

  • aa_class: 5-way partition by amino acid class (aromatic, polar, nonpolar, charged_positive, charged_negative)

  • user_defined: N+1 way partition with user-specified groups plus “rest_of_protein” (auto-added if groups don’t cover all residues)

partition_name

Descriptive name (e.g., “aa_class”, “lid_helices”)

Type:

str

partition_type

One of: “aa_class”, “user_defined”

Type:

str

polymer_type

Polymer type this result is for (e.g., “SBM”, “EGM”)

Type:

str

entries

Binding metrics for each element in the partition

Type:

list[PartitionBindingEntry]

total_contact_share

Validation check: should be ~1.0

Type:

float

total_expected_share

Validation check: should be ~1.0

Type:

float

total_contact_frames

Total contact frames from this polymer type (across all elements)

Type:

int

partition_name: str
partition_type: Literal['aa_class', 'user_defined']
polymer_type: str
entries: list[PartitionBindingEntry]
total_contact_share: float
total_expected_share: float
total_contact_frames: int
to_dataframe()[source]

Convert to pandas DataFrame for analysis/plotting.

enrichment_dict()[source]

Get enrichment as dict: {element: enrichment}.

contact_share_dict()[source]

Get contact shares as dict: {element: share}.

expected_share_dict()[source]

Get expected shares as dict: {element: share}.

get_entry(element_name)[source]

Get the entry for a specific partition element.

element_names()[source]

Get list of partition element names.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.shared.binding_preference.PolymerBindingPreferenceResult(*, aa_class_binding=<factory>, user_defined_partitions=<factory>, n_frames=0, total_exposed_residues=0, surface_exposure_threshold=None, polymer_types=<factory>, polymer_composition=None, protein_groups_used=<factory>, schema_version=5)[source]

Bases: BaseModel

Per-polymer binding preference using proper partition structure.

This result stores binding preference for ALL polymer types, with each polymer having its own partition-based enrichment calculations.

Unlike SystemCoverageResult (which collapses all polymer contacts), this maintains per-polymer data to answer: “Does SBMA prefer aromatic residues more than EGMA does?”

Partition Strategy (per polymer type)

  1. AA Class Partition (always computed): 5-way partition by amino acid class. Every surface residue belongs to exactly one class. Each polymer type gets its own enrichment values.

  2. User-Defined Partitions (from protein_partitions config): Custom partitions specified by the user. Each partition references groups from protein_groups. ‘rest_of_protein’ is auto-added if groups don’t cover all exposed protein residues. Each polymer type gets its own enrichment values per partition.

aa_class_binding

AA class partition binding for each polymer type. Keys are polymer type names (e.g., “SBM”, “EGM”).

Type:

dict[str, PartitionBindingResult]

user_defined_partitions

User-defined partitions for each polymer type. Outer keys are partition names, inner keys are polymer types. Example: {“lid_helices”: {“SBM”: …, “EGM”: …}}

Type:

dict[str, dict[str, PartitionBindingResult]]

n_frames

Total frames analyzed

Type:

int

total_exposed_residues

Number of surface-exposed protein residues

Type:

int

surface_exposure_threshold

SASA threshold used for surface filtering

Type:

float | None

polymer_types

Polymer types included in this result

Type:

list[str]

polymer_composition

Polymer composition metadata

Type:

PolymerComposition | None

schema_version

Schema version (5 = partition-based binding preference)

Type:

int

aa_class_binding: dict[str, PartitionBindingResult]
user_defined_partitions: dict[str, dict[str, PartitionBindingResult]]
n_frames: int
total_exposed_residues: int
surface_exposure_threshold: float | None
polymer_types: list[str]
polymer_composition: PolymerComposition | None
protein_groups_used: dict[str, str]
schema_version: int
get_aa_class_enrichment(polymer_type, aa_class)[source]

Get binding enrichment for an AA class for a specific polymer.

Parameters:
  • polymer_type (str) – Polymer type (e.g., “SBM”)

  • aa_class (str) – One of: aromatic, polar, nonpolar, charged_positive, charged_negative

Returns:

Binding enrichment, or None if not found

Return type:

float | None

get_user_partition_enrichment(partition_name, polymer_type, element_name)[source]

Get binding enrichment for a user partition element for a specific polymer.

Parameters:
  • partition_name (str) – Name of the user-defined partition (e.g., “lid_helices”)

  • polymer_type (str) – Polymer type (e.g., “SBM”)

  • element_name (str) – Element within the partition (e.g., “lid_helix_5”)

Returns:

Binding enrichment, or None if not found

Return type:

float | None

aa_class_enrichment_matrix()[source]

Get AA class enrichments as nested dict: {polymer_type: {aa_class: enrichment}}.

Returns:

Nested mapping of enrichment values.

Return type:

dict[str, dict[str, float]]

user_partition_enrichment_matrix(partition_name)[source]

Get user partition enrichments as nested dict.

Parameters:

partition_name (str) – Name of the user-defined partition

Returns:

{polymer_type: {element_name: enrichment}}

Return type:

dict[str, dict[str, float]]

aa_class_names()[source]

Get list of AA class names in canonical order.

user_partition_names()[source]

Get list of user-defined partition names.

save(path)[source]

Save to JSON file.

classmethod load(path)[source]

Load from JSON file.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.shared.binding_preference.AggregatedPartitionBindingEntry(*, partition_element, polymer_type, mean_contact_share=0.0, sem_contact_share=0.0, mean_enrichment=None, sem_enrichment=None, per_replicate_enrichments=<factory>, expected_share=0.0, n_exposed_in_element=0, n_residues_in_element=0, n_replicates=0)[source]

Bases: BaseModel

Aggregated binding metrics for one partition element for a specific polymer type.

Contains mean ± SEM across replicates for binding preference, enabling statistical comparison of binding enrichment across conditions.

partition_element

Element name (e.g., “aromatic”, “lid_helix_5”, “rest_of_protein”)

Type:

str

polymer_type

Polymer type this entry is for (e.g., “SBM”, “EGM”)

Type:

str

mean_contact_share

Mean contact share across replicates

Type:

float

sem_contact_share

Standard error of contact share

Type:

float

mean_enrichment

Mean enrichment across replicates

Type:

float | None

sem_enrichment

Standard error of enrichment

Type:

float | None

per_replicate_enrichments

Enrichment values from each replicate

Type:

list[float]

expected_share

Expected share based on surface availability

Type:

float

n_exposed_in_element

Surface-exposed residues in this element

Type:

int

n_residues_in_element

Total residues in this element

Type:

int

n_replicates

Number of replicates with valid data

Type:

int

partition_element: str
polymer_type: str
mean_contact_share: float
sem_contact_share: float
mean_enrichment: float | None
sem_enrichment: float | None
per_replicate_enrichments: list[float]
expected_share: float
n_exposed_in_element: int
n_residues_in_element: int
n_replicates: int
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.shared.binding_preference.AggregatedPartitionBindingResult(*, partition_name, partition_type, polymer_type, entries=<factory>, mean_total_contact_share=1.0, n_replicates=0)[source]

Bases: BaseModel

Aggregated binding preference for a partition for ONE polymer type.

Contains aggregated statistics across replicates for all partition elements.

partition_name

Descriptive name (e.g., “aa_class”, “lid_helices”)

Type:

str

partition_type

One of: “aa_class”, “user_defined”

Type:

str

polymer_type

Polymer type this result is for

Type:

str

entries

Aggregated binding metrics for each element

Type:

list[AggregatedPartitionBindingEntry]

mean_total_contact_share

Mean of total_contact_share across replicates (validation: should be ~1.0)

Type:

float

n_replicates

Number of replicates

Type:

int

partition_name: str
partition_type: Literal['aa_class', 'user_defined']
polymer_type: str
entries: list[AggregatedPartitionBindingEntry]
mean_total_contact_share: float
n_replicates: int
enrichment_dict()[source]

Get mean enrichment as dict: {element: mean_enrichment}.

element_names()[source]

Get list of partition element names.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.shared.binding_preference.AggregatedPolymerBindingPreferenceResult(*, aa_class_binding=<factory>, user_defined_partitions=<factory>, n_replicates=0, total_exposed_residues=0, surface_exposure_threshold=None, polymer_types=<factory>, schema_version=5)[source]

Bases: BaseModel

Aggregated per-polymer binding preference across replicates.

Contains mean ± SEM for all partition-based binding metrics.

aa_class_binding

Aggregated AA class partition binding for each polymer type.

Type:

dict[str, AggregatedPartitionBindingResult]

user_defined_partitions

Aggregated user-defined partitions for each polymer type.

Type:

dict[str, dict[str, AggregatedPartitionBindingResult]]

n_replicates

Number of replicates

Type:

int

total_exposed_residues

Number of surface-exposed protein residues

Type:

int

surface_exposure_threshold

SASA threshold used

Type:

float | None

polymer_types

Polymer types included

Type:

list[str]

schema_version

Schema version

Type:

int

aa_class_binding: dict[str, AggregatedPartitionBindingResult]
user_defined_partitions: dict[str, dict[str, AggregatedPartitionBindingResult]]
n_replicates: int
total_exposed_residues: int
surface_exposure_threshold: float | None
polymer_types: list[str]
schema_version: int
aa_class_enrichment_matrix()[source]

Get AA class enrichments as nested dict: {polymer_type: {aa_class: mean_enrichment}}.

aa_class_names()[source]

Get list of AA class names in canonical order.

user_partition_names()[source]

Get list of user-defined partition names.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

polyzymd.analyses.shared.binding_preference.aggregate_polymer_binding_preference(results)[source]

Aggregate per-polymer binding preference across replicates.

Parameters:

results (list[PolymerBindingPreferenceResult]) – Per-polymer binding preference results from multiple replicates

Returns:

Aggregated results with mean and SEM for all partitions

Return type:

AggregatedPolymerBindingPreferenceResult

class polyzymd.analyses.shared.binding_preference.AggregatedPartitionCoverageEntry(*, partition_element, mean_coverage_share=0.0, sem_coverage_share=0.0, mean_coverage_enrichment=None, sem_coverage_enrichment=None, per_replicate_enrichments=<factory>, expected_share=0.0, n_exposed_in_element=0, n_residues_in_element=0, n_replicates=0, mean_polymer_contributions=<factory>)[source]

Bases: BaseModel

Aggregated coverage for one partition element across replicates.

Contains mean ± SEM for coverage metrics, enabling statistical comparison of coverage enrichment across conditions.

partition_element

Element name (e.g., “aromatic”, “lid_helix_5”, “rest_of_protein”)

Type:

str

mean_coverage_share

Mean coverage share across replicates

Type:

float

sem_coverage_share

Standard error of coverage share

Type:

float

mean_coverage_enrichment

Mean coverage enrichment across replicates

Type:

float | None

sem_coverage_enrichment

Standard error of coverage enrichment

Type:

float | None

per_replicate_enrichments

Coverage enrichment values from each replicate

Type:

list[float]

expected_share

Expected coverage based on surface availability

Type:

float

n_exposed_in_element

Surface-exposed residues in this element

Type:

int

n_residues_in_element

Total residues in this element

Type:

int

n_replicates

Number of replicates with valid data

Type:

int

mean_polymer_contributions

Mean polymer contributions across replicates

Type:

dict[str, float]

partition_element: str
mean_coverage_share: float
sem_coverage_share: float
mean_coverage_enrichment: float | None
sem_coverage_enrichment: float | None
per_replicate_enrichments: list[float]
expected_share: float
n_exposed_in_element: int
n_residues_in_element: int
n_replicates: int
mean_polymer_contributions: dict[str, float]
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.shared.binding_preference.AggregatedPartitionCoverageResult(*, partition_name, partition_type, entries=<factory>, n_replicates=0)[source]

Bases: BaseModel

Aggregated coverage for a partition across replicates.

Contains mean ± SEM for all elements in the partition.

partition_name

Name of the partition

Type:

str

partition_type

One of: “aa_class”, “binary_custom”, “combined_custom”

Type:

str

entries

Aggregated coverage for each element

Type:

list[AggregatedPartitionCoverageEntry]

n_replicates

Number of replicates aggregated

Type:

int

partition_name: str
partition_type: Literal['aa_class', 'binary_custom', 'combined_custom', 'user_defined']
entries: list[AggregatedPartitionCoverageEntry]
n_replicates: int
to_dataframe()[source]

Convert to pandas DataFrame.

coverage_enrichment_dict()[source]

Get mean coverage enrichment as dict: {element: enrichment}.

get_entry(element_name)[source]

Get entry for a specific partition element.

element_names()[source]

Get list of partition element names.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class polyzymd.analyses.shared.binding_preference.AggregatedSystemCoverageResult(*, aa_class_coverage, custom_group_coverages=<factory>, combined_custom_coverage=None, user_defined_partitions=<factory>, n_replicates=0, total_exposed_residues=0, surface_exposure_threshold=None, custom_group_selections=<factory>, polymer_types_included=<factory>, has_overlapping_custom_groups=False, schema_version=2)[source]

Bases: BaseModel

System coverage aggregated across replicates (schema v2).

Contains aggregated partition coverages with mean ± SEM statistics for statistical comparison between conditions.

aa_class_coverage

Aggregated 5-way AA class partition

Type:

AggregatedPartitionCoverageResult

custom_group_coverages

Aggregated binary partitions for each custom group

Type:

dict[str, AggregatedPartitionCoverageResult]

combined_custom_coverage

Aggregated combined custom partition (if applicable)

Type:

AggregatedPartitionCoverageResult | None

n_replicates

Number of replicates aggregated

Type:

int

total_exposed_residues

Number of surface-exposed protein residues

Type:

int

surface_exposure_threshold

SASA threshold used for surface filtering

Type:

float | None

custom_group_selections

Custom group name to MDAnalysis selection

Type:

dict[str, str]

polymer_types_included

Polymer types that contributed to coverage

Type:

list[str]

has_overlapping_custom_groups

True if custom groups share residues

Type:

bool

schema_version

Schema version (2 = partition-based)

Type:

int

aa_class_coverage: AggregatedPartitionCoverageResult
custom_group_coverages: dict[str, AggregatedPartitionCoverageResult]
combined_custom_coverage: AggregatedPartitionCoverageResult | None
user_defined_partitions: dict[str, AggregatedPartitionCoverageResult]
n_replicates: int
total_exposed_residues: int
surface_exposure_threshold: float | None
custom_group_selections: dict[str, str]
polymer_types_included: list[str]
has_overlapping_custom_groups: bool
schema_version: int
get_aa_class_enrichment(aa_class)[source]

Get mean coverage enrichment for an AA class.

get_custom_group_enrichment(group_name)[source]

Get mean coverage enrichment for a custom group (vs rest_of_protein).

aa_class_enrichment_dict()[source]

Get AA class mean enrichments as dict: {aa_class: enrichment}.

custom_group_enrichment_dict()[source]

Get custom group mean enrichments as dict: {group_name: enrichment}.

aa_class_names()[source]

Get list of AA class names in canonical order.

custom_group_names()[source]

Get list of custom group names.

save(path)[source]

Save to JSON file.

classmethod load(path)[source]

Load from JSON file.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

polyzymd.analyses.shared.binding_preference.aggregate_system_coverage(results)[source]

Aggregate system coverage across replicates.

Computes mean ± SEM for coverage metrics across multiple replicates for all partitions (AA class, custom groups, combined).

Parameters:

results (list[SystemCoverageResult]) – System coverage results from multiple replicates

Returns:

Aggregated results with mean and SEM for all partitions

Return type:

AggregatedSystemCoverageResult

polyzymd.analyses.shared.binding_preference.resolve_protein_group_selections(universe, protein_group_selections=None)[source]

Resolve protein group MDAnalysis selections to residue IDs.

Converts MDAnalysis selection strings into sets of residue IDs at analysis time. This allows the user to define groups with flexible selections while we work with concrete residue IDs internally.

Parameters:
  • universe (Universe) – MDAnalysis Universe with loaded topology

  • protein_group_selections (dict[str, str], optional) – Mapping of group name to MDAnalysis selection string. If None, uses DEFAULT_AA_CLASS_SELECTIONS.

Returns:

Mapping of group name to set of residue IDs

Return type:

dict[str, set[int]]

Examples

>>> selections = {"aromatic": "protein and resname PHE TRP TYR HIS"}
>>> groups = resolve_protein_group_selections(universe, selections)
>>> print(groups["aromatic"])
{12, 45, 67, 89, ...}  # Set of aromatic residue IDs
polyzymd.analyses.shared.binding_preference.resolve_protein_groups_from_surface_exposure(surface_exposure, include_default_aa_groups=True, custom_protein_groups=None)[source]

Resolve protein groups from surface exposure data without Universe.

This function derives protein group → residue ID mappings using only the surface exposure result (which already contains resid, resname, and aa_class for each residue). This allows binding preference computation without requiring an MDAnalysis Universe at comparison time.

The function supports:

  • Default AA class groups (aromatic, polar, nonpolar, charged_positive, charged_negative) derived from the aa_class field in surface exposure data

  • Custom user-defined groups specified as resid lists

  • Override behavior: if a custom group has the same name as a default, the custom definition takes precedence

Parameters:
  • surface_exposure (SurfaceExposureResult) – Surface exposure analysis result containing residue data

  • include_default_aa_groups (bool, default True) – If True, include default AA class groupings (aromatic, polar, etc.) derived from surface exposure data

  • custom_protein_groups (dict[str, list[int]], optional) – User-defined protein groups as {group_name: [resid1, resid2, …]}. If a group name matches a default AA class, it overrides that default.

Returns:

Mapping of group name to set of residue IDs

Return type:

dict[str, set[int]]

Examples

>>> from polyzymd.analyses.shared.surface_exposure import SurfaceExposureFilter
>>> filter = SurfaceExposureFilter(threshold=0.2)
>>> surface_result = filter.calculate("enzyme.pdb")
>>> # Get default AA groups + custom active_site group
>>> groups = resolve_protein_groups_from_surface_exposure(
...     surface_result,
...     include_default_aa_groups=True,
...     custom_protein_groups={"active_site": [77, 133, 156]}
... )
>>> print(groups.keys())
dict_keys(['aromatic', 'polar', 'nonpolar', 'charged_positive', 'charged_negative', 'active_site'])
polyzymd.analyses.shared.binding_preference.resolve_polymer_type_selections(universe, polymer_type_selections=None, polymer_chain='C')[source]

Resolve polymer type selections and return list of polymer types.

If explicit selections are provided, validates them and returns the keys. If None, auto-detects polymer types from the standard polymer chain.

Parameters:
  • universe (Universe) – MDAnalysis Universe with loaded topology

  • polymer_type_selections (dict[str, str], optional) – Mapping of type name to MDAnalysis selection string. If None, auto-detects from chainID <polymer_chain>.

  • polymer_chain (str) – Chain ID for auto-detection when polymer_type_selections is None. Defaults to "C" (PolyzyMD chain convention).

Returns:

List of polymer type names (resnames or selection keys)

Return type:

list[str]

Examples

>>> # Auto-detect from chain C
>>> polymer_types = resolve_polymer_type_selections(universe, None)
>>> print(polymer_types)
['SBM', 'EGM']
>>> # Explicit selections
>>> selections = {"SBMA": "chainID C and resname SBM"}
>>> polymer_types = resolve_polymer_type_selections(universe, selections)
>>> print(polymer_types)
['SBMA']
polyzymd.analyses.shared.binding_preference.extract_polymer_composition(universe, polymer_type_selections=None, polymer_chain='C')[source]

Extract polymer composition (residue and heavy atom counts) from trajectory.

This function counts the number of residues and heavy atoms for each polymer type in the system. The composition data is used for dual normalization of binding preference enrichment ratios.

Parameters:
  • universe (Universe) – MDAnalysis Universe with loaded topology

  • polymer_type_selections (dict[str, str], optional) – Mapping of type name to MDAnalysis selection string. If None, auto-detects from chainID <polymer_chain> using unique resnames.

  • polymer_chain (str) – Chain ID for auto-detection when polymer_type_selections is None. Defaults to "C" (PolyzyMD chain convention).

Returns:

Composition data with residue_counts and heavy_atom_counts per type. Heavy atoms are defined as all atoms with element != ‘H’.

Return type:

PolymerComposition

Notes

For auto-detection (polymer_type_selections=None), the function: 1. Selects all atoms in the specified polymer chain 2. Groups residues by resname 3. Counts residues and heavy atoms per resname

For explicit selections, each selection string defines which atoms belong to that polymer type. The function counts residues (unique resids) and heavy atoms within each selection.

Examples

>>> # Auto-detect from chain C
>>> composition = extract_polymer_composition(universe)
>>> print(composition.residue_counts)
{'SBM': 50, 'EGM': 50}
>>> print(composition.heavy_atom_counts)
{'SBM': 750, 'EGM': 400}
>>> # Explicit selections
>>> selections = {"SBMA": "chainID C and resname SBM"}
>>> composition = extract_polymer_composition(universe, selections)
polyzymd.analyses.shared.binding_preference.compute_binding_preference_from_config(contact_result, universe, enzyme_pdb_path, config)[source]

Compute binding preference using contacts plugin settings.

This is a convenience function that orchestrates the full binding preference computation using configuration from analysis.yaml. It: 1. Calculates surface exposure from enzyme PDB 2. Resolves protein group selections to residue IDs 3. Computes binding preference with enrichment ratios

Parameters:
  • contact_result (Any) – Contact analysis results from trajectory

  • universe (Universe) – MDAnalysis Universe (used for resolving selections)

  • enzyme_pdb_path (Path or str) – Path to enzyme PDB file for SASA calculation

  • config (Any) – Settings object with binding preference fields

Returns:

Binding preference with enrichment ratios

Return type:

BindingPreferenceResult

Notes

This function requires rust_sasa_python to be installed for surface exposure calculation. Install with:

pip install rust-sasa-python

Examples

>>> from polyzymd.analyses.contacts import ContactsSettings
>>> config = ContactsSettings(
...     compute_binding_preference=True,
...     surface_exposure_threshold=0.2,
... )
>>> result = compute_binding_preference_from_config(
...     contact_result, universe, "enzyme.pdb", config
... )
>>> print(result.enrichment_matrix())
{'SBM': {'aromatic': 1.45, 'polar': 0.82, ...}, ...}
polyzymd.analyses.shared.binding_preference.compute_condition_binding_preference(cond, sim_config, analysis_dir, *, enzyme_pdb, contact_results_by_replicate, load_contact_result, threshold, include_default_aa_groups, custom_protein_groups, protein_partitions, polymer_type_selections, polymer_chain, settings_fp)[source]

Compute condition-level binding preference from contacts data.

Parameters:
  • cond (ConditionLike) – Condition to compute.

  • sim_config (Any) – Simulation config used for topology lookup.

  • analysis_dir (Path) – Contacts analysis directory where outputs are saved.

  • enzyme_pdb (Path) – Enzyme PDB path used for SASA filtering.

  • contact_results_by_replicate (Mapping[int, Path]) – Mapping from replicate index to contacts JSON path.

  • load_contact_result (callable) – Callable used to deserialize contacts results.

  • threshold (float) – Relative SASA threshold for exposed residues.

  • include_default_aa_groups (bool) – Whether default AA class groups are included.

  • custom_protein_groups (dict[str, list[int]] | None) – Optional custom protein groups.

  • protein_partitions (dict[str, list[str]] | None) – Optional user-defined protein partitions.

  • polymer_type_selections (dict[str, str] | None) – Optional explicit polymer selection map.

  • polymer_chain (str) – Polymer chain ID for auto-detection.

  • settings_fp (str | None) – Optional settings fingerprint used for cache filenames.

Returns:

Aggregated result, or None if computation failed.

Return type:

AggregatedBindingPreferenceResult | None

polyzymd.analyses.shared.binding_preference_helpers

Shared helpers for computing binding preference from contacts data.

Used by the contacts, binding_free_energy, and polymer_affinity analysis plugins to compute SASA-based binding preference enrichment.

Public functions

  • find_enzyme_pdb()

  • resolve_enzyme_pdb()

  • try_load_cached_binding_preference()

class polyzymd.analyses.shared.binding_preference_helpers.ConditionLike(*args, **kwargs)[source]

Bases: Protocol

Minimal condition protocol required by BP helper functions.

label

Condition label

Type:

str

replicates

Replicate IDs associated with this condition

Type:

Sequence[int]

label: str
replicates: Sequence[int]
__init__(*args, **kwargs)
polyzymd.analyses.shared.binding_preference_helpers.find_enzyme_pdb(sim_config)[source]

Find enzyme PDB file from simulation config.

Searches common locations relative to the project directory.

Parameters:

sim_config (SimulationConfig) – Simulation configuration (must have output.projects_directory).

Returns:

Path to enzyme PDB, or None if not found.

Return type:

Path or None

Raises:

ValueError – If a glob pattern matches multiple candidate enzyme PDB files.

polyzymd.analyses.shared.binding_preference_helpers.resolve_enzyme_pdb(enzyme_pdb_setting, source_path, sim_config)[source]

Resolve the enzyme PDB path from settings or auto-discovery.

Parameters:
  • enzyme_pdb_setting (str or None) – Explicit enzyme PDB path from analysis settings (e.g., enzyme_pdb_for_sasa). If relative, resolved against source_path’s parent directory.

  • source_path (Path or None) – Path to the comparison.yaml file (used to resolve relative paths).

  • sim_config (Any) – Simulation configuration for auto-discovery fallback.

Returns:

Resolved enzyme PDB path, or None if not found.

Return type:

Path or None

polyzymd.analyses.shared.binding_preference_helpers.try_load_cached_binding_preference(cond, analysis_dir, *, settings_fp=None)[source]

Try to load cached binding preference results for a condition.

Searches for binding preference files in order of preference: 1. binding_preference_aggregated.json 2. binding_preference_aggregated_reps*.json (glob pattern) 3. binding_preference.json (single replicate) 4. Per-replicate files (binding_preference_rep{N}.json)

Parameters:
  • cond (ConditionLike) – Condition to load.

  • analysis_dir (Path) – Analysis directory for this condition.

  • settings_fp (str or None, optional) – Settings fingerprint for cache lookup. When provided, fingerprinted cache files are searched first, then legacy filenames.

Returns:

Loaded result, or None if not found.

Return type:

AggregatedBindingPreferenceResult | BindingPreferenceResult | None

polyzymd.analyses.shared.groupings

Residue grouping abstractions for analysis plugins.

This module provides classification systems for residues:

  • ResidueGrouping: Abstract base class for residue classification

  • ProteinAAClassification: Standard amino acid classification

  • CustomGrouping: User-defined classification scheme

Examples

>>> from polyzymd.analyses.shared.groupings import ProteinAAClassification
>>>
>>> # Classify amino acids
>>> grouping = ProteinAAClassification()
>>> print(grouping.classify("PHE"))  # "aromatic"
>>> print(grouping.classify("LYS"))  # "charged_positive"
>>>
>>> # Get all residues in a group
>>> aromatics = grouping.get_residues_in_group("aromatic")
>>> # Returns: ["PHE", "TRP", "TYR", "HIS"]
class polyzymd.analyses.shared.groupings.ResidueGrouping[source]

Bases: ABC

Abstract base class for residue classification schemes.

Subclasses must implement classify() to map residue names to group labels.

Examples

>>> class MyPolymerGrouping(ResidueGrouping):
...     def classify(self, resname: str) -> str:
...         if resname in ["SBM", "SBMA"]:
...             return "zwitterionic"
...         elif resname in ["EGP", "EGMA"]:
...             return "hydrophilic"
...         return "unknown"
...
...     @property
...     def available_groups(self) -> list[str]:
...         return ["zwitterionic", "hydrophilic", "unknown"]
abstractmethod classify(resname)[source]

Classify a residue name into a group.

Parameters:

resname (str) – Residue name (3-letter code for amino acids)

Returns:

Group label for this residue type

Return type:

str

abstract property available_groups: list[str]

List of all group labels in this classification scheme.

get_residues_in_group(group)[source]

Get all residue names that belong to a group.

Parameters:

group (str) – Group label

Returns:

Residue names in this group

Return type:

list[str]

Raises:

ValueError – If group is not in available_groups

to_dict()[source]

Serialize grouping scheme to dictionary.

class polyzymd.analyses.shared.groupings.ProteinAAClassification(include_his_aromatic=True)[source]

Bases: ResidueGrouping

Standard amino acid classification.

Groups amino acids into: - aromatic: PHE, TRP, TYR, HIS - charged_positive: ARG, LYS - charged_negative: ASP, GLU - polar: ASN, CYS, GLN, SER, THR - nonpolar: ALA, GLY, ILE, LEU, MET, PRO, VAL

This classification matches the scaffold notebooks and common biochemistry conventions.

Parameters:

include_his_aromatic (bool, optional) – Whether to classify HIS as aromatic (default True). Some classifications put HIS with charged_positive.

Examples

>>> grouping = ProteinAAClassification()
>>> grouping.classify("PHE")
'aromatic'
>>> grouping.classify("LYS")
'charged_positive'
>>> grouping.get_residues_in_group("aromatic")
['PHE', 'TRP', 'TYR', 'HIS']
__init__(include_his_aromatic=True)[source]
classify(resname)[source]

Classify amino acid by residue name.

property available_groups: list[str]

List of all group labels in this classification scheme.

get_charged_groups()[source]

Convenience: get both charged group names.

get_hydrophobic_groups()[source]

Convenience: groups typically considered hydrophobic.

get_hydrophilic_groups()[source]

Convenience: groups typically considered hydrophilic.

class polyzymd.analyses.shared.groupings.CustomGrouping(classification, default_group='other')[source]

Bases: ResidueGrouping

User-defined residue classification.

Allows arbitrary mapping from residue names to group labels.

Parameters:
  • classification (dict[str, str]) – Mapping from residue name to group label.

  • default_group (str, optional) – Group label for unclassified residues. Default “other”.

Examples

>>> # Custom polymer classification
>>> grouping = CustomGrouping({
...     "SBM": "zwitterionic",
...     "SBMA": "zwitterionic",
...     "EGP": "peg_like",
...     "EGMA": "peg_like",
... }, default_group="unknown")
>>> grouping.classify("SBM")
'zwitterionic'
__init__(classification, default_group='other')[source]
classify(resname)[source]

Classify residue by name using custom mapping.

property available_groups: list[str]

List of all group labels in this classification scheme.

classmethod from_groups(groups, default_group='other')[source]

Create grouping from group -> residue list mapping.

Parameters:
  • groups (dict[str, list[str]]) – Mapping from group name to list of residue names

  • default_group (str) – Group for unlisted residues

Return type:

CustomGrouping

Examples

>>> grouping = CustomGrouping.from_groups({
...     "zwitterionic": ["SBM", "SBMA"],
...     "peg_like": ["EGP", "EGMA", "OEGMA"],
... })

polyzymd.analyses.shared.selectors

Molecular selector abstractions for analysis plugins.

This module provides a unified interface for selecting atoms, residues, or molecular groups from an MDAnalysis Universe. Selectors enable:

  1. Consistent selection logic across different analysis types

  2. Configurable protein/polymer/solvent selection

  3. Support for arbitrary user-defined selections

  4. Proximity-based selections (e.g., “residues near active site”)

The Strategy pattern is used so users can define custom selectors by subclassing MolecularSelector.

Examples

>>> from polyzymd.analyses.shared.selectors import ProteinResidues, PolymerChains
>>>
>>> # Select all protein residues
>>> protein_selector = ProteinResidues()
>>> protein_residues = protein_selector.select(universe)
>>>
>>> # Select polymer chains by type
>>> polymer_selector = PolymerResiduesByType(residue_names=["SBM", "EGP"])
>>> polymer_residues = polymer_selector.select(universe)
>>>
>>> # Select protein residues near catalytic triad
>>> triad_selector = ProteinResiduesNearReference(
...     reference_selection="resid 77 133 156",
...     cutoff=5.0
... )
>>> nearby_residues = triad_selector.select(universe)
class polyzymd.analyses.shared.selectors.MolecularSelector[source]

Bases: ABC

Abstract base class for molecular selections.

Subclasses must implement the select() method to define how atoms or residues are selected from a Universe.

This follows the Strategy pattern - different selectors can be swapped in to change selection behavior without modifying the analysis code.

Examples

>>> class ActiveSiteSelector(MolecularSelector):
...     def __init__(self, active_site_resids: list[int]):
...         self.resids = active_site_resids
...
...     def select(self, universe: Universe) -> SelectionResult:
...         resid_str = " ".join(str(r) for r in self.resids)
...         atoms = universe.select_atoms(f"resid {resid_str}")
...         return SelectionResult(
...             atoms=atoms,
...             residues=atoms.residues,
...             label="active_site",
...             metadata={"resids": self.resids}
...         )
...
...     @property
...     def label(self) -> str:
...         return "active_site"
abstractmethod select(universe)[source]

Select atoms/residues from a Universe.

Parameters:

universe (Universe) – MDAnalysis Universe to select from

Returns:

Container with selected atoms, residues, and metadata

Return type:

SelectionResult

abstract property label: str

Short label identifying this selector (for filenames/logging).

validate(universe)[source]

Validate the selector against a Universe.

Returns diagnostic information about whether the selection would succeed and what it would select.

Parameters:

universe (Universe) – MDAnalysis Universe to validate against

Returns:

Validation results with keys: - valid: bool - n_atoms: int - n_residues: int - error: str (if invalid) - warnings: list[str]

Return type:

dict

class polyzymd.analyses.shared.selectors.MDAnalysisSelector(selection, label=None)[source]

Bases: MolecularSelector

Simple selector using an MDAnalysis selection string.

This is the most flexible selector - it allows arbitrary MDAnalysis selection syntax. Use this when you need direct control over the selection or when the specialized selectors don’t fit your needs.

Parameters:
  • selection (str) – MDAnalysis selection string (e.g., “protein”, “resname SBM EGM”, “resid 1-50 and name CA”)

  • label (str, optional) – Human-readable label. If not provided, uses a sanitized version of the selection string.

Examples

>>> # Select polymer residues by name
>>> selector = MDAnalysisSelector("resname SBM EGM")
>>> result = selector.select(universe)
>>>
>>> # Select protein backbone near ligand
>>> selector = MDAnalysisSelector(
...     "protein and backbone and around 5.0 resname LIG",
...     label="protein_near_ligand"
... )
__init__(selection, label=None)[source]
select(universe)[source]

Select atoms using the MDAnalysis selection string.

property label: str

Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.SelectionResult(atoms, residues, label, metadata=<factory>)[source]

Bases: object

Container for selection results with metadata.

atoms

The selected atoms

Type:

AtomGroup

residues

The residues containing the selected atoms

Type:

ResidueGroup

label

Human-readable label for this selection

Type:

str

metadata

Additional metadata about the selection (e.g., selection string used, cutoff values, etc.)

Type:

dict

atoms: AtomGroup
residues: ResidueGroup
label: str
metadata: dict
property n_atoms: int

Number of selected atoms.

property n_residues: int

Number of selected residues.

property residue_ids: numpy.typing.NDArray.numpy.int64

1-indexed residue IDs (PyMOL convention).

property residue_names: list[str]

Residue names for each residue.

__init__(atoms, residues, label, metadata=<factory>)
class polyzymd.analyses.shared.selectors.CompositeSelector(selectors, mode='union', label=None)[source]

Bases: MolecularSelector

Combines multiple selectors with AND/OR logic.

Useful for complex selections like “protein residues that are both aromatic AND within 5A of the active site”.

Parameters:
  • selectors (list[MolecularSelector]) – List of selectors to combine

  • mode ({"union", "intersection"}) – How to combine selections: - “union”: Include atoms selected by ANY selector (OR) - “intersection”: Include only atoms selected by ALL selectors (AND)

  • label (str, optional) – Custom label. If not provided, generates from component labels.

__init__(selectors, mode='union', label=None)[source]
select(universe)[source]

Select atoms using combined selectors.

property label: str

Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.ProteinResidues(selection_modifier=None)[source]

Bases: MolecularSelector

Select all protein residues.

Uses MDAnalysis “protein” selection keyword which matches standard amino acid residues.

Parameters:

selection_modifier (str, optional) – Additional selection criteria to AND with “protein”. E.g., “and not name H*” to exclude hydrogens.

__init__(selection_modifier=None)[source]
select(universe)[source]

Select all protein atoms/residues.

property label: str

Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.ProteinResiduesByGroup(grouping, groups, exclude=False)[source]

Bases: MolecularSelector

Select protein residues by amino acid group classification.

Uses a ResidueGrouping to classify amino acids (e.g., aromatic, charged, polar, nonpolar) and selects only residues in the specified groups.

Parameters:
  • grouping (ResidueGrouping) – Classification scheme for amino acids

  • groups (list[str]) – Names of groups to include (e.g., [“aromatic”, “charged_positive”])

  • exclude (bool, optional) – If True, select residues NOT in the specified groups. Default False.

Examples

>>> from polyzymd.analyses.shared.groupings import ProteinAAClassification
>>>
>>> # Select aromatic residues
>>> grouping = ProteinAAClassification()
>>> selector = ProteinResiduesByGroup(grouping, groups=["aromatic"])
>>>
>>> # Select all charged residues
>>> selector = ProteinResiduesByGroup(
...     grouping,
...     groups=["charged_positive", "charged_negative"]
... )
__init__(grouping, groups, exclude=False)[source]
select(universe)[source]

Select protein residues matching the specified groups.

property label: str

Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.ProteinResiduesNearReference(reference_selection, cutoff, include_reference=True, frame=0)[source]

Bases: MolecularSelector

Select protein residues within a cutoff distance of reference atoms.

Useful for selecting residues near active sites, binding pockets, or other regions of interest.

Parameters:
  • reference_selection (str) – MDAnalysis selection string for reference atoms (e.g., “resid 77 133 156”)

  • cutoff (float) – Distance cutoff in Angstroms. Residues with any atom within this distance of any reference atom are selected.

  • include_reference (bool, optional) – Whether to include the reference residues themselves. Default True.

  • frame (int, optional) – Frame to use for distance calculation. Default is current frame (0).

Examples

>>> # Select residues within 5A of catalytic triad
>>> selector = ProteinResiduesNearReference(
...     reference_selection="resid 77 133 156",
...     cutoff=5.0,
... )
>>>
>>> # Select residues near substrate binding site (not including the site itself)
>>> selector = ProteinResiduesNearReference(
...     reference_selection="resname LIG",
...     cutoff=4.0,
...     include_reference=False,
... )
__init__(reference_selection, cutoff, include_reference=True, frame=0)[source]
select(universe)[source]

Select protein residues near the reference atoms.

property label: str

Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.PolymerChains(chain_id='C', residue_names=None, chain_indices=None, segids=None)[source]

Bases: MolecularSelector

Select polymer chains from the system.

For PolyzyMD-built systems, polymers are assigned to Chain C by convention. This selector uses chain ID selection by default, which is more reliable than residue name matching.

Parameters:
  • chain_id (str, optional) – Chain ID for polymer selection. Default “C” (PolyzyMD convention). Set to None to use residue_names instead.

  • residue_names (list[str], optional) – Residue names that identify polymer residues. Only used when chain_id is None, or as a filter within the chain. Default uses common PolyzyMD polymer names.

  • chain_indices (list[int], optional) – If provided, select only these polymer chain indices (0-indexed) from within the selected atoms. Useful when analyzing specific polymer chains in multi-chain systems.

  • segids (list[str], optional) – If provided, select only polymers with these segment IDs.

Notes

The PolyzyMD chain convention is: - Chain A: Protein/Enzyme - Chain B: Substrate/Ligand - Chain C: Polymers - Chain D+: Solvent (water, ions, co-solvents)

For systems not built with PolyzyMD, set chain_id=None and provide residue_names explicitly.

Examples

>>> # PolyzyMD system (recommended)
>>> selector = PolymerChains()  # Uses chain C
>>>
>>> # Non-PolyzyMD system
>>> selector = PolymerChains(chain_id=None, residue_names=["SBM", "EGM"])
>>>
>>> # PolyzyMD system with specific polymer types
>>> selector = PolymerChains(residue_names=["SBM"])  # SBM in chain C only
__init__(chain_id='C', residue_names=None, chain_indices=None, segids=None)[source]
select(universe)[source]

Select polymer atoms/residues.

property label: str

Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.PolymerResiduesByType(residue_names, exclude=False)[source]

Bases: MolecularSelector

Select polymer residues by monomer type (residue name).

This selector groups polymer residues by their residue names, allowing analysis of specific monomer types within copolymers.

Parameters:
  • residue_names (list[str]) – Residue names to select (e.g., [“SBM”, “EGP”] for SBMA-EGMA copolymer)

  • exclude (bool, optional) – If True, select polymer residues NOT matching these names. Default False.

Examples

>>> # Select SBMA monomers only
>>> selector = PolymerResiduesByType(residue_names=["SBM", "SBMA"])
>>>
>>> # Select non-SBMA monomers
>>> selector = PolymerResiduesByType(residue_names=["SBM", "SBMA"], exclude=True)
__init__(residue_names, exclude=False)[source]
select(universe)[source]

Select polymer residues by type.

property label: str

Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.PolymerSegments(residue_names=None, chain_index=None, segment_indices=None)[source]

Bases: MolecularSelector

Select individual segments (residues) within polymer chains.

This selector provides fine-grained access to polymer segments, useful for per-segment contact analysis.

Parameters:
  • residue_names (list[str], optional) – Residue names that identify polymer residues.

  • chain_index (int, optional) – Specific chain to select segments from (0-indexed). If None, selects from all chains.

  • segment_indices (list[int], optional) – Specific segment indices within chains to select. Uses 0-indexed positions within each chain.

Notes

A “segment” in this context refers to a single residue/monomer unit within a polymer chain, not MDAnalysis segments.

__init__(residue_names=None, chain_index=None, segment_indices=None)[source]
select(universe)[source]

Select polymer segments.

property label: str

Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.SolventMolecules(residue_names=None, exclude_near=None, exclude_cutoff=3.0)[source]

Bases: MolecularSelector

Select solvent (water) molecules.

Parameters:
  • residue_names (list[str], optional) – Residue names for water. Default uses common water names.

  • exclude_near (str, optional) – Exclude waters within a cutoff of this selection. E.g., “protein” to exclude waters in first hydration shell.

  • exclude_cutoff (float, optional) – Cutoff in Angstroms for exclude_near. Default 3.0.

Examples

>>> # Select all water
>>> selector = SolventMolecules()
>>>
>>> # Select bulk water (exclude first shell around protein)
>>> selector = SolventMolecules(
...     exclude_near="protein",
...     exclude_cutoff=5.0
... )
__init__(residue_names=None, exclude_near=None, exclude_cutoff=3.0)[source]
select(universe)[source]

Select water molecules.

property label: str

Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.CosolventMolecules(residue_names=None)[source]

Bases: MolecularSelector

Select cosolvent molecules (e.g., DMSO, acetonitrile).

Parameters:

residue_names (list[str], optional) – Residue names for cosolvent. Default uses common names. You should typically specify this for your system.

Examples

>>> # Select DMSO molecules
>>> selector = CosolventMolecules(residue_names=["DMSO", "DMS"])
__init__(residue_names=None)[source]
select(universe)[source]

Select cosolvent molecules.

property label: str

Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.SubstrateMolecule(residue_name, n_molecules=None)[source]

Bases: MolecularSelector

Select substrate or ligand molecules.

Parameters:
  • residue_name (str) – Residue name of the substrate.

  • n_molecules (int, optional) – Expected number of substrate molecules. If provided, validates that exactly this many are found. Default None (no validation).

Examples

>>> # Select resorufin butyrate substrate
>>> selector = SubstrateMolecule(residue_name="RBU")
>>>
>>> # Select single substrate, validate count
>>> selector = SubstrateMolecule(residue_name="RBU", n_molecules=1)
__init__(residue_name, n_molecules=None)[source]
select(universe)[source]

Select substrate molecules.

property label: str

Short label identifying this selector (for filenames/logging).

class polyzymd.analyses.shared.selectors.IonSelector(residue_names=None, ion_type='all')[source]

Bases: MolecularSelector

Select ion molecules (Na+, Cl-, etc.).

Parameters:
  • residue_names (list[str], optional) – Residue names for ions. Default includes common ions.

  • ion_type ({"all", "cations", "anions"}, optional) – Filter to specific ion types. Default “all”.

Examples

>>> # Select all ions
>>> selector = IonSelector()
>>>
>>> # Select only sodium ions
>>> selector = IonSelector(residue_names=["NA", "Na+", "SOD"])
DEFAULT_CATIONS = ['NA', 'Na+', 'SOD', 'K', 'K+', 'POT', 'MG', 'Mg2+', 'CA', 'Ca2+']
DEFAULT_ANIONS = ['CL', 'Cl-', 'CLA', 'BR', 'Br-']
__init__(residue_names=None, ion_type='all')[source]
select(universe)[source]

Select ion molecules.

property label: str

Short label identifying this selector (for filenames/logging).