Analysis Shared Utilities
This reference page documents contributor-facing utilities in
polyzymd.analyses.shared. These modules provide reusable building blocks for
analysis plugins; framework internals and plugin-private helpers are documented
with their owning packages.
The package root re-exports common helpers for convenience. Import specialized selectors, grouping classes, and module-specific helpers from their submodules.
Trajectory loading and windows
Use these modules to locate trajectories, parse time values, and resolve the trajectory window passed into MDAnalysis job lifecycles.
Trajectory loading utilities for PolyzyMD analysis.
This module provides config-aware trajectory loading that understands PolyzyMD’s directory structure and daisy-chain continuation patterns. File discovery is delegated to the active simulation engine so that both OpenMM and GROMACS directory layouts are handled transparently.
Key Features
Config-based path resolution (config.yaml is single source of truth)
Engine-aware file discovery (OpenMM daisy-chain, GROMACS flat layout)
Automatic detection of daisy-chain trajectory segments
Support for both scratch and projects directories
Lazy loading and memory-efficient iteration
- class polyzymd.analyses.shared.loader.TrajectoryInfo(topology_file, trajectory_files=<factory>, n_segments=0, working_directory=<factory>, replicate=1, topology_format=None, trajectory_format=None, warnings=<factory>)[source]
Bases:
objectInformation about discovered trajectory files.
- topology_file
Path to topology file (PDB)
- Type:
Path
- trajectory_files
List of trajectory files (DCD) in order
- Type:
list[Path]
- n_segments
Number of daisy-chain segments
- Type:
- working_directory
Base working directory for this replicate
- Type:
Path
- replicate
Replicate number
- Type:
- topology_format
Engine-reported topology format, when available.
- Type:
str or None, optional
- trajectory_format
Engine-reported trajectory format, when available.
- Type:
str or None, optional
- topology_file: Path
- n_segments: int = 0
- working_directory: Path
- replicate: int = 1
- property n_trajectory_files: int
Number of trajectory files found.
- validate()[source]
Validate that all files exist.
- __init__(topology_file, trajectory_files=<factory>, n_segments=0, working_directory=<factory>, replicate=1, topology_format=None, trajectory_format=None, warnings=<factory>)
- class polyzymd.analyses.shared.loader.TrajectoryLoader(config, engine_override=None)[source]
Bases:
objectConfig-aware trajectory loader for PolyzyMD simulations.
This class handles the complexity of finding and loading trajectories from PolyzyMD’s output structure, including:
Daisy-chain continuation segments (OpenMM)
Flat production directories (GROMACS)
Scratch vs projects directory resolution
Multiple replicates
File discovery is delegated to the simulation engine resolved from the config’s
enginefield. The engine is created lazily on the first call that needs it, so construction remains cheap. Engine resolution errors propagate unless an explicitengine_overrideis supplied.- Parameters:
config (SimulationConfig) – PolyzyMD simulation configuration.
engine_override (str or None, optional) – Force a specific engine name (
"openmm"or"gromacs") instead of readingconfig.engine.
Examples
>>> from polyzymd.config import load_config >>> config = load_config("config.yaml") >>> loader = TrajectoryLoader(config) >>> >>> # Load single replicate >>> u = loader.load_universe(replicate=1) >>> print(f"Loaded {len(u.trajectory)} frames") >>> >>> # Get trajectory info without loading >>> info = loader.get_trajectory_info(replicate=1) >>> print(f"Found {info.n_segments} segments") >>> >>> # Load multiple replicates >>> for rep in range(1, 6): ... u = loader.load_universe(replicate=rep) ... # ... analyze >>> >>> # Explicit engine override for GROMACS directories >>> loader = TrajectoryLoader(config, engine_override="gromacs")
Notes
Frame indices in MDAnalysis are 0-indexed. For user-facing output, add 1 to follow PyMOL convention (1-indexed frames).
- __init__(config, engine_override=None)[source]
- get_trajectory_info(replicate)[source]
Get trajectory file information for a replicate.
- Parameters:
replicate (int) – Replicate number (1-indexed)
- Returns:
Information about discovered trajectory files
- Return type:
TrajectoryInfo
- Raises:
FileNotFoundError – If working directory or required files don’t exist
- load_universe(replicate, cache=True)[source]
Load MDAnalysis Universe for a replicate.
- Parameters:
- Returns:
MDAnalysis Universe with trajectory loaded
- Return type:
Universe
Notes
For daisy-chain trajectories, all segments are loaded as a continuous trajectory using MDAnalysis’s ChainReader.
- iter_replicates(replicates)[source]
Iterate over multiple replicates.
- Parameters:
replicates (sequence of int) – Replicate numbers to load
- Yields:
tuple of (int, Universe) – Replicate number and loaded Universe
Examples
>>> for rep, u in loader.iter_replicates([1, 2, 3, 4, 5]): ... rmsf = compute_rmsf(u) ... results[rep] = rmsf
- get_frame_times(replicate, unit='ns')[source]
Get time values for each frame.
- get_timestep(replicate, unit='ps')[source]
Get the trajectory timestep (time between frames).
- get_first_frame_time(replicate, unit='ps')[source]
Return the first loaded frame timestamp when available.
MDAnalysis reports trajectory times in picoseconds. This method probes cached Universe metadata without changing the caller-visible current frame when the reader exposes a restorable frame index.
- Parameters:
- Returns:
Finite first-frame timestamp in the requested unit, or
Nonewhen the trajectory does not expose a usable timestamp.- Return type:
float | None
- Raises:
ValueError – Raised when
unitis not"ps"or"ns".
- clear_cache()[source]
Clear the Universe cache to free memory.
- find_topology(working_dir)[source]
Find topology file in working directory.
Delegates file discovery to the simulation engine. The engine applies its own search order (e.g. PDB preference for GROMACS,
solvated_system.pdbpreference for OpenMM).This method is used by several plugins that pass an explicit
working_dirunrelated to the current replicate. The replicate index is inferred from the directory name when possible (run_<N>), falling back to1.- Parameters:
working_dir (Path) – Directory to search for topology files.
- Returns:
Path to the topology file.
- Return type:
Path
- Raises:
FileNotFoundError – If no topology file is found.
- polyzymd.analyses.shared.loader.parse_time_string(time_str)[source]
Parse a time string with units into value and unit.
- Parameters:
time_str (str) – Time string like “100ns”, “5000ps”, “100 ns”, etc.
- Returns:
Numeric value and unit string
- Return type:
Examples
>>> parse_time_string("100ns") (100.0, "ns") >>> parse_time_string("5000 ps") (5000.0, "ps") >>> parse_time_string("100") # Default to ns (100.0, "ns")
- polyzymd.analyses.shared.loader.convert_time(value, from_unit, to_unit)[source]
Convert time between units.
- polyzymd.analyses.shared.loader.time_to_frame(time, time_unit, timestep, timestep_unit='ps')[source]
Convert time to frame index.
Trajectory window helpers for trajectory-backed analyses.
This module centralizes the frame-window logic shared by analysis plugins that
need to combine equilibration skipping with MDAnalysis run() slice
arguments. The helpers return a validated window that can be passed directly to
trajectory-native runners without PolyzyMD re-owning the frame loop.
- class polyzymd.analyses.shared.window.TrajectoryWindow(start, stop, step, equilibration_start, n_frames_total, n_frames_selected, timestep_ps, equilibration_ps, equilibration=None, first_frame_time_ps=None, selected_start_time_ps=None, equilibration_time_reference='loaded_frame_zero', warning_message=None)[source]
Bases:
objectValidated frame window for a trajectory-backed analysis.
- Parameters:
start (int) – Inclusive start frame for the analysis run.
stop (int) – Exclusive stop frame for the analysis run.
step (int) – Frame stride passed to
runner.run(step=...).equilibration_start (int) – Start frame implied by the equilibration time alone.
n_frames_total (int) – Total number of frames in the trajectory.
n_frames_selected (int) – Number of frames selected by
start,stop, andstep.timestep_ps (float) – Trajectory timestep in picoseconds.
equilibration_ps (float) – Equilibration time converted to picoseconds.
equilibration (str | None, optional) – Original equilibration time string used to resolve the window.
first_frame_time_ps (float | None, optional) – Absolute MDAnalysis timestamp of the first loaded frame in picoseconds, when available.
selected_start_time_ps (float | None, optional) – Timestamp of the selected start frame in the active time reference.
equilibration_time_reference (str, optional) – Time reference used to interpret
equilibration."trajectory_timestamp"means absolute MDAnalysis timestamps were available;"loaded_frame_zero"means the stale loaded-frame-relative origin was used.warning_message (str | None) – Non-fatal equilibration warning generated during validation.
- start: int
- stop: int
- step: int
- equilibration_start: int
- n_frames_total: int
- n_frames_selected: int
- timestep_ps: float
- equilibration_ps: float
- equilibration_time_reference: str = 'loaded_frame_zero'
- run_kwargs()[source]
Return keyword arguments for
MDAnalysisrunnerrun().
- __init__(start, stop, step, equilibration_start, n_frames_total, n_frames_selected, timestep_ps, equilibration_ps, equilibration=None, first_frame_time_ps=None, selected_start_time_ps=None, equilibration_time_reference='loaded_frame_zero', warning_message=None)
- polyzymd.analyses.shared.window.resolve_replicate_trajectory_window(*, loader, replicate, equilibration, n_frames_total, start=None, stop=None, step=1, min_frames=1, timestep_ps=None)[source]
Resolve a validated window using loader trajectory timing metadata.
- Parameters:
loader (TrajectoryLoader) – Loader for the replicate being analyzed.
replicate (int) – Replicate number.
equilibration (str) – Equilibration time string such as
"10ns".n_frames_total (int) – Total number of frames in the trajectory.
start (int | None, optional) – Absolute start frame for the analysis window. When
None, the equilibration-resolved start frame is used.stop (int | None, optional) – Absolute exclusive stop frame. When
None, the full remaining trajectory is used.step (int, optional) – Frame stride, by default 1.
min_frames (int, optional) – Minimum required number of selected frames, by default 1.
timestep_ps (float | None, optional) – Explicit timestep override in picoseconds. When
None, the loader timestep is used.
- Returns:
Validated trajectory window with materialized
run()arguments.- Return type:
TrajectoryWindow
- polyzymd.analyses.shared.window.resolve_trajectory_window(*, equilibration, n_frames_total, timestep_ps, start=None, stop=None, step=1, min_frames=1, first_frame_time_ps=None)[source]
Resolve and validate a trajectory frame window.
When the first loaded frame has a finite MDAnalysis timestamp,
equilibrationis interpreted as an absolute trajectory time. The start frame is the first loaded frame whose timestamp is greater than or equal to the equilibration time. When timestamp metadata is unavailable, the noncanonical loaded-frame-relative origin is used.- Parameters:
equilibration (str) – Equilibration time string such as
"10ns".n_frames_total (int) – Total number of frames in the trajectory.
timestep_ps (float) – Time between consecutive frames in picoseconds.
start (int | None, optional) – Absolute start frame for the analysis window. When
None, the equilibration-resolved start frame is used.stop (int | None, optional) – Absolute exclusive stop frame. When
None, the trajectory end is used.step (int, optional) – Frame stride, by default 1.
min_frames (int, optional) – Minimum required number of selected frames, by default 1.
first_frame_time_ps (float | None, optional) – Absolute MDAnalysis timestamp of loaded frame 0 in picoseconds. Non-finite values are ignored and use the stale loaded-frame-relative behavior.
- Returns:
Validated frame window.
- Return type:
TrajectoryWindow
- Raises:
ValueError – Raised when the timestep, equilibration, or window arguments are inconsistent with the trajectory.
Alignment and representative frames
Alignment helpers in polyzymd.analyses.shared.alignment standardize
reference-mode handling. Centroid helpers support plugins that need
representative frames or structures.
Representative frame finding utilities.
This module provides functions to find representative frames from MD trajectories using different methods. The representative frame is commonly used as a reference for trajectory alignment before RMSF calculations.
- polyzymd.analyses.shared.centroid.centroid(aligned-mean representative frame)
Finds the frame closest to the aligned mean structure. Uses all protein atoms by default to capture side chain conformations. Best for: Finding a representative equilibrium conformation while removing rigid-body translation and rotation effects.
- polyzymd.analyses.shared.centroid.average()
Aligns to an average structure computed from all frames. Note: The average structure is synthetic and may have unphysical geometry (e.g., distorted bond lengths/angles). Best for: Pure mathematical measure of thermal fluctuations around the mean.
- polyzymd.analyses.shared.centroid.frame()
Uses a specific frame as the reference (user-specified). Best for: Analyzing fluctuations relative to a known functional state, such as a catalytically competent conformation.
See also
The, use
- polyzymd.analyses.shared.centroid.find_centroid_frame(universe, selection='protein', start_frame=0, stop_frame=None, verbose=True)[source]
Find a representative aligned frame.
This function identifies the frame closest to the aligned mean structure. It first performs rigid-body alignment of each frame to a common reference, computes the mean coordinates in aligned space, then returns the trajectory frame with minimum RMSD to that aligned mean.
This approach avoids contamination from translation/rotation and provides a scientifically defensible representative frame for downstream alignment and RMSF calculations.
- Parameters:
universe (MDAnalysis.Universe) – Universe containing the trajectory to analyze.
selection (str, optional) – MDAnalysis selection string for atoms used to find the representative frame. Default is “protein” (all protein atoms) to capture both backbone and side chain conformations.
start_frame (int, optional) – First frame to include in analysis (0-indexed). Default is 0. Use this to skip equilibration frames.
stop_frame (int, optional) – Last frame to include (exclusive). Default is None (all frames).
verbose (bool, optional) – If True, log progress messages. Default is True.
- Returns:
Index of the representative frame (0-indexed, relative to full trajectory, not to
start_frame).- Return type:
Notes
The algorithm aligns all candidate frames to a common reference frame, computes the aligned mean structure, then selects the frame with minimum RMSD to that mean.
Using all protein atoms (default) rather than just CA atoms captures the full conformational state including side chain rotamers.
Examples
>>> import MDAnalysis as mda >>> u = mda.Universe("topology.pdb", "trajectory.dcd") >>> # Find centroid after 100 frames of equilibration >>> centroid_idx = find_centroid_frame(u, start_frame=100) >>> print(f"Representative aligned frame: {centroid_idx}")
>>> # Use only backbone atoms >>> centroid_idx = find_centroid_frame(u, selection="protein and backbone")
See also
find_reference_frameHigh-level function supporting multiple methods
- polyzymd.analyses.shared.centroid.find_reference_frame(universe, mode='centroid', selection='protein', start_frame=0, stop_frame=None, specific_frame=None, verbose=True)[source]
Find a reference frame for trajectory alignment.
This is the high-level interface for selecting a reference structure for RMSF calculations. It supports multiple methods for choosing the reference, each appropriate for different scientific questions.
- Parameters:
universe (MDAnalysis.Universe) – Universe containing the trajectory.
mode ({"centroid", "average", "frame", "external"}, optional) –
Method for selecting the reference. Default is “centroid”.
”centroid”: Representative aligned frame mode. Returns the frame index closest to the aligned mean structure.
”average”: Use average structure as reference. Returns None (caller should use AverageStructure).
”frame”: Use a specific frame specified by specific_frame. Returns the specified frame index (converted to 0-indexed).
”external”: Use an external PDB file as reference. Returns None (alignment handled by align_trajectory).
selection (str, optional) – MDAnalysis selection string for atoms to use. Default is “protein”. Only used for “centroid” mode.
start_frame (int, optional) – First frame for analysis (0-indexed). Default is 0.
stop_frame (int, optional) – Last frame for analysis (exclusive). Default is None.
specific_frame (int, optional) – Frame index to use when mode=”frame” (1-indexed, PyMOL convention). Required when mode=”frame”.
verbose (bool, optional) – Log progress messages. Default is True.
- Returns:
Frame index (0-indexed) to use as reference, or None if mode=”average” (indicating the caller should compute an average structure).
- Return type:
int or None
- Raises:
ValueError – If mode=”frame” but specific_frame is not provided. If specific_frame is out of range.
Examples
>>> # Find a representative aligned frame (equilibrium conformation) >>> ref_frame = find_reference_frame(u, mode="centroid", start_frame=100)
>>> # Use average structure >>> ref_frame = find_reference_frame(u, mode="average") >>> # ref_frame is None, use AverageStructure instead
>>> # Use a specific frame (e.g., catalytically competent state) >>> ref_frame = find_reference_frame(u, mode="frame", specific_frame=500)
See also
find_centroid_frameLow-level representative frame selection
Time-series statistics and convergence
These modules provide statistical summaries, autocorrelation-aware estimates, inferential tests, and convergence diagnostics used by built-in and contributor plugins.
Autocorrelation analysis for independent sampling.
MD trajectories are highly correlated in time - consecutive frames are not independent samples. This module provides tools to:
Compute the autocorrelation function (ACF) of an observable
Estimate the correlation time (τ) from the ACF
Compute statistical inefficiency (g) for proper uncertainty quantification
Select independent frames based on τ for proper statistics
Key Concepts
Autocorrelation function (ACF): Measures how correlated a signal is with itself at different time lags. ACF(0) = 1, and ACF decays toward 0.
Correlation time (τ): Characteristic time for decorrelation. Frames separated by > 2τ are approximately independent.
Statistical inefficiency (g): Factor by which variance is inflated due to correlation. g = 1 + 2*Σ C(t)*(1-t/N). N_eff = N/g.
Independent samples: For proper SEM calculation, we need N_eff independent samples, not N_frames correlated observations.
Methods for τ estimation
First zero crossing: τ is lag where ACF first crosses zero
Exponential fit: Fit ACF = exp(-t/τ) and extract τ
Integration: τ = ∫ACF(t)dt from 0 to first zero (or cutoff)
Statistical Validity
- The number of effective independent samples (N_eff) is computed as:
N_eff = N / g = N / (1 + 2*Σ C(t)*(1-t/N))
This matches the algorithm from Chodera et al. (2007) with the finite-size correction factor (1-t/N). When N_eff < 10, statistical estimates (mean, SEM) may be unreliable, and users should be warned per LiveCoMS best practices (Grossfield et al., 2018).
For multiple timeseries of different lengths (e.g., replicates), use statistical_inefficiency_multiple() which correctly handles the averaging.
References
Flyvbjerg & Petersen (1989) J. Chem. Phys. 91:461 (block averaging)
Chodera et al. (2007) J. Chem. Theory Comput. 3:26 (statistical inefficiency)
Grossfield et al. (2018) LiveCoMS 1:5067 (uncertainty quantification)
- class polyzymd.analyses.shared.autocorrelation.CorrelationTimeMethod(value)[source]
-
Method for estimating correlation time from ACF.
- FIRST_ZERO = 'first_zero'
- EXPONENTIAL_FIT = 'exponential_fit'
- INTEGRATION = 'integration'
- class polyzymd.analyses.shared.autocorrelation.ACFResult(lags, acf, timestep, timestep_unit, n_samples)[source]
Bases:
objectResult of autocorrelation function computation.
- lags
Time lags in the same units as timestep
- Type:
NDArray[np.float64]
- acf
Autocorrelation values (normalized, ACF[0] = 1)
- Type:
NDArray[np.float64]
- timestep
Time between frames
- Type:
- timestep_unit
Unit of timestep (e.g., “ps”, “ns”)
- Type:
- n_samples
Number of samples in the original timeseries
- Type:
- lags: numpy.typing.NDArray.numpy.float64
- acf: numpy.typing.NDArray.numpy.float64
- timestep: float
- timestep_unit: str
- n_samples: int
- to_dict()[source]
Convert to dictionary for serialization.
- __init__(lags, acf, timestep, timestep_unit, n_samples)
- class polyzymd.analyses.shared.autocorrelation.CorrelationTimeResult(tau, tau_unit, method, n_independent, statistical_inefficiency, warning=None)[source]
Bases:
objectResult of correlation time estimation.
- tau
Estimated correlation time
- Type:
- tau_unit
Unit of tau (same as timestep unit)
- Type:
- method
Method used for estimation
- Type:
- n_independent
Estimated number of independent samples in trajectory
- Type:
- statistical_inefficiency
g = 1 + 2*tau/dt, factor by which variance is inflated
- Type:
- warning
Warning message if statistics may be unreliable (e.g., N_ind < 10)
- Type:
str | None
- tau: float
- tau_unit: str
- method: str
- n_independent: int
- statistical_inefficiency: float
- property is_reliable: bool
Return True if statistics are likely reliable (N_ind >= 10).
- to_dict()[source]
Convert to dictionary for serialization.
- __init__(tau, tau_unit, method, n_independent, statistical_inefficiency, warning=None)
- polyzymd.analyses.shared.autocorrelation.compute_acf(timeseries, max_lag=None, timestep=1.0, timestep_unit='frames')[source]
Compute autocorrelation function of a 1D timeseries.
Uses FFT-based computation for efficiency.
- Parameters:
timeseries (array_like) – 1D array of values (e.g., RMSD over time, distance over time)
max_lag (int, optional) – Maximum lag to compute (in frames). Default is N//4 where N is the length of the timeseries.
timestep (float, optional) – Time between frames. Default is 1.0.
timestep_unit (str, optional) – Unit of timestep. Default is “frames”.
- Returns:
Container with lags, acf values, and metadata
- Return type:
ACFResult
Examples
>>> # Compute ACF of RMSD timeseries >>> rmsd = np.array([1.2, 1.3, 1.25, 1.4, ...]) # from MDAnalysis >>> acf_result = compute_acf(rmsd, timestep=10.0, timestep_unit="ps") >>> print(f"ACF at lag 100ps: {acf_result.acf[10]:.3f}")
Notes
The ACF is normalized so that ACF[0] = 1.
For a stationary process: ACF(τ) = <(x(t) - μ)(x(t+τ) - μ)> / σ²
For constant or near-constant timeseries (variance below a small epsilon), this function returns a defined degenerate ACF with ACF[0] = 1 and all positive lags set to 0.
- polyzymd.analyses.shared.autocorrelation.estimate_correlation_time(acf_or_timeseries, timestep=1.0, timestep_unit='frames', method='integration', n_frames=None)[source]
Estimate correlation time from ACF or raw timeseries.
- Parameters:
acf_or_timeseries (ACFResult or array_like) – Either an ACFResult from compute_acf(), or a raw timeseries
timestep (float, optional) – Time between frames (only used if passing raw timeseries)
timestep_unit (str, optional) – Unit of timestep (only used if passing raw timeseries)
method ({"first_zero", "exponential_fit", "integration"}) – Method for estimating τ: - “first_zero”: Lag where ACF first crosses zero - “exponential_fit”: Fit ACF = exp(-t/τ) - “integration”: τ = ∫ACF(t)dt (recommended, most robust)
n_frames (int, optional) – Total number of frames (for computing n_independent). Only needed if passing ACFResult.
- Returns:
Contains tau, method used, n_independent, statistical_inefficiency
- Return type:
CorrelationTimeResult
Examples
>>> acf_result = compute_acf(rmsd, timestep=10.0, timestep_unit="ps") >>> tau_result = estimate_correlation_time(acf_result, method="integration") >>> print(f"Correlation time: {tau_result.tau:.1f} {tau_result.tau_unit}") >>> print(f"Independent samples: {tau_result.n_independent}")
Notes
- The “integration” method is most robust for noisy ACFs. It computes:
τ = ∫₀^∞ ACF(t) dt ≈ Σ ACF[i] * dt
Integration stops at first zero crossing to avoid noise contribution.
- polyzymd.analyses.shared.autocorrelation.get_independent_indices(n_frames, correlation_time, timestep=1.0, start_frame=0)[source]
Get frame indices for independent samples.
Selects frames separated by at least 2*τ (correlation time) to ensure approximate independence for statistical analysis.
- Parameters:
n_frames (int) – Total number of frames in trajectory
correlation_time (float) – Correlation time τ (in same units as timestep)
timestep (float, optional) – Time between frames. Default is 1.0.
start_frame (int, optional) – First frame to consider (after equilibration). Default is 0. Note: Frame indices are 0-indexed internally, but user-facing documentation uses 1-indexed (PyMOL convention).
- Returns:
Array of frame indices (0-indexed) that are approximately independent
- Return type:
NDArray[np.int64]
Examples
>>> # Get independent frames for RMSF calculation >>> tau_result = estimate_correlation_time(rmsd, timestep=10.0) >>> indices = get_independent_indices( ... n_frames=10000, ... correlation_time=tau_result.tau, ... timestep=10.0, ... start_frame=1000, # Skip first 1000 frames for equilibration ... ) >>> print(f"Using {len(indices)} independent frames")
Notes
Frame indices returned are 0-indexed (for direct use with MDAnalysis). When displaying to users, add 1 for PyMOL convention.
The spacing is set to 2*τ/timestep, which gives frames with negligible correlation (ACF < 0.05 for exponential decay).
- polyzymd.analyses.shared.autocorrelation.statistical_inefficiency(timeseries, mintime=3, fft=True)[source]
Compute statistical inefficiency g directly from a timeseries.
The statistical inefficiency g is the factor by which the variance of the sample mean is increased due to correlation:
Var(mean) = Var(x) * g / N
This is computed as: g = 1 + 2 * Σ C(t) * (1 - t/N)
where C(t) is the normalized autocorrelation function and the sum includes the finite-size correction factor (1 - t/N) per Chodera et al. (2007).
- Parameters:
timeseries (array_like) – 1D array of values (e.g., contact binary array, RMSD over time)
mintime (int) – Minimum number of lags to compute before checking for zero crossing. Prevents early termination from noise. Default is 3.
fft (bool) – If True, use FFT-based ACF computation (faster). Default is True.
- Returns:
Statistical inefficiency g (>= 1.0). The number of effective independent samples is N_eff = N / g.
- Return type:
Examples
>>> # Binary contact timeseries >>> contacts = np.array([0, 1, 1, 1, 0, 0, 1, 1, ...]) >>> g = statistical_inefficiency(contacts) >>> n_eff = len(contacts) / g >>> print(f"Effective samples: {n_eff:.1f}")
>>> # Continuous observable >>> rmsd = np.array([1.2, 1.3, 1.25, 1.4, ...]) >>> g = statistical_inefficiency(rmsd)
Notes
This implementation follows the algorithm from Chodera et al. (2007) J. Chem. Theory Comput. 3:26, with the finite-size correction.
For binary (0/1) data, the algorithm works correctly as the variance of a Bernoulli random variable is p(1-p).
References
Chodera et al. (2007) J. Chem. Theory Comput. 3:26
- polyzymd.analyses.shared.autocorrelation.statistical_inefficiency_multiple(timeseries_list, mintime=3)[source]
Compute statistical inefficiency from multiple timeseries of different lengths.
This is critical for aggregating replicates with different frame counts. The algorithm computes a global mean μ across all timeseries, then averages the ACF numerator and denominator separately before computing g.
- Parameters:
- Returns:
Statistical inefficiency g (>= 1.0)
- Return type:
Examples
>>> # Three replicates with different lengths >>> ts1 = np.array([0, 1, 1, 0, 0, 1]) # 6 frames >>> ts2 = np.array([1, 1, 0, 0, 0]) # 5 frames >>> ts3 = np.array([0, 0, 1, 1, 1, 0, 1]) # 7 frames >>> g = statistical_inefficiency_multiple([ts1, ts2, ts3])
Notes
This implementation follows the algorithm from PyMBAR’s statistical_inefficiency_multiple(), adapted without the PyMBAR dependency.
The algorithm:
Compute global mean μ across all timeseries
For each lag t: - Compute sum of (x - μ) products across all timeseries where t < N_k - Compute sum of sample counts across all timeseries where t < N_k - Average to get C(t)
Sum with finite-size correction
References
Chodera et al. (2007) J. Chem. Theory Comput. 3:26
- polyzymd.analyses.shared.autocorrelation.n_effective(n_samples, g)[source]
Compute number of effective independent samples.
- polyzymd.analyses.shared.autocorrelation.check_statistical_reliability(n_eff, threshold=10)[source]
Check if statistics are reliable based on effective sample count.
- Parameters:
- Returns:
is_reliable (bool) – True if n_eff >= threshold
warning (str | None) – Warning message if not reliable, None otherwise
- Return type:
Examples
>>> g = statistical_inefficiency(contacts) >>> n_eff = n_effective(len(contacts), g) >>> is_ok, warning = check_statistical_reliability(n_eff) >>> if not is_ok: ... print(warning)
Statistical functions for replicate aggregation.
This module provides statistical utilities for combining results across multiple simulation replicates with proper error propagation.
Key design decisions: - All uncertainties are reported as Standard Error of the Mean (SEM) - SEM = std / sqrt(N) where N is the number of independent samples - Hierarchical aggregation preserves proper statistics at each level
Functions
- compute_sem
Standard error of the mean for a 1D array
- aggregate_per_residue_stats
Combine per-residue values across replicates
- aggregate_region_stats
Combine region-averaged values across replicates
- weighted_mean_with_sem
Weighted average with proper error propagation
- class polyzymd.analyses.shared.statistics.StatResult(mean, sem, n_samples)[source]
Bases:
objectContainer for mean +/- SEM results.
- mean
The mean value
- Type:
- sem
Standard error of the mean
- Type:
- n_samples
Number of samples used in computation
- Type:
- mean: float
- sem: float
- n_samples: int
- to_dict()[source]
Convert to dictionary for JSON serialization.
- __init__(mean, sem, n_samples)
- class polyzymd.analyses.shared.statistics.PerResidueStats(residue_ids, means, sems, n_replicates)[source]
Bases:
objectContainer for per-residue statistics across replicates.
- residue_ids
Residue identifiers (1-indexed, following PyMOL convention)
- Type:
NDArray[np.int64]
- means
Mean value for each residue across replicates
- Type:
NDArray[np.float64]
- sems
SEM for each residue across replicates
- Type:
NDArray[np.float64]
- n_replicates
Number of replicates aggregated
- Type:
- residue_ids: numpy.typing.NDArray.numpy.int64
- means: numpy.typing.NDArray.numpy.float64
- sems: numpy.typing.NDArray.numpy.float64
- n_replicates: int
- to_dict()[source]
Convert to dictionary for JSON serialization.
- __init__(residue_ids, means, sems, n_replicates)
- polyzymd.analyses.shared.statistics.compute_sem(values, ddof=1)[source]
Compute mean and standard error of the mean.
SEM = std / sqrt(N) where N is the number of samples.
- Parameters:
values (array_like) – 1D array of values (e.g., one value per replicate)
ddof (int, optional) – Delta degrees of freedom for std calculation. Default is 1 (Bessel’s correction for sample std).
- Returns:
Container with mean, sem, and n_samples
- Return type:
StatResult
Examples
>>> values = [2.5, 2.7, 2.6, 2.4, 2.8] # RMSF from 5 replicates >>> result = compute_sem(values) >>> print(f"RMSF = {result.mean:.2f} +/- {result.sem:.2f} A") RMSF = 2.60 +/- 0.07 A
Notes
For a single value, SEM is undefined (returns 0.0).
- polyzymd.analyses.shared.statistics.aggregate_per_residue_stats(per_replicate_values, residue_ids=None)[source]
Aggregate per-residue values across replicates.
For each residue, computes mean +/- SEM across all replicates. This is the correct way to aggregate per-residue RMSF values.
- Parameters:
per_replicate_values (sequence of arrays) – List/tuple of 1D arrays, each containing per-residue values from one replicate. All arrays must have the same length.
residue_ids (array, optional) – 1-indexed residue identifiers. If None, uses 1, 2, 3, … Following PyMOL convention (1-indexed).
- Returns:
Container with residue_ids, means, sems, n_replicates
- Return type:
PerResidueStats
- Raises:
ValueError – If arrays have inconsistent lengths or no replicates provided
- polyzymd.analyses.shared.statistics.aggregate_region_stats(per_replicate_values, residue_mask=None)[source]
Aggregate region-averaged values across replicates.
For whole-protein or region-specific metrics, this computes the mean of per-replicate averages, with SEM across replicates.
This implements the correct hierarchical aggregation: 1. First average within each replicate (over selected residues) 2. Then compute mean +/- SEM across replicate means
- Parameters:
per_replicate_values (sequence of arrays) – List/tuple of 1D arrays, each containing per-residue values from one replicate.
residue_mask (bool array, optional) – Boolean mask for residue selection. If None, uses all residues.
- Returns:
Mean +/- SEM of region-averaged values across replicates
- Return type:
StatResult
- polyzymd.analyses.shared.statistics.weighted_mean_with_sem(means, sems, weights=None)[source]
Compute weighted mean with proper error propagation.
Useful for combining results from different conditions or analyses with different uncertainties.
- Parameters:
means (array_like) – Mean values from each source
sems (array_like) – SEM values from each source
weights (array_like, optional) – Weights for each source. If None, uses inverse-variance weighting (1/sem^2), which is optimal for independent measurements.
- Returns:
Weighted mean with propagated uncertainty
- Return type:
StatResult
Notes
- For inverse-variance weighting, the combined SEM is:
SEM_combined = 1 / sqrt(sum(1/SEM_i^2))
- For arbitrary weights, uses standard error propagation:
SEM_combined = sqrt(sum((w_i * SEM_i)^2)) / sum(w_i)
Inferential statistical tests shared across analysis comparisons.
This module provides statistical functions for comparing analysis results across multiple conditions, including t-tests, ANOVA, and effect sizes.
It is the canonical home for inferential statistics used by analysis plugins and comparison utilities.
All functions use SciPy for statistical calculations.
- class polyzymd.analyses.shared.inferential_statistics.TTestResult(t_statistic, p_value)[source]
Bases:
objectResult of a two-sample t-test.
- t_statistic
The t-statistic
- Type:
- p_value
Two-tailed p-value
- Type:
- t_statistic: float
- p_value: float
- property significant: bool
Whether the result is significant at p < 0.05.
Note
This uses a hardcoded alpha=0.05 threshold. The comparison pipeline overrides significance with configurable thresholds (BH-adjusted or Tukey). Use this property only as a convenience default.
- to_dict()[source]
Convert to dictionary.
- __init__(t_statistic, p_value)
- class polyzymd.analyses.shared.inferential_statistics.EffectSize(cohens_d, interpretation, direction)[source]
Bases:
objectCohen’s d effect size with interpretation.
- cohens_d
The effect size (positive = group1 > group2)
- Type:
- interpretation
Categorical interpretation: “negligible”, “small”, “medium”, “large”
- Type:
- direction
“higher” (d > 0), “lower” (d < 0), or “unchanged” (d == 0).
- Type:
- interpretation: str
- direction: str
- to_dict()[source]
Convert to dictionary.
- __init__(cohens_d, interpretation, direction)
- class polyzymd.analyses.shared.inferential_statistics.ANOVAResult(f_statistic, p_value)[source]
Bases:
objectResult of one-way ANOVA.
- f_statistic
The F-statistic
- Type:
- p_value
P-value for the test
- Type:
- f_statistic: float
- p_value: float
- property significant: bool
Whether the result is significant at p < 0.05.
Note
This uses a hardcoded alpha=0.05 threshold. The comparison pipeline overrides significance with configurable thresholds. Use this property only as a convenience default.
- to_dict()[source]
Convert to dictionary.
- __init__(f_statistic, p_value)
- class polyzymd.analyses.shared.inferential_statistics.BHResult(raw_p_value, adjusted_p_value, significant, rank)[source]
Bases:
objectResult of Benjamini-Hochberg correction for one hypothesis.
- raw_p_value
Original uncorrected p-value.
- Type:
float | None
- adjusted_p_value
BH-adjusted p-value (q-value). None if raw was None.
- Type:
float | None
- significant
Whether adjusted_p_value <= alpha.
- Type:
- rank
1-based rank among non-None p-values (smallest=1). None if raw was None.
- Type:
int | None
- significant: bool
- __init__(raw_p_value, adjusted_p_value, significant, rank)
- polyzymd.analyses.shared.inferential_statistics.benjamini_hochberg(p_values, alpha=0.05)[source]
Apply Benjamini-Hochberg FDR correction to a family of p-values.
Implements the Benjamini-Hochberg (1995) step-up procedure to control the false discovery rate. The correction adjusts p-values such that declaring significance at
adjusted_p <= alphacontrols the expected proportion of false discoveries at level alpha.NoneandNaNentries in p_values (e.g. cross-temperature pairs where statistics are suppressed, or degenerate tests with undefined p-values) are passed through — the correspondingBHResulthasadjusted_p_value=Noneandsignificant=False.- Parameters:
- Returns:
One entry per input p-value, in the same order.
- Return type:
list[BHResult]
References
Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS B, 57(1), 289-300.
- polyzymd.analyses.shared.inferential_statistics.independent_ttest(group1, group2, method='student')[source]
Perform a two-sample independent t-test.
Tests the null hypothesis that two independent samples have identical expected values.
The
methodparameter controls the variance assumption:"student"uses Student’s t-test (equal_var=True), which assumes equal population variances"welch"uses Welch’s t-test (equal_var=False), which does not assume equal variances
Use
"student"when homoscedasticity is a reasonable assumption. Use"welch"when variances may differ across conditions.- Parameters:
group1 (array_like) – First group of values (e.g., control replicate means)
group2 (array_like) – Second group of values (e.g., treatment replicate means)
method (str, optional) – T-test method to use:
"student"or"welch", by default"student".
- Returns:
Result containing t-statistic and p-value
- Return type:
TTestResult
Examples
>>> control = [0.715, 0.693, 0.696] # No polymer RMSF >>> treatment = [0.517, 0.586] # 100% SBMA RMSF >>> result = independent_ttest(control, treatment) >>> print(f"t = {result.t_statistic:.3f}, p = {result.p_value:.4f}")
- Raises:
ValueError – If method is not
"student"or"welch"
- polyzymd.analyses.shared.inferential_statistics.one_way_anova(*groups)[source]
Perform classical one-way ANOVA across multiple groups.
Tests the null hypothesis that all groups have the same mean using
scipy.stats.f_oneway(equal variance assumption).- Parameters:
*groups (array_like) – Variable number of groups to compare. Each group must have at least 2 observations; groups with fewer observations cause the function to return NaN statistics.
- Returns:
Result containing F-statistic and p-value. Both are NaN if any group has fewer than 2 observations.
- Return type:
ANOVAResult
Examples
>>> no_poly = [0.715, 0.693, 0.696] >>> sbma = [0.517, 0.586] >>> egma = [0.558, 0.738, 0.496] >>> result = one_way_anova(no_poly, sbma, egma) >>> print(f"F = {result.f_statistic:.3f}, p = {result.p_value:.4f}")
- class polyzymd.analyses.shared.inferential_statistics.TukeyHSDResult(group_i, group_j, statistic, p_value)[source]
Bases:
objectResult of Tukey’s HSD test for one pair of groups.
- group_i
Index of the first group.
- Type:
- group_j
Index of the second group.
- Type:
- statistic
Mean difference (group_j - group_i).
- Type:
- p_value
Tukey-adjusted p-value for this pair.
- Type:
- group_i: int
- group_j: int
- statistic: float
- p_value: float
- __init__(group_i, group_j, statistic, p_value)
- polyzymd.analyses.shared.inferential_statistics.tukey_hsd(*groups)[source]
Run Tukey’s Honestly Significant Difference test.
Computes family-wise-adjusted p-values for all pairwise group comparisons using
scipy.stats.tukey_hsd.- Parameters:
*groups (array_like) – Variable number of groups to compare. Each group must have at least 2 observations.
- Returns:
One result per unique pair (i < j), ordered by (i, j). Returns an empty list if fewer than 2 groups are provided or any group has fewer than 2 observations.
- Return type:
list[TukeyHSDResult]
Examples
>>> results = tukey_hsd([1, 2, 3], [4, 5, 6], [7, 8, 9]) >>> for r in results: ... print(f"({r.group_i}, {r.group_j}): p={r.p_value:.4f}")
- polyzymd.analyses.shared.inferential_statistics.percent_change(control_mean, treatment_mean)[source]
Calculate percent change from control.
- Parameters:
- Returns:
Percent change: (treatment - control) / control * 100 Negative = reduction, Positive = increase.
Special handling for zero control values:
0 -> 0 returns
0.00 -> positive returns
math.inf0 -> negative returns
-math.inf
If either input is non-finite (NaN or +/-inf), returns
math.nan.- Return type:
Convergence diagnostics for sliding-window timeseries analysis.
This module implements a sliding-window convergence heuristic adapted from a collaborator notebook used for RMSD equilibration checks.
- class polyzymd.analyses.shared.convergence.ConvergenceResult(converged, assessable, convergence_time_ns, window_start_times_ns, window_mean_values, slope_times_ns, slopes, window_size_ns, step_size_ns, slope_threshold, sustained_for_ns, message)[source]
Bases:
objectContainer for convergence diagnostics.
- converged
Whether sustained convergence was detected.
- Type:
- assessable
Whether convergence could be assessed from available data.
- Type:
- convergence_time_ns
Start time of the first sustained converged period.
- Type:
float | None
- window_size_ns
Sliding window width in ns.
- Type:
- step_size_ns
Sliding window stride in ns.
- Type:
- slope_threshold
Absolute slope cutoff used for convergence.
- Type:
- sustained_for_ns
Required sustained duration below slope threshold.
- Type:
- message
Human-readable status message.
- Type:
- converged: bool
- assessable: bool
- window_size_ns: float
- step_size_ns: float
- slope_threshold: float
- sustained_for_ns: float
- message: str
- __init__(converged, assessable, convergence_time_ns, window_start_times_ns, window_mean_values, slope_times_ns, slopes, window_size_ns, step_size_ns, slope_threshold, sustained_for_ns, message)
- polyzymd.analyses.shared.convergence.find_convergence_time(time_ns, values, window_size_ns=15.0, step_size_ns=5.0, slope_threshold=0.0005, sustained_for_ns=15.0)[source]
Find sustained convergence time using a sliding-window slope heuristic.
- Parameters:
time_ns (array_like) – Monotonically increasing time values in ns.
values (array_like) – Signal values sampled at
time_ns.window_size_ns (float, optional) – Width of each averaging window in ns.
step_size_ns (float, optional) – Sliding step between successive windows in ns.
slope_threshold (float, optional) – Absolute slope threshold for classifying a window-to-window change as converged.
sustained_for_ns (float, optional) – Required cumulative duration below threshold before declaring convergence.
- Returns:
Full convergence diagnostics, including intermediate window means and slope traces.
- Return type:
ConvergenceResult
- Raises:
ValueError – Raised when inputs are invalid.
Plotting
Plotting helpers centralize figure themes, output paths, axis styling, legends, grouped bars, and matrix annotations.
Shared plotting utilities for analysis plugins.
This module provides reusable plotting helper functions extracted from the plotter infrastructure. Analysis plugins import these functions to apply consistent styling, save figures with watermarks, and render common chart elements (grouped bars, heatmap annotations, etc.) without inheriting from a base class.
All functions accept a PlotSettings object (from
polyzymd.config.comparison) so that user-configured themes, palettes,
and DPI settings are respected automatically.
Examples
>>> from polyzymd.analyses.shared.plotting import (
... apply_axis_style, apply_legend, get_palette_colors, save_figure,
... )
>>>
>>> fig, ax = plt.subplots()
>>> colors = get_palette_colors(3, plot_settings)
>>> ax.bar(x, y, color=colors[0])
>>> apply_axis_style(ax, plot_settings, title="My Plot", ylabel="Value (Å)")
>>> apply_legend(ax, plot_settings)
>>> save_figure(fig, output_dir / "my_plot.png", plot_settings)
- class polyzymd.analyses.shared.plotting.ArtifactPlotData(analysis_dir, condition_artifact, replicate_artifacts, aggregated_dir, run_dirs)[source]
Bases:
objectCanonical artifacts loaded for plot-time data access.
- analysis_dir: Path
- aggregated_dir: Path
- __init__(analysis_dir, condition_artifact, replicate_artifacts, aggregated_dir, run_dirs)
- polyzymd.analyses.shared.plotting.load_canonical_plot_artifacts(analysis_dir, replicates, *, require_condition=False, require_replicates=True)[source]
Load plot inputs from canonical MDAnalysis artifacts only.
The loader reads
aggregated/result.jsonand the configuredrun_N/result.jsonfiles throughArtifactStore. It never scans directories, opens non-canonical JSON files, or imports trajectory packages.- Parameters:
analysis_dir (Path) – Condition-level analysis directory containing
aggregatedandrun_Nsubdirectories.replicates (sequence of int) – Configured replicate IDs to load. Extra run directories are ignored.
require_condition (bool, optional) – Raise when
aggregated/result.jsonis absent, by default False.require_replicates (bool, optional) – Raise when any configured
run_N/result.jsonis absent, by default True.
- Returns:
Loaded canonical condition and replicate artifacts.
- Return type:
ArtifactPlotData
- polyzymd.analyses.shared.plotting.get_theme(plot_settings)[source]
Return the resolved
PlotThemefrom plot_settings.- Parameters:
plot_settings (PlotSettings) – Global plot settings (carries a
.themeproperty).- Return type:
PlotTheme
- polyzymd.analyses.shared.plotting.apply_axis_style(ax, plot_settings, *, title=None, xlabel=None, ylabel=None)[source]
Apply standard axis chrome from the theme.
Hides spines according to theme settings, sizes tick labels, and optionally sets title / xlabel / ylabel with themed font sizes.
- polyzymd.analyses.shared.plotting.apply_legend(ax, plot_settings, *, loc=None, bbox_to_anchor=<object object>, fontsize=None, **kwargs)[source]
Apply legend with themed defaults.
Uses
theme.legend_locandtheme.legend_bboxunless overridden by the caller. Extra kwargs are forwarded toax.legend().- Parameters:
ax (matplotlib Axes) – Target axes.
plot_settings (PlotSettings) – Global plot settings.
loc (str, optional) – Override
theme.legend_loc.bbox_to_anchor (tuple of float or None, optional) – Override
theme.legend_bbox. PassNoneexplicitly to suppress the bbox (e.g. for inside-axes placement).fontsize (int, optional) – Override
theme.legend_fontsize.**kwargs (Any) – Forwarded to
ax.legend().
- polyzymd.analyses.shared.plotting.get_palette_colors(n, plot_settings)[source]
Get n distinct colors from the configured palette.
Tries seaborn first (richer palette support), falls back to a matplotlib colormap sampled at evenly-spaced intervals.
- polyzymd.analyses.shared.plotting.order_condition_labels(labels, plot_settings)[source]
Return condition labels in semantic plot order when enabled.
Ordering only affects plot display order. It does not alter comparison statistics, rankings, or condition result files.
- polyzymd.analyses.shared.plotting.get_condition_colors(labels, plot_settings, *, control_label=None)[source]
Return colors for condition labels using semantic settings if enabled.
- Parameters:
- Returns:
Color values aligned to
labels.- Return type:
- polyzymd.analyses.shared.plotting.get_condition_color_map(labels, plot_settings, *, control_label=None)[source]
Return a label-to-color map using semantic condition color rules.
Resolution precedence is manual color, condition color, control color, family/value color, missing metadata fallback, then existing palette fallback. Invalid color or colormap values warn and continue to a safe fallback.
- Parameters:
- Returns:
Mapping from each label to its resolved matplotlib-compatible color.
- Return type:
dict of str to Any
- polyzymd.analyses.shared.plotting.get_output_path(output_dir, name, plot_settings)[source]
Generate output file path with correct format extension.
- Parameters:
output_dir (Path) – Output directory.
name (str) – Base filename (without extension).
plot_settings (PlotSettings) – Global plot settings (carries
format).
- Returns:
Full output path with extension.
- Return type:
Path
- polyzymd.analyses.shared.plotting.save_figure(fig, output_path, plot_settings, *, experimental_features=None, close=True)[source]
Save figure with DPI, watermark, and optional experimental stamp.
- Parameters:
fig (matplotlib Figure) – Figure to save.
output_path (Path) – Output file path.
plot_settings (PlotSettings) – Global plot settings (carries
dpiandtheme).experimental_features (sequence of str or None, optional) – Experimental feature ids to stamp onto the figure.
close (bool, optional) – If True, close the figure after saving. Set False when the caller needs to keep using the figure object.
- Returns:
Path to saved figure.
- Return type:
Path
- polyzymd.analyses.shared.plotting.finite_numeric_values(values)[source]
Return finite numeric values as a one-dimensional float array.
- Parameters:
values (Any) – Candidate scalar or sequence of replicate values.
Noneand non-numeric inputs are treated as missing data.- Returns:
One-dimensional array containing only finite floats. The array is empty when no finite numeric values are available.
- Return type:
- polyzymd.analyses.shared.plotting.replicate_jitter_offsets(n_values, bar_width)[source]
Return deterministic offsets for replicate dot overlays.
Offsets are centred on the corresponding bar position so overlays are reproducible across runs and independent of random-number state.
- Parameters:
- Returns:
Jitter offsets centred around zero.
- Return type:
- polyzymd.analyses.shared.plotting.has_replicate_uncertainty(replicate_values=None, *, n_replicates=None)[source]
Return whether replicate-level uncertainty can be displayed.
- Parameters:
replicate_values (Any, optional) – Per-condition or per-bar replicate values. Finite numeric entries are counted after coercion.
n_replicates (int or None, optional) – Explicit replicate count when the raw replicate values are not available.
- Returns:
True when at least two finite independent replicate values are present.
- Return type:
- polyzymd.analyses.shared.plotting.suppress_singleton_errors(errors, replicate_values)[source]
Return errors with singleton replicate uncertainties suppressed.
- Parameters:
errors (sequence of float) – SEM or uncertainty values aligned to
replicate_values.replicate_values (sequence or None) – Per-bar replicate values used to decide whether an error bar is statistically displayable.
- Returns:
Sanitized error values. Returns
Nonewhen no bar has replicate uncertainty, allowing callers to omit error bars entirely.- Return type:
- polyzymd.analyses.shared.plotting.scatter_replicate_values(ax, bar_positions, replicate_values, plot_settings, *, orientation='vertical', bar_width=0.8, dot_color=None, dot_size=None, dot_alpha=None, zorder=5)[source]
Overlay jittered per-replicate values on bars.
For vertical bars, bar positions are x-coordinates, jitter is applied in x, and replicate values are plotted on y. For horizontal bars, bar positions are y-coordinates, jitter is applied in y, and replicate values are plotted on x.
- Parameters:
ax (matplotlib.axes.Axes) – Axes containing the bar chart.
bar_positions (sequence of float or numpy.ndarray) – Bar centre positions aligned to
replicate_values.replicate_values (sequence of Any) – Per-bar replicate values. Each item may be a scalar or sequence; only finite numeric values are plotted.
plot_settings (PlotSettings) – Global plot settings whose theme provides default dot styling.
orientation ({"vertical", "horizontal"}, optional) – Bar orientation, by default
"vertical".bar_width (float, optional) – Width or height of the bars, by default
0.8.dot_color (Any, optional) – Override for theme dot colour.
dot_size (float, optional) – Override for theme dot size. Dots are skipped when non-positive.
dot_alpha (float, optional) – Override for theme dot alpha. Dots are skipped when non-positive.
zorder (float, optional) – Matplotlib z-order for dot overlays, by default
5.
- Returns:
Number of scatter calls emitted.
- Return type:
- Raises:
ValueError – If
orientationis not"vertical"or"horizontal", or ifbar_positionsandreplicate_valuesare not the same length.
- polyzymd.analyses.shared.plotting.scatter_stacked_segment_replicates(ax, x_position, bottom_value, replicate_values, plot_settings, *, replicate_base_values=None, positive_base_values=None, negative_base_values=None, bar_width=0.8, dot_color=None, dot_size=None, dot_alpha=None, placement='center', zorder=5)[source]
Overlay replicate dots on stacked segments.
The per-component replicate value is a segment height, not an absolute stacked coordinate. Plotting at
base + replicate / 2places each dot at the center of the component-specific replicate segment. Callers should pass replicate-specific bases when earlier stacked components vary by replicate. Signed stacks may pass separate positive and negative bases so each dot is placed on the same sign stack as its own replicate value.- Parameters:
ax (matplotlib.axes.Axes) – Axes containing the stacked bar chart.
x_position (float) – Center x-coordinate of the condition bar.
bottom_value (float) – Aggregate stack baseline for the current segment.
replicate_values (sequence of Any) – Component-specific per-replicate segment heights.
plot_settings (PlotSettings) – Plot configuration used for dot styling.
replicate_base_values (sequence of Any, optional) – Per-replicate cumulative stack bases for unsigned stacks. When omitted,
bottom_valueis used for every replicate for backward compatibility.positive_base_values (sequence of Any, optional) – Per-replicate cumulative positive stack bases for signed stacks.
negative_base_values (sequence of Any, optional) – Per-replicate cumulative negative stack bases for signed stacks.
bar_width (float, optional) – Width used for deterministic jitter, by default
0.8.dot_color (Any, optional) – Override for theme dot colour.
dot_size (float, optional) – Override for theme dot size.
dot_alpha (float, optional) – Override for theme dot alpha.
placement ({"center", "end"}, optional) – Dot placement within each replicate segment.
"center"usesbase + replicate / 2and"end"usesbase + replicate.zorder (float, optional) – Matplotlib z-order for dot overlays, by default
5.
- Returns:
Number of scatter calls emitted.
- Return type:
- Raises:
ValueError – If replicate base arrays do not align with
replicate_values.
- polyzymd.analyses.shared.plotting.grouped_bars(ax, x, series, colors, plot_settings, *, bar_width=None, show_error=True, reference_line=0.0, reference_label='Neutral (0)', replicate_values=None, **style_overrides)[source]
Render grouped bars with optional error bars and reference line.
Style values (alpha, capsize, edgecolor, linewidth, dot_size, etc.) are read from
plot_settings.theme. Callers can override any of them via**style_overridesusing the theme field names as keys.- Parameters:
ax (matplotlib Axes) – Target axes.
x (np.ndarray) – 1-D array of group centre positions (e.g.
np.arange(n_groups)).series (sequence of (label, means, errors)) – One tuple per condition. means and errors must have the same length as x.
colors (sequence) – One colour per condition (same length as series).
plot_settings (PlotSettings) – Global plot settings.
bar_width (float | None, optional) – Width of each individual bar. When
None(default) the width is computed as0.8 / len(series).show_error (bool, optional) – If
False, error bars are suppressed, by defaultTrue.reference_line (float | None, optional) – Y-value for a horizontal reference line. Set to
Noneto skip, by default0.0.reference_label (str, optional) – Legend label for the reference line, by default
"Neutral (0)".replicate_values (sequence or None, optional) – Per-replicate values for jittered dot overlay. Indexed as
replicate_values[condition_idx][group_idx]-> sequence of floats (one per replicate). WhenNone(default), no dots are drawn.**style_overrides – Override any theme field for this call only. Accepted keys:
bar_alpha,bar_capsize,bar_edgecolor,bar_linewidth,dot_size,dot_alpha,dot_color,reference_line_color,reference_line_style,reference_line_width.
- polyzymd.analyses.shared.plotting.annotate_cells(ax, matrix, plot_settings, *, fmt='.2f', fontsize=None, threshold=0.3, sem_matrix=None, show_sign=True, linespacing=None)[source]
Annotate heatmap cells with formatted values.
Iterates over every element of matrix and places a text label at the corresponding (col, row) position on ax. NaN cells are skipped. Text colour flips between black and white depending on the background intensity (controlled by threshold).
- Parameters:
ax (matplotlib Axes) – The axes containing the heatmap image.
matrix (np.ndarray) – 2-D array of values (rows x cols) matching the heatmap.
plot_settings (PlotSettings) – Global plot settings.
fmt (str, optional) – Format spec for the value, by default
".2f".fontsize (int | None, optional) – Annotation font size. When
None(default), usesplot_settings.theme.annotation_fontsize.threshold (float, optional) – Absolute-value threshold above which text turns white.
sem_matrix (np.ndarray | None, optional) – If provided, a second line
±{sem}is appended when the SEM value is finite.show_sign (bool, optional) – Prefix positive values with
"+", by defaultTrue.linespacing (float | None, optional) – Passed to
ax.text(linespacing=...)when SEM is shown.
- polyzymd.analyses.shared.plotting.symmetric_clim(values, pad=0.1)[source]
Compute symmetric colour limits centred on zero.
Selections, selectors, and grouping
Selection helpers extend MDAnalysis selections. Selector and grouping packages provide reusable abstractions for selecting molecules or classifying residues in plugin settings and analysis code.
Special selection syntax for distance calculations.
This module provides parsing of extended selection syntax for defining atom positions in distance calculations. It supports:
Standard MDAnalysis selections: “resid 133 and name OD1”
Midpoint of multiple atoms: “midpoint(resid 133 and name OD1 OD2)”
Center of mass of a group: “com(resid 50-75)”
Examples
>>> from polyzymd.analyses.shared.selections import parse_selection_string, get_position
>>>
>>> # Standard selection - single atom
>>> ag = parse_selection(universe, "resid 77 and name OG")
>>> pos = get_position(ag) # Returns position of single atom
>>>
>>> # Midpoint of Asp carboxyl oxygens
>>> ag = parse_selection(universe, "midpoint(resid 133 and name OD1 OD2)")
>>> pos = get_position(ag) # Returns midpoint of OD1 and OD2
>>>
>>> # Center of mass of lid domain
>>> ag = parse_selection(universe, "com(resid 50-75)")
>>> pos = get_position(ag) # Returns COM of residues 50-75
Notes
The midpoint() syntax is particularly useful for catalytic residues where the functional position is between two atoms (e.g., Asp carboxyl oxygens, Glu carboxyl oxygens).
The com() syntax is useful for domain motions where you want to track the center of mass of a group of residues (e.g., lid opening in lipases).
- polyzymd.analyses.shared.selections.translate_selection(selection)[source]
Translate PolyzyMD selection keywords to MDAnalysis equivalents.
This allows users to use the same selection syntax in analysis as they use in config.yaml for restraints and other atom selections.
Translations
pdbindex N→id N(PDB ATOM serial number)
The
pdbindexkeyword refers to the 1-indexed atom serial number from the PDB ATOM record (column 7-11), which is what PyMOL displays as “id”. In MDAnalysis, this is accessed via theidselection keyword.Note: MDAnalysis also has
bynumwhich is 1-indexed positional (i.e., bynum 1 = first atom, bynum 2 = second atom), but this does NOT correspond to PDB serial numbers when there are gaps in numbering. We useidbecause it matches actual PDB serial numbers.- param selection:
Selection string with possible PolyzyMD-specific keywords
- type selection:
str
- returns:
Selection string with MDAnalysis-compatible keywords
- rtype:
str
Examples
>>> translate_selection("pdbindex 100 and name CA") "id 100 and name CA"
>>> translate_selection("midpoint(pdbindex 100 and name OD1 OD2)") "midpoint(id 100 and name OD1 OD2)"
- class polyzymd.analyses.shared.selections.SelectionMode(value)[source]
-
Mode for position calculation from atom selection.
- SINGLE = 'single'
- CENTROID = 'centroid'
- MIDPOINT = 'midpoint'
- COM = 'com'
- class polyzymd.analyses.shared.selections.ParsedSelection(selection, mode, original)[source]
Bases:
objectResult of parsing a selection string.
- selection
The MDAnalysis selection string (without wrapper function)
- Type:
- mode
How to compute the position
- Type:
SelectionMode
- original
The original input string
- Type:
- selection: str
- mode: SelectionMode
- original: str
- __init__(selection, mode, original)
- polyzymd.analyses.shared.selections.parse_selection_string(selection)[source]
Parse a selection string to extract mode and MDAnalysis selection.
Also translates PolyzyMD-specific keywords (like
pdbindex) to their MDAnalysis equivalents (likeid).- Parameters:
selection (str) – Selection string, possibly with special syntax: - “resid 77 and name OG” - standard MDAnalysis - “midpoint(resid 133 and name OD1 OD2)” - midpoint mode - “com(resid 50-75)” - center of mass mode - “pdbindex 100 and name CA” - PolyzyMD pdbindex (translated to id)
- Returns:
Parsed selection with mode and clean selection string
- Return type:
ParsedSelection
Examples
>>> parsed = parse_selection_string("midpoint(resid 133 and name OD1 OD2)") >>> parsed.mode <SelectionMode.MIDPOINT: 'midpoint'> >>> parsed.selection "resid 133 and name OD1 OD2"
>>> parsed = parse_selection_string("pdbindex 100 and name CA") >>> parsed.selection "id 100 and name CA"
- polyzymd.analyses.shared.selections.select_atoms(universe, selection)[source]
Select atoms from universe using potentially special syntax.
- Parameters:
universe (Universe) – MDAnalysis Universe
selection (str) – Selection string (standard or special syntax)
- Returns:
Selected atoms
- Return type:
AtomGroup
- Raises:
ValueError – If selection matches no atoms, with diagnostic info
- polyzymd.analyses.shared.selections.get_position(atoms, mode=SelectionMode.SINGLE)[source]
Get position from atom group based on mode.
- Parameters:
atoms (AtomGroup) – MDAnalysis AtomGroup
mode (SelectionMode) – How to compute position: - SINGLE: Position of single atom (error if multiple) - CENTROID/MIDPOINT: Center of geometry - COM: Center of mass
- Returns:
3D position vector [x, y, z]
- Return type:
NDArray[np.float64]
- Raises:
ValueError – If mode is SINGLE but multiple atoms selected
- polyzymd.analyses.shared.selections.get_position_from_selection(universe, selection)[source]
Get position from selection string in one step.
This is a convenience function that combines parsing, selection, and position calculation.
- Parameters:
universe (Universe) – MDAnalysis Universe
selection (str) – Selection string (standard or special syntax)
- Returns:
3D position vector [x, y, z]
- Return type:
NDArray[np.float64]
Examples
>>> # Single atom >>> pos = get_position_from_selection(u, "resid 77 and name OG") >>> >>> # Midpoint of Asp carboxyl >>> pos = get_position_from_selection(u, "midpoint(resid 133 and name OD1 OD2)")
- polyzymd.analyses.shared.selections.validate_selection(universe, selection)[source]
Validate a selection string and return diagnostic info.
- Parameters:
universe (Universe) – MDAnalysis Universe
selection (str) – Selection string to validate
- Returns:
Diagnostic information: - valid: bool - n_atoms: int - mode: str - atoms: list of atom info dicts - error: str (if invalid) - diagnostics: str (detailed diagnostics if invalid)
- Return type:
- polyzymd.analyses.shared.selections.format_selection_for_label(selection)[source]
Convert selection string to a short label for filenames/display.
- Parameters:
selection (str) – Selection string (standard or special syntax)
- Returns:
Short label (e.g., “Asp133_mid” or “Ser77_OG”)
- Return type:
Examples
>>> format_selection_for_label("midpoint(resid 133 and name OD1 OD2)") "res133_mid" >>> format_selection_for_label("resid 77 and name OG") "res77_OG"
Molecular selector abstractions for analysis plugins.
This module provides a unified interface for selecting atoms, residues, or molecular groups from an MDAnalysis Universe. Selectors enable:
Consistent selection logic across different analysis types
Configurable protein/polymer/solvent selection
Support for arbitrary user-defined selections
Proximity-based selections (e.g., “residues near active site”)
The Strategy pattern is used so users can define custom selectors by subclassing MolecularSelector.
Examples
>>> from polyzymd.analyses.shared.selectors import ProteinResidues, PolymerChains
>>>
>>> # Select all protein residues
>>> protein_selector = ProteinResidues()
>>> protein_residues = protein_selector.select(universe)
>>>
>>> # Select polymer chains by type
>>> polymer_selector = PolymerResiduesByType(residue_names=["SBM", "EGP"])
>>> polymer_residues = polymer_selector.select(universe)
>>>
>>> # Select protein residues near catalytic triad
>>> triad_selector = ProteinResiduesNearReference(
... reference_selection="resid 77 133 156",
... cutoff=5.0
... )
>>> nearby_residues = triad_selector.select(universe)
- class polyzymd.analyses.shared.selectors.MolecularSelector[source]
Bases:
ABCAbstract base class for molecular selections.
Subclasses must implement the select() method to define how atoms or residues are selected from a Universe.
This follows the Strategy pattern - different selectors can be swapped in to change selection behavior without modifying the analysis code.
Examples
>>> class ActiveSiteSelector(MolecularSelector): ... def __init__(self, active_site_resids: list[int]): ... self.resids = active_site_resids ... ... def select(self, universe: Universe) -> SelectionResult: ... resid_str = " ".join(str(r) for r in self.resids) ... atoms = universe.select_atoms(f"resid {resid_str}") ... return SelectionResult( ... atoms=atoms, ... residues=atoms.residues, ... label="active_site", ... metadata={"resids": self.resids} ... ) ... ... @property ... def label(self) -> str: ... return "active_site"
- abstractmethod select(universe)[source]
Select atoms/residues from a Universe.
- Parameters:
universe (Universe) – MDAnalysis Universe to select from
- Returns:
Container with selected atoms, residues, and metadata
- Return type:
SelectionResult
- abstract property label: str
Short label identifying this selector (for filenames/logging).
- validate(universe)[source]
Validate the selector against a Universe.
Returns diagnostic information about whether the selection would succeed and what it would select.
- Parameters:
universe (Universe) – MDAnalysis Universe to validate against
- Returns:
Validation results with keys: - valid: bool - n_atoms: int - n_residues: int - error: str (if invalid) - warnings: list[str]
- Return type:
- class polyzymd.analyses.shared.selectors.MDAnalysisSelector(selection, label=None)[source]
Bases:
MolecularSelectorSimple selector using an MDAnalysis selection string.
This is the most flexible selector - it allows arbitrary MDAnalysis selection syntax. Use this when you need direct control over the selection or when the specialized selectors don’t fit your needs.
- Parameters:
Examples
>>> # Select polymer residues by name >>> selector = MDAnalysisSelector("resname SBM EGM") >>> result = selector.select(universe) >>> >>> # Select protein backbone near ligand >>> selector = MDAnalysisSelector( ... "protein and backbone and around 5.0 resname LIG", ... label="protein_near_ligand" ... )
- __init__(selection, label=None)[source]
- select(universe)[source]
Select atoms using the MDAnalysis selection string.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.SelectionResult(atoms, residues, label, metadata=<factory>)[source]
Bases:
objectContainer for selection results with metadata.
- atoms
The selected atoms
- Type:
AtomGroup
- residues
The residues containing the selected atoms
- Type:
ResidueGroup
- label
Human-readable label for this selection
- Type:
- metadata
Additional metadata about the selection (e.g., selection string used, cutoff values, etc.)
- Type:
- atoms: AtomGroup
- residues: ResidueGroup
- label: str
- metadata: dict
- property n_atoms: int
Number of selected atoms.
- property n_residues: int
Number of selected residues.
- property residue_ids: numpy.typing.NDArray.numpy.int64
1-indexed residue IDs (PyMOL convention).
- __init__(atoms, residues, label, metadata=<factory>)
- class polyzymd.analyses.shared.selectors.CompositeSelector(selectors, mode='union', label=None)[source]
Bases:
MolecularSelectorCombines multiple selectors with AND/OR logic.
Useful for complex selections like “protein residues that are both aromatic AND within 5A of the active site”.
- Parameters:
selectors (list[MolecularSelector]) – List of selectors to combine
mode ({"union", "intersection"}) – How to combine selections: - “union”: Include atoms selected by ANY selector (OR) - “intersection”: Include only atoms selected by ALL selectors (AND)
label (str, optional) – Custom label. If not provided, generates from component labels.
- __init__(selectors, mode='union', label=None)[source]
- select(universe)[source]
Select atoms using combined selectors.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.ProteinResidues(selection_modifier=None)[source]
Bases:
MolecularSelectorSelect all protein residues.
Uses MDAnalysis “protein” selection keyword which matches standard amino acid residues.
- Parameters:
selection_modifier (str, optional) – Additional selection criteria to AND with “protein”. E.g., “and not name H*” to exclude hydrogens.
- __init__(selection_modifier=None)[source]
- select(universe)[source]
Select all protein atoms/residues.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.ProteinResiduesByGroup(grouping, groups, exclude=False)[source]
Bases:
MolecularSelectorSelect protein residues by amino acid group classification.
Uses a ResidueGrouping to classify amino acids (e.g., aromatic, charged, polar, nonpolar) and selects only residues in the specified groups.
- Parameters:
Examples
>>> from polyzymd.analyses.shared.groupings import ProteinAAClassification >>> >>> # Select aromatic residues >>> grouping = ProteinAAClassification() >>> selector = ProteinResiduesByGroup(grouping, groups=["aromatic"]) >>> >>> # Select all charged residues >>> selector = ProteinResiduesByGroup( ... grouping, ... groups=["charged_positive", "charged_negative"] ... )
- __init__(grouping, groups, exclude=False)[source]
- select(universe)[source]
Select protein residues matching the specified groups.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.ProteinResiduesNearReference(reference_selection, cutoff, include_reference=True, frame=0)[source]
Bases:
MolecularSelectorSelect protein residues within a cutoff distance of reference atoms.
Useful for selecting residues near active sites, binding pockets, or other regions of interest.
- Parameters:
reference_selection (str) – MDAnalysis selection string for reference atoms (e.g., “resid 77 133 156”)
cutoff (float) – Distance cutoff in Angstroms. Residues with any atom within this distance of any reference atom are selected.
include_reference (bool, optional) – Whether to include the reference residues themselves. Default True.
frame (int, optional) – Frame to use for distance calculation. Default is current frame (0).
Examples
>>> # Select residues within 5A of catalytic triad >>> selector = ProteinResiduesNearReference( ... reference_selection="resid 77 133 156", ... cutoff=5.0, ... ) >>> >>> # Select residues near substrate binding site (not including the site itself) >>> selector = ProteinResiduesNearReference( ... reference_selection="resname LIG", ... cutoff=4.0, ... include_reference=False, ... )
- __init__(reference_selection, cutoff, include_reference=True, frame=0)[source]
- select(universe)[source]
Select protein residues near the reference atoms.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.PolymerChains(chain_id='C', residue_names=None, chain_indices=None, segids=None)[source]
Bases:
MolecularSelectorSelect polymer chains from the system.
For PolyzyMD-built systems, polymers are assigned to Chain C by convention. This selector uses chain ID selection by default, which is more reliable than residue name matching.
- Parameters:
chain_id (str, optional) – Chain ID for polymer selection. Default “C” (PolyzyMD convention). Set to None to use residue_names instead.
residue_names (list[str], optional) – Residue names that identify polymer residues. Only used when chain_id is None, or as a filter within the chain. Default uses common PolyzyMD polymer names.
chain_indices (list[int], optional) – If provided, select only these polymer chain indices (0-indexed) from within the selected atoms. Useful when analyzing specific polymer chains in multi-chain systems.
segids (list[str], optional) – If provided, select only polymers with these segment IDs.
Notes
The PolyzyMD chain convention is: - Chain A: Protein/Enzyme - Chain B: Substrate/Ligand - Chain C: Polymers - Chain D+: Solvent (water, ions, co-solvents)
For systems not built with PolyzyMD, set chain_id=None and provide residue_names explicitly.
Examples
>>> # PolyzyMD system (recommended) >>> selector = PolymerChains() # Uses chain C >>> >>> # Non-PolyzyMD system >>> selector = PolymerChains(chain_id=None, residue_names=["SBM", "EGM"]) >>> >>> # PolyzyMD system with specific polymer types >>> selector = PolymerChains(residue_names=["SBM"]) # SBM in chain C only
- __init__(chain_id='C', residue_names=None, chain_indices=None, segids=None)[source]
- select(universe)[source]
Select polymer atoms/residues.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.PolymerResiduesByType(residue_names, exclude=False)[source]
Bases:
MolecularSelectorSelect polymer residues by monomer type (residue name).
This selector groups polymer residues by their residue names, allowing analysis of specific monomer types within copolymers.
- Parameters:
Examples
>>> # Select SBMA monomers only >>> selector = PolymerResiduesByType(residue_names=["SBM", "SBMA"]) >>> >>> # Select non-SBMA monomers >>> selector = PolymerResiduesByType(residue_names=["SBM", "SBMA"], exclude=True)
- __init__(residue_names, exclude=False)[source]
- select(universe)[source]
Select polymer residues by type.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.PolymerSegments(residue_names=None, chain_index=None, segment_indices=None)[source]
Bases:
MolecularSelectorSelect individual segments (residues) within polymer chains.
This selector provides fine-grained access to polymer segments, useful for per-segment contact analysis.
- Parameters:
residue_names (list[str], optional) – Residue names that identify polymer residues.
chain_index (int, optional) – Specific chain to select segments from (0-indexed). If None, selects from all chains.
segment_indices (list[int], optional) – Specific segment indices within chains to select. Uses 0-indexed positions within each chain.
Notes
A “segment” in this context refers to a single residue/monomer unit within a polymer chain, not MDAnalysis segments.
- __init__(residue_names=None, chain_index=None, segment_indices=None)[source]
- select(universe)[source]
Select polymer segments.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.SolventMolecules(residue_names=None, exclude_near=None, exclude_cutoff=3.0)[source]
Bases:
MolecularSelectorSelect solvent (water) molecules.
- Parameters:
residue_names (list[str], optional) – Residue names for water. Default uses common water names.
exclude_near (str, optional) – Exclude waters within a cutoff of this selection. E.g., “protein” to exclude waters in first hydration shell.
exclude_cutoff (float, optional) – Cutoff in Angstroms for exclude_near. Default 3.0.
Examples
>>> # Select all water >>> selector = SolventMolecules() >>> >>> # Select bulk water (exclude first shell around protein) >>> selector = SolventMolecules( ... exclude_near="protein", ... exclude_cutoff=5.0 ... )
- __init__(residue_names=None, exclude_near=None, exclude_cutoff=3.0)[source]
- select(universe)[source]
Select water molecules.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.CosolventMolecules(residue_names=None)[source]
Bases:
MolecularSelectorSelect cosolvent molecules (e.g., DMSO, acetonitrile).
- Parameters:
residue_names (list[str], optional) – Residue names for cosolvent. Default uses common names. You should typically specify this for your system.
Examples
>>> # Select DMSO molecules >>> selector = CosolventMolecules(residue_names=["DMSO", "DMS"])
- __init__(residue_names=None)[source]
- select(universe)[source]
Select cosolvent molecules.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.SubstrateMolecule(residue_name, n_molecules=None)[source]
Bases:
MolecularSelectorSelect substrate or ligand molecules.
- Parameters:
Examples
>>> # Select resorufin butyrate substrate >>> selector = SubstrateMolecule(residue_name="RBU") >>> >>> # Select single substrate, validate count >>> selector = SubstrateMolecule(residue_name="RBU", n_molecules=1)
- __init__(residue_name, n_molecules=None)[source]
- select(universe)[source]
Select substrate molecules.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.IonSelector(residue_names=None, ion_type='all')[source]
Bases:
MolecularSelectorSelect ion molecules (Na+, Cl-, etc.).
- Parameters:
Examples
>>> # Select all ions >>> selector = IonSelector() >>> >>> # Select only sodium ions >>> selector = IonSelector(residue_names=["NA", "Na+", "SOD"])
- DEFAULT_CATIONS = ['NA', 'Na+', 'SOD', 'K', 'K+', 'POT', 'MG', 'Mg2+', 'CA', 'Ca2+']
- DEFAULT_ANIONS = ['CL', 'Cl-', 'CLA', 'BR', 'Br-']
- __init__(residue_names=None, ion_type='all')[source]
- select(universe)[source]
Select ion molecules.
- property label: str
Short label identifying this selector (for filenames/logging).
Base class for molecular selectors.
This module defines the abstract base class for all molecular selectors, providing a consistent interface for selecting atoms, residues, or groups from an MDAnalysis Universe.
The Strategy pattern allows users to define custom selection logic by subclassing MolecularSelector and implementing the select() method.
- class polyzymd.analyses.shared.selectors.base.SelectionResult(atoms, residues, label, metadata=<factory>)[source]
Bases:
objectContainer for selection results with metadata.
- atoms
The selected atoms
- Type:
AtomGroup
- residues
The residues containing the selected atoms
- Type:
ResidueGroup
- label
Human-readable label for this selection
- Type:
- metadata
Additional metadata about the selection (e.g., selection string used, cutoff values, etc.)
- Type:
- atoms: AtomGroup
- residues: ResidueGroup
- label: str
- metadata: dict
- property n_atoms: int
Number of selected atoms.
- property n_residues: int
Number of selected residues.
- property residue_ids: numpy.typing.NDArray.numpy.int64
1-indexed residue IDs (PyMOL convention).
- __init__(atoms, residues, label, metadata=<factory>)
- class polyzymd.analyses.shared.selectors.base.MolecularSelector[source]
Bases:
ABCAbstract base class for molecular selections.
Subclasses must implement the select() method to define how atoms or residues are selected from a Universe.
This follows the Strategy pattern - different selectors can be swapped in to change selection behavior without modifying the analysis code.
Examples
>>> class ActiveSiteSelector(MolecularSelector): ... def __init__(self, active_site_resids: list[int]): ... self.resids = active_site_resids ... ... def select(self, universe: Universe) -> SelectionResult: ... resid_str = " ".join(str(r) for r in self.resids) ... atoms = universe.select_atoms(f"resid {resid_str}") ... return SelectionResult( ... atoms=atoms, ... residues=atoms.residues, ... label="active_site", ... metadata={"resids": self.resids} ... ) ... ... @property ... def label(self) -> str: ... return "active_site"
- abstractmethod select(universe)[source]
Select atoms/residues from a Universe.
- Parameters:
universe (Universe) – MDAnalysis Universe to select from
- Returns:
Container with selected atoms, residues, and metadata
- Return type:
SelectionResult
- abstract property label: str
Short label identifying this selector (for filenames/logging).
- validate(universe)[source]
Validate the selector against a Universe.
Returns diagnostic information about whether the selection would succeed and what it would select.
- Parameters:
universe (Universe) – MDAnalysis Universe to validate against
- Returns:
Validation results with keys: - valid: bool - n_atoms: int - n_residues: int - error: str (if invalid) - warnings: list[str]
- Return type:
- class polyzymd.analyses.shared.selectors.base.MDAnalysisSelector(selection, label=None)[source]
Bases:
MolecularSelectorSimple selector using an MDAnalysis selection string.
This is the most flexible selector - it allows arbitrary MDAnalysis selection syntax. Use this when you need direct control over the selection or when the specialized selectors don’t fit your needs.
- Parameters:
Examples
>>> # Select polymer residues by name >>> selector = MDAnalysisSelector("resname SBM EGM") >>> result = selector.select(universe) >>> >>> # Select protein backbone near ligand >>> selector = MDAnalysisSelector( ... "protein and backbone and around 5.0 resname LIG", ... label="protein_near_ligand" ... )
- __init__(selection, label=None)[source]
- select(universe)[source]
Select atoms using the MDAnalysis selection string.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.base.CompositeSelector(selectors, mode='union', label=None)[source]
Bases:
MolecularSelectorCombines multiple selectors with AND/OR logic.
Useful for complex selections like “protein residues that are both aromatic AND within 5A of the active site”.
- Parameters:
selectors (list[MolecularSelector]) – List of selectors to combine
mode ({"union", "intersection"}) – How to combine selections: - “union”: Include atoms selected by ANY selector (OR) - “intersection”: Include only atoms selected by ALL selectors (AND)
label (str, optional) – Custom label. If not provided, generates from component labels.
- __init__(selectors, mode='union', label=None)[source]
- select(universe)[source]
Select atoms using combined selectors.
- property label: str
Short label identifying this selector (for filenames/logging).
Protein residue selectors.
This module provides selectors for protein residues:
ProteinResidues: Select all protein residues
ProteinResiduesByGroup: Select protein residues by amino acid classification
ProteinResiduesNearReference: Select residues within cutoff of reference atoms
Examples
>>> # Select all protein residues
>>> selector = ProteinResidues()
>>> result = selector.select(universe)
>>>
>>> # Select aromatic residues only
>>> selector = ProteinResiduesByGroup(
... grouping=ProteinAAClassification(),
... groups=["aromatic"]
... )
>>>
>>> # Select residues near catalytic triad
>>> selector = ProteinResiduesNearReference(
... reference_selection="resid 77 133 156",
... cutoff=5.0
... )
- class polyzymd.analyses.shared.selectors.protein.ProteinResidues(selection_modifier=None)[source]
Bases:
MolecularSelectorSelect all protein residues.
Uses MDAnalysis “protein” selection keyword which matches standard amino acid residues.
- Parameters:
selection_modifier (str, optional) – Additional selection criteria to AND with “protein”. E.g., “and not name H*” to exclude hydrogens.
- __init__(selection_modifier=None)[source]
- select(universe)[source]
Select all protein atoms/residues.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.protein.ProteinResiduesByGroup(grouping, groups, exclude=False)[source]
Bases:
MolecularSelectorSelect protein residues by amino acid group classification.
Uses a ResidueGrouping to classify amino acids (e.g., aromatic, charged, polar, nonpolar) and selects only residues in the specified groups.
- Parameters:
Examples
>>> from polyzymd.analyses.shared.groupings import ProteinAAClassification >>> >>> # Select aromatic residues >>> grouping = ProteinAAClassification() >>> selector = ProteinResiduesByGroup(grouping, groups=["aromatic"]) >>> >>> # Select all charged residues >>> selector = ProteinResiduesByGroup( ... grouping, ... groups=["charged_positive", "charged_negative"] ... )
- __init__(grouping, groups, exclude=False)[source]
- select(universe)[source]
Select protein residues matching the specified groups.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.protein.ProteinResiduesNearReference(reference_selection, cutoff, include_reference=True, frame=0)[source]
Bases:
MolecularSelectorSelect protein residues within a cutoff distance of reference atoms.
Useful for selecting residues near active sites, binding pockets, or other regions of interest.
- Parameters:
reference_selection (str) – MDAnalysis selection string for reference atoms (e.g., “resid 77 133 156”)
cutoff (float) – Distance cutoff in Angstroms. Residues with any atom within this distance of any reference atom are selected.
include_reference (bool, optional) – Whether to include the reference residues themselves. Default True.
frame (int, optional) – Frame to use for distance calculation. Default is current frame (0).
Examples
>>> # Select residues within 5A of catalytic triad >>> selector = ProteinResiduesNearReference( ... reference_selection="resid 77 133 156", ... cutoff=5.0, ... ) >>> >>> # Select residues near substrate binding site (not including the site itself) >>> selector = ProteinResiduesNearReference( ... reference_selection="resname LIG", ... cutoff=4.0, ... include_reference=False, ... )
- __init__(reference_selection, cutoff, include_reference=True, frame=0)[source]
- select(universe)[source]
Select protein residues near the reference atoms.
- property label: str
Short label identifying this selector (for filenames/logging).
Polymer chain and residue selectors.
This module provides selectors for polymer chains and residues:
PolymerChains: Select all polymer chains
PolymerResiduesByType: Select polymer residues by residue name (monomer type)
For systems built with PolyzyMD, use chain_id=”C” (the default) to select polymers based on the PolyzyMD chain convention: - Chain A: Protein/Enzyme - Chain B: Substrate/Ligand - Chain C: Polymers - Chain D+: Solvent (water, ions, co-solvents)
Examples
>>> # Select polymer chain C (PolyzyMD default)
>>> selector = PolymerChains()
>>> result = selector.select(universe)
>>>
>>> # Select by residue names (for non-PolyzyMD systems)
>>> selector = PolymerChains(chain_id=None, residue_names=["SBM", "EGP"])
>>>
>>> # Select specific polymer types within chain C
>>> selector = PolymerResiduesByType(residue_names=["SBM"])
- class polyzymd.analyses.shared.selectors.polymer.PolymerChains(chain_id='C', residue_names=None, chain_indices=None, segids=None)[source]
Bases:
MolecularSelectorSelect polymer chains from the system.
For PolyzyMD-built systems, polymers are assigned to Chain C by convention. This selector uses chain ID selection by default, which is more reliable than residue name matching.
- Parameters:
chain_id (str, optional) – Chain ID for polymer selection. Default “C” (PolyzyMD convention). Set to None to use residue_names instead.
residue_names (list[str], optional) – Residue names that identify polymer residues. Only used when chain_id is None, or as a filter within the chain. Default uses common PolyzyMD polymer names.
chain_indices (list[int], optional) – If provided, select only these polymer chain indices (0-indexed) from within the selected atoms. Useful when analyzing specific polymer chains in multi-chain systems.
segids (list[str], optional) – If provided, select only polymers with these segment IDs.
Notes
The PolyzyMD chain convention is: - Chain A: Protein/Enzyme - Chain B: Substrate/Ligand - Chain C: Polymers - Chain D+: Solvent (water, ions, co-solvents)
For systems not built with PolyzyMD, set chain_id=None and provide residue_names explicitly.
Examples
>>> # PolyzyMD system (recommended) >>> selector = PolymerChains() # Uses chain C >>> >>> # Non-PolyzyMD system >>> selector = PolymerChains(chain_id=None, residue_names=["SBM", "EGM"]) >>> >>> # PolyzyMD system with specific polymer types >>> selector = PolymerChains(residue_names=["SBM"]) # SBM in chain C only
- __init__(chain_id='C', residue_names=None, chain_indices=None, segids=None)[source]
- select(universe)[source]
Select polymer atoms/residues.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.polymer.PolymerResiduesByType(residue_names, exclude=False)[source]
Bases:
MolecularSelectorSelect polymer residues by monomer type (residue name).
This selector groups polymer residues by their residue names, allowing analysis of specific monomer types within copolymers.
- Parameters:
Examples
>>> # Select SBMA monomers only >>> selector = PolymerResiduesByType(residue_names=["SBM", "SBMA"]) >>> >>> # Select non-SBMA monomers >>> selector = PolymerResiduesByType(residue_names=["SBM", "SBMA"], exclude=True)
- __init__(residue_names, exclude=False)[source]
- select(universe)[source]
Select polymer residues by type.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.polymer.PolymerSegments(residue_names=None, chain_index=None, segment_indices=None)[source]
Bases:
MolecularSelectorSelect individual segments (residues) within polymer chains.
This selector provides fine-grained access to polymer segments, useful for per-segment contact analysis.
- Parameters:
residue_names (list[str], optional) – Residue names that identify polymer residues.
chain_index (int, optional) – Specific chain to select segments from (0-indexed). If None, selects from all chains.
segment_indices (list[int], optional) – Specific segment indices within chains to select. Uses 0-indexed positions within each chain.
Notes
A “segment” in this context refers to a single residue/monomer unit within a polymer chain, not MDAnalysis segments.
- __init__(residue_names=None, chain_index=None, segment_indices=None)[source]
- select(universe)[source]
Select polymer segments.
- property label: str
Short label identifying this selector (for filenames/logging).
Solvent, cosolvent, and substrate selectors.
This module provides selectors for non-protein, non-polymer molecules:
SolventMolecules: Select water molecules
CosolventMolecules: Select cosolvent (e.g., DMSO)
SubstrateMolecule: Select substrate/ligand molecules
Examples
>>> # Select water molecules
>>> selector = SolventMolecules()
>>>
>>> # Select DMSO cosolvent
>>> selector = CosolventMolecules(residue_names=["DMSO", "DMS"])
>>>
>>> # Select substrate by residue name
>>> selector = SubstrateMolecule(residue_name="RBU") # Resorufin butyrate
- class polyzymd.analyses.shared.selectors.solvent.SolventMolecules(residue_names=None, exclude_near=None, exclude_cutoff=3.0)[source]
Bases:
MolecularSelectorSelect solvent (water) molecules.
- Parameters:
residue_names (list[str], optional) – Residue names for water. Default uses common water names.
exclude_near (str, optional) – Exclude waters within a cutoff of this selection. E.g., “protein” to exclude waters in first hydration shell.
exclude_cutoff (float, optional) – Cutoff in Angstroms for exclude_near. Default 3.0.
Examples
>>> # Select all water >>> selector = SolventMolecules() >>> >>> # Select bulk water (exclude first shell around protein) >>> selector = SolventMolecules( ... exclude_near="protein", ... exclude_cutoff=5.0 ... )
- __init__(residue_names=None, exclude_near=None, exclude_cutoff=3.0)[source]
- select(universe)[source]
Select water molecules.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.solvent.CosolventMolecules(residue_names=None)[source]
Bases:
MolecularSelectorSelect cosolvent molecules (e.g., DMSO, acetonitrile).
- Parameters:
residue_names (list[str], optional) – Residue names for cosolvent. Default uses common names. You should typically specify this for your system.
Examples
>>> # Select DMSO molecules >>> selector = CosolventMolecules(residue_names=["DMSO", "DMS"])
- __init__(residue_names=None)[source]
- select(universe)[source]
Select cosolvent molecules.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.solvent.SubstrateMolecule(residue_name, n_molecules=None)[source]
Bases:
MolecularSelectorSelect substrate or ligand molecules.
- Parameters:
Examples
>>> # Select resorufin butyrate substrate >>> selector = SubstrateMolecule(residue_name="RBU") >>> >>> # Select single substrate, validate count >>> selector = SubstrateMolecule(residue_name="RBU", n_molecules=1)
- __init__(residue_name, n_molecules=None)[source]
- select(universe)[source]
Select substrate molecules.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.solvent.IonSelector(residue_names=None, ion_type='all')[source]
Bases:
MolecularSelectorSelect ion molecules (Na+, Cl-, etc.).
- Parameters:
Examples
>>> # Select all ions >>> selector = IonSelector() >>> >>> # Select only sodium ions >>> selector = IonSelector(residue_names=["NA", "Na+", "SOD"])
- DEFAULT_CATIONS = ['NA', 'Na+', 'SOD', 'K', 'K+', 'POT', 'MG', 'Mg2+', 'CA', 'Ca2+']
- DEFAULT_ANIONS = ['CL', 'Cl-', 'CLA', 'BR', 'Br-']
- __init__(residue_names=None, ion_type='all')[source]
- select(universe)[source]
Select ion molecules.
- property label: str
Short label identifying this selector (for filenames/logging).
Residue grouping abstractions for analysis plugins.
This module provides classification systems for residues:
ResidueGrouping: Abstract base class for residue classification
ProteinAAClassification: Standard amino acid classification
CustomGrouping: User-defined classification scheme
Examples
>>> from polyzymd.analyses.shared.groupings import ProteinAAClassification
>>>
>>> # Classify amino acids
>>> grouping = ProteinAAClassification()
>>> print(grouping.classify("PHE")) # "aromatic"
>>> print(grouping.classify("LYS")) # "charged_positive"
>>>
>>> # Get all residues in a group
>>> aromatics = grouping.get_residues_in_group("aromatic")
>>> # Returns: ["PHE", "TRP", "TYR", "HIS"]
- class polyzymd.analyses.shared.groupings.ResidueGrouping[source]
Bases:
ABCAbstract base class for residue classification schemes.
Subclasses must implement classify() to map residue names to group labels.
Examples
>>> class MyPolymerGrouping(ResidueGrouping): ... def classify(self, resname: str) -> str: ... if resname in ["SBM", "SBMA"]: ... return "zwitterionic" ... elif resname in ["EGP", "EGMA"]: ... return "hydrophilic" ... return "unknown" ... ... @property ... def available_groups(self) -> list[str]: ... return ["zwitterionic", "hydrophilic", "unknown"]
- abstractmethod classify(resname)[source]
Classify a residue name into a group.
- abstract property available_groups: list[str]
List of all group labels in this classification scheme.
- get_residues_in_group(group)[source]
Get all residue names that belong to a group.
- Parameters:
group (str) – Group label
- Returns:
Residue names in this group
- Return type:
- Raises:
ValueError – If group is not in available_groups
- to_dict()[source]
Serialize grouping scheme to dictionary.
- class polyzymd.analyses.shared.groupings.ProteinAAClassification(include_his_aromatic=True)[source]
Bases:
ResidueGroupingStandard amino acid classification.
Groups amino acids into: - aromatic: PHE, TRP, TYR, HIS - charged_positive: ARG, LYS - charged_negative: ASP, GLU - polar: ASN, CYS, GLN, SER, THR - nonpolar: ALA, GLY, ILE, LEU, MET, PRO, VAL
This classification matches the scaffold notebooks and common biochemistry conventions.
- Parameters:
include_his_aromatic (bool, optional) – Whether to classify HIS as aromatic (default True). Some classifications put HIS with charged_positive.
Examples
>>> grouping = ProteinAAClassification() >>> grouping.classify("PHE") 'aromatic' >>> grouping.classify("LYS") 'charged_positive' >>> grouping.get_residues_in_group("aromatic") ['PHE', 'TRP', 'TYR', 'HIS']
- __init__(include_his_aromatic=True)[source]
- classify(resname)[source]
Classify amino acid by residue name.
- get_charged_groups()[source]
Convenience: get both charged group names.
- get_hydrophobic_groups()[source]
Convenience: groups typically considered hydrophobic.
- get_hydrophilic_groups()[source]
Convenience: groups typically considered hydrophilic.
- class polyzymd.analyses.shared.groupings.CustomGrouping(classification, default_group='other')[source]
Bases:
ResidueGroupingUser-defined residue classification.
Allows arbitrary mapping from residue names to group labels.
- Parameters:
Examples
>>> # Custom polymer classification >>> grouping = CustomGrouping({ ... "SBM": "zwitterionic", ... "SBMA": "zwitterionic", ... "EGP": "peg_like", ... "EGMA": "peg_like", ... }, default_group="unknown") >>> grouping.classify("SBM") 'zwitterionic'
- __init__(classification, default_group='other')[source]
- classify(resname)[source]
Classify residue by name using custom mapping.
- classmethod from_groups(groups, default_group='other')[source]
Create grouping from group -> residue list mapping.
- Parameters:
- Return type:
CustomGrouping
Examples
>>> grouping = CustomGrouping.from_groups({ ... "zwitterionic": ["SBM", "SBMA"], ... "peg_like": ["EGP", "EGMA", "OEGMA"], ... })
Base classes for residue grouping/classification.
This module provides the abstract base class for residue classification schemes and concrete implementations for protein amino acids.
The Strategy pattern allows users to define custom classification schemes for polymers, modified residues, or other systems.
- class polyzymd.analyses.shared.groupings.base.ResidueGrouping[source]
Bases:
ABCAbstract base class for residue classification schemes.
Subclasses must implement classify() to map residue names to group labels.
Examples
>>> class MyPolymerGrouping(ResidueGrouping): ... def classify(self, resname: str) -> str: ... if resname in ["SBM", "SBMA"]: ... return "zwitterionic" ... elif resname in ["EGP", "EGMA"]: ... return "hydrophilic" ... return "unknown" ... ... @property ... def available_groups(self) -> list[str]: ... return ["zwitterionic", "hydrophilic", "unknown"]
- abstractmethod classify(resname)[source]
Classify a residue name into a group.
- abstract property available_groups: list[str]
List of all group labels in this classification scheme.
- get_residues_in_group(group)[source]
Get all residue names that belong to a group.
- Parameters:
group (str) – Group label
- Returns:
Residue names in this group
- Return type:
- Raises:
ValueError – If group is not in available_groups
- to_dict()[source]
Serialize grouping scheme to dictionary.
- class polyzymd.analyses.shared.groupings.base.ProteinAAClassification(include_his_aromatic=True)[source]
Bases:
ResidueGroupingStandard amino acid classification.
Groups amino acids into: - aromatic: PHE, TRP, TYR, HIS - charged_positive: ARG, LYS - charged_negative: ASP, GLU - polar: ASN, CYS, GLN, SER, THR - nonpolar: ALA, GLY, ILE, LEU, MET, PRO, VAL
This classification matches the scaffold notebooks and common biochemistry conventions.
- Parameters:
include_his_aromatic (bool, optional) – Whether to classify HIS as aromatic (default True). Some classifications put HIS with charged_positive.
Examples
>>> grouping = ProteinAAClassification() >>> grouping.classify("PHE") 'aromatic' >>> grouping.classify("LYS") 'charged_positive' >>> grouping.get_residues_in_group("aromatic") ['PHE', 'TRP', 'TYR', 'HIS']
- __init__(include_his_aromatic=True)[source]
- classify(resname)[source]
Classify amino acid by residue name.
- get_charged_groups()[source]
Convenience: get both charged group names.
- get_hydrophobic_groups()[source]
Convenience: groups typically considered hydrophobic.
- get_hydrophilic_groups()[source]
Convenience: groups typically considered hydrophilic.
- class polyzymd.analyses.shared.groupings.base.CustomGrouping(classification, default_group='other')[source]
Bases:
ResidueGroupingUser-defined residue classification.
Allows arbitrary mapping from residue names to group labels.
- Parameters:
Examples
>>> # Custom polymer classification >>> grouping = CustomGrouping({ ... "SBM": "zwitterionic", ... "SBMA": "zwitterionic", ... "EGP": "peg_like", ... "EGMA": "peg_like", ... }, default_group="unknown") >>> grouping.classify("SBM") 'zwitterionic'
- __init__(classification, default_group='other')[source]
- classify(resname)[source]
Classify residue by name using custom mapping.
- classmethod from_groups(groups, default_group='other')[source]
Create grouping from group -> residue list mapping.
- Parameters:
- Return type:
CustomGrouping
Examples
>>> grouping = CustomGrouping.from_groups({ ... "zwitterionic": ["SBM", "SBMA"], ... "peg_like": ["EGP", "EGMA", "OEGMA"], ... })
Amino acid classification and SASA reference data.
This module provides centralized reference data for amino acid properties:
Maximum accessible surface area (maxASA) from Tien et al. 2013
Standard amino acid classification by physicochemical properties
Default MDAnalysis selection strings for each AA class
These constants are used by:
Protein grouping in contact analysis
Template generation for analysis configs
References
Tien MZ, Meyer AG, Sydykova DK, Spielman SJ, Wilke CO. Maximum allowed solvent accessibilities of residues in proteins. PLoS One. 2013 Nov 21;8(11):e80635. doi: 10.1371/journal.pone.0080635. PMID: 24278298; PMCID: PMC3836772.
- class polyzymd.analyses.shared.aa_classification.AAClass(value)[source]
-
Standard amino acid classifications.
- AROMATIC = 'aromatic'
- POLAR = 'polar'
- NONPOLAR = 'nonpolar'
- CHARGED_POSITIVE = 'charged_positive'
- CHARGED_NEGATIVE = 'charged_negative'
- UNKNOWN = 'unknown'
- polyzymd.analyses.shared.aa_classification.get_aa_class(resname)[source]
Get amino acid classification for a residue name.
- Parameters:
resname (str) – 3-letter amino acid code (case-insensitive)
- Returns:
Classification: ‘aromatic’, ‘polar’, ‘nonpolar’, ‘charged_positive’, ‘charged_negative’, or ‘unknown’
- Return type:
Examples
>>> get_aa_class("PHE") 'aromatic' >>> get_aa_class("lys") 'charged_positive' >>> get_aa_class("UNK") 'unknown'
- polyzymd.analyses.shared.aa_classification.get_max_asa(resname)[source]
Get maximum accessible surface area for a residue name.
- Parameters:
resname (str) – 3-letter amino acid code (case-insensitive)
- Returns:
Maximum ASA in Angstrom^2, or None if residue not in table
- Return type:
float or None
Examples
>>> get_max_asa("ALA") 121.0 >>> get_max_asa("TRP") 264.0 >>> get_max_asa("UNK") # Returns None for unknown residues
- polyzymd.analyses.shared.aa_classification.get_residues_for_class(aa_class)[source]
Get all residue names belonging to an amino acid class.
- Parameters:
aa_class (str) – One of: ‘aromatic’, ‘polar’, ‘nonpolar’, ‘charged_positive’, ‘charged_negative’
- Returns:
List of 3-letter amino acid codes in this class
- Return type:
- Raises:
ValueError – If aa_class is not a valid classification
Examples
>>> get_residues_for_class("aromatic") ['PHE', 'TRP', 'TYR', 'HIS']
- polyzymd.analyses.shared.aa_classification.get_selection_for_class(aa_class)[source]
Get MDAnalysis selection string for an amino acid class.
- Parameters:
aa_class (str) – One of: ‘aromatic’, ‘polar’, ‘nonpolar’, ‘charged_positive’, ‘charged_negative’
- Returns:
MDAnalysis selection string
- Return type:
- Raises:
ValueError – If aa_class is not a valid classification
Examples
>>> get_selection_for_class("aromatic") 'protein and resname PHE TRP TYR HIS'
Diagnostics and path helpers
Diagnostics helpers validate selections and analysis inputs. The module is
polyzymd.analyses.shared.diagnostics. Path helpers standardize
artifact-oriented file locations used by analysis plugins.
Path utilities for the analysis plugin system.
- polyzymd.analyses.shared.paths.sanitize_label(label)[source]
Convert a condition label to a filesystem-safe directory name.
Replaces
%withpct, spaces with underscores, strips remaining non-alphanumeric chars (except hyphens, underscores, dots), and collapses consecutive underscores.
- polyzymd.analyses.shared.paths.format_replicate_cache_token(replicates)[source]
Format replicate IDs for cache filenames without range collisions.
Contiguous replicate IDs are compacted as a range, while non-contiguous IDs are listed explicitly so
(1, 3)cannot collide with(1, 2, 3).
Multi-run comparison and formatting
Multi-run helpers support plugins that compare several named runs or entities per condition, such as RMSD, radius of gyration, and SASA analyses.
Shared helpers for multi-run comparison orchestration.
These helpers keep run-wise comparison logic concise across plugins that compare multiple named runs (RMSD, Rg, SASA).
- polyzymd.analyses.shared.multi_run_comparison.filter_summaries_with_run(summaries, run_label, get_run_fn, logger=None)[source]
Filter condition summaries to those containing a specific run.
- Parameters:
summaries (dict[str, Any]) – Mapping from condition label to condition summary.
run_label (str) – Run label to keep.
get_run_fn (Callable[[Any, str], Any]) – Callback that returns run summary for
(summary, run_label)and raisesKeyErrorwhen the run is missing.logger (logging.Logger | None, optional) – Optional logger for missing-run warnings.
- Returns:
Subset of
summarieswith run data available.- Return type:
- polyzymd.analyses.shared.multi_run_comparison.build_condition_pairs(condition_labels, control_label, on_control_missing='all_pairs', logger=None)[source]
Build pairwise condition pairs for comparison.
- Parameters:
condition_labels (list[str]) – Ordered condition labels to compare.
control_label (str | None) – Preferred control label for control-vs-treatment comparisons.
on_control_missing (str, optional) –
Behavior when
control_labelis requested but unavailable.Supported values:
"all_pairs": fall back to all-vs-all"skip": return no pairs
logger (logging.Logger | None, optional) – Optional logger for fallback/skip messages.
- Returns:
Pair list as
(condition_a, condition_b)tuples.- Return type:
- Raises:
ValueError – Raised when
on_control_missingis not"all_pairs"or"skip".
- polyzymd.analyses.shared.multi_run_comparison.apply_fdr_correction(pairwise_results, anova_by_run=None, fdr_alpha=0.05, get_p_value=None, set_corrected=None)[source]
Apply Benjamini-Hochberg FDR correction across statistical result families.
- Parameters:
pairwise_results (list[Any]) – Pairwise comparison result objects.
anova_by_run (dict[Any, Any] | list[Any] | None, optional) – ANOVA result objects, as either list-like or dict-like container.
fdr_alpha (float, optional) – FDR threshold.
get_p_value (Callable[[Any], float | None] | None, optional) – Callback extracting raw p-value from a result object. Defaults to reading
.p_value.set_corrected (Callable[[Any, Any], None] | None, optional) – Callback applying BH output to each result object. Defaults to setting
.p_value_adjusted(when available) and.significant.
Shared formatting helpers for multi-run analysis outputs.
- polyzymd.analyses.shared.multi_run_formatting.is_sem_estimable(n_replicates)[source]
Return whether SEM can be estimated from replicate-level values.
- polyzymd.analyses.shared.multi_run_formatting.format_sem_value(sem, n_replicates, *, precision=2, unit='')[source]
Format SEM without implying singleton uncertainty is estimable.
- Parameters:
sem (float | None) – SEM value to display when enough replicates are available.
n_replicates (int) – Number of replicates contributing to the summary.
precision (int, optional) – Decimal places for numeric SEM values, by default 2.
unit (str, optional) – Unit suffix appended to numeric SEM values, by default
"".
- Returns:
"n/a"for singleton summaries, otherwise a formatted SEM value.- Return type:
- polyzymd.analyses.shared.multi_run_formatting.format_sem_phrase(sem, n_replicates, *, precision=2, unit='')[source]
Format a compact
SEM: ...phrase for summaries.- Parameters:
sem (float | None) – SEM value to display when enough replicates are available.
n_replicates (int) – Number of replicates contributing to the summary.
precision (int, optional) – Decimal places for numeric SEM values, by default 2.
unit (str, optional) – Unit suffix appended to numeric SEM values, by default
"".
- Returns:
"SEM: n/a (single replicate)"for singleton summaries, otherwise a numeric SEM phrase.- Return type:
- polyzymd.analyses.shared.multi_run_formatting.make_section_title(title, width)[source]
Build a section title and separator lines.
- polyzymd.analyses.shared.multi_run_formatting.make_ranked_table_header(*, mean_label)[source]
Build standard ranked-table headers for text output.
- polyzymd.analyses.shared.multi_run_formatting.make_ranked_markdown_header(*, mean_label)[source]
Build standard ranked-table headers for markdown output.
- polyzymd.analyses.shared.multi_run_formatting.format_pairwise_line(*, condition_a, condition_b, direction, p_value, effect_size, effect_label, percent_change, significant, prefix='Pairwise')[source]
Format one standard pairwise comparison line.
- polyzymd.analyses.shared.multi_run_formatting.format_anova_line(*, f_statistic, p_value, significant)[source]
Format one standard ANOVA line.
- polyzymd.analyses.shared.multi_run_formatting.format_markdown_bullet(prefix, line)[source]
Format a markdown bullet line with consistent prefixing.
- polyzymd.analyses.shared.multi_run_formatting.make_ranked_rows(ranking, get_values)[source]
Build ranked rows as
(label, mean, sem, rank)tuples.