Analysis Shared Utilities
API reference for polyzymd.analyses.shared and its commonly used
submodules.
polyzymd.analyses.shared
Shared infrastructure for analysis plugins.
This package provides a curated set of commonly used shared utilities. For advanced or less common helpers, import directly from the relevant shared submodules.
In particular, selectors, custom selections, and result I/O helpers are not re-exported from this package root and should be imported from:
polyzymd.analyses.shared.selectorspolyzymd.analyses.shared.selectionspolyzymd.analyses.shared.result_io
Sub-modules
- loader
Trajectory loading, time parsing, frame conversion.
- alignment
Trajectory alignment (centroid/average/frame/external).
- centroid
K-Means centroid frame finding, reference mode dispatch.
- statistics
SEM, per-residue/region aggregation, weighted mean.
- aggregation
Replicate collection, distance pair aggregation.
- autocorrelation
ACF, correlation time, statistical inefficiency.
- pbc
Minimum image distance, PBC-aware distance matrix.
- selections
Extended selection syntax (midpoint, COM), position retrieval.
- diagnostics
Selection diagnostics, equilibration validation.
- config_hash
Config hashing for cache validation.
- constants
Shared default values (cutoffs, thresholds).
- defaults
AnalysisDefaults model (equilibration_time).
- logging_utils
Colored terminal logging.
- plotting
Shared plotting helpers (axis styling, figure saving, grouped bars, etc.).
- class polyzymd.analyses.shared.TrajectoryInfo(topology_file, trajectory_files=<factory>, n_segments=0, working_directory=<factory>, replicate=1)[source]
Bases:
objectInformation about discovered trajectory files.
- topology_file
Path to topology file (PDB)
- Type:
Path
- trajectory_files
List of trajectory files (DCD) in order
- Type:
list[Path]
- n_segments
Number of daisy-chain segments
- Type:
- working_directory
Base working directory for this replicate
- Type:
Path
- replicate
Replicate number
- Type:
- topology_file: Path
- n_segments: int = 0
- working_directory: Path
- replicate: int = 1
- property n_trajectory_files: int
Number of trajectory files found.
- validate()[source]
Validate that all files exist.
- __init__(topology_file, trajectory_files=<factory>, n_segments=0, working_directory=<factory>, replicate=1)
- class polyzymd.analyses.shared.TrajectoryLoader(config, engine_override=None)[source]
Bases:
objectConfig-aware trajectory loader for PolyzyMD simulations.
This class handles the complexity of finding and loading trajectories from PolyzyMD’s output structure, including:
Daisy-chain continuation segments (OpenMM)
Flat production directories (GROMACS)
Scratch vs projects directory resolution
Multiple replicates
File discovery is delegated to the simulation engine resolved from the config’s
enginefield. The engine is created lazily on the first call that needs it, so construction remains cheap.- Parameters:
config (SimulationConfig) – PolyzyMD simulation configuration.
engine_override (str or None, optional) – Force a specific engine name (
"openmm"or"gromacs") instead of readingconfig.engine.
Examples
>>> from polyzymd.config import load_config >>> config = load_config("config.yaml") >>> loader = TrajectoryLoader(config) >>> >>> # Load single replicate >>> u = loader.load_universe(replicate=1) >>> print(f"Loaded {len(u.trajectory)} frames") >>> >>> # Get trajectory info without loading >>> info = loader.get_trajectory_info(replicate=1) >>> print(f"Found {info.n_segments} segments") >>> >>> # Load multiple replicates >>> for rep in range(1, 6): ... u = loader.load_universe(replicate=rep) ... # ... analyze >>> >>> # Explicit engine override for GROMACS directories >>> loader = TrajectoryLoader(config, engine_override="gromacs")
Notes
Frame indices in MDAnalysis are 0-indexed. For user-facing output, add 1 to follow PyMOL convention (1-indexed frames).
- __init__(config, engine_override=None)[source]
- get_trajectory_info(replicate)[source]
Get trajectory file information for a replicate.
- Parameters:
replicate (int) – Replicate number (1-indexed)
- Returns:
Information about discovered trajectory files
- Return type:
TrajectoryInfo
- Raises:
FileNotFoundError – If working directory or required files don’t exist
- load_universe(replicate, cache=True)[source]
Load MDAnalysis Universe for a replicate.
- Parameters:
- Returns:
MDAnalysis Universe with trajectory loaded
- Return type:
Universe
Notes
For daisy-chain trajectories, all segments are loaded as a continuous trajectory using MDAnalysis’s ChainReader.
- iter_replicates(replicates)[source]
Iterate over multiple replicates.
- Parameters:
replicates (sequence of int) – Replicate numbers to load
- Yields:
tuple of (int, Universe) – Replicate number and loaded Universe
Examples
>>> for rep, u in loader.iter_replicates([1, 2, 3, 4, 5]): ... rmsf = compute_rmsf(u) ... results[rep] = rmsf
- get_frame_times(replicate, unit='ns')[source]
Get time values for each frame.
- get_timestep(replicate, unit='ps')[source]
Get the trajectory timestep (time between frames).
- clear_cache()[source]
Clear the Universe cache to free memory.
- find_topology(working_dir)[source]
Find topology file in working directory.
Delegates file discovery to the simulation engine. The engine applies its own search order (e.g. PDB preference for GROMACS,
solvated_system.pdbpreference for OpenMM).This method is used by several plugins that pass an explicit
working_dirunrelated to the current replicate. The replicate index is inferred from the directory name when possible (run_<N>), falling back to1.- Parameters:
working_dir (Path) – Directory to search for topology files.
- Returns:
Path to the topology file.
- Return type:
Path
- Raises:
FileNotFoundError – If no topology file is found.
- polyzymd.analyses.shared.parse_time_string(time_str)[source]
Parse a time string with units into value and unit.
- Parameters:
time_str (str) – Time string like “100ns”, “5000ps”, “100 ns”, etc.
- Returns:
Numeric value and unit string
- Return type:
Examples
>>> parse_time_string("100ns") (100.0, "ns") >>> parse_time_string("5000 ps") (5000.0, "ps") >>> parse_time_string("100") # Default to ns (100.0, "ns")
- polyzymd.analyses.shared.convert_time(value, from_unit, to_unit)[source]
Convert time between units.
- polyzymd.analyses.shared.time_to_frame(time, time_unit, timestep, timestep_unit='ps')[source]
Convert time to frame index.
- class polyzymd.analyses.shared.AlignmentConfig(*, enabled=True, reference_mode='centroid', reference_frame=None, selection='protein and name CA', centroid_selection='protein', reference_file=None)[source]
Bases:
BaseModelConfiguration for trajectory alignment.
This model defines how a trajectory should be aligned before analysis. By default, alignment is ENABLED to remove rotational drift and COM motion that can add noise to distance measurements.
- enabled
Whether to align the trajectory before analysis. Default is True. When enabled, the user is notified via INFO-level logging.
- Type:
- reference_mode
How to select the reference structure:
“centroid”: Representative aligned frame (closest to aligned mean)
“average”: Mathematical average structure
“frame”: Specific frame number
- Type:
ReferenceMode
- reference_frame
Frame number (1-indexed) when reference_mode=”frame”. Required when mode is “frame”, ignored otherwise.
- Type:
int | None
- selection
MDAnalysis selection string for superposition. Atoms matching this selection are used to compute the optimal rotation/translation. Default: “protein and name CA” (backbone alpha-carbons).
- Type:
- centroid_selection
Selection for representative-frame detection when mode=”centroid”. Default: “protein” (all protein atoms for better discrimination).
- Type:
Examples
>>> # Default: align to centroid using CA atoms >>> config = AlignmentConfig()
>>> # Align to a specific frame >>> config = AlignmentConfig( ... reference_mode="frame", ... reference_frame=500, ... )
>>> # Disable alignment (not recommended for most analyses) >>> config = AlignmentConfig(enabled=False)
- enabled: bool
- reference_mode: ReferenceMode
- selection: str
- centroid_selection: str
- reference_file: Path | None
- validate_reference_params()[source]
Validate reference_frame and reference_file for their modes.
- to_dict()[source]
Convert to dictionary for cache key hashing.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- polyzymd.analyses.shared.align_trajectory(universe, config, start_frame=0, stop_frame=None)[source]
Align trajectory in-memory to a reference structure.
This function performs trajectory alignment using MDAnalysis, removing rotational drift and center-of-mass motion. The alignment is performed in-place, modifying the Universe’s coordinates.
- Parameters:
universe (Universe) – MDAnalysis Universe to align. Will be modified in-place.
config (AlignmentConfig) – Alignment configuration specifying reference mode and selections.
start_frame (int, optional) – First frame to consider (0-indexed). Default is 0. Frames before this are typically equilibration and should be excluded.
stop_frame (int, optional) – Last frame (exclusive). If None, uses all frames after start_frame.
- Returns:
Reference frame index (0-indexed), or None if using average mode. This can be stored in results for reproducibility.
- Return type:
int | None
- Raises:
ImportError – If MDAnalysis is not available.
ValueError – If reference_mode is “frame” but the frame number is out of range.
Notes
When alignment is performed, an INFO-level log message is emitted to notify the user. This ensures users are aware that their trajectory coordinates have been modified.
Examples
>>> from polyzymd.analyses.shared.alignment import AlignmentConfig, align_trajectory >>> >>> config = AlignmentConfig(reference_mode="centroid") >>> ref_frame = align_trajectory(universe, config, start_frame=100) >>> print(f"Aligned to frame {ref_frame}")
- polyzymd.analyses.shared.get_alignment_description(config)[source]
Generate a human-readable description of the alignment configuration.
- Parameters:
config (AlignmentConfig) – Alignment configuration.
- Returns:
Description suitable for documentation or result files.
- Return type:
- class polyzymd.analyses.shared.StatResult(mean, sem, n_samples)[source]
Bases:
objectContainer for mean +/- SEM results.
- mean
The mean value
- Type:
- sem
Standard error of the mean
- Type:
- n_samples
Number of samples used in computation
- Type:
- mean: float
- sem: float
- n_samples: int
- to_dict()[source]
Convert to dictionary for JSON serialization.
- __init__(mean, sem, n_samples)
- class polyzymd.analyses.shared.PerResidueStats(residue_ids, means, sems, n_replicates)[source]
Bases:
objectContainer for per-residue statistics across replicates.
- residue_ids
Residue identifiers (1-indexed, following PyMOL convention)
- Type:
NDArray[np.int64]
- means
Mean value for each residue across replicates
- Type:
NDArray[np.float64]
- sems
SEM for each residue across replicates
- Type:
NDArray[np.float64]
- n_replicates
Number of replicates aggregated
- Type:
- residue_ids: numpy.typing.NDArray.numpy.int64
- means: numpy.typing.NDArray.numpy.float64
- sems: numpy.typing.NDArray.numpy.float64
- n_replicates: int
- to_dict()[source]
Convert to dictionary for JSON serialization.
- __init__(residue_ids, means, sems, n_replicates)
- polyzymd.analyses.shared.compute_sem(values, ddof=1)[source]
Compute mean and standard error of the mean.
SEM = std / sqrt(N) where N is the number of samples.
- Parameters:
values (array_like) – 1D array of values (e.g., one value per replicate)
ddof (int, optional) – Delta degrees of freedom for std calculation. Default is 1 (Bessel’s correction for sample std).
- Returns:
Container with mean, sem, and n_samples
- Return type:
StatResult
Examples
>>> values = [2.5, 2.7, 2.6, 2.4, 2.8] # RMSF from 5 replicates >>> result = compute_sem(values) >>> print(f"RMSF = {result.mean:.2f} +/- {result.sem:.2f} A") RMSF = 2.60 +/- 0.07 A
Notes
For a single value, SEM is undefined (returns 0.0).
- polyzymd.analyses.shared.aggregate_per_residue_stats(per_replicate_values, residue_ids=None)[source]
Aggregate per-residue values across replicates.
For each residue, computes mean +/- SEM across all replicates. This is the correct way to aggregate per-residue RMSF values.
- Parameters:
per_replicate_values (sequence of arrays) – List/tuple of 1D arrays, each containing per-residue values from one replicate. All arrays must have the same length.
residue_ids (array, optional) – 1-indexed residue identifiers. If None, uses 1, 2, 3, … Following PyMOL convention (1-indexed).
- Returns:
Container with residue_ids, means, sems, n_replicates
- Return type:
PerResidueStats
- Raises:
ValueError – If arrays have inconsistent lengths or no replicates provided
- polyzymd.analyses.shared.aggregate_region_stats(per_replicate_values, residue_mask=None)[source]
Aggregate region-averaged values across replicates.
For whole-protein or region-specific metrics, this computes the mean of per-replicate averages, with SEM across replicates.
This implements the correct hierarchical aggregation: 1. First average within each replicate (over selected residues) 2. Then compute mean +/- SEM across replicate means
- Parameters:
per_replicate_values (sequence of arrays) – List/tuple of 1D arrays, each containing per-residue values from one replicate.
residue_mask (bool array, optional) – Boolean mask for residue selection. If None, uses all residues.
- Returns:
Mean +/- SEM of region-averaged values across replicates
- Return type:
StatResult
- polyzymd.analyses.shared.weighted_mean_with_sem(means, sems, weights=None)[source]
Compute weighted mean with proper error propagation.
Useful for combining results from different conditions or analyses with different uncertainties.
- Parameters:
means (array_like) – Mean values from each source
sems (array_like) – SEM values from each source
weights (array_like, optional) – Weights for each source. If None, uses inverse-variance weighting (1/sem^2), which is optimal for independent measurements.
- Returns:
Weighted mean with propagated uncertainty
- Return type:
StatResult
Notes
- For inverse-variance weighting, the combined SEM is:
SEM_combined = 1 / sqrt(sum(1/SEM_i^2))
- For arbitrary weights, uses standard error propagation:
SEM_combined = sqrt(sum((w_i * SEM_i)^2)) / sum(w_i)
- class polyzymd.analyses.shared.ACFResult(lags, acf, timestep, timestep_unit, n_samples)[source]
Bases:
objectResult of autocorrelation function computation.
- lags
Time lags in the same units as timestep
- Type:
NDArray[np.float64]
- acf
Autocorrelation values (normalized, ACF[0] = 1)
- Type:
NDArray[np.float64]
- timestep
Time between frames
- Type:
- timestep_unit
Unit of timestep (e.g., “ps”, “ns”)
- Type:
- n_samples
Number of samples in the original timeseries
- Type:
- lags: numpy.typing.NDArray.numpy.float64
- acf: numpy.typing.NDArray.numpy.float64
- timestep: float
- timestep_unit: str
- n_samples: int
- to_dict()[source]
Convert to dictionary for serialization.
- __init__(lags, acf, timestep, timestep_unit, n_samples)
- class polyzymd.analyses.shared.CorrelationTimeResult(tau, tau_unit, method, n_independent, statistical_inefficiency, warning=None)[source]
Bases:
objectResult of correlation time estimation.
- tau
Estimated correlation time
- Type:
- tau_unit
Unit of tau (same as timestep unit)
- Type:
- method
Method used for estimation
- Type:
- n_independent
Estimated number of independent samples in trajectory
- Type:
- statistical_inefficiency
g = 1 + 2*tau/dt, factor by which variance is inflated
- Type:
- warning
Warning message if statistics may be unreliable (e.g., N_ind < 10)
- Type:
str | None
- tau: float
- tau_unit: str
- method: str
- n_independent: int
- statistical_inefficiency: float
- property is_reliable: bool
Return True if statistics are likely reliable (N_ind >= 10).
- to_dict()[source]
Convert to dictionary for serialization.
- __init__(tau, tau_unit, method, n_independent, statistical_inefficiency, warning=None)
- polyzymd.analyses.shared.compute_acf(timeseries, max_lag=None, timestep=1.0, timestep_unit='frames')[source]
Compute autocorrelation function of a 1D timeseries.
Uses FFT-based computation for efficiency.
- Parameters:
timeseries (array_like) – 1D array of values (e.g., RMSD over time, distance over time)
max_lag (int, optional) – Maximum lag to compute (in frames). Default is N//4 where N is the length of the timeseries.
timestep (float, optional) – Time between frames. Default is 1.0.
timestep_unit (str, optional) – Unit of timestep. Default is “frames”.
- Returns:
Container with lags, acf values, and metadata
- Return type:
ACFResult
Examples
>>> # Compute ACF of RMSD timeseries >>> rmsd = np.array([1.2, 1.3, 1.25, 1.4, ...]) # from MDAnalysis >>> acf_result = compute_acf(rmsd, timestep=10.0, timestep_unit="ps") >>> print(f"ACF at lag 100ps: {acf_result.acf[10]:.3f}")
Notes
The ACF is normalized so that ACF[0] = 1.
For a stationary process: ACF(τ) = <(x(t) - μ)(x(t+τ) - μ)> / σ²
For constant or near-constant timeseries (variance below a small epsilon), this function returns a defined degenerate ACF with ACF[0] = 1 and all positive lags set to 0.
- polyzymd.analyses.shared.estimate_correlation_time(acf_or_timeseries, timestep=1.0, timestep_unit='frames', method='integration', n_frames=None)[source]
Estimate correlation time from ACF or raw timeseries.
- Parameters:
acf_or_timeseries (ACFResult or array_like) – Either an ACFResult from compute_acf(), or a raw timeseries
timestep (float, optional) – Time between frames (only used if passing raw timeseries)
timestep_unit (str, optional) – Unit of timestep (only used if passing raw timeseries)
method ({"first_zero", "exponential_fit", "integration"}) – Method for estimating τ: - “first_zero”: Lag where ACF first crosses zero - “exponential_fit”: Fit ACF = exp(-t/τ) - “integration”: τ = ∫ACF(t)dt (recommended, most robust)
n_frames (int, optional) – Total number of frames (for computing n_independent). Only needed if passing ACFResult.
- Returns:
Contains tau, method used, n_independent, statistical_inefficiency
- Return type:
CorrelationTimeResult
Examples
>>> acf_result = compute_acf(rmsd, timestep=10.0, timestep_unit="ps") >>> tau_result = estimate_correlation_time(acf_result, method="integration") >>> print(f"Correlation time: {tau_result.tau:.1f} {tau_result.tau_unit}") >>> print(f"Independent samples: {tau_result.n_independent}")
Notes
- The “integration” method is most robust for noisy ACFs. It computes:
τ = ∫₀^∞ ACF(t) dt ≈ Σ ACF[i] * dt
Integration stops at first zero crossing to avoid noise contribution.
- polyzymd.analyses.shared.get_independent_indices(n_frames, correlation_time, timestep=1.0, start_frame=0)[source]
Get frame indices for independent samples.
Selects frames separated by at least 2*τ (correlation time) to ensure approximate independence for statistical analysis.
- Parameters:
n_frames (int) – Total number of frames in trajectory
correlation_time (float) – Correlation time τ (in same units as timestep)
timestep (float, optional) – Time between frames. Default is 1.0.
start_frame (int, optional) – First frame to consider (after equilibration). Default is 0. Note: Frame indices are 0-indexed internally, but user-facing documentation uses 1-indexed (PyMOL convention).
- Returns:
Array of frame indices (0-indexed) that are approximately independent
- Return type:
NDArray[np.int64]
Examples
>>> # Get independent frames for RMSF calculation >>> tau_result = estimate_correlation_time(rmsd, timestep=10.0) >>> indices = get_independent_indices( ... n_frames=10000, ... correlation_time=tau_result.tau, ... timestep=10.0, ... start_frame=1000, # Skip first 1000 frames for equilibration ... ) >>> print(f"Using {len(indices)} independent frames")
Notes
Frame indices returned are 0-indexed (for direct use with MDAnalysis). When displaying to users, add 1 for PyMOL convention.
The spacing is set to 2*τ/timestep, which gives frames with negligible correlation (ACF < 0.05 for exponential decay).
- polyzymd.analyses.shared.statistical_inefficiency(timeseries, mintime=3, fft=True)[source]
Compute statistical inefficiency g directly from a timeseries.
The statistical inefficiency g is the factor by which the variance of the sample mean is increased due to correlation:
Var(mean) = Var(x) * g / N
This is computed as: g = 1 + 2 * Σ C(t) * (1 - t/N)
where C(t) is the normalized autocorrelation function and the sum includes the finite-size correction factor (1 - t/N) per Chodera et al. (2007).
- Parameters:
timeseries (array_like) – 1D array of values (e.g., contact binary array, RMSD over time)
mintime (int) – Minimum number of lags to compute before checking for zero crossing. Prevents early termination from noise. Default is 3.
fft (bool) – If True, use FFT-based ACF computation (faster). Default is True.
- Returns:
Statistical inefficiency g (>= 1.0). The number of effective independent samples is N_eff = N / g.
- Return type:
Examples
>>> # Binary contact timeseries >>> contacts = np.array([0, 1, 1, 1, 0, 0, 1, 1, ...]) >>> g = statistical_inefficiency(contacts) >>> n_eff = len(contacts) / g >>> print(f"Effective samples: {n_eff:.1f}")
>>> # Continuous observable >>> rmsd = np.array([1.2, 1.3, 1.25, 1.4, ...]) >>> g = statistical_inefficiency(rmsd)
Notes
This implementation follows the algorithm from Chodera et al. (2007) J. Chem. Theory Comput. 3:26, with the finite-size correction.
For binary (0/1) data, the algorithm works correctly as the variance of a Bernoulli random variable is p(1-p).
References
Chodera et al. (2007) J. Chem. Theory Comput. 3:26
- polyzymd.analyses.shared.statistical_inefficiency_multiple(timeseries_list, mintime=3)[source]
Compute statistical inefficiency from multiple timeseries of different lengths.
This is critical for aggregating replicates with different frame counts. The algorithm computes a global mean μ across all timeseries, then averages the ACF numerator and denominator separately before computing g.
- Parameters:
- Returns:
Statistical inefficiency g (>= 1.0)
- Return type:
Examples
>>> # Three replicates with different lengths >>> ts1 = np.array([0, 1, 1, 0, 0, 1]) # 6 frames >>> ts2 = np.array([1, 1, 0, 0, 0]) # 5 frames >>> ts3 = np.array([0, 0, 1, 1, 1, 0, 1]) # 7 frames >>> g = statistical_inefficiency_multiple([ts1, ts2, ts3])
Notes
This implementation follows the algorithm from PyMBAR’s statistical_inefficiency_multiple(), adapted without the PyMBAR dependency.
The algorithm:
Compute global mean μ across all timeseries
For each lag t: - Compute sum of (x - μ) products across all timeseries where t < N_k - Compute sum of sample counts across all timeseries where t < N_k - Average to get C(t)
Sum with finite-size correction
References
Chodera et al. (2007) J. Chem. Theory Comput. 3:26
- polyzymd.analyses.shared.n_effective(n_samples, g)[source]
Compute number of effective independent samples.
- polyzymd.analyses.shared.check_statistical_reliability(n_eff, threshold=10)[source]
Check if statistics are reliable based on effective sample count.
- Parameters:
- Returns:
is_reliable (bool) – True if n_eff >= threshold
warning (str | None) – Warning message if not reliable, None otherwise
- Return type:
Examples
>>> g = statistical_inefficiency(contacts) >>> n_eff = n_effective(len(contacts), g) >>> is_ok, warning = check_statistical_reliability(n_eff) >>> if not is_ok: ... print(warning)
- polyzymd.analyses.shared.minimum_image_distance(pos1, pos2, box=None)[source]
Calculate minimum image distance between two positions.
Uses the minimum image convention for periodic boundary conditions. Both orthorhombic and triclinic boxes are fully supported.
- Parameters:
pos1 (NDArray) – Position vectors of shape (3,) in Angstroms.
pos2 (NDArray) – Position vectors of shape (3,) in Angstroms.
box (NDArray, optional) – Box dimensions in MDAnalysis format: [Lx, Ly, Lz, alpha, beta, gamma] or just [Lx, Ly, Lz] (assumes orthorhombic). If None, no PBC correction is applied.
- Returns:
Minimum image distance in Angstroms.
- Return type:
- polyzymd.analyses.shared.pairwise_distances_pbc(positions1, positions2, box=None)[source]
Calculate pairwise distances between two sets of positions.
Uses the minimum image convention for periodic boundary conditions. Both orthorhombic and triclinic boxes are fully supported.
- Parameters:
positions1 (NDArray) – First set of positions, shape (N, 3) in Angstroms.
positions2 (NDArray) – Second set of positions, shape (M, 3) in Angstroms.
box (NDArray, optional) – Box dimensions in MDAnalysis format: [Lx, Ly, Lz, alpha, beta, gamma] or just [Lx, Ly, Lz] (assumes orthorhombic). If None, no PBC correction is applied.
- Returns:
Distance matrix of shape (N, M) in Angstroms.
- Return type:
NDArray[np.float64]
- class polyzymd.analyses.shared.AnalysisDefaults(*, equilibration_time='10ns', fdr_alpha=0.05, ttest_method='student', posthoc_method='ttest_bh')[source]
Bases:
BaseModelDefault parameters applied to all analyses.
- equilibration_time
Time to skip for equilibration (e.g., “10ns”, “5000ps”)
- Type:
- fdr_alpha
False discovery rate threshold for Benjamini-Hochberg correction in default scalar pairwise comparisons. Only used when
posthoc_methodis"ttest_bh".- Type:
- ttest_method
Two-sample t-test method used by the default scalar comparison.
"student"assumes equal variances and"welch"relaxes that assumption. Only used whenposthoc_methodis"ttest_bh".- Type:
- posthoc_method
Post-hoc testing method for pairwise comparisons.
"ttest_bh"(default) runs pairwise t-tests with Benjamini-Hochberg FDR correction."tukey_hsd"runs Tukey’s Honestly Significant Difference test across all groups.- Type:
- equilibration_time: str
- fdr_alpha: float
- ttest_method: str
- posthoc_method: str
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- polyzymd.analyses.shared.compute_config_hash(config)[source]
Compute hash of config parameters relevant to analysis.
Includes parameters that affect trajectory interpretation: - enzyme configuration - substrate configuration - polymer configuration - thermodynamics (temperature, pressure) - output paths (for trajectory location)
Excludes parameters that don’t affect analysis of completed trajectories: - simulation_phases (equilibration_stages/production settings) - force_field (already baked into trajectory)
- Parameters:
config (SimulationConfig) – PolyzyMD simulation configuration
- Returns:
Hex digest of SHA-256 hash (first 16 characters for brevity)
- Return type:
Examples
>>> from polyzymd.config import load_config >>> config = load_config("config.yaml") >>> hash_val = compute_config_hash(config) >>> print(f"Config hash: {hash_val}") Config hash: a3b2c1d4e5f67890
- polyzymd.analyses.shared.validate_config_hash(stored_hash, current_config, warn=True)[source]
Check if stored hash matches current config.
If the hashes don’t match, this indicates the config has changed since the analysis was performed. This could mean: 1. The user modified the config (bad practice - should create new project) 2. The analysis was performed on a different config file 3. A bug in the hashing algorithm
- Parameters:
- Returns:
True if hashes match, False otherwise
- Return type:
Examples
>>> stored_hash = loaded_result.get("config_hash", "") >>> config = load_config("config.yaml") >>> if not validate_config_hash(stored_hash, config): ... print("Warning: Results may be stale!")
- class polyzymd.analyses.shared.ConvergenceResult(converged, assessable, convergence_time_ns, window_start_times_ns, window_mean_values, slope_times_ns, slopes, window_size_ns, step_size_ns, slope_threshold, sustained_for_ns, message)[source]
Bases:
objectContainer for convergence diagnostics.
- converged
Whether sustained convergence was detected.
- Type:
- assessable
Whether convergence could be assessed from available data.
- Type:
- convergence_time_ns
Start time of the first sustained converged period.
- Type:
float | None
- window_size_ns
Sliding window width in ns.
- Type:
- step_size_ns
Sliding window stride in ns.
- Type:
- slope_threshold
Absolute slope cutoff used for convergence.
- Type:
- sustained_for_ns
Required sustained duration below slope threshold.
- Type:
- message
Human-readable status message.
- Type:
- converged: bool
- assessable: bool
- window_size_ns: float
- step_size_ns: float
- slope_threshold: float
- sustained_for_ns: float
- message: str
- __init__(converged, assessable, convergence_time_ns, window_start_times_ns, window_mean_values, slope_times_ns, slopes, window_size_ns, step_size_ns, slope_threshold, sustained_for_ns, message)
- polyzymd.analyses.shared.find_convergence_time(time_ns, values, window_size_ns=15.0, step_size_ns=5.0, slope_threshold=0.0005, sustained_for_ns=15.0)[source]
Find sustained convergence time using a sliding-window slope heuristic.
- Parameters:
time_ns (array_like) – Monotonically increasing time values in ns.
values (array_like) – Signal values sampled at
time_ns.window_size_ns (float, optional) – Width of each averaging window in ns.
step_size_ns (float, optional) – Sliding step between successive windows in ns.
slope_threshold (float, optional) – Absolute slope threshold for classifying a window-to-window change as converged.
sustained_for_ns (float, optional) – Required cumulative duration below threshold before declaring convergence.
- Returns:
Full convergence diagnostics, including intermediate window means and slope traces.
- Return type:
ConvergenceResult
- Raises:
ValueError – Raised when inputs are invalid.
- polyzymd.analyses.shared.get_theme(plot_settings)[source]
Return the resolved
PlotThemefrom plot_settings.- Parameters:
plot_settings (PlotSettings) – Global plot settings (carries a
.themeproperty).- Return type:
PlotTheme
- polyzymd.analyses.shared.apply_axis_style(ax, plot_settings, *, title=None, xlabel=None, ylabel=None)[source]
Apply standard axis chrome from the theme.
Hides spines according to theme settings, sizes tick labels, and optionally sets title / xlabel / ylabel with themed font sizes.
- polyzymd.analyses.shared.apply_legend(ax, plot_settings, *, loc=None, bbox_to_anchor=<object object>, fontsize=None, **kwargs)[source]
Apply legend with themed defaults.
Uses
theme.legend_locandtheme.legend_bboxunless overridden by the caller. Extra kwargs are forwarded toax.legend().- Parameters:
ax (matplotlib Axes) – Target axes.
plot_settings (PlotSettings) – Global plot settings.
loc (str, optional) – Override
theme.legend_loc.bbox_to_anchor (tuple of float or None, optional) – Override
theme.legend_bbox. PassNoneexplicitly to suppress the bbox (e.g. for inside-axes placement).fontsize (int, optional) – Override
theme.legend_fontsize.**kwargs (Any) – Forwarded to
ax.legend().
- polyzymd.analyses.shared.get_colors(n, plot_settings)[source]
Get n distinct colors from the configured palette.
Tries seaborn first (richer palette support), falls back to a matplotlib colormap sampled at evenly-spaced intervals.
- polyzymd.analyses.shared.get_output_path(output_dir, name, plot_settings)[source]
Generate output file path with correct format extension.
- Parameters:
output_dir (Path) – Output directory.
name (str) – Base filename (without extension).
plot_settings (PlotSettings) – Global plot settings (carries
format).
- Returns:
Full output path with extension.
- Return type:
Path
- polyzymd.analyses.shared.save_figure(fig, output_path, plot_settings, *, experimental_features=None, close=True)[source]
Save figure with DPI, watermark, and optional experimental stamp.
- Parameters:
fig (matplotlib Figure) – Figure to save.
output_path (Path) – Output file path.
plot_settings (PlotSettings) – Global plot settings (carries
dpiandtheme).experimental_features (sequence of str or None, optional) – Experimental feature ids to stamp onto the figure.
close (bool, optional) – If True, close the figure after saving. Set False when the caller needs to keep using the figure object.
- Returns:
Path to saved figure.
- Return type:
Path
- polyzymd.analyses.shared.grouped_bars(ax, x, series, colors, plot_settings, *, bar_width=None, show_error=True, reference_line=0.0, reference_label='Neutral (0)', replicate_values=None, **style_overrides)[source]
Render grouped bars with optional error bars and reference line.
Style values (alpha, capsize, edgecolor, linewidth, dot_size, etc.) are read from
plot_settings.theme. Callers can override any of them via**style_overridesusing the theme field names as keys.- Parameters:
ax (matplotlib Axes) – Target axes.
x (np.ndarray) – 1-D array of group centre positions (e.g.
np.arange(n_groups)).series (sequence of (label, means, errors)) – One tuple per condition. means and errors must have the same length as x.
colors (sequence) – One colour per condition (same length as series).
plot_settings (PlotSettings) – Global plot settings.
bar_width (float | None, optional) – Width of each individual bar. When
None(default) the width is computed as0.8 / len(series).show_error (bool, optional) – If
False, error bars are suppressed, by defaultTrue.reference_line (float | None, optional) – Y-value for a horizontal reference line. Set to
Noneto skip, by default0.0.reference_label (str, optional) – Legend label for the reference line, by default
"Neutral (0)".replicate_values (sequence or None, optional) – Per-replicate values for jittered dot overlay. Indexed as
replicate_values[condition_idx][group_idx]-> sequence of floats (one per replicate). WhenNone(default), no dots are drawn.**style_overrides – Override any theme field for this call only. Accepted keys:
bar_alpha,bar_capsize,bar_edgecolor,bar_linewidth,dot_size,dot_alpha,dot_color,reference_line_color,reference_line_style,reference_line_width.
- polyzymd.analyses.shared.find_json(directory, preferred, glob_pattern='*.json')[source]
Locate a JSON result file inside directory.
Tries the preferred filename first; if it does not exist, falls back to the first file matching glob_pattern (sorted lexicographically so results are deterministic).
- polyzymd.analyses.shared.annotate_cells(ax, matrix, plot_settings, *, fmt='.2f', fontsize=None, threshold=0.3, sem_matrix=None, show_sign=True, linespacing=None)[source]
Annotate heatmap cells with formatted values.
Iterates over every element of matrix and places a text label at the corresponding (col, row) position on ax. NaN cells are skipped. Text colour flips between black and white depending on the background intensity (controlled by threshold).
- Parameters:
ax (matplotlib Axes) – The axes containing the heatmap image.
matrix (np.ndarray) – 2-D array of values (rows x cols) matching the heatmap.
plot_settings (PlotSettings) – Global plot settings.
fmt (str, optional) – Format spec for the value, by default
".2f".fontsize (int | None, optional) – Annotation font size. When
None(default), usesplot_settings.theme.annotation_fontsize.threshold (float, optional) – Absolute-value threshold above which text turns white.
sem_matrix (np.ndarray | None, optional) – If provided, a second line
±{sem}is appended when the SEM value is finite.show_sign (bool, optional) – Prefix positive values with
"+", by defaultTrue.linespacing (float | None, optional) – Passed to
ax.text(linespacing=...)when SEM is shown.
- polyzymd.analyses.shared.symmetric_clim(values, pad=0.1)[source]
Compute symmetric colour limits centred on zero.
polyzymd.analyses.shared.loader
Trajectory loading utilities for PolyzyMD analysis.
This module provides config-aware trajectory loading that understands PolyzyMD’s directory structure and daisy-chain continuation patterns. File discovery is delegated to the active simulation engine so that both OpenMM and GROMACS directory layouts are handled transparently.
Key Features
Config-based path resolution (config.yaml is single source of truth)
Engine-aware file discovery (OpenMM daisy-chain, GROMACS flat layout)
Automatic detection of daisy-chain trajectory segments
Support for both scratch and projects directories
Lazy loading and memory-efficient iteration
- class polyzymd.analyses.shared.loader.TrajectoryInfo(topology_file, trajectory_files=<factory>, n_segments=0, working_directory=<factory>, replicate=1)[source]
Bases:
objectInformation about discovered trajectory files.
- topology_file
Path to topology file (PDB)
- Type:
Path
- trajectory_files
List of trajectory files (DCD) in order
- Type:
list[Path]
- n_segments
Number of daisy-chain segments
- Type:
- working_directory
Base working directory for this replicate
- Type:
Path
- replicate
Replicate number
- Type:
- topology_file: Path
- n_segments: int = 0
- working_directory: Path
- replicate: int = 1
- property n_trajectory_files: int
Number of trajectory files found.
- validate()[source]
Validate that all files exist.
- __init__(topology_file, trajectory_files=<factory>, n_segments=0, working_directory=<factory>, replicate=1)
- class polyzymd.analyses.shared.loader.TrajectoryLoader(config, engine_override=None)[source]
Bases:
objectConfig-aware trajectory loader for PolyzyMD simulations.
This class handles the complexity of finding and loading trajectories from PolyzyMD’s output structure, including:
Daisy-chain continuation segments (OpenMM)
Flat production directories (GROMACS)
Scratch vs projects directory resolution
Multiple replicates
File discovery is delegated to the simulation engine resolved from the config’s
enginefield. The engine is created lazily on the first call that needs it, so construction remains cheap.- Parameters:
config (SimulationConfig) – PolyzyMD simulation configuration.
engine_override (str or None, optional) – Force a specific engine name (
"openmm"or"gromacs") instead of readingconfig.engine.
Examples
>>> from polyzymd.config import load_config >>> config = load_config("config.yaml") >>> loader = TrajectoryLoader(config) >>> >>> # Load single replicate >>> u = loader.load_universe(replicate=1) >>> print(f"Loaded {len(u.trajectory)} frames") >>> >>> # Get trajectory info without loading >>> info = loader.get_trajectory_info(replicate=1) >>> print(f"Found {info.n_segments} segments") >>> >>> # Load multiple replicates >>> for rep in range(1, 6): ... u = loader.load_universe(replicate=rep) ... # ... analyze >>> >>> # Explicit engine override for GROMACS directories >>> loader = TrajectoryLoader(config, engine_override="gromacs")
Notes
Frame indices in MDAnalysis are 0-indexed. For user-facing output, add 1 to follow PyMOL convention (1-indexed frames).
- __init__(config, engine_override=None)[source]
- get_trajectory_info(replicate)[source]
Get trajectory file information for a replicate.
- Parameters:
replicate (int) – Replicate number (1-indexed)
- Returns:
Information about discovered trajectory files
- Return type:
TrajectoryInfo
- Raises:
FileNotFoundError – If working directory or required files don’t exist
- load_universe(replicate, cache=True)[source]
Load MDAnalysis Universe for a replicate.
- Parameters:
- Returns:
MDAnalysis Universe with trajectory loaded
- Return type:
Universe
Notes
For daisy-chain trajectories, all segments are loaded as a continuous trajectory using MDAnalysis’s ChainReader.
- iter_replicates(replicates)[source]
Iterate over multiple replicates.
- Parameters:
replicates (sequence of int) – Replicate numbers to load
- Yields:
tuple of (int, Universe) – Replicate number and loaded Universe
Examples
>>> for rep, u in loader.iter_replicates([1, 2, 3, 4, 5]): ... rmsf = compute_rmsf(u) ... results[rep] = rmsf
- get_frame_times(replicate, unit='ns')[source]
Get time values for each frame.
- get_timestep(replicate, unit='ps')[source]
Get the trajectory timestep (time between frames).
- clear_cache()[source]
Clear the Universe cache to free memory.
- find_topology(working_dir)[source]
Find topology file in working directory.
Delegates file discovery to the simulation engine. The engine applies its own search order (e.g. PDB preference for GROMACS,
solvated_system.pdbpreference for OpenMM).This method is used by several plugins that pass an explicit
working_dirunrelated to the current replicate. The replicate index is inferred from the directory name when possible (run_<N>), falling back to1.- Parameters:
working_dir (Path) – Directory to search for topology files.
- Returns:
Path to the topology file.
- Return type:
Path
- Raises:
FileNotFoundError – If no topology file is found.
- polyzymd.analyses.shared.loader.parse_time_string(time_str)[source]
Parse a time string with units into value and unit.
- Parameters:
time_str (str) – Time string like “100ns”, “5000ps”, “100 ns”, etc.
- Returns:
Numeric value and unit string
- Return type:
Examples
>>> parse_time_string("100ns") (100.0, "ns") >>> parse_time_string("5000 ps") (5000.0, "ps") >>> parse_time_string("100") # Default to ns (100.0, "ns")
- polyzymd.analyses.shared.loader.convert_time(value, from_unit, to_unit)[source]
Convert time between units.
- polyzymd.analyses.shared.loader.time_to_frame(time, time_unit, timestep, timestep_unit='ps')[source]
Convert time to frame index.
polyzymd.analyses.shared.alignment
Trajectory alignment utilities.
This module provides reusable trajectory alignment infrastructure for all analysis modules. The alignment is performed in-memory using MDAnalysis and removes rotational drift and center-of-mass motion that can add noise to distance and flexibility measurements.
Reference Modes
centroid: Align to a representative aligned frame (closest to aligned mean structure). Best for measuring fluctuations around a representative equilibrium state.
average: Align to mathematical average structure. Best for pure thermal fluctuation analysis (note: average may have unphysical geometry).
frame: Align to a specific user-specified frame number. Best for comparing to a known functional conformation.
external: Align to an external PDB structure (e.g., crystal structure). Best for measuring deviations from a catalytically competent reference geometry that is independent of the simulation conditions.
Usage
>>> from polyzymd.analyses.shared.alignment import AlignmentConfig, align_trajectory
>>>
>>> # Create configuration
>>> config = AlignmentConfig(
... enabled=True,
... reference_mode="centroid",
... selection="protein and name CA",
... )
>>>
>>> # Align trajectory in-place
>>> ref_frame = align_trajectory(universe, config, start_frame=100)
>>> # Universe is now aligned; ref_frame is the reference frame index
Notes
Alignment modifies the Universe in-place. If you need the original coordinates, load a fresh Universe after alignment.
The alignment selection should typically be stable atoms that define the reference frame. Common choices: - “protein and name CA”: Backbone alpha-carbons (default) - “protein and backbone”: Full backbone (N, CA, C) - “protein”: All protein atoms (includes flexible side chains)
References
MDAnalysis alignment: https://docs.mdanalysis.org/stable/documentation_pages/analysis/align.html
- class polyzymd.analyses.shared.alignment.AlignmentConfig(*, enabled=True, reference_mode='centroid', reference_frame=None, selection='protein and name CA', centroid_selection='protein', reference_file=None)[source]
Bases:
BaseModelConfiguration for trajectory alignment.
This model defines how a trajectory should be aligned before analysis. By default, alignment is ENABLED to remove rotational drift and COM motion that can add noise to distance measurements.
- enabled
Whether to align the trajectory before analysis. Default is True. When enabled, the user is notified via INFO-level logging.
- Type:
- reference_mode
How to select the reference structure:
“centroid”: Representative aligned frame (closest to aligned mean)
“average”: Mathematical average structure
“frame”: Specific frame number
- Type:
ReferenceMode
- reference_frame
Frame number (1-indexed) when reference_mode=”frame”. Required when mode is “frame”, ignored otherwise.
- Type:
int | None
- selection
MDAnalysis selection string for superposition. Atoms matching this selection are used to compute the optimal rotation/translation. Default: “protein and name CA” (backbone alpha-carbons).
- Type:
- centroid_selection
Selection for representative-frame detection when mode=”centroid”. Default: “protein” (all protein atoms for better discrimination).
- Type:
Examples
>>> # Default: align to centroid using CA atoms >>> config = AlignmentConfig()
>>> # Align to a specific frame >>> config = AlignmentConfig( ... reference_mode="frame", ... reference_frame=500, ... )
>>> # Disable alignment (not recommended for most analyses) >>> config = AlignmentConfig(enabled=False)
- enabled: bool
- reference_mode: ReferenceMode
- selection: str
- centroid_selection: str
- reference_file: Path | None
- validate_reference_params()[source]
Validate reference_frame and reference_file for their modes.
- to_dict()[source]
Convert to dictionary for cache key hashing.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- polyzymd.analyses.shared.alignment.align_trajectory(universe, config, start_frame=0, stop_frame=None)[source]
Align trajectory in-memory to a reference structure.
This function performs trajectory alignment using MDAnalysis, removing rotational drift and center-of-mass motion. The alignment is performed in-place, modifying the Universe’s coordinates.
- Parameters:
universe (Universe) – MDAnalysis Universe to align. Will be modified in-place.
config (AlignmentConfig) – Alignment configuration specifying reference mode and selections.
start_frame (int, optional) – First frame to consider (0-indexed). Default is 0. Frames before this are typically equilibration and should be excluded.
stop_frame (int, optional) – Last frame (exclusive). If None, uses all frames after start_frame.
- Returns:
Reference frame index (0-indexed), or None if using average mode. This can be stored in results for reproducibility.
- Return type:
int | None
- Raises:
ImportError – If MDAnalysis is not available.
ValueError – If reference_mode is “frame” but the frame number is out of range.
Notes
When alignment is performed, an INFO-level log message is emitted to notify the user. This ensures users are aware that their trajectory coordinates have been modified.
Examples
>>> from polyzymd.analyses.shared.alignment import AlignmentConfig, align_trajectory >>> >>> config = AlignmentConfig(reference_mode="centroid") >>> ref_frame = align_trajectory(universe, config, start_frame=100) >>> print(f"Aligned to frame {ref_frame}")
polyzymd.analyses.shared.autocorrelation
Autocorrelation analysis for independent sampling.
MD trajectories are highly correlated in time - consecutive frames are not independent samples. This module provides tools to:
Compute the autocorrelation function (ACF) of an observable
Estimate the correlation time (τ) from the ACF
Compute statistical inefficiency (g) for proper uncertainty quantification
Select independent frames based on τ for proper statistics
Key Concepts
Autocorrelation function (ACF): Measures how correlated a signal is with itself at different time lags. ACF(0) = 1, and ACF decays toward 0.
Correlation time (τ): Characteristic time for decorrelation. Frames separated by > 2τ are approximately independent.
Statistical inefficiency (g): Factor by which variance is inflated due to correlation. g = 1 + 2*Σ C(t)*(1-t/N). N_eff = N/g.
Independent samples: For proper SEM calculation, we need N_eff independent samples, not N_frames correlated observations.
Methods for τ estimation
First zero crossing: τ is lag where ACF first crosses zero
Exponential fit: Fit ACF = exp(-t/τ) and extract τ
Integration: τ = ∫ACF(t)dt from 0 to first zero (or cutoff)
Statistical Validity
- The number of effective independent samples (N_eff) is computed as:
N_eff = N / g = N / (1 + 2*Σ C(t)*(1-t/N))
This matches the algorithm from Chodera et al. (2007) with the finite-size correction factor (1-t/N). When N_eff < 10, statistical estimates (mean, SEM) may be unreliable, and users should be warned per LiveCoMS best practices (Grossfield et al., 2018).
For multiple timeseries of different lengths (e.g., replicates), use statistical_inefficiency_multiple() which correctly handles the averaging.
References
Flyvbjerg & Petersen (1989) J. Chem. Phys. 91:461 (block averaging)
Chodera et al. (2007) J. Chem. Theory Comput. 3:26 (statistical inefficiency)
Grossfield et al. (2018) LiveCoMS 1:5067 (uncertainty quantification)
- class polyzymd.analyses.shared.autocorrelation.CorrelationTimeMethod(value)[source]
-
Method for estimating correlation time from ACF.
- FIRST_ZERO = 'first_zero'
- EXPONENTIAL_FIT = 'exponential_fit'
- INTEGRATION = 'integration'
- class polyzymd.analyses.shared.autocorrelation.ACFResult(lags, acf, timestep, timestep_unit, n_samples)[source]
Bases:
objectResult of autocorrelation function computation.
- lags
Time lags in the same units as timestep
- Type:
NDArray[np.float64]
- acf
Autocorrelation values (normalized, ACF[0] = 1)
- Type:
NDArray[np.float64]
- timestep
Time between frames
- Type:
- timestep_unit
Unit of timestep (e.g., “ps”, “ns”)
- Type:
- n_samples
Number of samples in the original timeseries
- Type:
- lags: numpy.typing.NDArray.numpy.float64
- acf: numpy.typing.NDArray.numpy.float64
- timestep: float
- timestep_unit: str
- n_samples: int
- to_dict()[source]
Convert to dictionary for serialization.
- __init__(lags, acf, timestep, timestep_unit, n_samples)
- class polyzymd.analyses.shared.autocorrelation.CorrelationTimeResult(tau, tau_unit, method, n_independent, statistical_inefficiency, warning=None)[source]
Bases:
objectResult of correlation time estimation.
- tau
Estimated correlation time
- Type:
- tau_unit
Unit of tau (same as timestep unit)
- Type:
- method
Method used for estimation
- Type:
- n_independent
Estimated number of independent samples in trajectory
- Type:
- statistical_inefficiency
g = 1 + 2*tau/dt, factor by which variance is inflated
- Type:
- warning
Warning message if statistics may be unreliable (e.g., N_ind < 10)
- Type:
str | None
- tau: float
- tau_unit: str
- method: str
- n_independent: int
- statistical_inefficiency: float
- property is_reliable: bool
Return True if statistics are likely reliable (N_ind >= 10).
- to_dict()[source]
Convert to dictionary for serialization.
- __init__(tau, tau_unit, method, n_independent, statistical_inefficiency, warning=None)
- polyzymd.analyses.shared.autocorrelation.compute_acf(timeseries, max_lag=None, timestep=1.0, timestep_unit='frames')[source]
Compute autocorrelation function of a 1D timeseries.
Uses FFT-based computation for efficiency.
- Parameters:
timeseries (array_like) – 1D array of values (e.g., RMSD over time, distance over time)
max_lag (int, optional) – Maximum lag to compute (in frames). Default is N//4 where N is the length of the timeseries.
timestep (float, optional) – Time between frames. Default is 1.0.
timestep_unit (str, optional) – Unit of timestep. Default is “frames”.
- Returns:
Container with lags, acf values, and metadata
- Return type:
ACFResult
Examples
>>> # Compute ACF of RMSD timeseries >>> rmsd = np.array([1.2, 1.3, 1.25, 1.4, ...]) # from MDAnalysis >>> acf_result = compute_acf(rmsd, timestep=10.0, timestep_unit="ps") >>> print(f"ACF at lag 100ps: {acf_result.acf[10]:.3f}")
Notes
The ACF is normalized so that ACF[0] = 1.
For a stationary process: ACF(τ) = <(x(t) - μ)(x(t+τ) - μ)> / σ²
For constant or near-constant timeseries (variance below a small epsilon), this function returns a defined degenerate ACF with ACF[0] = 1 and all positive lags set to 0.
- polyzymd.analyses.shared.autocorrelation.estimate_correlation_time(acf_or_timeseries, timestep=1.0, timestep_unit='frames', method='integration', n_frames=None)[source]
Estimate correlation time from ACF or raw timeseries.
- Parameters:
acf_or_timeseries (ACFResult or array_like) – Either an ACFResult from compute_acf(), or a raw timeseries
timestep (float, optional) – Time between frames (only used if passing raw timeseries)
timestep_unit (str, optional) – Unit of timestep (only used if passing raw timeseries)
method ({"first_zero", "exponential_fit", "integration"}) – Method for estimating τ: - “first_zero”: Lag where ACF first crosses zero - “exponential_fit”: Fit ACF = exp(-t/τ) - “integration”: τ = ∫ACF(t)dt (recommended, most robust)
n_frames (int, optional) – Total number of frames (for computing n_independent). Only needed if passing ACFResult.
- Returns:
Contains tau, method used, n_independent, statistical_inefficiency
- Return type:
CorrelationTimeResult
Examples
>>> acf_result = compute_acf(rmsd, timestep=10.0, timestep_unit="ps") >>> tau_result = estimate_correlation_time(acf_result, method="integration") >>> print(f"Correlation time: {tau_result.tau:.1f} {tau_result.tau_unit}") >>> print(f"Independent samples: {tau_result.n_independent}")
Notes
- The “integration” method is most robust for noisy ACFs. It computes:
τ = ∫₀^∞ ACF(t) dt ≈ Σ ACF[i] * dt
Integration stops at first zero crossing to avoid noise contribution.
- polyzymd.analyses.shared.autocorrelation.get_independent_indices(n_frames, correlation_time, timestep=1.0, start_frame=0)[source]
Get frame indices for independent samples.
Selects frames separated by at least 2*τ (correlation time) to ensure approximate independence for statistical analysis.
- Parameters:
n_frames (int) – Total number of frames in trajectory
correlation_time (float) – Correlation time τ (in same units as timestep)
timestep (float, optional) – Time between frames. Default is 1.0.
start_frame (int, optional) – First frame to consider (after equilibration). Default is 0. Note: Frame indices are 0-indexed internally, but user-facing documentation uses 1-indexed (PyMOL convention).
- Returns:
Array of frame indices (0-indexed) that are approximately independent
- Return type:
NDArray[np.int64]
Examples
>>> # Get independent frames for RMSF calculation >>> tau_result = estimate_correlation_time(rmsd, timestep=10.0) >>> indices = get_independent_indices( ... n_frames=10000, ... correlation_time=tau_result.tau, ... timestep=10.0, ... start_frame=1000, # Skip first 1000 frames for equilibration ... ) >>> print(f"Using {len(indices)} independent frames")
Notes
Frame indices returned are 0-indexed (for direct use with MDAnalysis). When displaying to users, add 1 for PyMOL convention.
The spacing is set to 2*τ/timestep, which gives frames with negligible correlation (ACF < 0.05 for exponential decay).
- polyzymd.analyses.shared.autocorrelation.statistical_inefficiency(timeseries, mintime=3, fft=True)[source]
Compute statistical inefficiency g directly from a timeseries.
The statistical inefficiency g is the factor by which the variance of the sample mean is increased due to correlation:
Var(mean) = Var(x) * g / N
This is computed as: g = 1 + 2 * Σ C(t) * (1 - t/N)
where C(t) is the normalized autocorrelation function and the sum includes the finite-size correction factor (1 - t/N) per Chodera et al. (2007).
- Parameters:
timeseries (array_like) – 1D array of values (e.g., contact binary array, RMSD over time)
mintime (int) – Minimum number of lags to compute before checking for zero crossing. Prevents early termination from noise. Default is 3.
fft (bool) – If True, use FFT-based ACF computation (faster). Default is True.
- Returns:
Statistical inefficiency g (>= 1.0). The number of effective independent samples is N_eff = N / g.
- Return type:
Examples
>>> # Binary contact timeseries >>> contacts = np.array([0, 1, 1, 1, 0, 0, 1, 1, ...]) >>> g = statistical_inefficiency(contacts) >>> n_eff = len(contacts) / g >>> print(f"Effective samples: {n_eff:.1f}")
>>> # Continuous observable >>> rmsd = np.array([1.2, 1.3, 1.25, 1.4, ...]) >>> g = statistical_inefficiency(rmsd)
Notes
This implementation follows the algorithm from Chodera et al. (2007) J. Chem. Theory Comput. 3:26, with the finite-size correction.
For binary (0/1) data, the algorithm works correctly as the variance of a Bernoulli random variable is p(1-p).
References
Chodera et al. (2007) J. Chem. Theory Comput. 3:26
- polyzymd.analyses.shared.autocorrelation.statistical_inefficiency_multiple(timeseries_list, mintime=3)[source]
Compute statistical inefficiency from multiple timeseries of different lengths.
This is critical for aggregating replicates with different frame counts. The algorithm computes a global mean μ across all timeseries, then averages the ACF numerator and denominator separately before computing g.
- Parameters:
- Returns:
Statistical inefficiency g (>= 1.0)
- Return type:
Examples
>>> # Three replicates with different lengths >>> ts1 = np.array([0, 1, 1, 0, 0, 1]) # 6 frames >>> ts2 = np.array([1, 1, 0, 0, 0]) # 5 frames >>> ts3 = np.array([0, 0, 1, 1, 1, 0, 1]) # 7 frames >>> g = statistical_inefficiency_multiple([ts1, ts2, ts3])
Notes
This implementation follows the algorithm from PyMBAR’s statistical_inefficiency_multiple(), adapted without the PyMBAR dependency.
The algorithm:
Compute global mean μ across all timeseries
For each lag t: - Compute sum of (x - μ) products across all timeseries where t < N_k - Compute sum of sample counts across all timeseries where t < N_k - Average to get C(t)
Sum with finite-size correction
References
Chodera et al. (2007) J. Chem. Theory Comput. 3:26
- polyzymd.analyses.shared.autocorrelation.n_effective(n_samples, g)[source]
Compute number of effective independent samples.
- polyzymd.analyses.shared.autocorrelation.check_statistical_reliability(n_eff, threshold=10)[source]
Check if statistics are reliable based on effective sample count.
- Parameters:
- Returns:
is_reliable (bool) – True if n_eff >= threshold
warning (str | None) – Warning message if not reliable, None otherwise
- Return type:
Examples
>>> g = statistical_inefficiency(contacts) >>> n_eff = n_effective(len(contacts), g) >>> is_ok, warning = check_statistical_reliability(n_eff) >>> if not is_ok: ... print(warning)
polyzymd.analyses.shared.statistics
Statistical functions for replicate aggregation.
This module provides statistical utilities for combining results across multiple simulation replicates with proper error propagation.
Key design decisions: - All uncertainties are reported as Standard Error of the Mean (SEM) - SEM = std / sqrt(N) where N is the number of independent samples - Hierarchical aggregation preserves proper statistics at each level
Functions
- compute_sem
Standard error of the mean for a 1D array
- aggregate_per_residue_stats
Combine per-residue values across replicates
- aggregate_region_stats
Combine region-averaged values across replicates
- weighted_mean_with_sem
Weighted average with proper error propagation
- class polyzymd.analyses.shared.statistics.StatResult(mean, sem, n_samples)[source]
Bases:
objectContainer for mean +/- SEM results.
- mean
The mean value
- Type:
- sem
Standard error of the mean
- Type:
- n_samples
Number of samples used in computation
- Type:
- mean: float
- sem: float
- n_samples: int
- to_dict()[source]
Convert to dictionary for JSON serialization.
- __init__(mean, sem, n_samples)
- class polyzymd.analyses.shared.statistics.PerResidueStats(residue_ids, means, sems, n_replicates)[source]
Bases:
objectContainer for per-residue statistics across replicates.
- residue_ids
Residue identifiers (1-indexed, following PyMOL convention)
- Type:
NDArray[np.int64]
- means
Mean value for each residue across replicates
- Type:
NDArray[np.float64]
- sems
SEM for each residue across replicates
- Type:
NDArray[np.float64]
- n_replicates
Number of replicates aggregated
- Type:
- residue_ids: numpy.typing.NDArray.numpy.int64
- means: numpy.typing.NDArray.numpy.float64
- sems: numpy.typing.NDArray.numpy.float64
- n_replicates: int
- to_dict()[source]
Convert to dictionary for JSON serialization.
- __init__(residue_ids, means, sems, n_replicates)
- polyzymd.analyses.shared.statistics.compute_sem(values, ddof=1)[source]
Compute mean and standard error of the mean.
SEM = std / sqrt(N) where N is the number of samples.
- Parameters:
values (array_like) – 1D array of values (e.g., one value per replicate)
ddof (int, optional) – Delta degrees of freedom for std calculation. Default is 1 (Bessel’s correction for sample std).
- Returns:
Container with mean, sem, and n_samples
- Return type:
StatResult
Examples
>>> values = [2.5, 2.7, 2.6, 2.4, 2.8] # RMSF from 5 replicates >>> result = compute_sem(values) >>> print(f"RMSF = {result.mean:.2f} +/- {result.sem:.2f} A") RMSF = 2.60 +/- 0.07 A
Notes
For a single value, SEM is undefined (returns 0.0).
- polyzymd.analyses.shared.statistics.aggregate_per_residue_stats(per_replicate_values, residue_ids=None)[source]
Aggregate per-residue values across replicates.
For each residue, computes mean +/- SEM across all replicates. This is the correct way to aggregate per-residue RMSF values.
- Parameters:
per_replicate_values (sequence of arrays) – List/tuple of 1D arrays, each containing per-residue values from one replicate. All arrays must have the same length.
residue_ids (array, optional) – 1-indexed residue identifiers. If None, uses 1, 2, 3, … Following PyMOL convention (1-indexed).
- Returns:
Container with residue_ids, means, sems, n_replicates
- Return type:
PerResidueStats
- Raises:
ValueError – If arrays have inconsistent lengths or no replicates provided
- polyzymd.analyses.shared.statistics.aggregate_region_stats(per_replicate_values, residue_mask=None)[source]
Aggregate region-averaged values across replicates.
For whole-protein or region-specific metrics, this computes the mean of per-replicate averages, with SEM across replicates.
This implements the correct hierarchical aggregation: 1. First average within each replicate (over selected residues) 2. Then compute mean +/- SEM across replicate means
- Parameters:
per_replicate_values (sequence of arrays) – List/tuple of 1D arrays, each containing per-residue values from one replicate.
residue_mask (bool array, optional) – Boolean mask for residue selection. If None, uses all residues.
- Returns:
Mean +/- SEM of region-averaged values across replicates
- Return type:
StatResult
- polyzymd.analyses.shared.statistics.weighted_mean_with_sem(means, sems, weights=None)[source]
Compute weighted mean with proper error propagation.
Useful for combining results from different conditions or analyses with different uncertainties.
- Parameters:
means (array_like) – Mean values from each source
sems (array_like) – SEM values from each source
weights (array_like, optional) – Weights for each source. If None, uses inverse-variance weighting (1/sem^2), which is optimal for independent measurements.
- Returns:
Weighted mean with propagated uncertainty
- Return type:
StatResult
Notes
- For inverse-variance weighting, the combined SEM is:
SEM_combined = 1 / sqrt(sum(1/SEM_i^2))
- For arbitrary weights, uses standard error propagation:
SEM_combined = sqrt(sum((w_i * SEM_i)^2)) / sum(w_i)
polyzymd.analyses.shared.plotting
Shared plotting utilities for analysis plugins.
This module provides reusable plotting helper functions extracted from the plotter infrastructure. Analysis plugins import these functions to apply consistent styling, save figures with watermarks, and render common chart elements (grouped bars, heatmap annotations, etc.) without inheriting from a base class.
All functions accept a PlotSettings object (from
polyzymd.config.comparison) so that user-configured themes, palettes,
and DPI settings are respected automatically.
Examples
>>> from polyzymd.analyses.shared.plotting import (
... apply_axis_style, apply_legend, get_colors, save_figure,
... )
>>>
>>> fig, ax = plt.subplots()
>>> colors = get_colors(3, plot_settings)
>>> ax.bar(x, y, color=colors[0])
>>> apply_axis_style(ax, plot_settings, title="My Plot", ylabel="Value (Å)")
>>> apply_legend(ax, plot_settings)
>>> save_figure(fig, output_dir / "my_plot.png", plot_settings)
- polyzymd.analyses.shared.plotting.get_theme(plot_settings)[source]
Return the resolved
PlotThemefrom plot_settings.- Parameters:
plot_settings (PlotSettings) – Global plot settings (carries a
.themeproperty).- Return type:
PlotTheme
- polyzymd.analyses.shared.plotting.apply_axis_style(ax, plot_settings, *, title=None, xlabel=None, ylabel=None)[source]
Apply standard axis chrome from the theme.
Hides spines according to theme settings, sizes tick labels, and optionally sets title / xlabel / ylabel with themed font sizes.
- polyzymd.analyses.shared.plotting.apply_legend(ax, plot_settings, *, loc=None, bbox_to_anchor=<object object>, fontsize=None, **kwargs)[source]
Apply legend with themed defaults.
Uses
theme.legend_locandtheme.legend_bboxunless overridden by the caller. Extra kwargs are forwarded toax.legend().- Parameters:
ax (matplotlib Axes) – Target axes.
plot_settings (PlotSettings) – Global plot settings.
loc (str, optional) – Override
theme.legend_loc.bbox_to_anchor (tuple of float or None, optional) – Override
theme.legend_bbox. PassNoneexplicitly to suppress the bbox (e.g. for inside-axes placement).fontsize (int, optional) – Override
theme.legend_fontsize.**kwargs (Any) – Forwarded to
ax.legend().
- polyzymd.analyses.shared.plotting.get_colors(n, plot_settings)[source]
Get n distinct colors from the configured palette.
Tries seaborn first (richer palette support), falls back to a matplotlib colormap sampled at evenly-spaced intervals.
- polyzymd.analyses.shared.plotting.get_output_path(output_dir, name, plot_settings)[source]
Generate output file path with correct format extension.
- Parameters:
output_dir (Path) – Output directory.
name (str) – Base filename (without extension).
plot_settings (PlotSettings) – Global plot settings (carries
format).
- Returns:
Full output path with extension.
- Return type:
Path
- polyzymd.analyses.shared.plotting.save_figure(fig, output_path, plot_settings, *, experimental_features=None, close=True)[source]
Save figure with DPI, watermark, and optional experimental stamp.
- Parameters:
fig (matplotlib Figure) – Figure to save.
output_path (Path) – Output file path.
plot_settings (PlotSettings) – Global plot settings (carries
dpiandtheme).experimental_features (sequence of str or None, optional) – Experimental feature ids to stamp onto the figure.
close (bool, optional) – If True, close the figure after saving. Set False when the caller needs to keep using the figure object.
- Returns:
Path to saved figure.
- Return type:
Path
- polyzymd.analyses.shared.plotting.find_json(directory, preferred, glob_pattern='*.json')[source]
Locate a JSON result file inside directory.
Tries the preferred filename first; if it does not exist, falls back to the first file matching glob_pattern (sorted lexicographically so results are deterministic).
- polyzymd.analyses.shared.plotting.grouped_bars(ax, x, series, colors, plot_settings, *, bar_width=None, show_error=True, reference_line=0.0, reference_label='Neutral (0)', replicate_values=None, **style_overrides)[source]
Render grouped bars with optional error bars and reference line.
Style values (alpha, capsize, edgecolor, linewidth, dot_size, etc.) are read from
plot_settings.theme. Callers can override any of them via**style_overridesusing the theme field names as keys.- Parameters:
ax (matplotlib Axes) – Target axes.
x (np.ndarray) – 1-D array of group centre positions (e.g.
np.arange(n_groups)).series (sequence of (label, means, errors)) – One tuple per condition. means and errors must have the same length as x.
colors (sequence) – One colour per condition (same length as series).
plot_settings (PlotSettings) – Global plot settings.
bar_width (float | None, optional) – Width of each individual bar. When
None(default) the width is computed as0.8 / len(series).show_error (bool, optional) – If
False, error bars are suppressed, by defaultTrue.reference_line (float | None, optional) – Y-value for a horizontal reference line. Set to
Noneto skip, by default0.0.reference_label (str, optional) – Legend label for the reference line, by default
"Neutral (0)".replicate_values (sequence or None, optional) – Per-replicate values for jittered dot overlay. Indexed as
replicate_values[condition_idx][group_idx]-> sequence of floats (one per replicate). WhenNone(default), no dots are drawn.**style_overrides – Override any theme field for this call only. Accepted keys:
bar_alpha,bar_capsize,bar_edgecolor,bar_linewidth,dot_size,dot_alpha,dot_color,reference_line_color,reference_line_style,reference_line_width.
- polyzymd.analyses.shared.plotting.annotate_cells(ax, matrix, plot_settings, *, fmt='.2f', fontsize=None, threshold=0.3, sem_matrix=None, show_sign=True, linespacing=None)[source]
Annotate heatmap cells with formatted values.
Iterates over every element of matrix and places a text label at the corresponding (col, row) position on ax. NaN cells are skipped. Text colour flips between black and white depending on the background intensity (controlled by threshold).
- Parameters:
ax (matplotlib Axes) – The axes containing the heatmap image.
matrix (np.ndarray) – 2-D array of values (rows x cols) matching the heatmap.
plot_settings (PlotSettings) – Global plot settings.
fmt (str, optional) – Format spec for the value, by default
".2f".fontsize (int | None, optional) – Annotation font size. When
None(default), usesplot_settings.theme.annotation_fontsize.threshold (float, optional) – Absolute-value threshold above which text turns white.
sem_matrix (np.ndarray | None, optional) – If provided, a second line
±{sem}is appended when the SEM value is finite.show_sign (bool, optional) – Prefix positive values with
"+", by defaultTrue.linespacing (float | None, optional) – Passed to
ax.text(linespacing=...)when SEM is shown.
- polyzymd.analyses.shared.plotting.symmetric_clim(values, pad=0.1)[source]
Compute symmetric colour limits centred on zero.
polyzymd.analyses.shared.selections
Special selection syntax for distance calculations.
This module provides parsing of extended selection syntax for defining atom positions in distance calculations. It supports:
Standard MDAnalysis selections: “resid 133 and name OD1”
Midpoint of multiple atoms: “midpoint(resid 133 and name OD1 OD2)”
Center of mass of a group: “com(resid 50-75)”
Examples
>>> from polyzymd.analyses.shared.selections import parse_selection_string, get_position
>>>
>>> # Standard selection - single atom
>>> ag = parse_selection(universe, "resid 77 and name OG")
>>> pos = get_position(ag) # Returns position of single atom
>>>
>>> # Midpoint of Asp carboxyl oxygens
>>> ag = parse_selection(universe, "midpoint(resid 133 and name OD1 OD2)")
>>> pos = get_position(ag) # Returns midpoint of OD1 and OD2
>>>
>>> # Center of mass of lid domain
>>> ag = parse_selection(universe, "com(resid 50-75)")
>>> pos = get_position(ag) # Returns COM of residues 50-75
Notes
The midpoint() syntax is particularly useful for catalytic residues where the functional position is between two atoms (e.g., Asp carboxyl oxygens, Glu carboxyl oxygens).
The com() syntax is useful for domain motions where you want to track the center of mass of a group of residues (e.g., lid opening in lipases).
- polyzymd.analyses.shared.selections.translate_selection(selection)[source]
Translate PolyzyMD selection keywords to MDAnalysis equivalents.
This allows users to use the same selection syntax in analysis as they use in config.yaml for restraints and other atom selections.
Translations
pdbindex N→id N(PDB ATOM serial number)
The
pdbindexkeyword refers to the 1-indexed atom serial number from the PDB ATOM record (column 7-11), which is what PyMOL displays as “id”. In MDAnalysis, this is accessed via theidselection keyword.Note: MDAnalysis also has
bynumwhich is 1-indexed positional (i.e., bynum 1 = first atom, bynum 2 = second atom), but this does NOT correspond to PDB serial numbers when there are gaps in numbering. We useidbecause it matches actual PDB serial numbers.- param selection:
Selection string with possible PolyzyMD-specific keywords
- type selection:
str
- returns:
Selection string with MDAnalysis-compatible keywords
- rtype:
str
Examples
>>> translate_selection("pdbindex 100 and name CA") "id 100 and name CA"
>>> translate_selection("midpoint(pdbindex 100 and name OD1 OD2)") "midpoint(id 100 and name OD1 OD2)"
- class polyzymd.analyses.shared.selections.SelectionMode(value)[source]
-
Mode for position calculation from atom selection.
- SINGLE = 'single'
- CENTROID = 'centroid'
- MIDPOINT = 'midpoint'
- COM = 'com'
- class polyzymd.analyses.shared.selections.ParsedSelection(selection, mode, original)[source]
Bases:
objectResult of parsing a selection string.
- selection
The MDAnalysis selection string (without wrapper function)
- Type:
- mode
How to compute the position
- Type:
SelectionMode
- original
The original input string
- Type:
- selection: str
- mode: SelectionMode
- original: str
- __init__(selection, mode, original)
- polyzymd.analyses.shared.selections.parse_selection_string(selection)[source]
Parse a selection string to extract mode and MDAnalysis selection.
Also translates PolyzyMD-specific keywords (like
pdbindex) to their MDAnalysis equivalents (likeid).- Parameters:
selection (str) – Selection string, possibly with special syntax: - “resid 77 and name OG” - standard MDAnalysis - “midpoint(resid 133 and name OD1 OD2)” - midpoint mode - “com(resid 50-75)” - center of mass mode - “pdbindex 100 and name CA” - PolyzyMD pdbindex (translated to id)
- Returns:
Parsed selection with mode and clean selection string
- Return type:
ParsedSelection
Examples
>>> parsed = parse_selection_string("midpoint(resid 133 and name OD1 OD2)") >>> parsed.mode <SelectionMode.MIDPOINT: 'midpoint'> >>> parsed.selection "resid 133 and name OD1 OD2"
>>> parsed = parse_selection_string("pdbindex 100 and name CA") >>> parsed.selection "id 100 and name CA"
- polyzymd.analyses.shared.selections.select_atoms(universe, selection)[source]
Select atoms from universe using potentially special syntax.
- Parameters:
universe (Universe) – MDAnalysis Universe
selection (str) – Selection string (standard or special syntax)
- Returns:
Selected atoms
- Return type:
AtomGroup
- Raises:
ValueError – If selection matches no atoms, with diagnostic info
- polyzymd.analyses.shared.selections.get_position(atoms, mode=SelectionMode.SINGLE)[source]
Get position from atom group based on mode.
- Parameters:
atoms (AtomGroup) – MDAnalysis AtomGroup
mode (SelectionMode) – How to compute position: - SINGLE: Position of single atom (error if multiple) - CENTROID/MIDPOINT: Center of geometry - COM: Center of mass
- Returns:
3D position vector [x, y, z]
- Return type:
NDArray[np.float64]
- Raises:
ValueError – If mode is SINGLE but multiple atoms selected
- polyzymd.analyses.shared.selections.get_position_from_selection(universe, selection)[source]
Get position from selection string in one step.
This is a convenience function that combines parsing, selection, and position calculation.
- Parameters:
universe (Universe) – MDAnalysis Universe
selection (str) – Selection string (standard or special syntax)
- Returns:
3D position vector [x, y, z]
- Return type:
NDArray[np.float64]
Examples
>>> # Single atom >>> pos = get_position_from_selection(u, "resid 77 and name OG") >>> >>> # Midpoint of Asp carboxyl >>> pos = get_position_from_selection(u, "midpoint(resid 133 and name OD1 OD2)")
- polyzymd.analyses.shared.selections.validate_selection(universe, selection)[source]
Validate a selection string and return diagnostic info.
- Parameters:
universe (Universe) – MDAnalysis Universe
selection (str) – Selection string to validate
- Returns:
Diagnostic information: - valid: bool - n_atoms: int - mode: str - atoms: list of atom info dicts - error: str (if invalid) - diagnostics: str (detailed diagnostics if invalid)
- Return type:
- polyzymd.analyses.shared.selections.format_selection_for_label(selection)[source]
Convert selection string to a short label for filenames/display.
- Parameters:
selection (str) – Selection string (standard or special syntax)
- Returns:
Short label (e.g., “Asp133_mid” or “Ser77_OG”)
- Return type:
Examples
>>> format_selection_for_label("midpoint(resid 133 and name OD1 OD2)") "res133_mid" >>> format_selection_for_label("resid 77 and name OG") "res77_OG"
polyzymd.analyses.shared.config_hash
Config hashing for analysis cache validation.
When analysis results are cached, we store a hash of the relevant config parameters. If the config changes, we warn the user that cached results may be invalid.
This module provides: - compute_config_hash: Generate a hash of analysis-relevant config parameters - validate_config_hash: Check if stored hash matches current config
- Design Decision:
Config immutability is expected. If a user modifies config parameters (e.g., temperature), they should create a new project directory. The hash validation is a safety check, not an enforcement mechanism.
- polyzymd.analyses.shared.config_hash.compute_config_hash(config)[source]
Compute hash of config parameters relevant to analysis.
Includes parameters that affect trajectory interpretation: - enzyme configuration - substrate configuration - polymer configuration - thermodynamics (temperature, pressure) - output paths (for trajectory location)
Excludes parameters that don’t affect analysis of completed trajectories: - simulation_phases (equilibration_stages/production settings) - force_field (already baked into trajectory)
- Parameters:
config (SimulationConfig) – PolyzyMD simulation configuration
- Returns:
Hex digest of SHA-256 hash (first 16 characters for brevity)
- Return type:
Examples
>>> from polyzymd.config import load_config >>> config = load_config("config.yaml") >>> hash_val = compute_config_hash(config) >>> print(f"Config hash: {hash_val}") Config hash: a3b2c1d4e5f67890
- polyzymd.analyses.shared.config_hash.settings_fingerprint(settings)[source]
Compute a short deterministic fingerprint for analysis settings.
The fingerprint is derived from canonical JSON produced with
json.dumps(settings.model_dump(mode="json"), sort_keys=True), then hashed with SHA-256. It is intended for cache identity, so changing settings (for example contacts cutoff) naturally changes cache filenames.- Parameters:
settings (BaseModel) – Analysis plugin settings model.
- Returns:
First 8 hexadecimal characters of the SHA-256 digest.
- Return type:
- polyzymd.analyses.shared.config_hash.compute_cache_identity(*, config_hash, settings=None, settings_fp=None, cache_params=None, length=12)[source]
Compute a deterministic cache identity across config and settings.
This helper unifies cache identity generation for analysis caches. The identity combines:
Simulation config hash
Analysis settings fingerprint
Extra cache parameters when needed
- Parameters:
config_hash (str) – Hash of simulation config returned by
compute_config_hash().settings (BaseModel or None, optional) – Analysis settings model. Used only when
settings_fpis not provided.settings_fp (str or None, optional) – Precomputed settings fingerprint. If provided, this takes precedence over computing from
settings.cache_params (Mapping[str, object] or None, optional) – Additional cache identity inputs such as equilibration or selection.
length (int, optional) – Number of hex characters to return, by default 12.
- Returns:
Short hex identity safe for filenames.
- Return type:
- Raises:
ValueError – If neither
settingsnorsettings_fpis provided.
- polyzymd.analyses.shared.config_hash.extract_settings_fingerprint_from_path(cache_path)[source]
Extract settings fingerprint from a cache filename when present.
- polyzymd.analyses.shared.config_hash.validate_settings_fingerprint(stored_fingerprint, current_settings, *, warn=True, source=None)[source]
Validate cached settings fingerprint against current analysis settings.
- Parameters:
stored_fingerprint (str or None) – Fingerprint read from cached result metadata or filename.
current_settings (BaseModel) – Current plugin settings used for this analysis invocation.
warn (bool, optional) – Emit warnings on mismatch or missing fingerprint, by default True.
source (str or Path or None, optional) – Optional cache source path for diagnostics.
- Returns:
Truewhen cache settings are compatible with current settings, otherwiseFalse.- Return type:
Notes
Legacy cache files may not encode settings fingerprints. These files are treated as compatible for backward compatibility, with a warning to encourage recomputation.
- polyzymd.analyses.shared.config_hash.validate_config_hash(stored_hash, current_config, warn=True)[source]
Check if stored hash matches current config.
If the hashes don’t match, this indicates the config has changed since the analysis was performed. This could mean: 1. The user modified the config (bad practice - should create new project) 2. The analysis was performed on a different config file 3. A bug in the hashing algorithm
- Parameters:
- Returns:
True if hashes match, False otherwise
- Return type:
Examples
>>> stored_hash = loaded_result.get("config_hash", "") >>> config = load_config("config.yaml") >>> if not validate_config_hash(stored_hash, config): ... print("Warning: Results may be stale!")
polyzymd.analyses.shared.sasa
Shared SASA computation utilities.
This module provides reusable helpers for SASA analyses that need selection validation, atom-level and residue-level SASA traces, and raw artifact persistence.
- class polyzymd.analyses.shared.sasa.SASAComputationResult(atom_sasa_a2, residue_sasa_a2, total_sasa_a2, frames, time_ns, target_atom_indices, context_atom_indices, residue_keys, residue_chainids, residue_resids, residue_resnames)[source]
Bases:
objectContainer for raw SASA computation outputs.
- Parameters:
atom_sasa_a2 (NDArray[np.float64]) – Per-atom SASA trace in Ų, shape
(n_frames, n_target_atoms).residue_sasa_a2 (NDArray[np.float64]) – Per-residue SASA trace in Ų, shape
(n_frames, n_target_residues).total_sasa_a2 (NDArray[np.float64]) – Total target SASA in Ų, shape
(n_frames,).frames (NDArray[np.int64]) – 0-indexed frame indices used for analysis.
time_ns (NDArray[np.float64]) – Time axis in ns corresponding to
frames.target_atom_indices (NDArray[np.int64]) – Universe-global atom indices for target atoms.
context_atom_indices (NDArray[np.int64]) – Universe-global atom indices for context atoms.
residue_keys (list[str]) – Residue identity keys in
chainID:resid:resnameformat.residue_chainids (list[str]) – Chain IDs for each residue key.
residue_resids (list[int]) – Residue IDs for each residue key.
residue_resnames (list[str]) – Residue names for each residue key.
- atom_sasa_a2: numpy.typing.NDArray.numpy.float64
- residue_sasa_a2: numpy.typing.NDArray.numpy.float64
- total_sasa_a2: numpy.typing.NDArray.numpy.float64
- frames: numpy.typing.NDArray.numpy.int64
- time_ns: numpy.typing.NDArray.numpy.float64
- target_atom_indices: numpy.typing.NDArray.numpy.int64
- context_atom_indices: numpy.typing.NDArray.numpy.int64
- __init__(atom_sasa_a2, residue_sasa_a2, total_sasa_a2, frames, time_ns, target_atom_indices, context_atom_indices, residue_keys, residue_chainids, residue_resids, residue_resnames)
- polyzymd.analyses.shared.sasa.resolve_selection_indices(universe, selection, *, role, run_label)[source]
Resolve an MDAnalysis selection to atom indices.
- Parameters:
- Returns:
Selected atom group and its universe-global atom indices.
- Return type:
tuple[Any, NDArray[np.int64]]
- polyzymd.analyses.shared.sasa.validate_target_subset(target_indices, context_indices, *, run_label, target_selection, context_selection)[source]
Validate target-selection atoms are a subset of context-selection atoms.
- Raises:
ValueError – If any target atom is absent from the context selection.
- polyzymd.analyses.shared.sasa.compute_sasa(universe, *, run_label, target_selection, context_selection, probe_radius_nm, n_sphere_points, start_frame, stop_frame, timestep_ps, chunk_size=100, stride=1)[source]
Compute target SASA over a selected context.
- Parameters:
universe (Any) – MDAnalysis universe.
run_label (str) – Human-readable run label.
target_selection (str) – Selection of atoms whose SASA is reported.
context_selection (str) – Selection of atoms considered during SASA computation.
probe_radius_nm (float) – Probe radius in nm.
n_sphere_points (int) – Number of sphere points for Shrake-Rupley.
start_frame (int) – First frame index (inclusive).
stop_frame (int) – Last frame index (exclusive).
timestep_ps (float) – Timestep in ps.
chunk_size (int, optional) – Number of analyzed frames processed per Shrake-Rupley chunk, by default 100.
stride (int, optional) – Frame stride applied before chunking (1 = analyze every frame), by default 1.
- Returns:
Raw atom-level, residue-level, and total SASA traces in Ų.
- Return type:
SASAComputationResult
- Raises:
ValueError – If frame bounds or stride/chunk parameters are invalid.
- polyzymd.analyses.shared.sasa.save_sasa_artifacts(npz_path, metadata_path, result, *, run_label, target_selection, context_selection, probe_radius_nm, n_sphere_points, equilibration)[source]
Save raw SASA arrays plus JSON sidecar metadata.
- Parameters:
npz_path (Path) – Output path for compressed NumPy archive.
metadata_path (Path) – Output path for JSON metadata sidecar.
result (SASAComputationResult) – Raw SASA arrays to persist.
run_label (str) – Run label.
target_selection (str) – Target selection string.
context_selection (str) – Context selection string.
probe_radius_nm (float) – Probe radius in nm.
n_sphere_points (int) – Number of sphere points.
equilibration (str) – Equilibration string.
- polyzymd.analyses.shared.sasa.load_sasa_artifacts(npz_path, metadata_path)[source]
Load SASA raw arrays and metadata from disk.
- class polyzymd.analyses.shared.sasa.SASAArtifactContract(schema_name, schema_version, compatibility_version, producer, engine, mode, units, run_label, target_selection, context_selection, probe_radius_nm, n_sphere_points, equilibration, compatibility_hash)[source]
Bases:
objectCanonical contract metadata for a persisted SASA artifact.
- Parameters:
schema_name (str) – Artifact schema identifier.
schema_version (int) – Artifact schema version.
compatibility_version (int) – Compatibility hash payload version.
producer (str) – Producer module name.
engine (str) – SASA engine identifier.
mode (str) – SASA mode. Canonical value is
"atom".units (str) – SASA units. Canonical value is
"A^2".run_label (str) – Human-readable run label.
target_selection (str) – Target atom selection used for SASA reporting.
context_selection (str) – Context atom selection used during SASA computation.
probe_radius_nm (float) – Probe radius in nm.
n_sphere_points (int) – Number of Shrake-Rupley sphere points.
equilibration (str) – Equilibration descriptor.
compatibility_hash (str) – Deterministic compatibility hash.
- schema_name: str
- schema_version: int
- compatibility_version: int
- producer: str
- engine: str
- mode: str
- units: str
- run_label: str
- target_selection: str
- context_selection: str
- probe_radius_nm: float
- n_sphere_points: int
- equilibration: str
- compatibility_hash: str
- __init__(schema_name, schema_version, compatibility_version, producer, engine, mode, units, run_label, target_selection, context_selection, probe_radius_nm, n_sphere_points, equilibration, compatibility_hash)
- class polyzymd.analyses.shared.sasa.SASAArtifactCompatibilityQuery(probe_radius_nm, n_sphere_points, equilibration, selection=None, context_selection=None)[source]
Bases:
objectCompatibility query metadata for sibling artifact lookup.
- Parameters:
probe_radius_nm (float) – Probe radius expected by the consumer.
n_sphere_points (int) – Sphere point count expected by the consumer.
equilibration (str) – Equilibration label expected by the consumer.
selection (str | None, optional) – Target selection string for advisory hash comparison.
context_selection (str | None, optional) – Context selection string for advisory hash comparison.
- probe_radius_nm: float
- n_sphere_points: int
- equilibration: str
- __init__(probe_radius_nm, n_sphere_points, equilibration, selection=None, context_selection=None)
- class polyzymd.analyses.shared.sasa.SASAArtifactCompatibility(is_compatible, is_legacy, schema_version, selection_hash_matches, mismatched_fields)[source]
Bases:
objectCompatibility decision for a single SASA artifact metadata payload.
- Parameters:
is_compatible (bool) – True when metadata-level required fields are compatible.
is_legacy (bool) – True when schema version fields are absent.
schema_version (int | None) – Parsed schema version if present.
selection_hash_matches (bool | None) – Advisory selection hash result when query selections are supplied.
mismatched_fields (tuple[str, ...]) – Names of fields that made the artifact incompatible.
- is_compatible: bool
- is_legacy: bool
- __init__(is_compatible, is_legacy, schema_version, selection_hash_matches, mismatched_fields)
- class polyzymd.analyses.shared.sasa.SASASiblingArtifactMatch(sibling_analysis_dir, npz_path, metadata_path, metadata, compatibility)[source]
Bases:
objectA sibling SASA artifact candidate and its compatibility outcome.
- Parameters:
sibling_analysis_dir (Path) – Directory containing sibling SASA artifacts for the replicate.
npz_path (Path) – Path to the SASA NPZ payload.
metadata_path (Path) – Path to the JSON metadata sidecar.
compatibility (SASAArtifactCompatibility) – Compatibility decision for this artifact.
- sibling_analysis_dir: Path
- npz_path: Path
- metadata_path: Path
- compatibility: SASAArtifactCompatibility
- __init__(sibling_analysis_dir, npz_path, metadata_path, metadata, compatibility)
- polyzymd.analyses.shared.sasa.compute_sasa_artifact_compatibility_hash(*, probe_radius_nm, n_sphere_points, selection, equilibration, context_selection=None)[source]
Compute deterministic compatibility hash for SASA artifacts.
- Parameters:
- Returns:
First 16 characters of the SHA-256 compatibility digest.
- Return type:
- polyzymd.analyses.shared.sasa.build_sasa_artifact_contract(*, run_label, target_selection, context_selection, probe_radius_nm, n_sphere_points, equilibration)[source]
Build canonical SASA artifact contract metadata.
- Parameters:
run_label (str) – Human-readable run label.
target_selection (str) – Target atom selection used for SASA reporting.
context_selection (str) – Context atom selection used during SASA computation.
probe_radius_nm (float) – Probe radius in nm.
n_sphere_points (int) – Number of Shrake-Rupley sphere points.
equilibration (str) – Equilibration descriptor.
- Returns:
Fully populated canonical metadata contract.
- Return type:
SASAArtifactContract
- polyzymd.analyses.shared.sasa.build_sasa_artifact_metadata(result, *, run_label, target_selection, context_selection, probe_radius_nm, n_sphere_points, equilibration)[source]
Build flat SASA artifact metadata with schema versioning fields.
- Parameters:
result (SASAComputationResult) – Raw SASA arrays and residue metadata.
run_label (str) – Human-readable run label.
target_selection (str) – Target atom selection.
context_selection (str) – Context atom selection.
probe_radius_nm (float) – Probe radius in nm.
n_sphere_points (int) – Number of Shrake-Rupley sphere points.
equilibration (str) – Equilibration descriptor.
- Returns:
Flat metadata dictionary for JSON sidecar persistence.
- Return type:
- polyzymd.analyses.shared.sasa.check_sasa_artifact_compatibility(metadata, query)[source]
Evaluate metadata-level compatibility for reusable SASA artifacts.
- polyzymd.analyses.shared.sasa.adapt_canonical_sasa_to_exposure(result, *, exposure_threshold)[source]
Adapt canonical shared SASA result into exposure SASA trajectory format.
- Parameters:
result (SASAComputationResult) – Canonical shared SASA computation result in Ų.
exposure_threshold (float) – Relative SASA threshold used by exposure consumers.
- Returns:
SASATrajectoryResultinstance for exposure analysis.- Return type:
Any
- polyzymd.analyses.shared.sasa.find_sibling_sasa_artifacts(replicate_analysis_dir, query)[source]
Find compatible sibling SASA artifacts for a replicate analysis directory.
- Parameters:
replicate_analysis_dir (Path) – Replicate analysis directory, for example
analysis/<cond>/exposure/run_1.query (SASAArtifactCompatibilityQuery) – Compatibility query used to pre-filter candidates.
- Returns:
Compatible sibling artifacts sorted by preference.
- Return type:
list[SASASiblingArtifactMatch]
polyzymd.analyses.shared.surface_exposure
Surface exposure filtering using SASA calculations.
Shared utility for determining which protein residues are surface-exposed based on solvent-accessible surface area (SASA) calculations. Surface exposure data is consumed by multiple analysis plugins (contacts, binding_free_energy, polymer_affinity) for binding preference analysis, since buried residues cannot participate in polymer-protein contacts.
Uses rust_sasa_python for fast SASA computation on initial PDB structures.
The surface exposure threshold determines which residues are considered “exposed” based on their relative SASA (actual SASA / max theoretical SASA). A threshold of 0.2 means residues with at least 20% of their theoretical maximum surface area exposed are considered surface-accessible.
Examples
>>> from polyzymd.analyses.shared.surface_exposure import SurfaceExposureFilter
>>> filter = SurfaceExposureFilter(threshold=0.2)
>>> result = filter.calculate("enzyme.pdb")
>>> print(f"Found {result.exposed_count} surface-exposed residues")
>>> print(result.exposed_by_aa_class())
{'aromatic': 12, 'polar': 25, 'nonpolar': 18, ...}
- class polyzymd.analyses.shared.surface_exposure.ResidueExposure(resid, resname, chain_id, sasa, max_sasa, relative_sasa, is_exposed, aa_class)[source]
Bases:
objectSASA data for a single protein residue.
- resid
Residue ID (1-indexed, as in PDB)
- Type:
- resname
3-letter residue name (e.g., “ALA”, “PHE”)
- Type:
- chain_id
Chain identifier from PDB
- Type:
- sasa
Calculated SASA in Angstrom^2
- Type:
- max_sasa
Maximum theoretical SASA for this residue type (Tien et al. 2013)
- Type:
- relative_sasa
Ratio of actual to maximum SASA (0.0 to ~1.0+)
- Type:
- is_exposed
True if relative_sasa >= threshold
- Type:
- aa_class
Amino acid classification (aromatic, polar, etc.)
- Type:
- resid: int
- resname: str
- chain_id: str
- sasa: float
- max_sasa: float
- relative_sasa: float
- is_exposed: bool
- aa_class: str
- __init__(resid, resname, chain_id, sasa, max_sasa, relative_sasa, is_exposed, aa_class)
- class polyzymd.analyses.shared.surface_exposure.SurfaceExposureResult(residue_exposures=<factory>, threshold=0.2, pdb_path='', probe_radius=1.4, n_points=1000)[source]
Bases:
objectSurface exposure analysis result for a protein structure.
Contains SASA data for all residues and provides convenience methods for filtering and grouping by exposure status and amino acid class.
- residue_exposures
SASA data for each residue
- Type:
list[ResidueExposure]
- threshold
Relative SASA threshold used (e.g., 0.2 for 20%)
- Type:
- pdb_path
Path to the PDB file analyzed
- Type:
- probe_radius
Probe radius used for SASA calculation
- Type:
- n_points
Number of points used for SASA calculation
- Type:
- residue_exposures: list[ResidueExposure]
- threshold: float = 0.2
- pdb_path: str = ''
- probe_radius: float = 1.4
- n_points: int = 1000
- property exposed_count: int
Number of surface-exposed residues.
- property total_count: int
Total number of residues analyzed.
- exposed_by_aa_class()[source]
Count of exposed residues per amino acid class.
- total_by_aa_class()[source]
Count of all residues per amino acid class (exposed + buried).
- exposed_in_selection(resids)[source]
Get exposed residues within a specific selection.
- get_exposure(resid)[source]
Get exposure data for a specific residue.
- Parameters:
resid (int) – Residue ID to look up
- Returns:
Exposure data, or None if resid not found
- Return type:
ResidueExposure or None
- to_dict()[source]
Serialize to dictionary for JSON storage.
- classmethod from_dict(data)[source]
Deserialize from dictionary.
- __init__(residue_exposures=<factory>, threshold=0.2, pdb_path='', probe_radius=1.4, n_points=1000)
- class polyzymd.analyses.shared.surface_exposure.SurfaceExposureFilter(threshold=0.2, probe_radius=1.4, n_points=1000)[source]
Bases:
objectCompute surface exposure from initial PDB structure.
Uses rust_sasa_python for fast SASA calculation. Residues are classified as “exposed” if their relative SASA (actual/max theoretical) meets or exceeds the threshold.
- Parameters:
threshold (float) – Relative SASA threshold (0-1). Residues with SASA/maxSASA >= threshold are considered exposed. Default 0.2 (20% of max theoretical).
probe_radius (float) – Probe radius for SASA calculation in Angstroms. Default 1.4 (water molecule radius).
n_points (int) – Number of points for SASA calculation. Higher = more accurate but slower. Default 1000.
Examples
>>> filter = SurfaceExposureFilter(threshold=0.2) >>> result = filter.calculate("enzyme.pdb") >>> exposed = result.exposed_resids >>> print(f"Found {len(exposed)} exposed residues")
>>> # Get exposed residues by AA class >>> by_class = result.exposed_by_aa_class() >>> print(f"Exposed aromatics: {by_class.get('aromatic', 0)}")
- __init__(threshold=0.2, probe_radius=1.4, n_points=1000)[source]
- calculate(pdb_path)[source]
Calculate surface exposure for all protein residues.
- Parameters:
pdb_path (Path or str) – Path to PDB file (typically the enzyme PDB from config). Should contain only the protein (not polymer/solvent).
- Returns:
Surface exposure data for all residues
- Return type:
SurfaceExposureResult
- Raises:
FileNotFoundError – If PDB file does not exist
ImportError – If rust_sasa_python is not installed
polyzymd.analyses.shared.binding_preference
Binding preference metrics for polymer-protein contacts.
Shared compute layer consumed by contacts, binding_free_energy, and polymer_affinity plugins.
- class polyzymd.analyses.shared.binding_preference.PolymerComposition(*, residue_counts=<factory>, heavy_atom_counts=<factory>)[source]
Bases:
BaseModelPolymer composition data extracted from trajectory.
Contains counts of residues and heavy atoms for each polymer type, enabling normalization of enrichment ratios by polymer availability.
This data is used for dual normalization of binding preference:
Residue-based: Normalizes by number of polymer residues per type. Matches the experimental viewpoint where concentrations are specified in terms of monomer units.
Atom-based: Normalizes by number of heavy atoms (non-hydrogen) per type. Accounts for differences in monomer size, since larger monomers have more surface area and thus more opportunity for contacts.
- residue_counts
Number of residues per polymer type (e.g., {“SBM”: 50, “EGM”: 50})
- heavy_atom_counts
Number of heavy atoms (non-hydrogen) per polymer type. Heavy atoms are defined as all atoms with element != ‘H’.
Examples
>>> composition = PolymerComposition( ... residue_counts={"SBM": 50, "EGM": 50}, ... heavy_atom_counts={"SBM": 750, "EGM": 400}, ... ) >>> composition.total_residues 100 >>> composition.residue_fraction("SBM") 0.5 >>> composition.heavy_atom_fraction("SBM") 0.652 # SBM has larger monomers
- property total_residues: int
Total polymer residues across all types.
- property total_heavy_atoms: int
Total heavy atoms across all polymer types.
- residue_fraction(polymer_type)[source]
Fraction of residues that are this polymer type.
- heavy_atom_fraction(polymer_type)[source]
Fraction of heavy atoms that are this polymer type.
- polymer_types()[source]
Get sorted list of polymer types in this composition.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class polyzymd.analyses.shared.binding_preference.BindingPreferenceEntry(*, polymer_type, protein_group, total_contact_frames, mean_contact_fraction, n_residues_in_group, n_exposed_in_group, n_residues_contacted=0, contact_share=0.0, expected_share=0.0, enrichment=None, polymer_residue_count=0, total_polymer_residues=0, polymer_heavy_atom_count=0, total_polymer_heavy_atoms=0)[source]
Bases:
BaseModelSingle entry in the binding preference matrix.
Represents the binding preference metrics for one (polymer_type, protein_group) combination.
Enrichment Interpretation (centered at zero)
The enrichment ratio measures whether a polymer type contacts a protein group more or less than expected based on surface availability.
- enrichment > 0: Preferential binding (more contacts than expected)
+0.5 means “50% more contacts than expected”
+1.0 means “2× as many contacts as expected”
enrichment = 0: Neutral (contact frequency matches surface availability)
- enrichment < 0: Avoidance (fewer contacts than expected)
-0.3 means “30% fewer contacts than expected”
enrichment = -1: Complete avoidance (no contacts at all)
- The expected share is based on protein surface availability:
expected_share = n_exposed_in_group / total_exposed_residues
This normalization answers: “Given how much of the protein surface is aromatic/charged/etc., does this polymer type contact that surface proportionally, more than proportionally, or less?”
- polymer_type
Polymer residue type (e.g., “SBM”, “EGM”)
- Type:
- protein_group
Protein group label (e.g., “aromatic”, “charged_positive”)
- Type:
- total_contact_frames
Sum of contact frames across all exposed residues in this group.
- Type:
- mean_contact_fraction
Average per-residue contact fraction within this group.
- Type:
- n_residues_in_group
Total residues in this protein group (exposed + buried)
- Type:
- n_exposed_in_group
Surface-exposed residues in this group (used for enrichment)
- Type:
- n_residues_contacted
Number of exposed residues that had at least one contact
- Type:
- contact_share
Fraction of this polymer’s total contacts that went to this group.
- Type:
- expected_share
Expected contact share based on surface availability (n_exposed_in_group / total_exposed_residues)
- Type:
- enrichment
Zero-centered enrichment: (contact_share / expected_share) - 1
- Type:
float | None
- polymer_residue_count
Number of residues of this polymer type (metadata)
- Type:
- total_polymer_residues
Total polymer residues across all types (metadata)
- Type:
- polymer_heavy_atom_count
Number of heavy atoms for this polymer type (metadata)
- Type:
- total_polymer_heavy_atoms
Total polymer heavy atoms across all types (metadata)
- Type:
- polymer_type: str
- protein_group: str
- total_contact_frames: int
- mean_contact_fraction: float
- n_residues_in_group: int
- n_exposed_in_group: int
- n_residues_contacted: int
- contact_share: float
- expected_share: float
- polymer_residue_count: int
- total_polymer_residues: int
- polymer_heavy_atom_count: int
- total_polymer_heavy_atoms: int
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class polyzymd.analyses.shared.binding_preference.BindingPreferenceResult(**data)[source]
Bases:
BaseModelComplete binding preference analysis result.
Provides enrichment-normalized metrics for polymer-protein binding preferences, answering questions like:
“Does SBMA preferentially bind aromatic residues?”
“How does EGMA’s preference for charged residues compare to SBMA?”
“Which amino acid class does this polymer type prefer?”
Enrichment values are centered at zero: - enrichment > 0: Preferential binding - enrichment = 0: Neutral (random chance) - enrichment < 0: Avoidance
- entries
All (polymer_type × protein_group) combinations
- Type:
list[BindingPreferenceEntry]
- n_frames
Total frames analyzed
- Type:
- total_exposed_residues
Number of surface-exposed residues considered
- Type:
- surface_exposure_threshold
SASA threshold used for surface filtering
- Type:
- polymer_composition
Polymer composition data (residue/atom counts per type, metadata only)
- Type:
PolymerComposition
- system_coverage
System-level coverage metrics collapsed across polymer types. Answers: “What does this polymer mixture collectively cover?”
- Type:
SystemCoverageResult, optional
- schema_version
Version for forward compatibility. Version 4 adds system_coverage.
- Type:
- entries: list[BindingPreferenceEntry]
- n_frames: int
- total_exposed_residues: int
- polymer_composition: PolymerComposition | None
- system_coverage: SystemCoverageResult | None
- binding_preference: PolymerBindingPreferenceResult | None
- schema_version: int
- to_dataframe()[source]
Convert to pandas DataFrame for analysis/plotting.
- Returns:
Columns: polymer_type, protein_group, total_contact_frames, mean_contact_fraction, n_residues_in_group, n_exposed_in_group, n_residues_contacted, contact_share, expected_share, enrichment
- Return type:
pd.DataFrame
- enrichment_matrix()[source]
Get enrichment as nested dict: {polymer_type: {protein_group: value}}.
Enrichment values are centered at zero and normalized by protein surface availability: - > 0: Preferential binding (more contacts than expected) - = 0: Neutral (matches surface availability) - < 0: Avoidance (fewer contacts than expected)
- Returns:
Nested mapping of enrichment values. Missing/invalid values are returned as 0.0.
- Return type:
Examples
>>> matrix = result.enrichment_matrix() >>> print(matrix["SBM"]["aromatic"]) 0.45 # 45% more contacts than expected based on surface availability
- contact_fraction_matrix()[source]
Get mean contact fractions as nested dict.
- contact_share_matrix()[source]
Get contact shares as nested dict.
- get_enrichment(polymer_type, protein_group)[source]
Get enrichment for a specific (polymer_type, protein_group) pair.
- get_entry(polymer_type, protein_group)[source]
Get the full entry for a (polymer_type, protein_group) pair.
- polymer_types()[source]
Get list of polymer types in this result.
- protein_groups()[source]
Get list of protein groups in this result.
- save(path)[source]
Save to JSON file.
- classmethod load(path)[source]
Load from JSON file.
- Parameters:
path (str or Path) – Path to JSON file
- Returns:
Loaded result
- Return type:
BindingPreferenceResult
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- polyzymd.analyses.shared.binding_preference.compute_binding_preference(contact_result, surface_exposure, protein_groups, polymer_composition, polymer_types=None, protein_group_selections=None, polymer_type_selections=None, protein_partitions=None)[source]
Compute binding preference from contact results.
This function computes enrichment ratios for each (polymer_type, protein_group) combination, answering: “Does this polymer type preferentially bind this protein group compared to random chance?”
The enrichment calculation accounts for: 1. Surface exposure (only exposed residues are considered) 2. Contact duration (contact frames are summed, not binary counts) 3. Polymer composition (normalization by residue count or heavy atom count)
- Parameters:
contact_result (Any) – Raw contact analysis results from trajectory
surface_exposure (SurfaceExposureResult) – Surface exposure data for filtering buried residues
protein_groups (dict[str, set[int]]) – Mapping of group name to residue IDs in that group. Only surface-exposed residues in each group are counted. Example: {“aromatic”: {12, 45, 67}, “charged”: {23, 34}}
polymer_composition (PolymerComposition) – Polymer composition data (residue and heavy atom counts per type). Used for dual normalization of enrichment ratios.
polymer_types (list[str], optional) – If provided, only compute for these polymer residue types. If None, all polymer types found in contacts are used.
protein_group_selections (dict[str, str], optional) – Original MDAnalysis selections (for metadata/reproducibility)
polymer_type_selections (dict[str, str], optional) – Original MDAnalysis selections (for metadata/reproducibility)
protein_partitions (dict[str, list[str]], optional) – User-defined partitions for system coverage plots. Each partition maps a name to a list of group names from protein_groups. Groups within a partition must be mutually exclusive. Example: {“lid_helices”: [“lid_helix_5”, “lid_helix_10”]}
- Returns:
Binding preference metrics with dual enrichment normalization
- Return type:
BindingPreferenceResult
Notes
Enrichment calculation is centered at zero
For each (polymer_type, protein_group) pair:
contact_share = polymer_contacts_to_group / polymer_total_contacts
Two normalization methods are computed:
Residue-based (matches experimental concentration ratios):
expected_by_residue = polymer_residue_count / total_polymer_residues enrichment_by_residue = (contact_share / expected_by_residue) - 1
Atom-based (accounts for monomer size via heavy atoms):
expected_by_atoms = polymer_heavy_atoms / total_polymer_heavy_atoms enrichment_by_atoms = (contact_share / expected_by_atoms) - 1
Interpretation (both methods): - enrichment > 0: Preferential binding (more contacts than expected) - enrichment = 0: Neutral (matches random chance) - enrichment < 0: Avoidance (fewer contacts than expected) - enrichment = -1: Complete avoidance (no contacts at all)
When to use which metric: - enrichment_by_residue: Direct comparison to experimental concentration ratios - enrichment_by_atoms: Reveals true chemical affinity vs. geometric/steric effects
- polyzymd.analyses.shared.binding_preference.aggregate_binding_preference(results)[source]
Aggregate binding preference across replicates.
Computes mean ± SEM for both residue-based and atom-based enrichment ratios across multiple replicates.
- Parameters:
results (list[BindingPreferenceResult]) – Binding preference results from multiple replicates
- Returns:
Aggregated results with mean and SEM for both normalization methods
- Return type:
AggregatedBindingPreferenceResult
- polyzymd.analyses.shared.binding_preference.aggregate_binding_preference_results(results)
Aggregate binding preference across replicates.
Computes mean ± SEM for both residue-based and atom-based enrichment ratios across multiple replicates.
- Parameters:
results (list[BindingPreferenceResult]) – Binding preference results from multiple replicates
- Returns:
Aggregated results with mean and SEM for both normalization methods
- Return type:
AggregatedBindingPreferenceResult
- class polyzymd.analyses.shared.binding_preference.AggregatedBindingPreferenceEntry(*, polymer_type, protein_group, mean_enrichment=None, sem_enrichment=None, per_replicate_enrichments=<factory>, mean_contact_fraction=0.0, sem_contact_fraction=0.0, mean_contact_share=0.0, expected_share=0.0, n_exposed_in_group=0, n_residues_in_group=0, n_replicates=0)[source]
Bases:
BaseModelAggregated binding preference for one (polymer_type, protein_group) pair.
Contains mean ± SEM across replicates for enrichment based on protein surface availability.
Enrichment values are centered at zero: - > 0: Preferential binding (more contacts than expected by surface area) - = 0: Neutral (contact frequency matches surface availability) - < 0: Avoidance (fewer contacts than expected by surface area)
- The expected share is based on protein surface availability:
expected_share = n_exposed_in_group / total_exposed_residues
This normalization answers: “Given how much of the protein surface is aromatic/charged/etc., does this polymer type contact that surface proportionally, more than proportionally, or less?”
- polymer_type: str
- protein_group: str
- mean_contact_fraction: float
- sem_contact_fraction: float
- mean_contact_share: float
- expected_share: float
- n_exposed_in_group: int
- n_residues_in_group: int
- n_replicates: int
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class polyzymd.analyses.shared.binding_preference.AggregatedBindingPreferenceResult(**data)[source]
Bases:
BaseModelBinding preference aggregated across replicates.
Contains mean ± SEM for all metrics across multiple replicates. Enrichment is normalized by protein surface availability.
Enrichment values are centered at zero: - > 0: Preferential binding (more contacts than expected by surface area) - = 0: Neutral (contact frequency matches surface availability) - < 0: Avoidance (fewer contacts than expected by surface area)
- entries: list[AggregatedBindingPreferenceEntry]
- n_replicates: int
- total_exposed_residues: int
- polymer_composition: PolymerComposition | None
- system_coverage: AggregatedSystemCoverageResult | None
- binding_preference: AggregatedPolymerBindingPreferenceResult | None
- schema_version: int
- to_dataframe()[source]
Convert to pandas DataFrame.
- enrichment_matrix()[source]
Get mean enrichment as nested dict.
Enrichment values are centered at zero and normalized by protein surface availability: - > 0: Preferential binding (more contacts than expected) - = 0: Neutral (matches surface availability) - < 0: Avoidance (fewer contacts than expected)
- get_entry(polymer_type, protein_group)[source]
Get entry for a (polymer_type, protein_group) pair.
- polymer_types()[source]
Get list of polymer types.
- protein_groups()[source]
Get list of protein groups.
- save(path)[source]
Save to JSON file.
- classmethod load(path)[source]
Load from JSON file.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class polyzymd.analyses.shared.binding_preference.SystemCoverageEntry(*, protein_group, total_contact_frames=0, coverage_share=0.0, expected_share=0.0, coverage_enrichment=None, n_exposed_in_group=0, n_residues_in_group=0, polymer_contributions=<factory>)[source]
Bases:
BaseModelSystem-level coverage entry for one protein group.
While BindingPreferenceEntry answers “What does SBMA prefer?”, this entry answers “What fraction of ALL polymer contacts in this system go to this protein group?” — collapsing across polymer types for condition-level analysis.
Use Case
Compare copolymer compositions (conditions) with each other. For example: “Does a 70:30 SBMA:EGMA mixture cover aromatic residues differently than a 30:70 mixture?”
Coverage Enrichment Calculation
For each protein group:
coverage_share = Σ(all polymer contacts to group) / Σ(all polymer contacts) expected_share = n_exposed_in_group / total_exposed_residues coverage_enrichment = (coverage_share / expected_share) - 1
Interpretation (centered at zero):
- coverage_enrichment > 0: Preferential coverage (more than surface predicts)
+0.5 means “50% more coverage than expected”
coverage_enrichment = 0: Neutral (coverage matches surface availability)
- coverage_enrichment < 0: Under-coverage (less than surface predicts)
-0.3 means “30% less coverage than expected”
coverage_enrichment = -1: Complete avoidance (no coverage at all)
- protein_group
Protein group label (e.g., “aromatic”, “charged_positive”)
- Type:
- total_contact_frames
Sum of contact frames from ALL polymer types to this group.
- Type:
- coverage_share
Fraction of all polymer contacts that went to this group.
- Type:
- expected_share
Expected coverage based on protein surface availability (n_exposed_in_group / total_exposed_residues)
- Type:
- coverage_enrichment
Zero-centered enrichment: (coverage_share / expected_share) - 1
- Type:
float | None
- n_exposed_in_group
Surface-exposed residues in this group
- Type:
- n_residues_in_group
Total residues in this group (exposed + buried)
- Type:
- polymer_contributions
Breakdown of coverage by polymer type: {“SBMA”: 0.35, “EGMA”: 0.65} Values sum to 1.0 (fraction of contacts to this group from each polymer)
- protein_group: str
- total_contact_frames: int
- coverage_share: float
- expected_share: float
- n_exposed_in_group: int
- n_residues_in_group: int
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class polyzymd.analyses.shared.binding_preference.PartitionCoverageEntry(*, partition_element, total_contact_frames=0, coverage_share=0.0, expected_share=0.0, coverage_enrichment=None, n_exposed_in_element=0, n_residues_in_element=0, polymer_contributions=<factory>)[source]
Bases:
BaseModelCoverage metrics for one element in a partition.
A partition element is a mutually exclusive subset of protein residues. Within a partition, all elements together cover the entire protein surface exactly once (no residue is counted in multiple elements).
This ensures that: - coverage_share sums to 1.0 across all elements in the partition - expected_share sums to 1.0 across all elements in the partition - enrichment is mathematically valid (no inflated denominators)
- partition_element
Name of this partition element (e.g., “aromatic”, “lid_helix_5”, “rest_of_protein”)
- Type:
- total_contact_frames
Sum of contact frames from ALL polymer types to residues in this element
- Type:
- coverage_share
Fraction of all polymer contacts that went to this element. Sums to 1.0 across all elements in the partition.
- Type:
- expected_share
Expected coverage based on surface availability (n_exposed / total_exposed). Sums to 1.0 across all elements in the partition.
- Type:
- coverage_enrichment
Zero-centered enrichment: (coverage_share / expected_share) - 1
- Type:
float | None
- n_exposed_in_element
Number of surface-exposed residues in this element
- Type:
- n_residues_in_element
Total residues in this element (exposed + buried)
- Type:
- polymer_contributions
Breakdown of coverage by polymer type (sums to 1.0 for this element)
- partition_element: str
- total_contact_frames: int
- coverage_share: float
- expected_share: float
- n_exposed_in_element: int
- n_residues_in_element: int
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class polyzymd.analyses.shared.binding_preference.PartitionCoverageResult(*, partition_name, partition_type, entries=<factory>, total_coverage_share=1.0, total_expected_share=1.0)[source]
Bases:
BaseModelCoverage analysis for a complete partition of the protein surface.
A partition divides the protein surface into mutually exclusive regions that together cover the entire surface. This ensures valid enrichment calculations where both coverage_share and expected_share sum to 1.0.
Partition Types
aa_class: 5-way partition by amino acid class (aromatic, polar, nonpolar, charged_positive, charged_negative)
binary_custom: 2-way partition for a custom group vs rest_of_protein (e.g., “lid_helix_5” vs “rest_of_protein”)
combined_custom: N+1 way partition with all non-overlapping custom groups plus “rest_of_protein”
- partition_name
Descriptive name (e.g., “aa_class”, “lid_helix_5_vs_rest”)
- Type:
- partition_type
One of: “aa_class”, “binary_custom”, “combined_custom”
- Type:
- entries
Coverage metrics for each element in the partition
- Type:
list[PartitionCoverageEntry]
- total_coverage_share
Validation check: should be ~1.0
- Type:
- total_expected_share
Validation check: should be ~1.0
- Type:
- partition_name: str
- partition_type: Literal['aa_class', 'binary_custom', 'combined_custom', 'user_defined']
- entries: list[PartitionCoverageEntry]
- total_coverage_share: float
- total_expected_share: float
- to_dataframe()[source]
Convert to pandas DataFrame for analysis/plotting.
- coverage_enrichment_dict()[source]
Get coverage enrichment as dict: {element: enrichment}.
- coverage_share_dict()[source]
Get coverage shares as dict: {element: share}.
- expected_share_dict()[source]
Get expected shares as dict: {element: share}.
- get_entry(element_name)[source]
Get the entry for a specific partition element.
- element_names()[source]
Get list of partition element names.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class polyzymd.analyses.shared.binding_preference.SystemCoverageResult(*, aa_class_coverage, custom_group_coverages=<factory>, combined_custom_coverage=None, user_defined_partitions=<factory>, n_frames=0, total_contact_frames=0, total_exposed_residues=0, surface_exposure_threshold=None, custom_group_selections=<factory>, polymer_types_included=<factory>, has_overlapping_custom_groups=False, overlapping_group_pairs=<factory>, schema_version=2)[source]
Bases:
BaseModelSystem-level coverage analysis with proper partition structure.
This result uses partitions to ensure mathematically valid enrichment calculations. A partition divides the protein surface into mutually exclusive regions, avoiding the overlap bug where custom groups and AA class groups can inflate the expected_share denominator.
Partition Strategy
AA Class Partition (always computed): 5-way partition by amino acid class. Every surface residue belongs to exactly one class.
Binary Custom Partitions (per custom group): Each custom group is compared to “rest_of_protein”. This answers: “Does my lid_helix_5 have enriched polymer contacts vs non-lid regions?”
Combined Custom Partition (optional): If custom groups don’t overlap, all custom groups + rest_of_protein form a single partition. If groups overlap, this is not computed and an error is raised if explicitly requested.
User-Defined Partitions (from protein_partitions config): Custom partitions specified by the user in the YAML config. Each partition references groups from protein_groups and must be mutually exclusive. ‘rest_of_protein’ is auto-added if the groups don’t cover all exposed protein residues. One plot per partition is generated.
- aa_class_coverage
5-way partition by amino acid class. Always computed.
- Type:
PartitionCoverageResult
- custom_group_coverages
Binary partitions for each custom group vs rest_of_protein. Keys are custom group names.
- combined_custom_coverage
All custom groups + rest_of_protein as a single partition. Only computed if custom groups don’t overlap.
- Type:
PartitionCoverageResult | None
- user_defined_partitions
User-defined partitions from protein_partitions config. Keys are partition names, values are the computed coverage partitions. ‘rest_of_protein’ is auto-added if groups don’t fully cover the protein.
- n_frames
Total frames analyzed
- Type:
- total_contact_frames
Sum of all polymer contacts across all groups
- Type:
- total_exposed_residues
Number of surface-exposed protein residues
- Type:
- surface_exposure_threshold
SASA threshold used for surface filtering
- Type:
float | None
- custom_group_selections
Custom group name to MDAnalysis selection (for metadata)
- has_overlapping_custom_groups
True if custom groups share residues (combined partition not computed)
- Type:
- overlapping_group_pairs
Pairs of custom groups that overlap (for diagnostics)
- schema_version
Schema version (2 = partition-based)
- Type:
- aa_class_coverage: PartitionCoverageResult
- combined_custom_coverage: PartitionCoverageResult | None
- n_frames: int
- total_contact_frames: int
- total_exposed_residues: int
- has_overlapping_custom_groups: bool
- schema_version: int
- get_aa_class_enrichment(aa_class)[source]
Get coverage enrichment for an AA class.
- get_custom_group_enrichment(group_name)[source]
Get coverage enrichment for a custom group (vs rest_of_protein).
- aa_class_enrichment_dict()[source]
Get AA class enrichments as dict: {aa_class: enrichment}.
- custom_group_enrichment_dict()[source]
Get custom group enrichments as dict: {group_name: enrichment}.
Each custom group’s enrichment is relative to rest_of_protein.
- aa_class_names()[source]
Get list of AA class names in canonical order.
- custom_group_names()[source]
Get list of custom group names.
- user_partition_names()[source]
Get list of user-defined partition names.
- get_user_partition(partition_name)[source]
Get a user-defined partition by name.
- Parameters:
partition_name (str) – Name of the partition (e.g., “lid_helices”)
- Returns:
The partition result, or None if not found
- Return type:
PartitionCoverageResult | None
- classmethod load(path)[source]
Load from JSON file.
- Parameters:
path (str or Path) – Path to JSON file
- Returns:
Loaded result
- Return type:
SystemCoverageResult
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class polyzymd.analyses.shared.binding_preference.PartitionBindingEntry(*, partition_element, polymer_type, total_contact_frames=0, contact_share=0.0, expected_share=0.0, enrichment=None, n_exposed_in_element=0, n_residues_in_element=0, n_residues_contacted=0)[source]
Bases:
BaseModelBinding metrics for one partition element for a specific polymer type.
A partition element is a mutually exclusive subset of protein residues. Within a partition, all elements together cover the entire protein surface exactly once (no residue is counted in multiple elements).
This entry is for a SINGLE polymer type, answering: “What fraction of SBMA’s contacts go to aromatic residues?”
This ensures that: - contact_share sums to 1.0 across all elements in the partition (for this polymer) - expected_share sums to 1.0 across all elements in the partition - enrichment is mathematically valid (no inflated denominators)
- partition_element
Name of this partition element (e.g., “aromatic”, “lid_helix_5”, “rest_of_protein”)
- Type:
- polymer_type
Polymer type this entry is for (e.g., “SBM”, “EGM”)
- Type:
- total_contact_frames
Sum of contact frames from THIS polymer type to residues in this element
- Type:
- contact_share
Fraction of this polymer’s contacts that went to this element. Sums to 1.0 across all elements in the partition (for this polymer).
- Type:
- expected_share
Expected share based on surface availability (n_exposed / total_exposed). Sums to 1.0 across all elements in the partition.
- Type:
- enrichment
Zero-centered enrichment: (contact_share / expected_share) - 1
- Type:
float | None
- n_exposed_in_element
Number of surface-exposed residues in this element
- Type:
- n_residues_in_element
Total residues in this element (exposed + buried)
- Type:
- n_residues_contacted
Number of exposed residues that had at least one contact from this polymer
- Type:
- partition_element: str
- polymer_type: str
- total_contact_frames: int
- contact_share: float
- expected_share: float
- n_exposed_in_element: int
- n_residues_in_element: int
- n_residues_contacted: int
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class polyzymd.analyses.shared.binding_preference.PartitionBindingResult(*, partition_name, partition_type, polymer_type, entries=<factory>, total_contact_share=1.0, total_expected_share=1.0, total_contact_frames=0)[source]
Bases:
BaseModelBinding preference for a complete partition for ONE polymer type.
A partition divides the protein surface into mutually exclusive regions that together cover the entire surface. This class stores the binding preference of a single polymer type across all partition elements.
This ensures valid enrichment calculations where both contact_share and expected_share sum to 1.0.
Partition Types
aa_class: 5-way partition by amino acid class (aromatic, polar, nonpolar, charged_positive, charged_negative)
user_defined: N+1 way partition with user-specified groups plus “rest_of_protein” (auto-added if groups don’t cover all residues)
- partition_name
Descriptive name (e.g., “aa_class”, “lid_helices”)
- Type:
- partition_type
One of: “aa_class”, “user_defined”
- Type:
- polymer_type
Polymer type this result is for (e.g., “SBM”, “EGM”)
- Type:
- entries
Binding metrics for each element in the partition
- Type:
list[PartitionBindingEntry]
- total_contact_share
Validation check: should be ~1.0
- Type:
- total_expected_share
Validation check: should be ~1.0
- Type:
- total_contact_frames
Total contact frames from this polymer type (across all elements)
- Type:
- partition_name: str
- partition_type: Literal['aa_class', 'user_defined']
- polymer_type: str
- entries: list[PartitionBindingEntry]
- total_contact_share: float
- total_expected_share: float
- total_contact_frames: int
- to_dataframe()[source]
Convert to pandas DataFrame for analysis/plotting.
- enrichment_dict()[source]
Get enrichment as dict: {element: enrichment}.
- contact_share_dict()[source]
Get contact shares as dict: {element: share}.
- expected_share_dict()[source]
Get expected shares as dict: {element: share}.
- get_entry(element_name)[source]
Get the entry for a specific partition element.
- element_names()[source]
Get list of partition element names.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class polyzymd.analyses.shared.binding_preference.PolymerBindingPreferenceResult(*, aa_class_binding=<factory>, user_defined_partitions=<factory>, n_frames=0, total_exposed_residues=0, surface_exposure_threshold=None, polymer_types=<factory>, polymer_composition=None, protein_groups_used=<factory>, schema_version=5)[source]
Bases:
BaseModelPer-polymer binding preference using proper partition structure.
This result stores binding preference for ALL polymer types, with each polymer having its own partition-based enrichment calculations.
Unlike SystemCoverageResult (which collapses all polymer contacts), this maintains per-polymer data to answer: “Does SBMA prefer aromatic residues more than EGMA does?”
Partition Strategy (per polymer type)
AA Class Partition (always computed): 5-way partition by amino acid class. Every surface residue belongs to exactly one class. Each polymer type gets its own enrichment values.
User-Defined Partitions (from protein_partitions config): Custom partitions specified by the user. Each partition references groups from protein_groups. ‘rest_of_protein’ is auto-added if groups don’t cover all exposed protein residues. Each polymer type gets its own enrichment values per partition.
- aa_class_binding
AA class partition binding for each polymer type. Keys are polymer type names (e.g., “SBM”, “EGM”).
- user_defined_partitions
User-defined partitions for each polymer type. Outer keys are partition names, inner keys are polymer types. Example: {“lid_helices”: {“SBM”: …, “EGM”: …}}
- n_frames
Total frames analyzed
- Type:
- total_exposed_residues
Number of surface-exposed protein residues
- Type:
- surface_exposure_threshold
SASA threshold used for surface filtering
- Type:
float | None
- polymer_composition
Polymer composition metadata
- Type:
PolymerComposition | None
- schema_version
Schema version (5 = partition-based binding preference)
- Type:
- n_frames: int
- total_exposed_residues: int
- polymer_composition: PolymerComposition | None
- schema_version: int
- get_aa_class_enrichment(polymer_type, aa_class)[source]
Get binding enrichment for an AA class for a specific polymer.
- get_user_partition_enrichment(partition_name, polymer_type, element_name)[source]
Get binding enrichment for a user partition element for a specific polymer.
- aa_class_enrichment_matrix()[source]
Get AA class enrichments as nested dict: {polymer_type: {aa_class: enrichment}}.
- user_partition_enrichment_matrix(partition_name)[source]
Get user partition enrichments as nested dict.
- aa_class_names()[source]
Get list of AA class names in canonical order.
- user_partition_names()[source]
Get list of user-defined partition names.
- save(path)[source]
Save to JSON file.
- classmethod load(path)[source]
Load from JSON file.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class polyzymd.analyses.shared.binding_preference.AggregatedPartitionBindingEntry(*, partition_element, polymer_type, mean_contact_share=0.0, sem_contact_share=0.0, mean_enrichment=None, sem_enrichment=None, per_replicate_enrichments=<factory>, expected_share=0.0, n_exposed_in_element=0, n_residues_in_element=0, n_replicates=0)[source]
Bases:
BaseModelAggregated binding metrics for one partition element for a specific polymer type.
Contains mean ± SEM across replicates for binding preference, enabling statistical comparison of binding enrichment across conditions.
- partition_element
Element name (e.g., “aromatic”, “lid_helix_5”, “rest_of_protein”)
- Type:
- polymer_type
Polymer type this entry is for (e.g., “SBM”, “EGM”)
- Type:
- mean_contact_share
Mean contact share across replicates
- Type:
- sem_contact_share
Standard error of contact share
- Type:
- mean_enrichment
Mean enrichment across replicates
- Type:
float | None
- sem_enrichment
Standard error of enrichment
- Type:
float | None
- expected_share
Expected share based on surface availability
- Type:
- n_exposed_in_element
Surface-exposed residues in this element
- Type:
- n_residues_in_element
Total residues in this element
- Type:
- n_replicates
Number of replicates with valid data
- Type:
- partition_element: str
- polymer_type: str
- mean_contact_share: float
- sem_contact_share: float
- expected_share: float
- n_exposed_in_element: int
- n_residues_in_element: int
- n_replicates: int
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class polyzymd.analyses.shared.binding_preference.AggregatedPartitionBindingResult(*, partition_name, partition_type, polymer_type, entries=<factory>, mean_total_contact_share=1.0, n_replicates=0)[source]
Bases:
BaseModelAggregated binding preference for a partition for ONE polymer type.
Contains aggregated statistics across replicates for all partition elements.
- partition_name
Descriptive name (e.g., “aa_class”, “lid_helices”)
- Type:
- partition_type
One of: “aa_class”, “user_defined”
- Type:
- polymer_type
Polymer type this result is for
- Type:
- entries
Aggregated binding metrics for each element
- Type:
list[AggregatedPartitionBindingEntry]
- mean_total_contact_share
Mean of total_contact_share across replicates (validation: should be ~1.0)
- Type:
- n_replicates
Number of replicates
- Type:
- partition_name: str
- partition_type: Literal['aa_class', 'user_defined']
- polymer_type: str
- entries: list[AggregatedPartitionBindingEntry]
- mean_total_contact_share: float
- n_replicates: int
- enrichment_dict()[source]
Get mean enrichment as dict: {element: mean_enrichment}.
- element_names()[source]
Get list of partition element names.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class polyzymd.analyses.shared.binding_preference.AggregatedPolymerBindingPreferenceResult(*, aa_class_binding=<factory>, user_defined_partitions=<factory>, n_replicates=0, total_exposed_residues=0, surface_exposure_threshold=None, polymer_types=<factory>, schema_version=5)[source]
Bases:
BaseModelAggregated per-polymer binding preference across replicates.
Contains mean ± SEM for all partition-based binding metrics.
- aa_class_binding
Aggregated AA class partition binding for each polymer type.
- user_defined_partitions
Aggregated user-defined partitions for each polymer type.
- n_replicates
Number of replicates
- Type:
- total_exposed_residues
Number of surface-exposed protein residues
- Type:
- surface_exposure_threshold
SASA threshold used
- Type:
float | None
- schema_version
Schema version
- Type:
- n_replicates: int
- total_exposed_residues: int
- schema_version: int
- aa_class_enrichment_matrix()[source]
Get AA class enrichments as nested dict: {polymer_type: {aa_class: mean_enrichment}}.
- aa_class_names()[source]
Get list of AA class names in canonical order.
- user_partition_names()[source]
Get list of user-defined partition names.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- polyzymd.analyses.shared.binding_preference.aggregate_polymer_binding_preference(results)[source]
Aggregate per-polymer binding preference across replicates.
- Parameters:
results (list[PolymerBindingPreferenceResult]) – Per-polymer binding preference results from multiple replicates
- Returns:
Aggregated results with mean and SEM for all partitions
- Return type:
AggregatedPolymerBindingPreferenceResult
- class polyzymd.analyses.shared.binding_preference.AggregatedPartitionCoverageEntry(*, partition_element, mean_coverage_share=0.0, sem_coverage_share=0.0, mean_coverage_enrichment=None, sem_coverage_enrichment=None, per_replicate_enrichments=<factory>, expected_share=0.0, n_exposed_in_element=0, n_residues_in_element=0, n_replicates=0, mean_polymer_contributions=<factory>)[source]
Bases:
BaseModelAggregated coverage for one partition element across replicates.
Contains mean ± SEM for coverage metrics, enabling statistical comparison of coverage enrichment across conditions.
- partition_element
Element name (e.g., “aromatic”, “lid_helix_5”, “rest_of_protein”)
- Type:
- mean_coverage_share
Mean coverage share across replicates
- Type:
- sem_coverage_share
Standard error of coverage share
- Type:
- mean_coverage_enrichment
Mean coverage enrichment across replicates
- Type:
float | None
- sem_coverage_enrichment
Standard error of coverage enrichment
- Type:
float | None
- expected_share
Expected coverage based on surface availability
- Type:
- n_exposed_in_element
Surface-exposed residues in this element
- Type:
- n_residues_in_element
Total residues in this element
- Type:
- n_replicates
Number of replicates with valid data
- Type:
- partition_element: str
- mean_coverage_share: float
- sem_coverage_share: float
- expected_share: float
- n_exposed_in_element: int
- n_residues_in_element: int
- n_replicates: int
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class polyzymd.analyses.shared.binding_preference.AggregatedPartitionCoverageResult(*, partition_name, partition_type, entries=<factory>, n_replicates=0)[source]
Bases:
BaseModelAggregated coverage for a partition across replicates.
Contains mean ± SEM for all elements in the partition.
- partition_name
Name of the partition
- Type:
- partition_type
One of: “aa_class”, “binary_custom”, “combined_custom”
- Type:
- entries
Aggregated coverage for each element
- Type:
list[AggregatedPartitionCoverageEntry]
- n_replicates
Number of replicates aggregated
- Type:
- partition_name: str
- partition_type: Literal['aa_class', 'binary_custom', 'combined_custom', 'user_defined']
- entries: list[AggregatedPartitionCoverageEntry]
- n_replicates: int
- to_dataframe()[source]
Convert to pandas DataFrame.
- coverage_enrichment_dict()[source]
Get mean coverage enrichment as dict: {element: enrichment}.
- get_entry(element_name)[source]
Get entry for a specific partition element.
- element_names()[source]
Get list of partition element names.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class polyzymd.analyses.shared.binding_preference.AggregatedSystemCoverageResult(*, aa_class_coverage, custom_group_coverages=<factory>, combined_custom_coverage=None, user_defined_partitions=<factory>, n_replicates=0, total_exposed_residues=0, surface_exposure_threshold=None, custom_group_selections=<factory>, polymer_types_included=<factory>, has_overlapping_custom_groups=False, schema_version=2)[source]
Bases:
BaseModelSystem coverage aggregated across replicates (schema v2).
Contains aggregated partition coverages with mean ± SEM statistics for statistical comparison between conditions.
- aa_class_coverage
Aggregated 5-way AA class partition
- Type:
AggregatedPartitionCoverageResult
- custom_group_coverages
Aggregated binary partitions for each custom group
- combined_custom_coverage
Aggregated combined custom partition (if applicable)
- Type:
AggregatedPartitionCoverageResult | None
- n_replicates
Number of replicates aggregated
- Type:
- total_exposed_residues
Number of surface-exposed protein residues
- Type:
- surface_exposure_threshold
SASA threshold used for surface filtering
- Type:
float | None
- has_overlapping_custom_groups
True if custom groups share residues
- Type:
- schema_version
Schema version (2 = partition-based)
- Type:
- aa_class_coverage: AggregatedPartitionCoverageResult
- combined_custom_coverage: AggregatedPartitionCoverageResult | None
- n_replicates: int
- total_exposed_residues: int
- has_overlapping_custom_groups: bool
- schema_version: int
- get_aa_class_enrichment(aa_class)[source]
Get mean coverage enrichment for an AA class.
- get_custom_group_enrichment(group_name)[source]
Get mean coverage enrichment for a custom group (vs rest_of_protein).
- aa_class_enrichment_dict()[source]
Get AA class mean enrichments as dict: {aa_class: enrichment}.
- custom_group_enrichment_dict()[source]
Get custom group mean enrichments as dict: {group_name: enrichment}.
- aa_class_names()[source]
Get list of AA class names in canonical order.
- custom_group_names()[source]
Get list of custom group names.
- save(path)[source]
Save to JSON file.
- classmethod load(path)[source]
Load from JSON file.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- polyzymd.analyses.shared.binding_preference.aggregate_system_coverage(results)[source]
Aggregate system coverage across replicates.
Computes mean ± SEM for coverage metrics across multiple replicates for all partitions (AA class, custom groups, combined).
- Parameters:
results (list[SystemCoverageResult]) – System coverage results from multiple replicates
- Returns:
Aggregated results with mean and SEM for all partitions
- Return type:
AggregatedSystemCoverageResult
- polyzymd.analyses.shared.binding_preference.resolve_protein_group_selections(universe, protein_group_selections=None)[source]
Resolve protein group MDAnalysis selections to residue IDs.
Converts MDAnalysis selection strings into sets of residue IDs at analysis time. This allows the user to define groups with flexible selections while we work with concrete residue IDs internally.
- Parameters:
- Returns:
Mapping of group name to set of residue IDs
- Return type:
Examples
>>> selections = {"aromatic": "protein and resname PHE TRP TYR HIS"} >>> groups = resolve_protein_group_selections(universe, selections) >>> print(groups["aromatic"]) {12, 45, 67, 89, ...} # Set of aromatic residue IDs
- polyzymd.analyses.shared.binding_preference.resolve_protein_groups_from_surface_exposure(surface_exposure, include_default_aa_groups=True, custom_protein_groups=None)[source]
Resolve protein groups from surface exposure data without Universe.
This function derives protein group → residue ID mappings using only the surface exposure result (which already contains resid, resname, and aa_class for each residue). This allows binding preference computation without requiring an MDAnalysis Universe at comparison time.
The function supports:
Default AA class groups (aromatic, polar, nonpolar, charged_positive, charged_negative) derived from the aa_class field in surface exposure data
Custom user-defined groups specified as resid lists
Override behavior: if a custom group has the same name as a default, the custom definition takes precedence
- Parameters:
surface_exposure (SurfaceExposureResult) – Surface exposure analysis result containing residue data
include_default_aa_groups (bool, default True) – If True, include default AA class groupings (aromatic, polar, etc.) derived from surface exposure data
custom_protein_groups (dict[str, list[int]], optional) – User-defined protein groups as {group_name: [resid1, resid2, …]}. If a group name matches a default AA class, it overrides that default.
- Returns:
Mapping of group name to set of residue IDs
- Return type:
Examples
>>> from polyzymd.analyses.shared.surface_exposure import SurfaceExposureFilter >>> filter = SurfaceExposureFilter(threshold=0.2) >>> surface_result = filter.calculate("enzyme.pdb") >>> # Get default AA groups + custom active_site group >>> groups = resolve_protein_groups_from_surface_exposure( ... surface_result, ... include_default_aa_groups=True, ... custom_protein_groups={"active_site": [77, 133, 156]} ... ) >>> print(groups.keys()) dict_keys(['aromatic', 'polar', 'nonpolar', 'charged_positive', 'charged_negative', 'active_site'])
- polyzymd.analyses.shared.binding_preference.resolve_polymer_type_selections(universe, polymer_type_selections=None, polymer_chain='C')[source]
Resolve polymer type selections and return list of polymer types.
If explicit selections are provided, validates them and returns the keys. If None, auto-detects polymer types from the standard polymer chain.
- Parameters:
universe (Universe) – MDAnalysis Universe with loaded topology
polymer_type_selections (dict[str, str], optional) – Mapping of type name to MDAnalysis selection string. If None, auto-detects from
chainID <polymer_chain>.polymer_chain (str) – Chain ID for auto-detection when polymer_type_selections is None. Defaults to
"C"(PolyzyMD chain convention).
- Returns:
List of polymer type names (resnames or selection keys)
- Return type:
Examples
>>> # Auto-detect from chain C >>> polymer_types = resolve_polymer_type_selections(universe, None) >>> print(polymer_types) ['SBM', 'EGM']
>>> # Explicit selections >>> selections = {"SBMA": "chainID C and resname SBM"} >>> polymer_types = resolve_polymer_type_selections(universe, selections) >>> print(polymer_types) ['SBMA']
- polyzymd.analyses.shared.binding_preference.extract_polymer_composition(universe, polymer_type_selections=None, polymer_chain='C')[source]
Extract polymer composition (residue and heavy atom counts) from trajectory.
This function counts the number of residues and heavy atoms for each polymer type in the system. The composition data is used for dual normalization of binding preference enrichment ratios.
- Parameters:
universe (Universe) – MDAnalysis Universe with loaded topology
polymer_type_selections (dict[str, str], optional) – Mapping of type name to MDAnalysis selection string. If None, auto-detects from
chainID <polymer_chain>using unique resnames.polymer_chain (str) – Chain ID for auto-detection when polymer_type_selections is None. Defaults to
"C"(PolyzyMD chain convention).
- Returns:
Composition data with residue_counts and heavy_atom_counts per type. Heavy atoms are defined as all atoms with element != ‘H’.
- Return type:
PolymerComposition
Notes
For auto-detection (polymer_type_selections=None), the function: 1. Selects all atoms in the specified polymer chain 2. Groups residues by resname 3. Counts residues and heavy atoms per resname
For explicit selections, each selection string defines which atoms belong to that polymer type. The function counts residues (unique resids) and heavy atoms within each selection.
Examples
>>> # Auto-detect from chain C >>> composition = extract_polymer_composition(universe) >>> print(composition.residue_counts) {'SBM': 50, 'EGM': 50} >>> print(composition.heavy_atom_counts) {'SBM': 750, 'EGM': 400}
>>> # Explicit selections >>> selections = {"SBMA": "chainID C and resname SBM"} >>> composition = extract_polymer_composition(universe, selections)
- polyzymd.analyses.shared.binding_preference.compute_binding_preference_from_config(contact_result, universe, enzyme_pdb_path, config)[source]
Compute binding preference using contacts plugin settings.
This is a convenience function that orchestrates the full binding preference computation using configuration from analysis.yaml. It: 1. Calculates surface exposure from enzyme PDB 2. Resolves protein group selections to residue IDs 3. Computes binding preference with enrichment ratios
- Parameters:
contact_result (Any) – Contact analysis results from trajectory
universe (Universe) – MDAnalysis Universe (used for resolving selections)
enzyme_pdb_path (Path or str) – Path to enzyme PDB file for SASA calculation
config (Any) – Settings object with binding preference fields
- Returns:
Binding preference with enrichment ratios
- Return type:
BindingPreferenceResult
Notes
This function requires rust_sasa_python to be installed for surface exposure calculation. Install with:
pip install rust-sasa-python
Examples
>>> from polyzymd.analyses.contacts import ContactsSettings >>> config = ContactsSettings( ... compute_binding_preference=True, ... surface_exposure_threshold=0.2, ... ) >>> result = compute_binding_preference_from_config( ... contact_result, universe, "enzyme.pdb", config ... ) >>> print(result.enrichment_matrix()) {'SBM': {'aromatic': 1.45, 'polar': 0.82, ...}, ...}
- polyzymd.analyses.shared.binding_preference.compute_condition_binding_preference(cond, sim_config, analysis_dir, *, enzyme_pdb, contact_results_by_replicate, load_contact_result, threshold, include_default_aa_groups, custom_protein_groups, protein_partitions, polymer_type_selections, polymer_chain, settings_fp)[source]
Compute condition-level binding preference from contacts data.
- Parameters:
cond (ConditionLike) – Condition to compute.
sim_config (Any) – Simulation config used for topology lookup.
analysis_dir (Path) – Contacts analysis directory where outputs are saved.
enzyme_pdb (Path) – Enzyme PDB path used for SASA filtering.
contact_results_by_replicate (Mapping[int, Path]) – Mapping from replicate index to contacts JSON path.
load_contact_result (callable) – Callable used to deserialize contacts results.
threshold (float) – Relative SASA threshold for exposed residues.
include_default_aa_groups (bool) – Whether default AA class groups are included.
custom_protein_groups (dict[str, list[int]] | None) – Optional custom protein groups.
protein_partitions (dict[str, list[str]] | None) – Optional user-defined protein partitions.
polymer_type_selections (dict[str, str] | None) – Optional explicit polymer selection map.
polymer_chain (str) – Polymer chain ID for auto-detection.
settings_fp (str | None) – Optional settings fingerprint used for cache filenames.
- Returns:
Aggregated result, or None if computation failed.
- Return type:
AggregatedBindingPreferenceResult | None
polyzymd.analyses.shared.binding_preference_helpers
Shared helpers for computing binding preference from contacts data.
Used by the contacts, binding_free_energy, and polymer_affinity analysis plugins to compute SASA-based binding preference enrichment.
Public functions
find_enzyme_pdb()
resolve_enzyme_pdb()
try_load_cached_binding_preference()
- class polyzymd.analyses.shared.binding_preference_helpers.ConditionLike(*args, **kwargs)[source]
Bases:
ProtocolMinimal condition protocol required by BP helper functions.
- label
Condition label
- Type:
- replicates
Replicate IDs associated with this condition
- Type:
Sequence[int]
- label: str
- __init__(*args, **kwargs)
- polyzymd.analyses.shared.binding_preference_helpers.find_enzyme_pdb(sim_config)[source]
Find enzyme PDB file from simulation config.
Searches common locations relative to the project directory.
- Parameters:
sim_config (SimulationConfig) – Simulation configuration (must have
output.projects_directory).- Returns:
Path to enzyme PDB, or None if not found.
- Return type:
Path or None
- Raises:
ValueError – If a glob pattern matches multiple candidate enzyme PDB files.
- polyzymd.analyses.shared.binding_preference_helpers.resolve_enzyme_pdb(enzyme_pdb_setting, source_path, sim_config)[source]
Resolve the enzyme PDB path from settings or auto-discovery.
- Parameters:
enzyme_pdb_setting (str or None) – Explicit enzyme PDB path from analysis settings (e.g.,
enzyme_pdb_for_sasa). If relative, resolved against source_path’s parent directory.source_path (Path or None) – Path to the comparison.yaml file (used to resolve relative paths).
sim_config (Any) – Simulation configuration for auto-discovery fallback.
- Returns:
Resolved enzyme PDB path, or None if not found.
- Return type:
Path or None
- polyzymd.analyses.shared.binding_preference_helpers.try_load_cached_binding_preference(cond, analysis_dir, *, settings_fp=None)[source]
Try to load cached binding preference results for a condition.
Searches for binding preference files in order of preference: 1. binding_preference_aggregated.json 2. binding_preference_aggregated_reps*.json (glob pattern) 3. binding_preference.json (single replicate) 4. Per-replicate files (binding_preference_rep{N}.json)
- Parameters:
cond (ConditionLike) – Condition to load.
analysis_dir (Path) – Analysis directory for this condition.
settings_fp (str or None, optional) – Settings fingerprint for cache lookup. When provided, fingerprinted cache files are searched first, then legacy filenames.
- Returns:
Loaded result, or None if not found.
- Return type:
AggregatedBindingPreferenceResult | BindingPreferenceResult | None
polyzymd.analyses.shared.groupings
Residue grouping abstractions for analysis plugins.
This module provides classification systems for residues:
ResidueGrouping: Abstract base class for residue classification
ProteinAAClassification: Standard amino acid classification
CustomGrouping: User-defined classification scheme
Examples
>>> from polyzymd.analyses.shared.groupings import ProteinAAClassification
>>>
>>> # Classify amino acids
>>> grouping = ProteinAAClassification()
>>> print(grouping.classify("PHE")) # "aromatic"
>>> print(grouping.classify("LYS")) # "charged_positive"
>>>
>>> # Get all residues in a group
>>> aromatics = grouping.get_residues_in_group("aromatic")
>>> # Returns: ["PHE", "TRP", "TYR", "HIS"]
- class polyzymd.analyses.shared.groupings.ResidueGrouping[source]
Bases:
ABCAbstract base class for residue classification schemes.
Subclasses must implement classify() to map residue names to group labels.
Examples
>>> class MyPolymerGrouping(ResidueGrouping): ... def classify(self, resname: str) -> str: ... if resname in ["SBM", "SBMA"]: ... return "zwitterionic" ... elif resname in ["EGP", "EGMA"]: ... return "hydrophilic" ... return "unknown" ... ... @property ... def available_groups(self) -> list[str]: ... return ["zwitterionic", "hydrophilic", "unknown"]
- abstractmethod classify(resname)[source]
Classify a residue name into a group.
- abstract property available_groups: list[str]
List of all group labels in this classification scheme.
- get_residues_in_group(group)[source]
Get all residue names that belong to a group.
- Parameters:
group (str) – Group label
- Returns:
Residue names in this group
- Return type:
- Raises:
ValueError – If group is not in available_groups
- to_dict()[source]
Serialize grouping scheme to dictionary.
- class polyzymd.analyses.shared.groupings.ProteinAAClassification(include_his_aromatic=True)[source]
Bases:
ResidueGroupingStandard amino acid classification.
Groups amino acids into: - aromatic: PHE, TRP, TYR, HIS - charged_positive: ARG, LYS - charged_negative: ASP, GLU - polar: ASN, CYS, GLN, SER, THR - nonpolar: ALA, GLY, ILE, LEU, MET, PRO, VAL
This classification matches the scaffold notebooks and common biochemistry conventions.
- Parameters:
include_his_aromatic (bool, optional) – Whether to classify HIS as aromatic (default True). Some classifications put HIS with charged_positive.
Examples
>>> grouping = ProteinAAClassification() >>> grouping.classify("PHE") 'aromatic' >>> grouping.classify("LYS") 'charged_positive' >>> grouping.get_residues_in_group("aromatic") ['PHE', 'TRP', 'TYR', 'HIS']
- __init__(include_his_aromatic=True)[source]
- classify(resname)[source]
Classify amino acid by residue name.
- get_charged_groups()[source]
Convenience: get both charged group names.
- get_hydrophobic_groups()[source]
Convenience: groups typically considered hydrophobic.
- get_hydrophilic_groups()[source]
Convenience: groups typically considered hydrophilic.
- class polyzymd.analyses.shared.groupings.CustomGrouping(classification, default_group='other')[source]
Bases:
ResidueGroupingUser-defined residue classification.
Allows arbitrary mapping from residue names to group labels.
- Parameters:
Examples
>>> # Custom polymer classification >>> grouping = CustomGrouping({ ... "SBM": "zwitterionic", ... "SBMA": "zwitterionic", ... "EGP": "peg_like", ... "EGMA": "peg_like", ... }, default_group="unknown") >>> grouping.classify("SBM") 'zwitterionic'
- __init__(classification, default_group='other')[source]
- classify(resname)[source]
Classify residue by name using custom mapping.
- classmethod from_groups(groups, default_group='other')[source]
Create grouping from group -> residue list mapping.
- Parameters:
- Return type:
CustomGrouping
Examples
>>> grouping = CustomGrouping.from_groups({ ... "zwitterionic": ["SBM", "SBMA"], ... "peg_like": ["EGP", "EGMA", "OEGMA"], ... })
polyzymd.analyses.shared.selectors
Molecular selector abstractions for analysis plugins.
This module provides a unified interface for selecting atoms, residues, or molecular groups from an MDAnalysis Universe. Selectors enable:
Consistent selection logic across different analysis types
Configurable protein/polymer/solvent selection
Support for arbitrary user-defined selections
Proximity-based selections (e.g., “residues near active site”)
The Strategy pattern is used so users can define custom selectors by subclassing MolecularSelector.
Examples
>>> from polyzymd.analyses.shared.selectors import ProteinResidues, PolymerChains
>>>
>>> # Select all protein residues
>>> protein_selector = ProteinResidues()
>>> protein_residues = protein_selector.select(universe)
>>>
>>> # Select polymer chains by type
>>> polymer_selector = PolymerResiduesByType(residue_names=["SBM", "EGP"])
>>> polymer_residues = polymer_selector.select(universe)
>>>
>>> # Select protein residues near catalytic triad
>>> triad_selector = ProteinResiduesNearReference(
... reference_selection="resid 77 133 156",
... cutoff=5.0
... )
>>> nearby_residues = triad_selector.select(universe)
- class polyzymd.analyses.shared.selectors.MolecularSelector[source]
Bases:
ABCAbstract base class for molecular selections.
Subclasses must implement the select() method to define how atoms or residues are selected from a Universe.
This follows the Strategy pattern - different selectors can be swapped in to change selection behavior without modifying the analysis code.
Examples
>>> class ActiveSiteSelector(MolecularSelector): ... def __init__(self, active_site_resids: list[int]): ... self.resids = active_site_resids ... ... def select(self, universe: Universe) -> SelectionResult: ... resid_str = " ".join(str(r) for r in self.resids) ... atoms = universe.select_atoms(f"resid {resid_str}") ... return SelectionResult( ... atoms=atoms, ... residues=atoms.residues, ... label="active_site", ... metadata={"resids": self.resids} ... ) ... ... @property ... def label(self) -> str: ... return "active_site"
- abstractmethod select(universe)[source]
Select atoms/residues from a Universe.
- Parameters:
universe (Universe) – MDAnalysis Universe to select from
- Returns:
Container with selected atoms, residues, and metadata
- Return type:
SelectionResult
- abstract property label: str
Short label identifying this selector (for filenames/logging).
- validate(universe)[source]
Validate the selector against a Universe.
Returns diagnostic information about whether the selection would succeed and what it would select.
- Parameters:
universe (Universe) – MDAnalysis Universe to validate against
- Returns:
Validation results with keys: - valid: bool - n_atoms: int - n_residues: int - error: str (if invalid) - warnings: list[str]
- Return type:
- class polyzymd.analyses.shared.selectors.MDAnalysisSelector(selection, label=None)[source]
Bases:
MolecularSelectorSimple selector using an MDAnalysis selection string.
This is the most flexible selector - it allows arbitrary MDAnalysis selection syntax. Use this when you need direct control over the selection or when the specialized selectors don’t fit your needs.
- Parameters:
Examples
>>> # Select polymer residues by name >>> selector = MDAnalysisSelector("resname SBM EGM") >>> result = selector.select(universe) >>> >>> # Select protein backbone near ligand >>> selector = MDAnalysisSelector( ... "protein and backbone and around 5.0 resname LIG", ... label="protein_near_ligand" ... )
- __init__(selection, label=None)[source]
- select(universe)[source]
Select atoms using the MDAnalysis selection string.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.SelectionResult(atoms, residues, label, metadata=<factory>)[source]
Bases:
objectContainer for selection results with metadata.
- atoms
The selected atoms
- Type:
AtomGroup
- residues
The residues containing the selected atoms
- Type:
ResidueGroup
- label
Human-readable label for this selection
- Type:
- metadata
Additional metadata about the selection (e.g., selection string used, cutoff values, etc.)
- Type:
- atoms: AtomGroup
- residues: ResidueGroup
- label: str
- metadata: dict
- property n_atoms: int
Number of selected atoms.
- property n_residues: int
Number of selected residues.
- property residue_ids: numpy.typing.NDArray.numpy.int64
1-indexed residue IDs (PyMOL convention).
- __init__(atoms, residues, label, metadata=<factory>)
- class polyzymd.analyses.shared.selectors.CompositeSelector(selectors, mode='union', label=None)[source]
Bases:
MolecularSelectorCombines multiple selectors with AND/OR logic.
Useful for complex selections like “protein residues that are both aromatic AND within 5A of the active site”.
- Parameters:
selectors (list[MolecularSelector]) – List of selectors to combine
mode ({"union", "intersection"}) – How to combine selections: - “union”: Include atoms selected by ANY selector (OR) - “intersection”: Include only atoms selected by ALL selectors (AND)
label (str, optional) – Custom label. If not provided, generates from component labels.
- __init__(selectors, mode='union', label=None)[source]
- select(universe)[source]
Select atoms using combined selectors.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.ProteinResidues(selection_modifier=None)[source]
Bases:
MolecularSelectorSelect all protein residues.
Uses MDAnalysis “protein” selection keyword which matches standard amino acid residues.
- Parameters:
selection_modifier (str, optional) – Additional selection criteria to AND with “protein”. E.g., “and not name H*” to exclude hydrogens.
- __init__(selection_modifier=None)[source]
- select(universe)[source]
Select all protein atoms/residues.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.ProteinResiduesByGroup(grouping, groups, exclude=False)[source]
Bases:
MolecularSelectorSelect protein residues by amino acid group classification.
Uses a ResidueGrouping to classify amino acids (e.g., aromatic, charged, polar, nonpolar) and selects only residues in the specified groups.
- Parameters:
Examples
>>> from polyzymd.analyses.shared.groupings import ProteinAAClassification >>> >>> # Select aromatic residues >>> grouping = ProteinAAClassification() >>> selector = ProteinResiduesByGroup(grouping, groups=["aromatic"]) >>> >>> # Select all charged residues >>> selector = ProteinResiduesByGroup( ... grouping, ... groups=["charged_positive", "charged_negative"] ... )
- __init__(grouping, groups, exclude=False)[source]
- select(universe)[source]
Select protein residues matching the specified groups.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.ProteinResiduesNearReference(reference_selection, cutoff, include_reference=True, frame=0)[source]
Bases:
MolecularSelectorSelect protein residues within a cutoff distance of reference atoms.
Useful for selecting residues near active sites, binding pockets, or other regions of interest.
- Parameters:
reference_selection (str) – MDAnalysis selection string for reference atoms (e.g., “resid 77 133 156”)
cutoff (float) – Distance cutoff in Angstroms. Residues with any atom within this distance of any reference atom are selected.
include_reference (bool, optional) – Whether to include the reference residues themselves. Default True.
frame (int, optional) – Frame to use for distance calculation. Default is current frame (0).
Examples
>>> # Select residues within 5A of catalytic triad >>> selector = ProteinResiduesNearReference( ... reference_selection="resid 77 133 156", ... cutoff=5.0, ... ) >>> >>> # Select residues near substrate binding site (not including the site itself) >>> selector = ProteinResiduesNearReference( ... reference_selection="resname LIG", ... cutoff=4.0, ... include_reference=False, ... )
- __init__(reference_selection, cutoff, include_reference=True, frame=0)[source]
- select(universe)[source]
Select protein residues near the reference atoms.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.PolymerChains(chain_id='C', residue_names=None, chain_indices=None, segids=None)[source]
Bases:
MolecularSelectorSelect polymer chains from the system.
For PolyzyMD-built systems, polymers are assigned to Chain C by convention. This selector uses chain ID selection by default, which is more reliable than residue name matching.
- Parameters:
chain_id (str, optional) – Chain ID for polymer selection. Default “C” (PolyzyMD convention). Set to None to use residue_names instead.
residue_names (list[str], optional) – Residue names that identify polymer residues. Only used when chain_id is None, or as a filter within the chain. Default uses common PolyzyMD polymer names.
chain_indices (list[int], optional) – If provided, select only these polymer chain indices (0-indexed) from within the selected atoms. Useful when analyzing specific polymer chains in multi-chain systems.
segids (list[str], optional) – If provided, select only polymers with these segment IDs.
Notes
The PolyzyMD chain convention is: - Chain A: Protein/Enzyme - Chain B: Substrate/Ligand - Chain C: Polymers - Chain D+: Solvent (water, ions, co-solvents)
For systems not built with PolyzyMD, set chain_id=None and provide residue_names explicitly.
Examples
>>> # PolyzyMD system (recommended) >>> selector = PolymerChains() # Uses chain C >>> >>> # Non-PolyzyMD system >>> selector = PolymerChains(chain_id=None, residue_names=["SBM", "EGM"]) >>> >>> # PolyzyMD system with specific polymer types >>> selector = PolymerChains(residue_names=["SBM"]) # SBM in chain C only
- __init__(chain_id='C', residue_names=None, chain_indices=None, segids=None)[source]
- select(universe)[source]
Select polymer atoms/residues.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.PolymerResiduesByType(residue_names, exclude=False)[source]
Bases:
MolecularSelectorSelect polymer residues by monomer type (residue name).
This selector groups polymer residues by their residue names, allowing analysis of specific monomer types within copolymers.
- Parameters:
Examples
>>> # Select SBMA monomers only >>> selector = PolymerResiduesByType(residue_names=["SBM", "SBMA"]) >>> >>> # Select non-SBMA monomers >>> selector = PolymerResiduesByType(residue_names=["SBM", "SBMA"], exclude=True)
- __init__(residue_names, exclude=False)[source]
- select(universe)[source]
Select polymer residues by type.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.PolymerSegments(residue_names=None, chain_index=None, segment_indices=None)[source]
Bases:
MolecularSelectorSelect individual segments (residues) within polymer chains.
This selector provides fine-grained access to polymer segments, useful for per-segment contact analysis.
- Parameters:
residue_names (list[str], optional) – Residue names that identify polymer residues.
chain_index (int, optional) – Specific chain to select segments from (0-indexed). If None, selects from all chains.
segment_indices (list[int], optional) – Specific segment indices within chains to select. Uses 0-indexed positions within each chain.
Notes
A “segment” in this context refers to a single residue/monomer unit within a polymer chain, not MDAnalysis segments.
- __init__(residue_names=None, chain_index=None, segment_indices=None)[source]
- select(universe)[source]
Select polymer segments.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.SolventMolecules(residue_names=None, exclude_near=None, exclude_cutoff=3.0)[source]
Bases:
MolecularSelectorSelect solvent (water) molecules.
- Parameters:
residue_names (list[str], optional) – Residue names for water. Default uses common water names.
exclude_near (str, optional) – Exclude waters within a cutoff of this selection. E.g., “protein” to exclude waters in first hydration shell.
exclude_cutoff (float, optional) – Cutoff in Angstroms for exclude_near. Default 3.0.
Examples
>>> # Select all water >>> selector = SolventMolecules() >>> >>> # Select bulk water (exclude first shell around protein) >>> selector = SolventMolecules( ... exclude_near="protein", ... exclude_cutoff=5.0 ... )
- __init__(residue_names=None, exclude_near=None, exclude_cutoff=3.0)[source]
- select(universe)[source]
Select water molecules.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.CosolventMolecules(residue_names=None)[source]
Bases:
MolecularSelectorSelect cosolvent molecules (e.g., DMSO, acetonitrile).
- Parameters:
residue_names (list[str], optional) – Residue names for cosolvent. Default uses common names. You should typically specify this for your system.
Examples
>>> # Select DMSO molecules >>> selector = CosolventMolecules(residue_names=["DMSO", "DMS"])
- __init__(residue_names=None)[source]
- select(universe)[source]
Select cosolvent molecules.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.SubstrateMolecule(residue_name, n_molecules=None)[source]
Bases:
MolecularSelectorSelect substrate or ligand molecules.
- Parameters:
Examples
>>> # Select resorufin butyrate substrate >>> selector = SubstrateMolecule(residue_name="RBU") >>> >>> # Select single substrate, validate count >>> selector = SubstrateMolecule(residue_name="RBU", n_molecules=1)
- __init__(residue_name, n_molecules=None)[source]
- select(universe)[source]
Select substrate molecules.
- property label: str
Short label identifying this selector (for filenames/logging).
- class polyzymd.analyses.shared.selectors.IonSelector(residue_names=None, ion_type='all')[source]
Bases:
MolecularSelectorSelect ion molecules (Na+, Cl-, etc.).
- Parameters:
Examples
>>> # Select all ions >>> selector = IonSelector() >>> >>> # Select only sodium ions >>> selector = IonSelector(residue_names=["NA", "Na+", "SOD"])
- DEFAULT_CATIONS = ['NA', 'Na+', 'SOD', 'K', 'K+', 'POT', 'MG', 'Mg2+', 'CA', 'Ca2+']
- DEFAULT_ANIONS = ['CL', 'Cl-', 'CLA', 'BR', 'Br-']
- __init__(residue_names=None, ion_type='all')[source]
- select(universe)[source]
Select ion molecules.
- property label: str
Short label identifying this selector (for filenames/logging).