"""Result models for binding free energy comparison analysis.
Physics background
------------------
In the NPT ensemble (constant pressure, as used in all polyzymd simulations)
the correct thermodynamic potential is the **Gibbs free energy** G.
The quantity computed here is a **selectivity free energy** (ΔG_sel) that
measures how much more (or less) favorable it is for a polymer to contact
a given group of protein residues compared to what would be expected if the
polymer contacted each exposed surface residue in proportion to that residue
group's share of the total solvent-exposed protein surface.
Concretely: if aromatic residues make up 10% of the solvent-exposed surface
but receive 20% of the polymer's contacts, the polymer preferentially contacts
aromatic residues. The reference (expected) distribution is simply proportional
to surface availability — not any property of the polymer itself.
ΔG_sel(j) = -k_B·T · ln(contact_share_j / expected_share_j)
where:
contact_share_j = (contact frames involving residues in group j) /
(total contact frames across all protein residues)
— the observed fraction of polymer contacts directed
at group j
expected_share_j = (number of solvent-exposed residues in group j) /
(total number of solvent-exposed protein residues)
— the fraction of the protein surface belonging to
group j; this is the reference assuming contacts are
distributed purely by surface area
k_B = Boltzmann constant (0.0019872041 kcal mol⁻¹ K⁻¹)
T = simulation temperature in Kelvin
Because both distributions are normalized over the same partition (they sum
to 1 over all groups), there is no arbitrary additive constant — ΔG_sel is
fully determined by the data.
When units='kT' (default), the formula simplifies to:
ΔG_sel(j) / k_BT = -ln(contact_share_j / expected_share_j)
yielding a dimensionless value directly comparable to the thermal energy
scale. A value of -1.0 means the binding preference is exactly 1 k_bT
favorable relative to the surface-availability reference.
Note: contact_share / expected_share = enrichment_ratio = enrichment + 1
(where enrichment is the existing dimensionless enrichment score from binding
preference analysis). So ΔG_sel = -kT·ln(enrichment + 1), and the two
representations are mathematically equivalent; ΔG_sel simply puts the
enrichment score on a physically meaningful energy scale.
Sign convention:
ΔG_sel < 0 → preferential contact (observed > surface-availability reference)
ΔG_sel > 0 → contact avoidance (observed < surface-availability reference)
ΔG_sel = 0 → contacts match the surface-availability reference exactly
Differences between conditions (ΔG_sel,B(j) − ΔG_sel,A(j)) give a true ΔΔG,
stored in `FreeEnergyPairwiseEntry.delta_delta_G`.
Uncertainty propagation
-----------------------
When multiple independent replicates are available, two uncertainty estimates
are reported:
1. **Between-replicate SEM on ΔG_sel** (primary, used for pairwise statistics):
ΔG_sel is computed independently for each replicate, and the SEM is taken
directly across those values. This is the most statistically sound approach
for independent replicates and is the quantity used in t-tests.
2. **Delta-method propagation** (analytical approximation, stored for reference):
For the mean contact_share and its SEM, uncertainty is propagated through
the logarithm using first-order error propagation (Taylor 1997, ch. 3;
Bevington & Robinson 2003, ch. 3):
σ(ΔG_sel) ≈ k_B·T · √[(σ_cs / cs)² + (σ_es / es)²]
(or simply √[...] when units='kT')
where σ_cs = SEM of contact_share across replicates, and σ_es ≈ 0 because
expected_share is computed from a single static PDB structure (no replicate
variance). This simplifies to σ(ΔG_sel) ≈ k_B·T · (σ_cs / cs)
(or σ_cs / cs when units='kT').
References:
- Taylor, J. R. (1997). *An Introduction to Error Analysis*, 2nd ed.
University Science Books. (Ch. 3: Error propagation for functions of
one or more variables)
- Bevington, P. R. & Robinson, D. K. (2003). *Data Reduction and Error
Analysis for the Physical Sciences*, 3rd ed. McGraw-Hill. (Ch. 3)
- Wikipedia: Delta method,
https://en.wikipedia.org/wiki/Delta_method
Temperature handling:
When units='kT', ΔG_sel = -ln(ratio) is temperature-independent (the same
ratio at any temperature gives the same dimensionless value). However,
the underlying contact probabilities ARE temperature-dependent, so cross-
temperature comparisons still require caution.
When units='kcal/mol' or 'kJ/mol', ΔG_sel computed at temperature T is NOT
directly comparable to ΔG_sel at temperature T'. Pairwise statistical
comparisons are only computed between conditions sharing the same
simulation temperature.
"""
from __future__ import annotations
from datetime import datetime
from pathlib import Path
from typing import Optional
from pydantic import BaseModel, Field
from polyzymd import __version__
[docs]
class FreeEnergyEntry(BaseModel):
"""Free energy analysis for one (polymer_type, protein_group) pair in one condition.
Stores both the ΔG_sel value and the raw probability quantities used to compute
it, enabling reproducibility and downstream verification.
Attributes
----------
polymer_type : str
Polymer residue type (e.g., "SBM", "EGM").
protein_group : str
Protein amino acid group label (e.g., "aromatic", "charged_positive").
partition_name : str
Name of the partition this group belongs to (e.g., "aa_class").
contact_share : float
Observed fraction of polymer contacts directed at this group.
This is P_obs in ΔG_sel = -kT·ln(P_obs / P_ref).
expected_share : float
Surface-availability-weighted reference fraction.
This is P_ref in ΔG_sel = -kT·ln(P_obs / P_ref).
enrichment_ratio : float
contact_share / expected_share (= enrichment + 1).
Stored for traceability; ΔG_sel = -kT·ln(enrichment_ratio).
delta_G : float | None
ΔG_sel in the configured units. None when contact_share = 0 or
expected_share = 0 (log undefined or reference missing).
delta_G_uncertainty : float | None
σ(ΔG_sel) from delta-method error propagation. None if delta_G is None
or if SEM data is unavailable (single replicate).
delta_G_per_replicate : list[float]
Per-replicate ΔG_sel values used for cross-condition statistics.
units : str
Energy units ("kT", "kcal/mol", or "kJ/mol").
temperature_K : float
Simulation temperature in Kelvin (used as kT denominator).
n_replicates : int
Number of replicates with valid data for this entry.
n_exposed_in_group : int
Number of surface-exposed residues in this group (used for expected_share).
"""
polymer_type: str
protein_group: str
partition_name: str = "aa_class"
contact_share: float
expected_share: float
enrichment_ratio: float = Field(
description="contact_share / expected_share; ΔG_sel = -kT·ln(this value)"
)
delta_G: Optional[float] = Field(
default=None,
description="ΔG_sel = -k_B·T·ln(contact_share / expected_share) in units",
)
delta_G_uncertainty: Optional[float] = Field(
default=None,
description="σ(ΔG_sel) from delta-method: k_B·T·√[(σ_cs/cs)²+(σ_es/es)²]",
)
delta_G_per_replicate: list[float] = Field(
default_factory=list,
description="Per-replicate ΔG_sel values for statistical testing",
)
units: str = "kT"
temperature_K: float
n_replicates: int = 0
n_exposed_in_group: int = 0
[docs]
class FreeEnergyConditionSummary(BaseModel):
"""Free energy summary for one simulation condition.
Aggregates FreeEnergyEntry objects across all (polymer_type, protein_group)
pairs for a single condition, together with condition metadata.
Attributes
----------
label : str
Display name for this condition.
config_path : str
Path to the SimulationConfig YAML used.
temperature_K : float
Simulation temperature in Kelvin.
n_replicates : int
Number of replicates in this condition.
units : str
Energy units ("kT", "kcal/mol", or "kJ/mol").
entries : list[FreeEnergyEntry]
All (polymer_type, protein_group) ΔG_sel entries.
polymer_types : list[str]
Polymer residue types present.
protein_groups : list[str]
Protein group labels analyzed.
"""
label: str
config_path: str
temperature_K: float
n_replicates: int
units: str = "kT"
entries: list[FreeEnergyEntry] = Field(default_factory=list)
polymer_types: list[str] = Field(default_factory=list)
protein_groups: list[str] = Field(default_factory=list)
@property
def primary_metric_value(self) -> float:
"""Mean ΔG_sel across all valid entries (for BaseConditionSummary compatibility)."""
vals = [e.delta_G for e in self.entries if e.delta_G is not None]
return float(sum(vals) / len(vals)) if vals else 0.0
@property
def primary_metric_sem(self) -> float:
"""Mean σ(ΔG_sel) across all valid entries."""
vals = [e.delta_G_uncertainty for e in self.entries if e.delta_G_uncertainty is not None]
return float(sum(vals) / len(vals)) if vals else 0.0
[docs]
def get_entry(
self,
polymer_type: str,
protein_group: str,
partition_name: str | None = None,
) -> Optional[FreeEnergyEntry]:
"""Get the FreeEnergyEntry for a (polymer_type, protein_group) pair.
Parameters
----------
polymer_type : str
Polymer type.
protein_group : str
AA group label.
partition_name : str or None, optional
If given, further restrict to entries belonging to this partition.
Necessary when the same ``protein_group`` label appears in multiple
partitions (e.g., "rest_of_protein" in several user-defined
partitions).
Returns
-------
FreeEnergyEntry or None
"""
for e in self.entries:
if e.polymer_type == polymer_type and e.protein_group == protein_group:
if partition_name is not None and e.partition_name != partition_name:
continue
return e
return None
[docs]
class FreeEnergyPairwiseEntry(BaseModel):
"""Pairwise comparison between two conditions for one (polymer, group) pair.
Each condition has a per-group selectivity free energy ΔG_sel. The difference
ΔΔG = ΔG_sel,B − ΔG_sel,A is a true double-delta quantity.
Statistics are only computed when both conditions share the same simulation
temperature. If temperatures differ, all stat fields are None and the
``cross_temperature`` flag is set to True.
Attributes
----------
polymer_type : str
Polymer residue type.
protein_group : str
Protein group label.
condition_a : str
Label of the first condition.
condition_b : str
Label of the second condition.
temperature_a_K : float
Temperature of condition A in Kelvin.
temperature_b_K : float
Temperature of condition B in Kelvin.
cross_temperature : bool
True when temperatures differ — statistics are suppressed.
delta_G_a : float | None
ΔG_sel for condition A.
delta_G_b : float | None
ΔG_sel for condition B.
delta_delta_G : float | None
ΔΔG = ΔG_sel,B − ΔG_sel,A. Positive → B has less favorable selectivity.
t_statistic : float | None
T-test statistic (None for cross-temperature pairs).
p_value : float | None
Two-tailed p-value (None for cross-temperature pairs).
"""
polymer_type: str
protein_group: str
condition_a: str
condition_b: str
temperature_a_K: float
temperature_b_K: float
cross_temperature: bool = False
delta_G_a: Optional[float] = None
delta_G_b: Optional[float] = None
delta_delta_G: Optional[float] = None
t_statistic: Optional[float] = None
p_value: Optional[float] = None
[docs]
class BindingFreeEnergyResult(BaseModel):
"""Complete binding free energy comparison result.
This is the main output from BindingFreeEnergyComparator.compare().
Physics summary
---------------
Formula: ΔG_sel = -k_B·T · ln(contact_share / expected_share)
Uncertainty: σ(ΔG_sel) = k_B·T · √[(σ_cs/cs)² + (σ_es/es)²]
Temperature note: pairwise statistics are suppressed between conditions
at different temperatures. The ``mixed_temperatures`` flag indicates this
occurred. Each condition's temperature is stored in its summary.
Attributes
----------
name : str
Name of the comparison project.
units : str
Energy units ("kT", "kcal/mol", or "kJ/mol").
formula : str
Human-readable formula string (for documentation/output).
mixed_temperatures : bool
True if conditions span more than one simulation temperature.
temperature_groups : dict[float, list[str]]
Mapping of temperature (K) to condition labels at that temperature.
conditions : list[FreeEnergyConditionSummary]
Summary for each condition.
pairwise_comparisons : list[FreeEnergyPairwiseEntry]
All pairwise comparisons (cross-T pairs have stats suppressed).
polymer_types : list[str]
All polymer types found.
protein_groups : list[str]
All protein groups analyzed.
surface_exposure_threshold : float | None
SASA threshold used (from binding preference settings).
equilibration_time : str
Equilibration time used.
created_at : datetime
When the analysis was run.
polyzymd_version : str
Version of polyzymd used.
"""
name: str
units: str = "kT"
formula: str = "ΔG_sel = -ln(contact_share / expected_share) [units: k_bT]"
mixed_temperatures: bool = False
temperature_groups: dict[str, list[str]] = Field(
default_factory=dict,
description="temperature_K (as str key) → list of condition labels",
)
conditions: list[FreeEnergyConditionSummary] = Field(default_factory=list)
pairwise_comparisons: list[FreeEnergyPairwiseEntry] = Field(default_factory=list)
polymer_types: list[str] = Field(default_factory=list)
protein_groups: list[str] = Field(default_factory=list)
surface_exposure_threshold: Optional[float] = None
equilibration_time: str = ""
created_at: datetime = Field(default_factory=datetime.now)
polyzymd_version: str = Field(default_factory=lambda: __version__)
[docs]
def save(self, path: Path | str) -> Path:
"""Save result to JSON file.
Parameters
----------
path : Path or str
Output path.
Returns
-------
Path
Path to the saved file.
"""
path = Path(path)
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(self.model_dump_json(indent=2))
return path
[docs]
@classmethod
def load(cls, path: Path | str) -> "BindingFreeEnergyResult":
"""Load result from JSON file.
Parameters
----------
path : Path or str
Path to JSON file.
Returns
-------
BindingFreeEnergyResult
Loaded result.
"""
path = Path(path)
return cls.model_validate_json(path.read_text())
[docs]
def get_condition(self, label: str) -> FreeEnergyConditionSummary:
"""Get a condition summary by label.
Parameters
----------
label : str
Condition label.
Returns
-------
FreeEnergyConditionSummary
Raises
------
KeyError
If not found.
"""
for cond in self.conditions:
if cond.label == label:
return cond
raise KeyError(f"Condition '{label}' not found")