# Exposure Dynamics Analysis Understand **how your polymer interacts with transiently exposed protein residues** using dynamic SASA-based enrichment and chaperone event analysis. This module answers two complementary questions that static contact analysis cannot: 1. *"Does my polymer preferentially bind certain amino acid classes **at the moment they are surface-exposed**, controlling for instantaneous surface availability?"* 2. *"Does the polymer act as a molecular chaperone — providing contact during transient exposure events that would otherwise be unassisted?"* ```{note} Exposure dynamics analysis builds on [contacts analysis](analysis_contacts_quickstart.md) and uses SASA computed per-frame from MD trajectories, not from a static PDB. Run contacts analysis first, then enable exposure dynamics in your comparison configuration. ``` ```{warning} Exposure dynamics is currently labeled **experimental** in the presentation release. The command remains available and plots are still produced, but CLI output and figures now carry explicit experimental labeling because the metric definitions and interpretation may still change. ``` :::{admonition} Environment Setup :class: tip All commands below assume you have activated the PolyzyMD pixi environment: ```bash pixi shell -e build ``` Alternatively, prefix each command with `pixi run -e build`. ::: ## Scientific Motivation ### Why static analysis is insufficient Standard binding preference analysis (see [Binding Preference Analysis](analysis_binding_preference.md)) uses SASA from a single static PDB to define which residues are "surface accessible." This works well for residues that are consistently exposed, but misses a crucial class of events: **transient exposure**. A residue that is buried 90% of the time but exposes briefly may be the most biologically relevant interaction site — or it may represent a mechanically vulnerable region that the polymer can protect or destabilize. A static SASA filter either includes it (at a permissive threshold) or excludes it entirely (at a stringent threshold), conflating residues with fundamentally different dynamics. **Dynamic enrichment** solves this by computing the enrichment *frame by frame*, conditioning on instantaneous surface availability. A residue that exposes in only 10% of frames contributes to the enrichment *only in those 10% of frames*. ### Why chaperone fraction captures something enrichment cannot Dynamic enrichment tells you whether the polymer preferentially contacts aromatic residues when they are exposed — but it does not tell you whether the polymer is *present during exposure events* or contacts residues *after they are already exposed and re-expose repeatedly*. **Chaperone fraction** asks a different question: of all the exposure episodes a residue undergoes (contiguous stretches of frames where it is exposed), what fraction have polymer contact during the episode? A high chaperone fraction for a transient residue suggests the polymer is co-localizing during the exposure window, not merely making incidental contact at stable surface sites. --- ## The Two Metrics ### Dynamic enrichment For each polymer type $P$ and amino-acid group $G$ (e.g., "aromatic"), the enrichment is computed frame by frame over the full trajectory: ```{math} \text{For each frame } t\text{:} \quad n^G_{\text{exp}}(t) = \left|\{i \in G \mid \text{rel\_SASA}(t,i) > \theta\}\right| \quad n_{\text{exp}}(t) = \left|\{i \mid \text{rel\_SASA}(t,i) > \theta\}\right| \quad n^G_{\text{con}}(t) = \left|\{i \in G \mid \text{rel\_SASA}(t,i) > \theta \text{ AND } \text{contact}_P(t,i)\}\right| \quad \text{observed}(t) = \frac{n^G_{\text{con}}(t)}{n^G_{\text{exp}}(t)} \quad (\text{skip frame if } n^G_{\text{exp}}(t) = 0) \quad \text{expected}(t) = \frac{n^G_{\text{exp}}(t)}{n_{\text{exp}}(t)} ``` ```{math} \text{Enrichment} = \frac{\overline{\text{observed}}}{\overline{\text{expected}}} - 1 ``` where $\theta$ is the exposure threshold (default 0.20), overbars denote the mean over frames where the group has at least one exposed residue, and $\text{contact}_P(t,i)$ is True if any atom of polymer type $P$ is within the contact cutoff of residue $i$ at frame $t$. **Plain-English interpretation:** An enrichment of $+6.94$ for aromatic residues means that, on average across all frames, polymer type $P$ contacted aromatic residues **7.94× more often than their instantaneous share of the exposed surface would predict** if contacts were distributed proportionally. | Enrichment | Meaning | |------------|---------| | **> 0** | Preferential contact — polymer contacts this group more than surface share predicts | | **= 0** | Neutral — contact rate matches instantaneous surface availability | | **< 0** | Avoidance — polymer contacts this group less than surface share predicts | | **= -1** | Complete avoidance — no contacts with this group observed | ```{important} **Enrichment is not a binding free energy.** It is a conditional contact frequency ratio. A large positive value means preferential co-localization, not strong thermodynamic affinity. Do not use phrases like "binding energy" or "affinity" when reporting enrichment values. ``` #### Dynamic enrichment vs. static binding preference Both metrics answer different questions. Choose the right one for your scientific question: | Aspect | Dynamic enrichment | Static binding preference | |--------|--------------------|--------------------------| | **SASA baseline** | Per-frame instantaneous exposure | Single static PDB | | **Residue eligibility** | Changes every frame | Fixed at analysis setup | | **Normalization** | Instantaneous exposed fraction | Time-averaged surface fraction | | **Best for** | Residues with variable exposure; detecting transient-site preference | Stable surface residues; overall polymer chemistry characterization | | **Typical magnitude** | Can be large (>5) for transiently exposed groups | Usually −1 to +2 | | **Equilibration window** | Included in frame count (see [Limitations](#known-assumptions-and-limitations)) | Not applicable | Both analyses are independently valid. Do not substitute one for the other when interpreting results. ### Chaperone fraction #### Residue stability classification Before computing chaperone fractions, each residue is classified by its **exposure fraction** — the proportion of trajectory frames where its relative SASA exceeds the threshold $\theta$: | Classification | Condition | Meaning | |---------------|-----------|---------| | `stably_exposed` | exposure_fraction ≥ `transient_upper` (default 0.80) | Consistently on the surface | | `transient` | `transient_lower` < exposure_fraction < `transient_upper` | Fluctuates between buried and exposed | | `stably_buried` | exposure_fraction ≤ `transient_lower` (default 0.20) | Consistently buried | Only **transient** residues are used for condition-level chaperone fraction reporting, because chaperone events are only meaningful for residues that genuinely fluctuate. #### Event detection For each residue, the trajectory is segmented into **exposure windows**: maximal contiguous runs of frames where the residue is exposed (relative SASA > $\theta$). Windows shorter than `min_event_length` frames are discarded to remove single-frame SASA noise. Each exposure window is then classified: - **Chaperone event**: at least one polymer contact occurs during the window - **Unassisted event**: no polymer contact occurs during the window #### Per-residue chaperone fraction ```{math} \text{chaperone\_fraction} = \frac{n_{\text{chaperone events}}}{n_{\text{chaperone events}} + n_{\text{unassisted events}}} ``` A value of 1.0 means every exposure window for this residue had polymer contact; 0.0 means none did. #### Condition-level chaperone fraction The reported condition-level value is the mean chaperone fraction over all transient residues in a replicate: ```{math} \text{condition chaperone fraction} = \frac{1}{|R_{\text{transient}}|} \sum_{i \in R_{\text{transient}}} \text{chaperone\_fraction}_i ``` This value is computed per replicate; the mean ± SEM across replicates is reported in comparisons. ```{note} **What "transient" means in the JSON output:** In `exposure_dynamics.json`, each residue has a `stability` field. Residues with `"stability": "transient"` are those with exposure fraction between `transient_lower` and `transient_upper`. The condition-level chaperone fraction is `numpy.mean()` over exactly these residues. ``` --- ## Known Assumptions and Limitations ```{important} **Read this section before reporting results.** These are not caveats to hide — they are properties of the analysis that affect interpretation. Understanding them will help you explain your results to reviewers. ``` ### Exposure threshold is 0.20 by default The exposure threshold $\theta = 0.20$ means a residue is classified as exposed when its relative SASA (computed with MDTraj `shrake_rupley`, protein-only, chain A) exceeds 20% of the Tien et al. 2013 maximum ASA for that residue type. At this threshold, a typical enzyme will have ~80–100 exposed residues per frame. Lowering the threshold to 0.10 includes more partially buried residues (larger denominator, smaller enrichment magnitudes); raising to 0.30 restricts to only highly exposed residues (smaller denominator, more variable enrichments). **The threshold is stored in `sasa_metadata.json`** alongside each SASA cache. If you change the threshold, you must recompute: ```bash polyzymd compare run exposure -f comparison.yaml --recompute ``` Comparing enrichment values computed with different thresholds is not valid. ### Contact frame indices are absolute (no offset) Contact event `start_frame` values in `contacts_rep*.json` are **absolute trajectory frame indices**, numbered from 0 at the start of the full trajectory. They are not relative to the beginning of the contact analysis window or the end of equilibration. When cross-referencing SASA data manually: ```python # CORRECT: use contact frame index directly as SASA row sasa_at_contact = sasa_matrix[contact_event.start_frame] # WRONG: do NOT add any offset sasa_at_contact = sasa_matrix[contact_event.start_frame + equilibration_offset] # Wrong! ``` This means frame 0 of `sasa_trajectory.npz` corresponds to absolute frame 0 of the MD trajectory (the first frame of equilibration, if equilibration frames were saved to the trajectory file). ### Equilibration frames are included in SASA but not in contacts SASA is computed over **all** saved trajectory frames, including any equilibration period at the start. Contact analysis may begin at a later frame (stored in `contacts_rep*.json` as `start_frame`). During the equilibration window, contact arrays are all-False — no contact events are recorded. **Effect on enrichment:** Equilibration frames contribute to $n_{\text{exp}}(t)$ and $n^G_{\text{exp}}(t)$ in the denominator of the expected fraction, but never to $n^G_{\text{con}}(t)$. This slightly deflates `mean_observed` relative to a post-equilibration-only analysis. In the reference dataset (38 equilibration frames out of 322 total, ~12%), the effect is: - Systematic: affects all conditions identically - Consistent: relative comparisons between conditions are unaffected - Disclosed: `sasa_metadata.json` records `n_frames` (total) and the contacts file records `start_frame` (equilibration end) If you want post-equilibration-only enrichment, you can slice the SASA array manually — see [Independent Verification](#independent-verification) below. ### Contact cutoff is stored in the contacts cache The enrichment and chaperone fraction are both computed from the same cached contact events. The cutoff used is recorded in each `contacts_rep*.json` under `criteria_cutoff`. If you re-run contact analysis with a different cutoff, you must also recompute exposure dynamics to maintain consistency. ```{note} The default contact cutoff in `plugins.contacts.cutoff` is 4.5 Å, but cached results from earlier runs may use 4.0 Å. Always check `criteria_cutoff` in the contacts JSON if you suspect a mismatch. See [Troubleshooting](troubleshooting.md) for details. ``` ### No causation or free-energy claims - Dynamic enrichment is a **conditional contact frequency ratio**, not a binding free energy or interaction enthalpy. - Chaperone fraction is an **event co-occurrence rate**, not a mechanistic claim that the polymer caused or prevented folding. - Use language like "SBM co-localizes with aromatic residues during exposure events" rather than "SBM stabilizes aromatic residues." --- ## Quick Start ### Enable exposure dynamics in comparison.yaml `````{tab-set} ````{tab-item} YAML (Recommended) ```yaml # comparison.yaml name: "Exposure Dynamics Study" control: "No Polymer" conditions: - label: "No Polymer" config: "../no_polymer/config.yaml" replicates: [1, 2, 3] - label: "100% SBMA" config: "../sbma_100/config.yaml" replicates: [1, 2, 3] - label: "100% EGMA" config: "../egma_100/config.yaml" replicates: [1, 2, 3] plugins: contacts: polymer_selection: "chainID C" protein_selection: "protein" cutoff: 4.5 exposure: exposure_threshold: 0.20 # Relative SASA threshold for exposed/buried transient_lower: 0.20 # Below this → stably buried transient_upper: 0.80 # Above this → stably exposed min_event_length: 1 # Minimum frames for an exposure window protein_chain: "A" # Chain letter for protein # polymer_resnames: ["SBM", "EGM"] # Optional: subset of polymer types ``` Then run: ```bash polyzymd compare run exposure -f comparison.yaml ``` ```` ````{tab-item} Python Exposure dynamics is designed to be run through the CLI — there is no stable public Python API for calling the analysis directly. **Recommended: use the orchestrator** If you need programmatic access, use the public orchestrator interface: ```python from polyzymd.analyses import get_analysis, run_comparison from polyzymd.config.comparison import ComparisonConfig config = ComparisonConfig.from_yaml("comparison.yaml") exposure = get_analysis("exposure") result = run_comparison(exposure(), config) ``` **Loading results after a CLI run** You can also run the analysis via the CLI and then load the JSON results: ```python import json from pathlib import Path # After running: polyzymd compare run exposure -f comparison.yaml result_path = Path("comparison/exposure/result.json") result = json.loads(result_path.read_text()) ``` ```` ````` ### Example output ``` Exposure Dynamics - Dynamic Enrichment by Amino Acid Class ------------------------------------------------------------------ Metric: mean_t(observed[t]) / mean_t(expected[t]) - 1 Exposure threshold: 0.20 (relative SASA, Tien et al. 2013) Frames: 322 total (38 equilibration + 284 production) Polymer: SBM (100% SBMA / 0% EGMA) AA Group Enrichment Observed Expected N frames exposed ----------------------------------------------------------------------- aromatic +6.942 0.651 0.083 322 charged_negative +3.053 0.321 0.079 322 charged_positive +1.738 0.206 0.076 322 nonpolar -0.376 0.049 0.078 322 polar +0.051 0.082 0.078 322 Chaperone Fraction - Transient Residues ------------------------------------------------------------------ Condition N transient Chaperone fraction 100% SBMA / 0% EGMA 41 0.517 ± 0.023 75% SBMA / 25% EGMA 39 0.762 ± 0.018 50% SBMA / 50% EGMA 38 0.743 ± 0.021 25% SBMA / 75% EGMA 35 0.635 ± 0.019 0% SBMA / 100% EGMA 33 0.441 ± 0.025 ``` ### Interpreting these results **Dynamic enrichment (SBM → aromatic = +6.94):** - SBM contacts aromatic residues 7.94× more often than their instantaneous share of the exposed surface would predict - This is **not** an artifact of there being few aromatic residues: with $\theta = 0.20$, aromatic residues are typically exposed in most frames (the Tien et al. max ASA for TRP/TYR/PHE is generous), and the result holds consistently across all 322 frames **Why dynamic enrichment can be much larger than static enrichment:** Both metrics compute the same ratio — observed contact share divided by expected contact share — but they define "expected" differently, and this difference matters when exposure is dynamic. *Static binding preference* defines expected from a **fixed snapshot**: the fraction of all surface-exposed residues (from a single PDB) that belong to group $G$. This denominator is a constant across all frames. *Dynamic enrichment* conditions on the **instantaneous exposed set at each frame**, computes `observed(t)` and `expected(t)` independently per frame, and then takes the enrichment of the means: ``` Static: enrichment = (contact_frames_G / contact_frames_all) / (n_exposed_G_static / n_exposed_all_static) - 1 Dynamic: enrichment = mean_t( observed(t) ) / mean_t( expected(t) ) - 1 where observed(t) = n_contacted_G(t) / n_exposed_G(t) expected(t) = n_exposed_G(t) / n_exposed_all(t) ``` **Concrete example (two representative frames):** The protein has ~20 aromatic residues. At threshold 0.20, typically 12 of them are exposed per frame, but the *total* exposed surface fluctuates with protein conformation. Suppose the polymer contacts 8 of the exposed aromatics in both frames: | | Frame A | Frame B | Mean across all frames | |-|---------|---------|----------------------| | Exposed aromatics | 12 | 12 | — | | Total exposed residues | 70 | 110 | — | | `observed(t)` = 8/12 | 0.667 | 0.667 | **0.667** | | `expected(t)` = 12/70, 12/110 | 0.171 | 0.109 | **0.140** | Dynamic enrichment = **mean(observed)** / **mean(expected)** − 1 = 0.667 / 0.140 − 1 = **+3.76** Notice that `mean(expected) = 0.140` is *not* the same as computing expected from the mean exposed counts: 12 / mean(70, 110) = 12/90 = 0.133. This is because `mean_t(ratio) ≠ ratio_of_means` when both numerator and denominator vary across frames. The dynamic expected is lower when the protein tends to expose aromatics preferentially in frames where the overall exposed set is also small (correlated fluctuations). Now compare to static enrichment with aromatics comprising 14% of the fixed surface: if the observed contact share is ~0.30, static enrichment = 0.30 / 0.14 − 1 = +1.14. The dynamic value is larger because the per-frame contact rate (8 out of 12 exposed aromatics) is much higher than what the aggregated contact-frame count captures across the whole trajectory. Both metrics are valid — they answer different questions. ```{admonition} Why we report mean(observed)/mean(expected) rather than mean(observed/expected) :class: note An alternative formula would compute the **per-frame enrichment ratio** and average those ratios: $$\text{enrichment}_{\text{alt}} = \overline{\left(\frac{\text{observed}(t)}{\text{expected}(t)} - 1\right)}$$ In the example above this gives (0.667/0.171 − 1 + 0.667/0.109 − 1) / 2 = **+4.01**, slightly larger than the +3.76 from the ratio-of-means formula. We chose **not** to use this formulation for one key reason: **small-denominator instability.** In frames where only 1–2 aromatic residues happen to be exposed out of 90 total, `expected(t)` is on the order of 0.01–0.02. A single contact with that one exposed residue sends `observed(t) / expected(t)` to 50–100, dominating the mean and producing enrichment values driven by rare conformational states rather than the polymer's typical behavior. The ratio-of-means formula (`mean(observed) / mean(expected)`) avoids this: each frame contributes equally to the numerator and denominator averages independently, so no single frame can distort the result. This is consistent with the LiveCoMS best practices for mean-based MD metrics (Grossfield et al., 2018): compute the average of the quantity of interest, then form the ratio — rather than averaging the ratio itself. ``` **Chaperone fraction trend:** The 75% SBMA / 25% EGMA condition shows a higher chaperone fraction (0.762) than pure SBMA (0.517). This does not imply 75% SBMA is "better" — it means exposure events in that condition are more frequently accompanied by polymer contact. Whether this reflects beneficial protection or increased polymer interference depends on the biological context and downstream functional assays. --- ## Configuration Reference ### ExposureSettings | Field | Type | Default | Description | |-------|------|---------|-------------| | `exposure_threshold` | float | `0.20` | Relative SASA threshold for exposed/buried classification (0–1). Stored in `sasa_metadata.json`. | | `transient_lower` | float | `0.20` | Exposure fraction below which a residue is classified `stably_buried`. | | `transient_upper` | float | `0.80` | Exposure fraction above which a residue is classified `stably_exposed`. | | `min_event_length` | int | `1` | Minimum consecutive exposed frames to count as an exposure window. | | `protein_chain` | str | `"A"` | Chain letter for the protein (chain A convention). | | `protein_selection` | str | `"protein"` | MDAnalysis selection for protein atoms. | | `polymer_selection` | str | `"chainID C"` | MDAnalysis selection for polymer atoms. | | `polymer_resnames` | list[str] | `null` | Subset of polymer monomer resnames to analyze. If null, all detected types are included. | | `probe_radius_nm` | float | `0.14` | Probe radius for MDTraj `shrake_rupley`, in nm (1.4 Å water probe). | | `n_sphere_points` | int | `960` | Number of sphere points for `shrake_rupley`. Higher = more accurate, slower. | ### Comparison Settings No comparison-specific parameters beyond analysis settings defaults. Specify only `plugins.exposure` in `comparison.yaml`. --- ## Amino Acid Classification Residues are grouped by chemical class for enrichment reporting. The same groups used in binding preference analysis apply here: | Group | Amino Acids | Chemical Property | |-------|-------------|-------------------| | `aromatic` | TRP, PHE, TYR | π-stacking, hydrophobic | | `charged_positive` | LYS, ARG | Cationic at pH 7 | | `charged_negative` | ASP, GLU | Anionic at pH 7 | | `nonpolar` | ALA, VAL, LEU, ILE, MET, PRO, GLY | Hydrophobic/aliphatic | | `polar` | SER, THR, ASN, GLN, HIS, CYS | H-bond donors/acceptors | ```{note} Histidine (HIS) is classified as **polar** because its protonation state is pH-dependent. If your system uses HID/HIE/HIP protonation state naming, these are recognized and classified correctly. ``` --- ## Output Files Exposure dynamics results are written under each condition's `analysis/` directory: ``` project/ └── condition_name/ └── analysis/ ├── contacts/ │ ├── run_1/ │ │ └── result.json ← Contact events (input to exposure analysis) │ ├── run_2/ │ │ └── result.json │ └── run_3/ │ └── result.json └── exposure/ ├── run_1/ │ ├── sasa/ │ │ ├── sasa_trajectory.npz ← Per-frame SASA (n_frames × n_residues) │ │ └── sasa_metadata.json ← Threshold, resnames, aa_classes, n_frames │ └── exposure/ │ └── exposure_dynamics.json ← Per-residue stability + chaperone events ├── run_2/ │ ├── sasa/ │ │ ├── sasa_trajectory.npz │ │ └── sasa_metadata.json │ └── exposure/ │ └── exposure_dynamics.json └── run_3/ ├── sasa/ │ ├── sasa_trajectory.npz │ └── sasa_metadata.json └── exposure/ └── exposure_dynamics.json ``` ### sasa_metadata.json fields | Field | Description | |-------|-------------| | `exposure_threshold` | The $\theta$ value used for this cache | | `n_frames` | Total frames in SASA array (including equilibration) | | `n_residues` | Number of protein residues | | `resnames` | List of 3-letter residue names in chain order | | `aa_classes` | List of AA class labels (same order as resnames) | | `trajectory_path` | Source trajectory file path (provenance) | ### exposure_dynamics.json fields (per residue) | Field | Type | Description | |-------|------|-------------| | `resid` | int | 1-indexed residue ID | | `resname` | str | 3-letter amino acid code | | `aa_class` | str | Amino acid class label | | `exposure_fraction` | float | Fraction of frames where residue is exposed | | `stability` | str | `"stably_exposed"`, `"transient"`, or `"stably_buried"` | | `n_exposed_windows` | int | Number of contiguous exposed windows detected | | `n_chaperone_events` | int | Windows with at least one polymer contact | | `n_unassisted_events` | int | Windows without any polymer contact | | `chaperone_fraction` | float | `n_chaperone / (n_chaperone + n_unassisted)` | | `polymer_type_counts` | dict | `{resname: n_events}` breakdown by polymer type | | `mean_chaperone_event_duration` | float | Mean length (frames) of chaperone events | | `mean_unassisted_event_duration` | float | Mean length (frames) of unassisted events | --- ## Independent Verification You can verify both metrics independently using only NumPy and the raw cached files, without running any polyzymd code. This allows you to audit the formula implementation directly. ```{tip} Run the following script on your own data to reproduce the enrichment and chaperone fraction for a single condition and replicate. All intermediate values are printed so you can trace any discrepancy. ``` ```python """ Independent audit of dynamic enrichment and chaperone fraction. Requires only: numpy, json, pathlib — no polyzymd imports. Usage: python audit_exposure.py Edit COND, REP, and BASE to match your data. """ import json import numpy as np from pathlib import Path # ── Edit these paths ───────────────────────────────────────────────────────── BASE = Path("/path/to/your/project") COND = "condition_name" REP = 1 POLYMER_TYPE = "SBM" # resname to audit AA_GROUP = "aromatic" # group to audit # ───────────────────────────────────────────────────────────────────────────── analysis_dir = BASE / COND / "analysis" # ── Load raw SASA ───────────────────────────────────────────────────────────── sasa_dir = analysis_dir / "exposure" / f"run_{REP}" / "sasa" npz = np.load(sasa_dir / "sasa_trajectory.npz") rel_sasa = npz["relative_sasa_per_frame"] # (n_frames, n_residues) resids = npz["resids"] # (n_residues,) meta = json.loads((sasa_dir / "sasa_metadata.json").read_text()) threshold = meta["exposure_threshold"] aa_classes = meta["aa_classes"] resnames = meta["resnames"] n_frames, n_residues = rel_sasa.shape print(f"SASA: {n_frames} frames × {n_residues} residues, threshold={threshold}") # ── Load contact events ─────────────────────────────────────────────────────── contacts_json = json.loads( (analysis_dir / "contacts" / f"run_{REP}" / "result.json").read_text() ) # Frame indices are ABSOLUTE (0-indexed from trajectory frame 0) # Do NOT add any equilibration offset when indexing into rel_sasa. contact_matrix = np.zeros((n_frames, n_residues), dtype=bool) resid_to_idx = {int(r): i for i, r in enumerate(resids)} for rc in contacts_json["residue_contacts"]: res_idx = resid_to_idx.get(rc["protein_resid"]) if res_idx is None: continue for sc in rc["segment_contacts"]: if sc["polymer_resname"] != POLYMER_TYPE: continue for ev in sc["events"]: start = ev["start_frame"] dur = ev["duration_frames"] contact_matrix[start : start + dur, res_idx] = True print(f"Contact matrix built: {contact_matrix.sum()} True entries for {POLYMER_TYPE}") # ── Compute enrichment ──────────────────────────────────────────────────────── exposed_mask = rel_sasa > threshold # (n_frames, n_residues) group_mask = np.array([c == AA_GROUP for c in aa_classes]) # (n_residues,) group_exposed = exposed_mask[:, group_mask] # (n_frames, n_group) group_contact = contact_matrix[:, group_mask] # (n_frames, n_group) total_exposed = exposed_mask.sum(axis=1).astype(float) # (n_frames,) n_group_exp = group_exposed.sum(axis=1).astype(float) # (n_frames,) n_group_con = (group_exposed & group_contact).sum(axis=1).astype(float) valid = n_group_exp > 0 observed = np.where(valid, n_group_con / np.maximum(n_group_exp, 1e-12), 0.0) expected = np.where(total_exposed > 0, n_group_exp / np.maximum(total_exposed, 1e-12), 0.0) mean_obs = observed[valid].mean() if valid.any() else 0.0 mean_exp = expected[valid].mean() if valid.any() else 0.0 enrichment = mean_obs / max(mean_exp, 1e-12) - 1.0 print(f"\nEnrichment ({POLYMER_TYPE} → {AA_GROUP}):") print(f" mean_observed = {mean_obs:.6f}") print(f" mean_expected = {mean_exp:.6f}") print(f" enrichment = {enrichment:.4f}") print(f" frames with group exposed: {valid.sum()} / {n_frames}") # ── Compute chaperone fraction for one residue ──────────────────────────────── # Pick first aromatic residue as example aromatic_idxs = np.where(group_mask)[0] if len(aromatic_idxs) == 0: print("No aromatic residues found.") else: idx = aromatic_idxs[0] resid = int(resids[idx]) exp_vec = exposed_mask[:, idx] con_vec = contact_matrix[:, idx] # Find contiguous exposed windows in_window = False windows = [] for f in range(n_frames): if exp_vec[f] and not in_window: win_start = f in_window = True elif not exp_vec[f] and in_window: windows.append((win_start, f - 1)) in_window = False if in_window: windows.append((win_start, n_frames - 1)) chap = sum(1 for s, e in windows if con_vec[s:e+1].any()) total_win = len(windows) frac = chap / total_win if total_win > 0 else 0.0 print(f"\nChaperone fraction for residue {resid} ({resnames[idx]}, {aa_classes[idx]}):") print(f" exposure windows: {total_win}") print(f" chaperone events: {chap}") print(f" chaperone_fraction = {frac:.3f}") # Compare with cached value dyn_path = analysis_dir / "exposure" / f"run_{REP}" / "exposure" / "exposure_dynamics.json" if dyn_path.exists(): dyn = json.loads(dyn_path.read_text()) cached = next((r for r in dyn["residues"] if r["resid"] == resid), None) if cached: print(f" cached value = {cached['chaperone_fraction']:.3f}") match = abs(frac - cached["chaperone_fraction"]) < 1e-3 print(f" match: {'YES ✓' if match else 'NO — investigate'}") ``` --- ## Troubleshooting ### "My manually computed enrichment doesn't match" The two most common causes, in order of likelihood: 1. **Wrong threshold.** Check `sasa_metadata.json` → `exposure_threshold`. If your manual computation uses a different threshold, the exposed sets will differ. Use `meta["exposure_threshold"]` directly rather than hardcoding a value. 2. **Frame offset applied incorrectly.** Contact `start_frame` values are absolute trajectory frame indices. Do not add any equilibration offset when indexing the SASA array. See [Contact frame indices are absolute](#contact-frame-indices-are-absolute-no-offset). 3. **Different polymer type.** If `polymer_resnames` is set in your config, only the listed types are included. Verify the resname matches exactly (case-sensitive: `"SBM"` not `"sbm"`). ### "Chaperone fraction seems too high / too low" - Check `n_transient_residues` in the output. If very few residues are transient (< 10), the mean is noisy and condition comparisons may not be meaningful. - Verify `transient_lower` and `transient_upper` are appropriate for your system. A protein with slow dynamics may need a narrower transient window. - Check whether `min_event_length` is filtering too aggressively. At `min_event_length=5`, short exposure flickers are discarded; at `=1`, all windows are counted. ### "Enrichment is very large (> 5)" This is not necessarily an artifact. Large enrichment values occur when a group has a small instantaneous expected fraction but a high observed contact rate. Verify by checking: - `n_frames_with_exposed` in the result: if this is close to `n_frames`, the group is consistently exposed and the large enrichment is robust. - `observed_contact_fraction`: if this is, e.g., 0.65, then the polymer contacts 65% of exposed aromatic residues per frame — a real signal. - `expected_contact_fraction_residue`: if this is ~0.08, the group comprises only 8% of the exposed surface on average, making the ratio large. Use the [audit script](#independent-verification) to trace the per-frame values if needed. ### "Results change when I rerun" Exposure dynamics results are cached in `exposure_dynamics.json` and SASA in `sasa_trajectory.npz`. If input trajectories or contact results change, the cache is stale. Force recomputation with: ```bash polyzymd compare run exposure -f comparison.yaml --recompute ``` Or delete the cache directories manually: ```bash rm -rf condition_name/analysis/exposure/run_*/sasa/ rm -rf condition_name/analysis/exposure/run_*/exposure/ ``` --- ## See Also - [Contacts Analysis Quick Start](analysis_contacts_quickstart.md) — contacts analysis required before exposure analysis - [Binding Preference Analysis](analysis_binding_preference.md) — static SASA-based enrichment - [Statistics Best Practices](../explanation/analysis_statistics_best_practices.md) — replicate aggregation, uncertainty - [Comparing Conditions](analysis_compare_conditions.md) — multi-condition workflows - [Equilibration](equilibration.md) — how equilibration frames affect trajectory-level analyses