# Polymer Bridging Analysis (Experimental)

```{warning}
Polymer bridging is **experimental**. This plugin was contributed as a
proof-of-concept extensibility exercise. Metric definitions,
chemistry-aware profiling outputs, and interpretation guidance are all
subject to change. CLI output and generated figures carry explicit
experimental labels.
```

Quantify **per-fragment, per-frame multisite attachment** of individual polymer
chains (oligomers) to the enzyme surface directly from trajectories.

This analysis answers:
*"When a single oligomer chain contacts the protein, how often does it contact
more than one distinct protein residue — and which residue classes and monomer
types are involved?"*

:::{admonition} Environment Setup
:class: tip

All commands below assume you have activated the PolyzyMD pixi environment:

```bash
pixi shell -e build
```

Alternatively, prefix each command with `pixi run -e build`.
:::

## Core Concepts

### What Is an Observation?

The fundamental unit of data in this analysis is one **observation**: a single
polymer fragment in a single trajectory frame that makes at least one contact
with the protein (within the distance cutoff). Every observation records:

- Which protein residues are contacted
- The frame-wise C-alpha distances between those residues
- The amino acid class of each contacted residue
- The monomer identity of each contacting polymer residue
- The ordered residue-name sequence (signature) of the polymer fragment

One replicate may produce thousands of observations. All statistics are computed
over this observation population.

### What Is "Multisite"?

An observation is classified as **multisite** when the polymer fragment contacts
protein residues whose effective eligible valency exceeds 1. The meaning of
"eligible" depends on the `min_ca_distance_angstrom` setting:

- **`min_ca_distance_angstrom = 0`** (default): Any observation contacting 2+
  distinct residues counts as multisite, regardless of their spatial
  separation.

- **`min_ca_distance_angstrom > 0`** (e.g., 8.0 or 10.0): An observation
  is multisite only if at least two contacted residues have a frame-wise
  C-alpha separation >= the threshold. This filters out contacts with
  sequentially adjacent residues that happen to be geometrically close,
  focusing on **geometrically significant bridging** across the protein
  surface.

  Eligible valency is the number of residues that participate in at least one
  such qualifying pair. It is computed **per frame** from actual atomic
  coordinates, so the same pair of residues may qualify in one frame and not
  another if the protein is flexible.

### What Is "High Valency"?

An observation is **high valency** when its eligible valency is 3 or more —
meaning the polymer simultaneously contacts at least three spatially separated
protein residues in that frame.

## Primary Metrics

| Metric | Field name | Description |
|--------|-----------|-------------|
| **Mean Contacts / Oligomer** | `mean_contacts_per_contacting_oligomer` | Average number of distinct protein residues contacted per observation. |
| **Multisite Fraction** | `multisite_fraction` | Fraction of observations with eligible valency > 1. |
| **High-Valency Fraction** | `high_valency_fraction` | Fraction of observations with eligible valency >= 3. |
| **Valency Distribution** | `valency_probabilities` | Probability of 1-site, 2-site, and 3+-site attachment across all observations. |

These four metrics are used in the default cross-condition comparison pipeline
(t-tests, ANOVA, effect sizes, rankings).

## Chemistry-Aware Outputs (Experimental)

```{important}
All chemistry-aware outputs described below are labeled
`polymer_bridging_chemistry` in the experimental feature system. They are
**descriptive probabilities** — observed frequencies over the observation
population — not normalized enrichments and not evidence of mechanism.

Probabilities reflect what was observed in the simulation. They do not
control for surface accessibility, polymer composition, or reference
expectations. Interpret them as a **starting point for hypothesis
generation**, not as proof of preferential interaction.
```

Chemistry-aware outputs are computed only from **multivalent observations**
(eligible valency > 1). They are reported per-replicate and aggregated
(mean +/- SEM) across replicates.

### Protein Residue Classification

All protein residue outputs use the `ProteinAAClassification` scheme from
`polyzymd.analyses.shared.groupings`:

| Class | Amino Acids |
|-------|-------------|
| aromatic | PHE, TRP, TYR, HIS |
| charged_positive | ARG, LYS |
| charged_negative | ASP, GLU |
| polar | ASN, CYS, GLN, SER, THR |
| nonpolar | ALA, GLY, ILE, LEU, MET, PRO, VAL |
| unknown | Non-standard residues |

Common protonation-state variants (HIE, HID, HIP, ASH, GLH, etc.) are
automatically mapped to their parent residue.

### Anchor and Peripheral Residues

In each multivalent observation, the plugin identifies an **anchor** — the
protein residue with the closest atom-level distance to the polymer. All other
eligible contacted residues are **peripheral**.

- **Anchor protein class probabilities**: Frequency distribution of the amino
  acid class of the anchor residue across all multivalent observations.

- **Peripheral protein class probabilities**: Frequency distribution of the
  amino acid classes of non-anchor eligible residues.

- **Multivalent protein class probabilities**: Frequency distribution of the
  amino acid classes of *all* eligible residues in multivalent observations
  (anchor + peripheral combined).

### Polymer Monomer Probabilities

- **Polymer contact type probabilities**: Frequency of each polymer monomer
  type (by residue name, e.g. SBM, EGM) among all polymer residues that
  make protein contacts in multivalent observations.

- **Polymer anchor type probabilities**: Frequency of the polymer monomer
  type of the anchor (the polymer residue closest to the anchor protein
  residue) across multivalent observations.

### Cross-Classification Matrices

- **Anchor-to-peripheral class matrix**
  (`anchor_to_peripheral_group_matrix`): A row-normalized matrix where
  rows are the anchor protein class and columns are peripheral protein
  classes. Each row sums to 1.0. Answers: *"Given that the anchor is
  aromatic, what protein classes are the peripheral contacts?"*

- **Polymer-anchor to protein-anchor matrix**
  (`polymer_anchor_to_protein_anchor_matrix`): A row-normalized matrix
  where rows are polymer monomer types and columns are protein anchor
  classes. Each row sums to 1.0. Answers: *"Given that SBM is the polymer
  anchor monomer, which protein residue classes does it anchor to?"*

### Fragment Signature Probabilities

Each polymer fragment has an **ordered 5-mer signature** — the sequence of
residue names along the fragment (e.g., `EGM-EGM-SBM-EGM-EGM`). The top-10
most frequent signatures across multivalent observations are reported as
probabilities. These may help identify whether specific polymer subsequences
are over-represented in bridging events.

```{note}
Fragment signatures depend on the topology's residue ordering. In practice
the fragment length equals the number of monomers in the polymer chain. The
"5-mer" label is for illustration — the actual signature length is the full
fragment.
```

## Quick Start

### Step 1: Add bridging settings to your comparison YAML

```yaml
# comparison.yaml
plugins:
  polymer_bridging:
    cutoff: 4.5
    min_ca_distance_angstrom: 8.0   # Require contacted residues to be >= 8 A apart
    protein_selection: "protein"
    polymer_selection: "chainID C"  # Must match chain convention
```

Setting `min_ca_distance_angstrom: 0` disables the geometric filter entirely,
counting any 2+-residue observation as multisite.

### Step 2: Run the analysis

```bash
polyzymd compare run polymer_bridging -f comparison.yaml
```

The plugin automatically filters out conditions that have no polymer atoms
(e.g., a protein-only control).

### Step 3: Inspect results

The CLI prints a comparison table with the three primary metrics. Example:

```text
WARNING: Experimental analysis
Definitions and interpretation may change after the presentation release.
Affected: Polymer bridging chemistry profiling

Polymer Bridging Comparison
================================================================================
Multisite Fraction
  100% SBMA    : 0.312 +/- 0.018  (n=3)
  SBMA-EGPMA   : 0.487 +/- 0.025  (n=3)
  Pairwise: 100% SBMA -> SBMA-EGPMA  p=0.004 ** d=1.82  +56.1%  (more multisite)

Average Oligomer Valency
  100% SBMA    : 1.41 +/- 0.03  (n=3)
  SBMA-EGPMA   : 1.72 +/- 0.05  (n=3)
  Pairwise: 100% SBMA -> SBMA-EGPMA  p=0.008 ** d=1.54  +22.0%  (more bridging)

High-Valency Oligomers
  100% SBMA    : 0.051 +/- 0.009  (n=3)
  SBMA-EGPMA   : 0.128 +/- 0.014  (n=3)
  Pairwise: 100% SBMA -> SBMA-EGPMA  p=0.012 *  d=1.23  +150.8%  (more high-valency)
```

```{tip}
Use `--format json` to export the full `ComparisonResult` for downstream
analysis, or `--format markdown` for integration into reports.
```

### Step 4: Generate plots

```bash
polyzymd compare plot-all -f comparison.yaml
```

## Configuration Reference

### `plugins.polymer_bridging`

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `cutoff` | float | `4.5` | Contact distance cutoff in Angstroms. Atom pairs within this distance are considered in contact. |
| `min_ca_distance_angstrom` | float | `0.0` | Minimum frame-wise C-alpha distance between contacted protein residues for an observation to count as multisite. Set > 0 to filter for geometrically significant bridging. Must be >= 0. |
| `protein_selection` | str | `"protein"` | MDAnalysis atom selection string for the protein. |
| `polymer_selection` | str | `"chainID C"` | MDAnalysis atom selection string for the polymer. Must match the chain convention (C = polymer). |

### Plot Settings

All plots are enabled by default and can be individually toggled in
`plot_settings.polymer_bridging`:

| Setting | Type | Default | Description |
|---------|------|---------|-------------|
| `generate_multisite_bars` | bool | `true` | Bar chart of multisite fraction per condition. |
| `generate_mean_contacts_bars` | bool | `true` | Bar chart of mean contacted residues per oligomer. |
| `generate_valency_stack` | bool | `true` | Stacked bars showing 1 / 2 / 3+ valency distribution. |
| `generate_anchor_group_bars` | bool | `true` | Grouped bars of anchor protein residue class. |
| `generate_protein_group_stack` | bool | `true` | Stacked bars of protein classes in multivalent events. |
| `generate_anchor_peripheral_heatmap` | bool | `true` | Heatmap of anchor vs. peripheral protein class co-occurrence. |
| `generate_polymer_anchor_heatmap` | bool | `true` | Heatmap of polymer anchor monomer vs. protein anchor class. |
| `generate_fragment_signature_bars` | bool | `true` | Top-10 fragment signatures by frequency. |
| `figsize_bars` | (float, float) | `(9, 6)` | Figure size for bar charts. |
| `figsize_stack` | (float, float) | `(11, 6)` | Figure size for stacked charts. |
| `figsize_heatmap` | (float, float) | `(9, 7)` | Figure size for heatmaps. |

## Output Files

### Comparison cache

Per-replicate and aggregated results use **fingerprinted filenames** that
encode the analysis settings hash, ensuring that results computed with
different settings do not collide in the cache.

```text
comparison_workspace/
├── analysis/
│   ├── <condition_A>/
│   │   └── polymer_bridging/
│   │       ├── run_1/
│   │       │   └── polymer_bridging_<settings_tag>.json
│   │       ├── run_2/
│   │       │   └── polymer_bridging_<settings_tag>.json
│   │       └── aggregated/
│   │           └── polymer_bridging_<rep_range>_<settings_tag>.json
│   └── <condition_B>/
│       └── polymer_bridging/
│           └── ...
└── comparison/
    └── polymer_bridging/
        └── result.json                  # ComparisonResult
```

`<settings_tag>` is an 8-character hex fingerprint derived from the plugin
settings (cutoff, selections, etc.).  `<rep_range>` encodes the replicate
range (e.g., `r1-3` for replicates 1–3).

### Generated Figures

All figures are stamped with an `EXPERIMENTAL` tag and saved to the configured
output directory (default: `figures/`).

| Filename | Content |
|----------|---------|
| `polymer_bridging_multisite_fraction.*` | Bar chart of multisite probability. |
| `polymer_bridging_mean_contacts.*` | Bar chart of average valency. |
| `polymer_bridging_valency_distribution.*` | Stacked bars: 1-site, 2-site, 3+ site. |
| `polymer_bridging_anchor_groups.*` | Grouped bars of anchor protein residue class. |
| `polymer_bridging_protein_group_distribution.*` | Stacked bars of all protein classes in multivalent events. |
| `polymer_bridging_anchor_peripheral_heatmap.*` | Anchor class (row) vs. peripheral class (column). |
| `polymer_bridging_polymer_anchor_heatmap.*` | Polymer monomer type (row) vs. protein anchor class (column). |
| `polymer_bridging_fragment_signatures.*` | Top-10 fragment signature probabilities. |

## CLI Reference

```text
polyzymd compare run polymer_bridging [OPTIONS]
```

| Option | Description |
|--------|-------------|
| `-f, --file PATH` | Path to `comparison.yaml` (default: `comparison.yaml`) |
| `--recompute` | Force recompute even if cached results exist |
| `--format [table\|markdown\|json]` | Output format (default: `table`) |
| `-o, --output PATH` | Save output to file |
| `-q, --quiet` | Suppress INFO messages |
| `--debug` | Enable DEBUG logging |
| `--eq-time TEXT` | Override equilibration time |

Alias: `polyzymd compare run bridging` (resolves to `polymer_bridging`).

## Loading Results Programmatically

Result files use fingerprinted names (see [Output Files](#output-files)).
Use `glob` to find the correct file, or import the result classes directly:

```python
from pathlib import Path
from polyzymd.analyses.polymer_bridging import (
    PolymerBridgingAggregatedResult,
    PolymerBridgingReplicateResult,
)

# Load per-replicate result (fingerprinted filename)
run_dir = Path("analysis/<condition>/polymer_bridging/run_1")
rep_file = next(run_dir.glob("polymer_bridging_*.json"))
rep = PolymerBridgingReplicateResult.load(rep_file)
print(f"Multisite fraction: {rep.multisite_fraction:.3f}")
print(f"Anchor protein groups: {rep.anchor_protein_group_probabilities}")

# Load aggregated result (fingerprinted filename)
agg_dir = Path("analysis/<condition>/polymer_bridging/aggregated")
agg_file = next(agg_dir.glob("polymer_bridging_*.json"))
agg = PolymerBridgingAggregatedResult.load(agg_file)
print(f"Mean valency: {agg.mean_contacts_per_contacting_oligomer:.2f} "
      f"+/- {agg.mean_contacts_sem:.2f}")

# Inspect cross-classification matrices
for anchor_class, peripherals in agg.anchor_to_peripheral_group_matrix_mean.items():
    for peripheral_class, prob in peripherals.items():
        if prob > 0.05:
            print(f"  {anchor_class} -> {peripheral_class}: {prob:.2f}")
```

## Interpretation Caveats

```{warning}
These caveats are essential for responsible use of this analysis. Read them
before presenting or publishing polymer bridging results.
```

1. **Descriptive, not mechanistic.** All outputs are observed frequencies. A
   high anchor probability for aromatic residues does not prove that aromatic
   anchoring drives conjugate stability — it may reflect surface composition,
   polymer placement, or simulation artifacts.

2. **Not normalized enrichment.** Unlike binding preference analysis, polymer
   bridging probabilities are **raw frequencies over the observation
   population**, not enrichments normalized by surface availability. A protein
   with 30% aromatic surface and 30% aromatic anchors is not showing
   enrichment — it is showing baseline proportionality. Compare with surface
   composition before drawing conclusions.

3. **Frame-wise, not event-wise.** Each trajectory frame generates independent
   observations. A polymer that bridges two residues for 500 consecutive frames
   counts as 500 observations, not one sustained bridging event. Residence-time
   analysis is not yet implemented.

4. **C-alpha distance is dynamic.** When `min_ca_distance_angstrom > 0`, the
   threshold is evaluated per frame against actual C-alpha coordinates. Protein
   breathing motions mean that the same pair of residues may qualify in some
   frames and not others. This is physically correct but can make results
   sensitive to protein flexibility.

5. **Anchor selection is heuristic.** The anchor is the polymer-protein pair
   with the minimum atom-level distance. In cases with multiple equidistant
   contacts, the choice is arbitrary. This affects anchor-specific outputs but
   not the primary multisite/valency metrics.

6. **Fragment signatures assume topology ordering.** The ordered monomer
   sequence comes directly from the topology. If the topology's residue
   ordering does not reflect the true polymer sequence, signatures will be
   misleading.

7. **Conditions without polymer are filtered.** The plugin automatically
   excludes conditions that lack polymer atoms (e.g., protein-only controls).
   This is correct behavior but means the control condition for pairwise
   statistics must itself contain polymer.

8. **Proof-of-concept status.** This plugin was contributed as an extensibility
   exercise. The analysis methodology has not been independently validated.
   Use it for internal hypothesis generation, not for publication-ready
   claims, until the methodology matures.

## Relation to Other Analyses

| Analysis | What It Measures | Relation to Polymer Bridging |
|----------|-----------------|------------------------------|
| **Contacts** | Total contact counts and frequencies | Polymer bridging decomposes contacts per-chain, adding valency information. |
| **Binding Preference** | Enrichment by residue class | Provides surface-normalized context that bridging lacks. |
| **Polymer Affinity** | Total interaction strength (N x deltaG) | Complementary: affinity measures total adhesion; bridging measures spatial distribution of adhesion per chain. |
| **RMSF** | Structural flexibility | Complementary: does multisite bridging correlate with reduced flexibility? |
| **Catalytic Triad** | Active site geometry | Complementary: do bridging events coincide with triad perturbation? |

## Troubleshooting

### "No conditions passed polymer filtering"

All conditions were filtered out because the plugin could not detect polymer
atoms. Check that your `polymer_selection` matches your topology (default is
`"chainID C"`, following the PolyzyMD chain convention).

### Very low multisite fraction

If multisite fraction is near zero:
- Check that `min_ca_distance_angstrom` is not too stringent. A value of 20+ A
  may filter out nearly all events for small proteins.
- Verify that the polymer fragments contain more than one monomer.
- Check that the contact cutoff is appropriate for your force field (4.5 A is
  standard for heavy-atom contacts).

### All probabilities are empty (`{}`)

Chemistry-aware outputs require multivalent observations. If no observation has
eligible valency > 1, all chemistry dictionaries will be empty. Lower
`min_ca_distance_angstrom` or verify that the polymer makes multi-residue
contacts.

### Heatmaps show only one condition

The anchor-peripheral and polymer-anchor heatmaps currently display data for
the **first condition only** (by label order). This is a known limitation of
the current plotting code. To compare matrices across conditions, load the
aggregated results programmatically (see above).

## See Also

- [Contacts Analysis Quick Start](analysis_contacts_quickstart.md) — prerequisite contact computation
- [Binding Preference Analysis](analysis_binding_preference.md) — surface-normalized enrichment (complementary)
- [Polymer Affinity Analysis](analysis_polymer_affinity.md) — total interaction strength scoring
- [Statistics Best Practices](../explanation/analysis_statistics_best_practices.md) — replicate planning
- [Comparing Conditions](analysis_compare_conditions.md) — multi-condition workflows
- [Extending the Analysis Framework](../contributor_guide/extending_analyses.md) — contribute a new plugin