# Catalytic Triad Analysis: Statistical Best Practices

A comprehensive guide to analyzing catalytic triad geometry from MD simulations,
including statistical considerations, interpretation guidelines, and worked
examples from real enzyme-polymer systems.

```{note}
**Just need quick results?** See the [Quick Start Guide](analysis_triad_quickstart.md)
for copy-paste commands and minimal setup.
```

## Introduction

The catalytic triad is a fundamental structural motif in many enzyme families,
including serine proteases, lipases, and esterases. Maintaining proper triad
geometry is essential for catalytic activity. This guide explains:

- What catalytic triads are and why they matter
- How to measure triad integrity in MD simulations
- The "simultaneous contact fraction" metric
- Statistical considerations for robust analysis
- How to interpret results in the context of enzyme function

## What is a Catalytic Triad?

### The Ser-His-Asp/Glu Mechanism

The classic catalytic triad consists of three amino acids working together:

1. **Serine (Ser)** - The nucleophile that attacks the substrate
2. **Histidine (His)** - The general base/acid that shuttles protons
3. **Aspartate (Asp) or Glutamate (Glu)** - Orients and activates histidine

These residues form a hydrogen bonding network:

```
        O                         O
        ||                        ||
Substrate-C     ->    Substrate-C-O-Ser
        |                         |
       ...                       ...

    Asp/Glu --- His --- Ser --- Substrate
       |         |       |
      C=O      N-H     O-H
       |         |       |
      O-H .... N      [nucleophilic attack]
```

The proton relay works as follows:
1. Asp stabilizes the positively charged His through a hydrogen bond
2. His abstracts a proton from Ser's hydroxyl group
3. The activated Ser oxygen attacks the substrate carbonyl

### Critical Distances

For the triad to function, specific hydrogen bonds must be maintained:

| Pair | Donor | Acceptor | Typical Distance |
|------|-------|----------|------------------|
| Asp-His | His ND1 (N-H) | Asp OD1/OD2 | 2.7 - 3.5 A |
| His-Ser | Ser OG (O-H) | His NE2 | 2.7 - 3.5 A |

These distances are measured between heavy atoms (N, O). The actual H-bond
length (H...acceptor) is ~1.0 A shorter.

### Why Geometry Matters

If these distances increase beyond ~4 A:
- Hydrogen bonds break
- Proton relay cannot function
- Catalytic activity is lost or severely reduced

MD simulations can reveal whether polymer conjugation, mutations, or other
modifications disrupt triad geometry.

## The Simultaneous Contact Metric

### Definition

The **simultaneous contact fraction** is the percentage of frames where
**all** distance pairs in the triad are below the contact threshold at the
same time:

$$
f_{\text{contact}} = \frac{1}{N} \sum_{t=1}^{N} \prod_{i=1}^{M} \mathbb{1}[d_i(t) < \theta]
$$

Where:
- $N$ = number of frames
- $M$ = number of pairs (typically 2 for a triad)
- $d_i(t)$ = distance of pair $i$ at frame $t$
- $\theta$ = contact threshold (typically 3.5 A)
- $\mathbb{1}[\cdot]$ = indicator function (1 if true, 0 if false)

### Why Simultaneous Matters

Consider two scenarios with 50% individual contact per pair:

**Scenario A: Alternating**
```
Frame:    1  2  3  4  5  6  7  8
Asp-His:  +  -  +  -  +  -  +  -   (50% contact)
His-Ser:  -  +  -  +  -  +  -  +   (50% contact)
Both:     -  -  -  -  -  -  -  -   (0% simultaneous!)
```

**Scenario B: Correlated**
```
Frame:    1  2  3  4  5  6  7  8
Asp-His:  +  +  +  +  -  -  -  -   (50% contact)
His-Ser:  +  +  +  +  -  -  -  -   (50% contact)
Both:     +  +  +  +  -  -  -  -   (50% simultaneous)
```

Scenario A has the same per-pair statistics but **zero** catalytic competence
because the triad is never intact. The simultaneous contact fraction captures
this critical distinction.

### Interpretation Guidelines

| Simultaneous Contact | Interpretation |
|---------------------|----------------|
| > 80% | Excellent triad integrity |
| 50 - 80% | Good integrity, some flexibility |
| 20 - 50% | Moderate disruption |
| 5 - 20% | Significant disruption |
| < 5% | Severely disrupted triad |

```{warning}
These thresholds are guidelines, not absolute rules. The relationship between
simultaneous contact and actual catalytic activity depends on the enzyme and
should be validated against experimental data when possible.
```

## H-Bond Distance Thresholds

### Choosing a Threshold

The default threshold of **3.5 A** is based on typical H-bond geometry:

| Threshold | Rationale |
|-----------|-----------|
| 3.0 A | Strict - strong H-bonds only |
| 3.5 A | Standard - typical H-bond cutoff |
| 4.0 A | Relaxed - includes weaker interactions |

### When to Adjust

**Use a stricter threshold (3.0 A)** when:
- Comparing to crystal structures (often show shorter distances)
- Studying high-activity conformations
- The default gives near-100% contact (not discriminating)

**Use a relaxed threshold (4.0 A)** when:
- Studying dynamic enzymes with flexible active sites
- Default gives very low contact but enzyme is known to be active
- Accounting for force field limitations

### Heavy Atom vs Hydrogen Distances

PolyzyMD measures **heavy atom distances** (N-O, O-O), not hydrogen positions:

| Measurement | Typical Value |
|-------------|---------------|
| Heavy atom (N...O) | 2.7 - 3.5 A |
| H-bond (H...O) | 1.7 - 2.5 A |

This is more robust because hydrogen positions are often uncertain in MD
(fast motion, force field limitations).

## Autocorrelation Analysis

### The Correlation Problem

Contact states in MD are temporally correlated. If the triad is in contact at
frame 100, it's very likely to be in contact at frame 101. This affects
uncertainty estimation.

### How PolyzyMD Handles This

PolyzyMD computes the autocorrelation function of the simultaneous contact
timeseries and estimates:

1. **Correlation time (tau)** - characteristic decay time
2. **N_independent** - effective number of independent samples
3. **Corrected SEM** - uncertainty accounting for correlation

### SEM for Proportions

For a binary contact timeseries (1 = contact, 0 = no contact), the standard
error of the proportion is:

$$
\text{SEM} = \sqrt{\frac{p(1-p)}{N_{\text{ind}}}}
$$

Where:
- $p$ = simultaneous contact fraction
- $N_{\text{ind}}$ = number of independent samples (after autocorrelation correction)

This is the formula for the standard error of a binomial proportion.

### The Low Reliability Warning

When `N_independent < 10`, PolyzyMD warns about low statistical reliability.
This doesn't mean your results are wrong, but uncertainties may be
underestimated.

**Solutions:**
1. Use multiple independent replicates (recommended)
2. Run longer simulations
3. Interpret results with appropriate caution

## Worked Example: LipA Polymer Study

This example uses real data from simulations of Lipase A (LipA) from
*Bacillus subtilis* with various polymer conjugations.

### System Details

- **Enzyme**: Lipase A (181 residues)
- **Catalytic triad**: Ser77, His156, Asp133
- **Simulations**: 200 ns production, 100 ns equilibration discarded
- **Conditions**: 6 polymer compositions (3 replicates each)

### comparison.yaml Configuration

```yaml
name: "LipA_polymer_study"
description: "Effect of SBMA/EGMA polymer composition on LipA triad"
control: "No Polymer"

conditions:
  - label: "No Polymer"
    config: "../noPoly_LipA_DMSO/config.yaml"
    replicates: [1, 2, 3]

  - label: "100% SBMA"
    config: "../SBMA_100_DMSO/config.yaml"
    replicates: [1, 2]

  - label: "75% SBMA / 25% EGMA"
    config: "../SBMA_75_EGMA_25_DMSO/config.yaml"
    replicates: [1, 2, 3]

  - label: "50% SBMA / 50% EGMA"
    config: "../SBMA_50_EGMA_50_DMSO/config.yaml"
    replicates: [1, 2, 3]

  - label: "25% SBMA / 75% EGMA"
    config: "../SBMA_25_EGMA_75_DMSO/config.yaml"
    replicates: [1, 2, 3]

  - label: "100% EGMA"
    config: "../EGMA_100_DMSO/config.yaml"
    replicates: [1, 2, 3]

defaults:
  equilibration_time: "100ns"

catalytic_triad:
  name: "LipA_catalytic_triad"
  description: "Ser-His-Asp catalytic triad of Lipase A (Bacillus subtilis)"
  threshold: 3.5
  pairs:
    - label: "Asp133-His156"
      selection_a: "midpoint(resid 133 and name OD1 OD2)"
      selection_b: "resid 156 and name ND1"
    - label: "His156-Ser77"
      selection_a: "resid 156 and name NE2"
      selection_b: "resid 77 and name OG"
```

### Results Summary

| Condition | n | Asp133-His156 | His156-Ser77 | Simultaneous Contact |
|-----------|---|---------------|--------------|---------------------|
| No Polymer | 3 | 3.09 +/- 0.21 A | 4.03 +/- 1.07 A | 49.9 +/- 27.3% |
| 100% SBMA | 2 | 3.03 +/- 0.00 A | 3.75 +/- 0.86 A | 45.3 +/- 45.3% |
| 75% SBMA / 25% EGMA | 3 | 3.67 +/- 0.51 A | 4.92 +/- 0.50 A | 6.9 +/- 6.8% |
| 50% SBMA / 50% EGMA | 3 | 8.20 +/- 4.30 A | 4.23 +/- 0.20 A | 7.9 +/- 7.9% |
| 25% SBMA / 75% EGMA | 3 | 5.25 +/- 0.56 A | 5.50 +/- 1.07 A | 14.0 +/- 14.0% |
| **100% EGMA** | 3 | 3.29 +/- 0.30 A | 3.31 +/- 0.20 A | **50.6 +/- 17.0%** |

### Interpretation

1. **Best triad preservation**: 100% EGMA shows the highest simultaneous
   contact (50.6%) with the lowest variability (+/-17%), suggesting it
   provides consistent triad stabilization.

2. **No Polymer baseline**: 49.9% contact with high variability (+/-27.3%)
   indicates the unmodified enzyme has dynamic triad geometry.

3. **Mixed ratios disrupt triad**: 75% SBMA/25% EGMA, 50/50, and 25/75
   all show severely reduced contact (6.9-14.0%), suggesting these mixed
   polymer compositions interfere with triad geometry.

4. **Diagnostic per-pair analysis**: The 50/50 condition shows very high
   Asp133-His156 distance (8.20 A), indicating the Asp-His hydrogen bond
   is specifically disrupted. The His-Ser distance remains closer to
   normal (4.23 A).

### Conclusions

- Pure EGMA coating preserves triad geometry better than pure SBMA or no polymer
- Mixed polymer ratios severely disrupt the catalytic triad
- The Asp-His hydrogen bond appears most sensitive to disruption

These results suggest that polymer composition significantly affects enzyme
active site geometry, with implications for catalytic activity.

## Selection Syntax Details

### Standard MDAnalysis Selections

Most selections use standard MDAnalysis syntax:

```yaml
selection_a: "resid 77 and name OG"    # Serine OG oxygen
selection_b: "resid 156 and name NE2"  # Histidine NE2 nitrogen
```

Common patterns:
- `resid N` - residue number N
- `name X` - atom name X
- `resname ABC` - residue name ABC
- `and`, `or`, `not` - logical operators

### The midpoint() Function

For residues with two equivalent atoms (like Asp/Glu carboxylates), use
`midpoint()` to compute the geometric center:

```yaml
selection_a: "midpoint(resid 133 and name OD1 OD2)"
```

This computes:

$$
\mathbf{r}_{\text{mid}} = \frac{\mathbf{r}_{\text{OD1}} + \mathbf{r}_{\text{OD2}}}{2}
$$

**When to use midpoint():**
- Asp/Glu carboxylate groups (OD1/OD2 or OE1/OE2)
- Symmetric groups where either atom could participate

### The com() Function

For larger groups, use `com()` to compute the center of mass:

```yaml
selection_a: "com(resid 133)"  # Center of mass of entire Asp133
```

This weights atom positions by mass:

$$
\mathbf{r}_{\text{com}} = \frac{\sum_i m_i \mathbf{r}_i}{\sum_i m_i}
$$

**When to use com():**
- Tracking overall residue position
- Large binding site analysis
- When specific atoms aren't well-defined

## Multi-Replicate Analysis

### Why Replicates Matter

Single trajectories can be misleading due to:
- Rare conformational events
- Metastable states
- Stochastic variation

Multiple independent replicates provide:
- Reproducibility testing
- Robust uncertainty estimates
- Detection of outlier trajectories

### Recommended Practice

Run **at least 3 replicates** per condition. The aggregated analysis will:

1. Compute triad metrics for each replicate independently
2. Report mean and SEM across replicates
3. Show per-replicate values to assess variability

### Interpreting Replicate Variability

High replicate variability (large SEM) suggests:
- The system samples multiple conformational states
- Triad geometry is dynamic
- More replicates may be needed for statistical confidence

Low variability suggests:
- Consistent behavior across trajectories
- The observed state is representative

## Handling Incomplete Data

During active research, simulations are often in progress. You may want to analyze
available data without waiting for all replicates to complete. PolyzyMD handles
this gracefully by skipping missing or failed replicates with informative warnings.

### Missing Replicates

When a requested replicate's data cannot be found, PolyzyMD logs a warning and
continues with the remaining replicates:

```
WARNING: Skipping replicate 2: trajectory data not found.
Working directory not found: /scratch/user/project/run_2
Has replicate 2 been simulated?
```

### Failed Replicates

If a replicate exists but analysis fails (e.g., corrupted trajectory, missing
atoms), PolyzyMD logs the error and continues:

```
WARNING: Skipping replicate 3: analysis failed with error: 
Could not read DCD file: unexpected end of file
```

### Aggregation Summary

When aggregating with incomplete data, PolyzyMD reports which replicates
were successfully analyzed:

```
WARNING: Aggregating 2 of 3 requested replicates. Skipped: [2]
```

### Minimum Requirements

**At least 2 successful replicates are required for aggregation.** This is
because SEM calculation requires n ≥ 2. If fewer than 2 replicates succeed,
PolyzyMD raises an error:

```
ValueError: Aggregation requires at least 2 successful replicates, but only
1 succeeded. Failed replicates: [2, 3]
```

### Output File Naming

Output filenames reflect the actual replicates used, not the requested range.
This makes it clear which data contributed to the results:

| Requested | Successful | Filename |
|-----------|------------|----------|
| 1-3 | 1, 2, 3 | `triad_*_reps1-3_eq100ns.json` |
| 1-3 | 1, 3 | `triad_*_reps1_3_eq100ns.json` |
| 1-5 | 1, 2, 4 | `triad_*_reps1_2_4_eq100ns.json` |

Note: Contiguous ranges use a hyphen (`1-3`), non-contiguous use underscores (`1_3`).

### Best Practices for Incomplete Data

1. **Investigate missing data**: If replicates consistently fail, check:
   - Did the simulation complete? Check SLURM logs for timeouts or errors.
   - Are trajectory files in the expected location? Verify paths in config.yaml.
   - Is the trajectory corrupted? Try loading it manually with MDAnalysis.

2. **Document which replicates were used**: When publishing results from
   incomplete data, clearly state which replicates contributed to aggregated
   statistics. The JSON output includes a `replicates` field for this purpose.

3. **Re-run with complete data**: Once all simulations finish, re-run
   analysis with `--recompute` to include all replicates:
   ```bash
   polyzymd analyze triad -c comparison.yaml --recompute
   ```

4. **Consider statistical implications**: Results from 2 replicates have
   larger uncertainty than 3+. Be appropriately cautious when interpreting
   results with fewer replicates than planned.

### Example: Analyzing During Active Simulations

A common workflow when simulations are still running:

```bash
# Request all 5 planned replicates, but only 3 have completed
polyzymd analyze triad -c comparison.yaml -r 1-5 --eq-time 100ns

# Output shows:
# Skipping replicate 4: trajectory data not found...
# Skipping replicate 5: trajectory data not found...
# Aggregating 3 of 5 requested replicates. Skipped: [4, 5]
#
# Results saved to: triad_LipA_reps1-3_eq100ns.json
```

Later, when all simulations complete:

```bash
# Re-run to include all replicates
polyzymd analyze triad -c comparison.yaml -r 1-5 --eq-time 100ns --recompute

# Now all 5 are included:
# Results saved to: triad_LipA_reps1-5_eq100ns.json
```

## Comparing Across Conditions

### Statistical Considerations

When comparing triad metrics across conditions:

1. **Use replicate-based statistics**: Compare mean values across replicates,
   not frame-by-frame values

2. **Report effect sizes**: A large difference may be meaningful even if
   not statistically significant with few replicates

3. **Consider biological significance**: A 10% difference in contact fraction
   might be functionally important even if p > 0.05

### Example Comparison

From the LipA study:

```
Condition         Contact      vs No Polymer
-------------------------------------------------
No Polymer        49.9%        (control)
100% EGMA         50.6%        +0.7% (similar)
100% SBMA         45.3%        -4.6% (slightly lower)
75/25 SBMA/EGMA   6.9%         -43.0% (much lower)
```

The 75/25 mixture shows a dramatic 43 percentage point reduction in triad
contact compared to the control - a biologically meaningful difference
regardless of statistical significance.

## Common Pitfalls

### Pitfall 1: Wrong Residue Numbering

**Symptom**: Very high distances (> 10 A) or selection errors

**Cause**: PDB residue numbers don't match your selections

**Solution**: Check your topology file's residue numbering. Use VMD or
PyMOL to visualize and verify:

```bash
# In VMD console
atomselect top "resid 77 and name OG"
```

### Pitfall 2: Incorrect Atom Names

**Symptom**: "Selection returned 0 atoms" error

**Cause**: Atom names differ between force fields

**Solution**: Check your topology for exact atom names. Common variations:

| Residue | AMBER | CHARMM |
|---------|-------|--------|
| His (delta) | ND1 | ND1 |
| His (epsilon) | NE2 | NE2 |
| Ser | OG | OG |
| Asp | OD1, OD2 | OD1, OD2 |

### Pitfall 3: Threshold Too Strict

**Symptom**: Near-zero contact fraction for active enzyme

**Cause**: 3.5 A may be too strict for some systems

**Solution**: Try 4.0 A threshold and compare. If the enzyme is
experimentally active, some threshold should show reasonable contact.

### Pitfall 4: Insufficient Equilibration

**Symptom**: Contact fraction drifts over time; different equilibration
times give very different results

**Cause**: Including non-equilibrated frames

**Solution**: Use RMSD analysis to determine equilibration time. The
system should be equilibrated before triad analysis.

### Pitfall 5: Ignoring Per-Pair Diagnostics

**Symptom**: Low simultaneous contact but unclear why

**Cause**: Not examining which pair is disrupted

**Solution**: Always check per-pair distances. This reveals whether:
- Both pairs are moderately disrupted
- One pair is specifically broken
- The issue is distance vs. variability

## Python API

### Programmatic Analysis

```python
from polyzymd.config.loader import load_config
from polyzymd.compare.config import ComparisonConfig
from polyzymd.analysis.triad import CatalyticTriadAnalyzer

# Load configs
comp_config = ComparisonConfig.from_yaml("comparison.yaml")
sim_config = load_config(comp_config.conditions[0].config)

# Create analyzer
analyzer = CatalyticTriadAnalyzer(
    config=sim_config,
    triad_config=comp_config.catalytic_triad,
    equilibration="100ns",
)

# Single replicate
result = analyzer.compute(replicate=1)
print(f"Simultaneous contact: {result.simultaneous_contact_fraction * 100:.1f}%")

# Aggregated
agg_result = analyzer.compute_aggregated(replicates=[1, 2, 3])
print(f"Mean contact: {agg_result.overall_simultaneous_contact * 100:.1f} "
      f"+/- {agg_result.sem_simultaneous_contact * 100:.1f}%")
```

### Accessing Detailed Results

```python
# Per-pair statistics
for pair in result.pair_results:
    print(f"{pair.pair_label}:")
    print(f"  Mean: {pair.mean_distance:.2f} A")
    print(f"  Std:  {pair.std_distance:.2f} A")
    if pair.fraction_below_threshold:
        print(f"  Contact: {pair.fraction_below_threshold * 100:.1f}%")

# Autocorrelation info
if result.sim_contact_correlation_time:
    print(f"Correlation time: {result.sim_contact_correlation_time:.1f} "
          f"{result.sim_contact_correlation_time_unit}")
    print(f"N independent: {result.sim_contact_n_independent}")
```

### Loading Saved Results

```python
from polyzymd.analysis.results.triad import TriadResult, TriadAggregatedResult

# Load single replicate result
result = TriadResult.load("analysis/triad/run_1/triad_LipA_eq100ns.json")
print(result.summary())

# Load aggregated result
agg = TriadAggregatedResult.load("analysis/triad/aggregated/triad_LipA_reps1-3_eq100ns.json")
print(agg.summary())
```

## Plotting and Visualization

```{admonition} TODO: Plotting Documentation
:class: warning

Detailed documentation for plotting catalytic triad results is planned for a
future release. In the meantime, you can:

1. Load JSON results and create custom plots with matplotlib
2. Use VMD/PyMOL to visualize triad geometry along trajectories
3. Export distance timeseries for external analysis

See the [RMSF plotting documentation](analysis_rmsf_quickstart.md#comparing-two-conditions)
for general guidance on comparing analysis results across conditions.
```

## References

### Enzyme Mechanism

**Hedstrom L.** (2002) "Serine Protease Mechanism and Specificity."
*Chemical Reviews* 102:4501-4524.
https://doi.org/10.1021/cr000033x

Classic review of serine protease catalytic mechanism.

**Blow DM.** (1976) "Structure and Mechanism of Chymotrypsin."
*Accounts of Chemical Research* 9:145-152.

Foundational paper on catalytic triad structure.

### MD Analysis Best Practices

**Grossfield A, Patrone PN, Roe DR, Schultz AJ, Siderius DW, Zuckerman DM.**
(2018) "Best Practices for Quantification of Uncertainty and Sampling Quality
in Molecular Simulations." *Living Journal of Computational Molecular Science*
1(1):5067. https://doi.org/10.33011/livecoms.1.1.5067

Definitive guide to uncertainty quantification in MD - the basis for
PolyzyMD's autocorrelation analysis.

### H-Bond Geometry

**Jeffrey GA, Saenger W.** (1991) *Hydrogen Bonding in Biological Structures.*
Springer-Verlag.

Comprehensive reference for hydrogen bond geometry in biomolecules.

## See Also

- [Quick Start Guide](analysis_triad_quickstart.md) - Get results fast
- [RMSF Analysis](analysis_rmsf_quickstart.md) - Analyze flexibility
- [Comparing Conditions](analysis_compare_conditions.md) - Statistical comparisons
- [Statistical Best Practices](analysis_rmsf_best_practices.md) - Detailed statistics guide