Catalytic Triad Analysis: Statistical Best Practices

A comprehensive guide to analyzing catalytic triad geometry from MD simulations, including statistical considerations, interpretation guidelines, and worked examples from real enzyme-polymer systems.

Note

Just need quick results? See the Quick Start Guide for copy-paste commands and minimal setup.

Introduction

The catalytic triad is a fundamental structural motif in many enzyme families, including serine proteases, lipases, and esterases. Maintaining proper triad geometry is essential for catalytic activity. This guide explains:

  • What catalytic triads are and why they matter

  • How to measure triad integrity in MD simulations

  • The “simultaneous contact fraction” metric

  • Statistical considerations for robust analysis

  • How to interpret results in the context of enzyme function

What is a Catalytic Triad?

The Ser-His-Asp/Glu Mechanism

The classic catalytic triad consists of three amino acids working together:

  1. Serine (Ser) - The nucleophile that attacks the substrate

  2. Histidine (His) - The general base/acid that shuttles protons

  3. Aspartate (Asp) or Glutamate (Glu) - Orients and activates histidine

These residues form a hydrogen bonding network:

        O                         O
        ||                        ||
Substrate-C     ->    Substrate-C-O-Ser
        |                         |
       ...                       ...

    Asp/Glu --- His --- Ser --- Substrate
       |         |       |
      C=O      N-H     O-H
       |         |       |
      O-H .... N      [nucleophilic attack]

The proton relay works as follows:

  1. Asp stabilizes the positively charged His through a hydrogen bond

  2. His abstracts a proton from Ser’s hydroxyl group

  3. The activated Ser oxygen attacks the substrate carbonyl

Critical Distances

For the triad to function, specific hydrogen bonds must be maintained:

Pair

Donor

Acceptor

Typical Distance

Asp-His

His ND1 (N-H)

Asp OD1/OD2

2.7 - 3.5 A

His-Ser

Ser OG (O-H)

His NE2

2.7 - 3.5 A

These distances are measured between heavy atoms (N, O). The actual H-bond length (H…acceptor) is ~1.0 A shorter.

Why Geometry Matters

If these distances increase beyond ~4 A:

  • Hydrogen bonds break

  • Proton relay cannot function

  • Catalytic activity is lost or severely reduced

MD simulations can reveal whether polymer conjugation, mutations, or other modifications disrupt triad geometry.

The Simultaneous Contact Metric

Definition

The simultaneous contact fraction is the percentage of frames where all distance pairs in the triad are below the contact threshold at the same time:

\[ f_{\text{contact}} = \frac{1}{N} \sum_{t=1}^{N} \prod_{i=1}^{M} \mathbb{1}[d_i(t) < \theta] \]

Where:

  • \(N\) = number of frames

  • \(M\) = number of pairs (typically 2 for a triad)

  • \(d_i(t)\) = distance of pair \(i\) at frame \(t\)

  • \(\theta\) = contact threshold (typically 3.5 A)

  • \(\mathbb{1}[\cdot]\) = indicator function (1 if true, 0 if false)

Why Simultaneous Matters

Consider two scenarios with 50% individual contact per pair:

Scenario A: Alternating

Frame:    1  2  3  4  5  6  7  8
Asp-His:  +  -  +  -  +  -  +  -   (50% contact)
His-Ser:  -  +  -  +  -  +  -  +   (50% contact)
Both:     -  -  -  -  -  -  -  -   (0% simultaneous!)

Scenario B: Correlated

Frame:    1  2  3  4  5  6  7  8
Asp-His:  +  +  +  +  -  -  -  -   (50% contact)
His-Ser:  +  +  +  +  -  -  -  -   (50% contact)
Both:     +  +  +  +  -  -  -  -   (50% simultaneous)

Scenario A has the same per-pair statistics but zero catalytic competence because the triad is never intact. The simultaneous contact fraction captures this critical distinction.

Interpretation Guidelines

Simultaneous Contact

Interpretation

> 80%

Excellent triad integrity

50 - 80%

Good integrity, some flexibility

20 - 50%

Moderate disruption

5 - 20%

Significant disruption

< 5%

Severely disrupted triad

Warning

These thresholds are guidelines, not absolute rules. The relationship between simultaneous contact and actual catalytic activity depends on the enzyme and should be validated against experimental data when possible.

H-Bond Distance Thresholds

Choosing a Threshold

The default threshold of 3.5 A is based on typical H-bond geometry:

Threshold

Rationale

3.0 A

Strict - strong H-bonds only

3.5 A

Standard - typical H-bond cutoff

4.0 A

Relaxed - includes weaker interactions

When to Adjust

Use a stricter threshold (3.0 A) when:

  • Comparing to crystal structures (often show shorter distances)

  • Studying high-activity conformations

  • The default gives near-100% contact (not discriminating)

Use a relaxed threshold (4.0 A) when:

  • Studying dynamic enzymes with flexible active sites

  • Default gives very low contact but enzyme is known to be active

  • Accounting for force field limitations

Heavy Atom vs Hydrogen Distances

PolyzyMD measures heavy atom distances (N-O, O-O), not hydrogen positions:

Measurement

Typical Value

Heavy atom (N…O)

2.7 - 3.5 A

H-bond (H…O)

1.7 - 2.5 A

This is more robust because hydrogen positions are often uncertain in MD (fast motion, force field limitations).

Autocorrelation Analysis

The Correlation Problem

Contact states in MD are temporally correlated. If the triad is in contact at frame 100, it’s very likely to be in contact at frame 101. This affects uncertainty estimation.

How PolyzyMD Handles This

PolyzyMD computes the autocorrelation function of the simultaneous contact timeseries and estimates:

  1. Correlation time (tau) - characteristic decay time

  2. N_independent - effective number of independent samples

  3. Corrected SEM - uncertainty accounting for correlation

SEM for Proportions

For a binary contact timeseries (1 = contact, 0 = no contact), the standard error of the proportion is:

\[ \text{SEM} = \sqrt{\frac{p(1-p)}{N_{\text{ind}}}} \]

Where:

  • \(p\) = simultaneous contact fraction

  • \(N_{\text{ind}}\) = number of independent samples (after autocorrelation correction)

This is the formula for the standard error of a binomial proportion.

The Low Reliability Warning

When N_independent < 10, PolyzyMD warns about low statistical reliability. This doesn’t mean your results are wrong, but uncertainties may be underestimated.

Solutions:

  1. Use multiple independent replicates (recommended)

  2. Run longer simulations

  3. Interpret results with appropriate caution

Worked Example: LipA Polymer Study

This example uses real data from simulations of Lipase A (LipA) from Bacillus subtilis with various polymer conjugations.

System Details

  • Enzyme: Lipase A (181 residues)

  • Catalytic triad: Ser77, His156, Asp133

  • Simulations: 200 ns production, 100 ns equilibration discarded

  • Conditions: 6 polymer compositions (3 replicates each)

comparison.yaml Configuration

name: "LipA_polymer_study"
description: "Effect of SBMA/EGMA polymer composition on LipA triad"
control: "No Polymer"

conditions:
  - label: "No Polymer"
    config: "../noPoly_LipA_DMSO/config.yaml"
    replicates: [1, 2, 3]

  - label: "100% SBMA"
    config: "../SBMA_100_DMSO/config.yaml"
    replicates: [1, 2]

  - label: "75% SBMA / 25% EGMA"
    config: "../SBMA_75_EGMA_25_DMSO/config.yaml"
    replicates: [1, 2, 3]

  - label: "50% SBMA / 50% EGMA"
    config: "../SBMA_50_EGMA_50_DMSO/config.yaml"
    replicates: [1, 2, 3]

  - label: "25% SBMA / 75% EGMA"
    config: "../SBMA_25_EGMA_75_DMSO/config.yaml"
    replicates: [1, 2, 3]

  - label: "100% EGMA"
    config: "../EGMA_100_DMSO/config.yaml"
    replicates: [1, 2, 3]

defaults:
  equilibration_time: "100ns"

catalytic_triad:
  name: "LipA_catalytic_triad"
  description: "Ser-His-Asp catalytic triad of Lipase A (Bacillus subtilis)"
  threshold: 3.5
  pairs:
    - label: "Asp133-His156"
      selection_a: "midpoint(resid 133 and name OD1 OD2)"
      selection_b: "resid 156 and name ND1"
    - label: "His156-Ser77"
      selection_a: "resid 156 and name NE2"
      selection_b: "resid 77 and name OG"

Results Summary

Condition

n

Asp133-His156

His156-Ser77

Simultaneous Contact

No Polymer

3

3.09 +/- 0.21 A

4.03 +/- 1.07 A

49.9 +/- 27.3%

100% SBMA

2

3.03 +/- 0.00 A

3.75 +/- 0.86 A

45.3 +/- 45.3%

75% SBMA / 25% EGMA

3

3.67 +/- 0.51 A

4.92 +/- 0.50 A

6.9 +/- 6.8%

50% SBMA / 50% EGMA

3

8.20 +/- 4.30 A

4.23 +/- 0.20 A

7.9 +/- 7.9%

25% SBMA / 75% EGMA

3

5.25 +/- 0.56 A

5.50 +/- 1.07 A

14.0 +/- 14.0%

100% EGMA

3

3.29 +/- 0.30 A

3.31 +/- 0.20 A

50.6 +/- 17.0%

Interpretation

  1. Best triad preservation: 100% EGMA shows the highest simultaneous contact (50.6%) with the lowest variability (+/-17%), suggesting it provides consistent triad stabilization.

  2. No Polymer baseline: 49.9% contact with high variability (+/-27.3%) indicates the unmodified enzyme has dynamic triad geometry.

  3. Mixed ratios disrupt triad: 75% SBMA/25% EGMA, 50/50, and 25/75 all show severely reduced contact (6.9-14.0%), suggesting these mixed polymer compositions interfere with triad geometry.

  4. Diagnostic per-pair analysis: The 50/50 condition shows very high Asp133-His156 distance (8.20 A), indicating the Asp-His hydrogen bond is specifically disrupted. The His-Ser distance remains closer to normal (4.23 A).

Conclusions

  • Pure EGMA coating preserves triad geometry better than pure SBMA or no polymer

  • Mixed polymer ratios severely disrupt the catalytic triad

  • The Asp-His hydrogen bond appears most sensitive to disruption

These results suggest that polymer composition significantly affects enzyme active site geometry, with implications for catalytic activity.

Selection Syntax Details

Standard MDAnalysis Selections

Most selections use standard MDAnalysis syntax:

selection_a: "resid 77 and name OG"    # Serine OG oxygen
selection_b: "resid 156 and name NE2"  # Histidine NE2 nitrogen

Common patterns:

  • resid N - residue number N

  • name X - atom name X

  • resname ABC - residue name ABC

  • and, or, not - logical operators

The midpoint() Function

For residues with two equivalent atoms (like Asp/Glu carboxylates), use midpoint() to compute the geometric center:

selection_a: "midpoint(resid 133 and name OD1 OD2)"

This computes:

\[ \mathbf{r}_{\text{mid}} = \frac{\mathbf{r}_{\text{OD1}} + \mathbf{r}_{\text{OD2}}}{2} \]

When to use midpoint():

  • Asp/Glu carboxylate groups (OD1/OD2 or OE1/OE2)

  • Symmetric groups where either atom could participate

The com() Function

For larger groups, use com() to compute the center of mass:

selection_a: "com(resid 133)"  # Center of mass of entire Asp133

This weights atom positions by mass:

\[ \mathbf{r}_{\text{com}} = \frac{\sum_i m_i \mathbf{r}_i}{\sum_i m_i} \]

When to use com():

  • Tracking overall residue position

  • Large binding site analysis

  • When specific atoms aren’t well-defined

Multi-Replicate Analysis

Why Replicates Matter

Single trajectories can be misleading due to:

  • Rare conformational events

  • Metastable states

  • Stochastic variation

Multiple independent replicates provide:

  • Reproducibility testing

  • Robust uncertainty estimates

  • Detection of outlier trajectories

Interpreting Replicate Variability

High replicate variability (large SEM) suggests:

  • The system samples multiple conformational states

  • Triad geometry is dynamic

  • More replicates may be needed for statistical confidence

Low variability suggests:

  • Consistent behavior across trajectories

  • The observed state is representative

Handling Incomplete Data

During active research, simulations are often in progress. You may want to analyze available data without waiting for all replicates to complete. PolyzyMD handles this gracefully by skipping missing or failed replicates with informative warnings.

Missing Replicates

When a requested replicate’s data cannot be found, PolyzyMD logs a warning and continues with the remaining replicates:

WARNING: Skipping replicate 2: trajectory data not found.
Working directory not found: /scratch/user/project/run_2
Has replicate 2 been simulated?

Failed Replicates

If a replicate exists but analysis fails (e.g., corrupted trajectory, missing atoms), PolyzyMD logs the error and continues:

WARNING: Skipping replicate 3: analysis failed with error: 
Could not read DCD file: unexpected end of file

Aggregation Summary

When aggregating with incomplete data, PolyzyMD reports which replicates were successfully analyzed:

WARNING: Aggregating 2 of 3 requested replicates. Skipped: [2]

Minimum Requirements

At least 2 successful replicates are required for aggregation. This is because SEM calculation requires n ≥ 2. If fewer than 2 replicates succeed, PolyzyMD raises an error:

ValueError: Aggregation requires at least 2 successful replicates, but only
1 succeeded. Failed replicates: [2, 3]

Output File Naming

Output filenames reflect the actual replicates used, not the requested range. This makes it clear which data contributed to the results:

Requested

Successful

Filename

1-3

1, 2, 3

triad_*_reps1-3_eq100ns.json

1-3

1, 3

triad_*_reps1_3_eq100ns.json

1-5

1, 2, 4

triad_*_reps1_2_4_eq100ns.json

Note: Contiguous ranges use a hyphen (1-3), non-contiguous use underscores (1_3).

Best Practices for Incomplete Data

  1. Investigate missing data: If replicates consistently fail, check:

    • Did the simulation complete? Check SLURM logs for timeouts or errors.

    • Are trajectory files in the expected location? Verify paths in config.yaml.

    • Is the trajectory corrupted? Try loading it manually with MDAnalysis.

  2. Document which replicates were used: When publishing results from incomplete data, clearly state which replicates contributed to aggregated statistics. The JSON output includes a replicates field for this purpose.

  3. Re-run with complete data: Once all simulations finish, re-run analysis with --recompute to include all replicates:

    polyzymd analyze triad -c comparison.yaml --recompute
    
  4. Consider statistical implications: Results from 2 replicates have larger uncertainty than 3+. Be appropriately cautious when interpreting results with fewer replicates than planned.

Example: Analyzing During Active Simulations

A common workflow when simulations are still running:

# Request all 5 planned replicates, but only 3 have completed
polyzymd analyze triad -c comparison.yaml -r 1-5 --eq-time 100ns

# Output shows:
# Skipping replicate 4: trajectory data not found...
# Skipping replicate 5: trajectory data not found...
# Aggregating 3 of 5 requested replicates. Skipped: [4, 5]
#
# Results saved to: triad_LipA_reps1-3_eq100ns.json

Later, when all simulations complete:

# Re-run to include all replicates
polyzymd analyze triad -c comparison.yaml -r 1-5 --eq-time 100ns --recompute

# Now all 5 are included:
# Results saved to: triad_LipA_reps1-5_eq100ns.json

Comparing Across Conditions

Statistical Considerations

When comparing triad metrics across conditions:

  1. Use replicate-based statistics: Compare mean values across replicates, not frame-by-frame values

  2. Report effect sizes: A large difference may be meaningful even if not statistically significant with few replicates

  3. Consider biological significance: A 10% difference in contact fraction might be functionally important even if p > 0.05

Example Comparison

From the LipA study:

Condition         Contact      vs No Polymer
-------------------------------------------------
No Polymer        49.9%        (control)
100% EGMA         50.6%        +0.7% (similar)
100% SBMA         45.3%        -4.6% (slightly lower)
75/25 SBMA/EGMA   6.9%         -43.0% (much lower)

The 75/25 mixture shows a dramatic 43 percentage point reduction in triad contact compared to the control - a biologically meaningful difference regardless of statistical significance.

Common Pitfalls

Pitfall 1: Wrong Residue Numbering

Symptom: Very high distances (> 10 A) or selection errors

Cause: PDB residue numbers don’t match your selections

Solution: Check your topology file’s residue numbering. Use VMD or PyMOL to visualize and verify:

# In VMD console
atomselect top "resid 77 and name OG"

Pitfall 2: Incorrect Atom Names

Symptom: “Selection returned 0 atoms” error

Cause: Atom names differ between force fields

Solution: Check your topology for exact atom names. Common variations:

Residue

AMBER

CHARMM

His (delta)

ND1

ND1

His (epsilon)

NE2

NE2

Ser

OG

OG

Asp

OD1, OD2

OD1, OD2

Pitfall 3: Threshold Too Strict

Symptom: Near-zero contact fraction for active enzyme

Cause: 3.5 A may be too strict for some systems

Solution: Try 4.0 A threshold and compare. If the enzyme is experimentally active, some threshold should show reasonable contact.

Pitfall 4: Insufficient Equilibration

Symptom: Contact fraction drifts over time; different equilibration times give very different results

Cause: Including non-equilibrated frames

Solution: Use RMSD analysis to determine equilibration time. The system should be equilibrated before triad analysis.

Pitfall 5: Ignoring Per-Pair Diagnostics

Symptom: Low simultaneous contact but unclear why

Cause: Not examining which pair is disrupted

Solution: Always check per-pair distances. This reveals whether:

  • Both pairs are moderately disrupted

  • One pair is specifically broken

  • The issue is distance vs. variability

Python API

Programmatic Analysis

from polyzymd.config.loader import load_config
from polyzymd.compare.config import ComparisonConfig
from polyzymd.analysis.triad import CatalyticTriadAnalyzer

# Load configs
comp_config = ComparisonConfig.from_yaml("comparison.yaml")
sim_config = load_config(comp_config.conditions[0].config)

# Create analyzer
analyzer = CatalyticTriadAnalyzer(
    config=sim_config,
    triad_config=comp_config.catalytic_triad,
    equilibration="100ns",
)

# Single replicate
result = analyzer.compute(replicate=1)
print(f"Simultaneous contact: {result.simultaneous_contact_fraction * 100:.1f}%")

# Aggregated
agg_result = analyzer.compute_aggregated(replicates=[1, 2, 3])
print(f"Mean contact: {agg_result.overall_simultaneous_contact * 100:.1f} "
      f"+/- {agg_result.sem_simultaneous_contact * 100:.1f}%")

Accessing Detailed Results

# Per-pair statistics
for pair in result.pair_results:
    print(f"{pair.pair_label}:")
    print(f"  Mean: {pair.mean_distance:.2f} A")
    print(f"  Std:  {pair.std_distance:.2f} A")
    if pair.fraction_below_threshold:
        print(f"  Contact: {pair.fraction_below_threshold * 100:.1f}%")

# Autocorrelation info
if result.sim_contact_correlation_time:
    print(f"Correlation time: {result.sim_contact_correlation_time:.1f} "
          f"{result.sim_contact_correlation_time_unit}")
    print(f"N independent: {result.sim_contact_n_independent}")

Loading Saved Results

from polyzymd.analysis.results.triad import TriadResult, TriadAggregatedResult

# Load single replicate result
result = TriadResult.load("analysis/triad/run_1/triad_LipA_eq100ns.json")
print(result.summary())

# Load aggregated result
agg = TriadAggregatedResult.load("analysis/triad/aggregated/triad_LipA_reps1-3_eq100ns.json")
print(agg.summary())

Plotting and Visualization

TODO: Plotting Documentation

Detailed documentation for plotting catalytic triad results is planned for a future release. In the meantime, you can:

  1. Load JSON results and create custom plots with matplotlib

  2. Use VMD/PyMOL to visualize triad geometry along trajectories

  3. Export distance timeseries for external analysis

See the RMSF plotting documentation for general guidance on comparing analysis results across conditions.

References

Enzyme Mechanism

Hedstrom L. (2002) “Serine Protease Mechanism and Specificity.” Chemical Reviews 102:4501-4524. https://doi.org/10.1021/cr000033x

Classic review of serine protease catalytic mechanism.

Blow DM. (1976) “Structure and Mechanism of Chymotrypsin.” Accounts of Chemical Research 9:145-152.

Foundational paper on catalytic triad structure.

MD Analysis Best Practices

Grossfield A, Patrone PN, Roe DR, Schultz AJ, Siderius DW, Zuckerman DM. (2018) “Best Practices for Quantification of Uncertainty and Sampling Quality in Molecular Simulations.” Living Journal of Computational Molecular Science 1(1):5067. https://doi.org/10.33011/livecoms.1.1.5067

Definitive guide to uncertainty quantification in MD - the basis for PolyzyMD’s autocorrelation analysis.

H-Bond Geometry

Jeffrey GA, Saenger W. (1991) Hydrogen Bonding in Biological Structures. Springer-Verlag.

Comprehensive reference for hydrogen bond geometry in biomolecules.

See Also