Catalytic Triad Analysis: Statistical Best Practices
A comprehensive guide to analyzing catalytic triad geometry from MD simulations, including statistical considerations, interpretation guidelines, and worked examples from real enzyme-polymer systems.
Note
Just need quick results? See the Quick Start Guide for copy-paste commands and minimal setup.
Introduction
The catalytic triad is a fundamental structural motif in many enzyme families, including serine proteases, lipases, and esterases. Maintaining proper triad geometry is essential for catalytic activity. This guide explains:
What catalytic triads are and why they matter
How to measure triad integrity in MD simulations
The “simultaneous contact fraction” metric
Statistical considerations for robust analysis
How to interpret results in the context of enzyme function
What is a Catalytic Triad?
The Ser-His-Asp/Glu Mechanism
The classic catalytic triad consists of three amino acids working together:
Serine (Ser) - The nucleophile that attacks the substrate
Histidine (His) - The general base/acid that shuttles protons
Aspartate (Asp) or Glutamate (Glu) - Orients and activates histidine
These residues form a hydrogen bonding network:
O O
|| ||
Substrate-C -> Substrate-C-O-Ser
| |
... ...
Asp/Glu --- His --- Ser --- Substrate
| | |
C=O N-H O-H
| | |
O-H .... N [nucleophilic attack]
The proton relay works as follows:
Asp stabilizes the positively charged His through a hydrogen bond
His abstracts a proton from Ser’s hydroxyl group
The activated Ser oxygen attacks the substrate carbonyl
Critical Distances
For the triad to function, specific hydrogen bonds must be maintained:
Pair |
Donor |
Acceptor |
Typical Distance |
|---|---|---|---|
Asp-His |
His ND1 (N-H) |
Asp OD1/OD2 |
2.7 - 3.5 A |
His-Ser |
Ser OG (O-H) |
His NE2 |
2.7 - 3.5 A |
These distances are measured between heavy atoms (N, O). The actual H-bond length (H…acceptor) is ~1.0 A shorter.
Why Geometry Matters
If these distances increase beyond ~4 A:
Hydrogen bonds break
Proton relay cannot function
Catalytic activity is lost or severely reduced
MD simulations can reveal whether polymer conjugation, mutations, or other modifications disrupt triad geometry.
The Simultaneous Contact Metric
Definition
The simultaneous contact fraction is the percentage of frames where all distance pairs in the triad are below the contact threshold at the same time:
Where:
\(N\) = number of frames
\(M\) = number of pairs (typically 2 for a triad)
\(d_i(t)\) = distance of pair \(i\) at frame \(t\)
\(\theta\) = contact threshold (typically 3.5 A)
\(\mathbb{1}[\cdot]\) = indicator function (1 if true, 0 if false)
Why Simultaneous Matters
Consider two scenarios with 50% individual contact per pair:
Scenario A: Alternating
Frame: 1 2 3 4 5 6 7 8
Asp-His: + - + - + - + - (50% contact)
His-Ser: - + - + - + - + (50% contact)
Both: - - - - - - - - (0% simultaneous!)
Scenario B: Correlated
Frame: 1 2 3 4 5 6 7 8
Asp-His: + + + + - - - - (50% contact)
His-Ser: + + + + - - - - (50% contact)
Both: + + + + - - - - (50% simultaneous)
Scenario A has the same per-pair statistics but zero catalytic competence because the triad is never intact. The simultaneous contact fraction captures this critical distinction.
Interpretation Guidelines
Simultaneous Contact |
Interpretation |
|---|---|
> 80% |
Excellent triad integrity |
50 - 80% |
Good integrity, some flexibility |
20 - 50% |
Moderate disruption |
5 - 20% |
Significant disruption |
< 5% |
Severely disrupted triad |
Warning
These thresholds are guidelines, not absolute rules. The relationship between simultaneous contact and actual catalytic activity depends on the enzyme and should be validated against experimental data when possible.
H-Bond Distance Thresholds
Choosing a Threshold
The default threshold of 3.5 A is based on typical H-bond geometry:
Threshold |
Rationale |
|---|---|
3.0 A |
Strict - strong H-bonds only |
3.5 A |
Standard - typical H-bond cutoff |
4.0 A |
Relaxed - includes weaker interactions |
When to Adjust
Use a stricter threshold (3.0 A) when:
Comparing to crystal structures (often show shorter distances)
Studying high-activity conformations
The default gives near-100% contact (not discriminating)
Use a relaxed threshold (4.0 A) when:
Studying dynamic enzymes with flexible active sites
Default gives very low contact but enzyme is known to be active
Accounting for force field limitations
Heavy Atom vs Hydrogen Distances
PolyzyMD measures heavy atom distances (N-O, O-O), not hydrogen positions:
Measurement |
Typical Value |
|---|---|
Heavy atom (N…O) |
2.7 - 3.5 A |
H-bond (H…O) |
1.7 - 2.5 A |
This is more robust because hydrogen positions are often uncertain in MD (fast motion, force field limitations).
Autocorrelation Analysis
The Correlation Problem
Contact states in MD are temporally correlated. If the triad is in contact at frame 100, it’s very likely to be in contact at frame 101. This affects uncertainty estimation.
How PolyzyMD Handles This
PolyzyMD computes the autocorrelation function of the simultaneous contact timeseries and estimates:
Correlation time (tau) - characteristic decay time
N_independent - effective number of independent samples
Corrected SEM - uncertainty accounting for correlation
SEM for Proportions
For a binary contact timeseries (1 = contact, 0 = no contact), the standard error of the proportion is:
Where:
\(p\) = simultaneous contact fraction
\(N_{\text{ind}}\) = number of independent samples (after autocorrelation correction)
This is the formula for the standard error of a binomial proportion.
The Low Reliability Warning
When N_independent < 10, PolyzyMD warns about low statistical reliability.
This doesn’t mean your results are wrong, but uncertainties may be
underestimated.
Solutions:
Use multiple independent replicates (recommended)
Run longer simulations
Interpret results with appropriate caution
Worked Example: LipA Polymer Study
This example uses real data from simulations of Lipase A (LipA) from Bacillus subtilis with various polymer conjugations.
System Details
Enzyme: Lipase A (181 residues)
Catalytic triad: Ser77, His156, Asp133
Simulations: 200 ns production, 100 ns equilibration discarded
Conditions: 6 polymer compositions (3 replicates each)
comparison.yaml Configuration
name: "LipA_polymer_study"
description: "Effect of SBMA/EGMA polymer composition on LipA triad"
control: "No Polymer"
conditions:
- label: "No Polymer"
config: "../noPoly_LipA_DMSO/config.yaml"
replicates: [1, 2, 3]
- label: "100% SBMA"
config: "../SBMA_100_DMSO/config.yaml"
replicates: [1, 2]
- label: "75% SBMA / 25% EGMA"
config: "../SBMA_75_EGMA_25_DMSO/config.yaml"
replicates: [1, 2, 3]
- label: "50% SBMA / 50% EGMA"
config: "../SBMA_50_EGMA_50_DMSO/config.yaml"
replicates: [1, 2, 3]
- label: "25% SBMA / 75% EGMA"
config: "../SBMA_25_EGMA_75_DMSO/config.yaml"
replicates: [1, 2, 3]
- label: "100% EGMA"
config: "../EGMA_100_DMSO/config.yaml"
replicates: [1, 2, 3]
defaults:
equilibration_time: "100ns"
catalytic_triad:
name: "LipA_catalytic_triad"
description: "Ser-His-Asp catalytic triad of Lipase A (Bacillus subtilis)"
threshold: 3.5
pairs:
- label: "Asp133-His156"
selection_a: "midpoint(resid 133 and name OD1 OD2)"
selection_b: "resid 156 and name ND1"
- label: "His156-Ser77"
selection_a: "resid 156 and name NE2"
selection_b: "resid 77 and name OG"
Results Summary
Condition |
n |
Asp133-His156 |
His156-Ser77 |
Simultaneous Contact |
|---|---|---|---|---|
No Polymer |
3 |
3.09 +/- 0.21 A |
4.03 +/- 1.07 A |
49.9 +/- 27.3% |
100% SBMA |
2 |
3.03 +/- 0.00 A |
3.75 +/- 0.86 A |
45.3 +/- 45.3% |
75% SBMA / 25% EGMA |
3 |
3.67 +/- 0.51 A |
4.92 +/- 0.50 A |
6.9 +/- 6.8% |
50% SBMA / 50% EGMA |
3 |
8.20 +/- 4.30 A |
4.23 +/- 0.20 A |
7.9 +/- 7.9% |
25% SBMA / 75% EGMA |
3 |
5.25 +/- 0.56 A |
5.50 +/- 1.07 A |
14.0 +/- 14.0% |
100% EGMA |
3 |
3.29 +/- 0.30 A |
3.31 +/- 0.20 A |
50.6 +/- 17.0% |
Interpretation
Best triad preservation: 100% EGMA shows the highest simultaneous contact (50.6%) with the lowest variability (+/-17%), suggesting it provides consistent triad stabilization.
No Polymer baseline: 49.9% contact with high variability (+/-27.3%) indicates the unmodified enzyme has dynamic triad geometry.
Mixed ratios disrupt triad: 75% SBMA/25% EGMA, 50/50, and 25/75 all show severely reduced contact (6.9-14.0%), suggesting these mixed polymer compositions interfere with triad geometry.
Diagnostic per-pair analysis: The 50/50 condition shows very high Asp133-His156 distance (8.20 A), indicating the Asp-His hydrogen bond is specifically disrupted. The His-Ser distance remains closer to normal (4.23 A).
Conclusions
Pure EGMA coating preserves triad geometry better than pure SBMA or no polymer
Mixed polymer ratios severely disrupt the catalytic triad
The Asp-His hydrogen bond appears most sensitive to disruption
These results suggest that polymer composition significantly affects enzyme active site geometry, with implications for catalytic activity.
Selection Syntax Details
Standard MDAnalysis Selections
Most selections use standard MDAnalysis syntax:
selection_a: "resid 77 and name OG" # Serine OG oxygen
selection_b: "resid 156 and name NE2" # Histidine NE2 nitrogen
Common patterns:
resid N- residue number Nname X- atom name Xresname ABC- residue name ABCand,or,not- logical operators
The midpoint() Function
For residues with two equivalent atoms (like Asp/Glu carboxylates), use
midpoint() to compute the geometric center:
selection_a: "midpoint(resid 133 and name OD1 OD2)"
This computes:
When to use midpoint():
Asp/Glu carboxylate groups (OD1/OD2 or OE1/OE2)
Symmetric groups where either atom could participate
The com() Function
For larger groups, use com() to compute the center of mass:
selection_a: "com(resid 133)" # Center of mass of entire Asp133
This weights atom positions by mass:
When to use com():
Tracking overall residue position
Large binding site analysis
When specific atoms aren’t well-defined
Multi-Replicate Analysis
Why Replicates Matter
Single trajectories can be misleading due to:
Rare conformational events
Metastable states
Stochastic variation
Multiple independent replicates provide:
Reproducibility testing
Robust uncertainty estimates
Detection of outlier trajectories
Recommended Practice
Run at least 3 replicates per condition. The aggregated analysis will:
Compute triad metrics for each replicate independently
Report mean and SEM across replicates
Show per-replicate values to assess variability
Interpreting Replicate Variability
High replicate variability (large SEM) suggests:
The system samples multiple conformational states
Triad geometry is dynamic
More replicates may be needed for statistical confidence
Low variability suggests:
Consistent behavior across trajectories
The observed state is representative
Handling Incomplete Data
During active research, simulations are often in progress. You may want to analyze available data without waiting for all replicates to complete. PolyzyMD handles this gracefully by skipping missing or failed replicates with informative warnings.
Missing Replicates
When a requested replicate’s data cannot be found, PolyzyMD logs a warning and continues with the remaining replicates:
WARNING: Skipping replicate 2: trajectory data not found.
Working directory not found: /scratch/user/project/run_2
Has replicate 2 been simulated?
Failed Replicates
If a replicate exists but analysis fails (e.g., corrupted trajectory, missing atoms), PolyzyMD logs the error and continues:
WARNING: Skipping replicate 3: analysis failed with error:
Could not read DCD file: unexpected end of file
Aggregation Summary
When aggregating with incomplete data, PolyzyMD reports which replicates were successfully analyzed:
WARNING: Aggregating 2 of 3 requested replicates. Skipped: [2]
Minimum Requirements
At least 2 successful replicates are required for aggregation. This is because SEM calculation requires n ≥ 2. If fewer than 2 replicates succeed, PolyzyMD raises an error:
ValueError: Aggregation requires at least 2 successful replicates, but only
1 succeeded. Failed replicates: [2, 3]
Output File Naming
Output filenames reflect the actual replicates used, not the requested range. This makes it clear which data contributed to the results:
Requested |
Successful |
Filename |
|---|---|---|
1-3 |
1, 2, 3 |
|
1-3 |
1, 3 |
|
1-5 |
1, 2, 4 |
|
Note: Contiguous ranges use a hyphen (1-3), non-contiguous use underscores (1_3).
Best Practices for Incomplete Data
Investigate missing data: If replicates consistently fail, check:
Did the simulation complete? Check SLURM logs for timeouts or errors.
Are trajectory files in the expected location? Verify paths in config.yaml.
Is the trajectory corrupted? Try loading it manually with MDAnalysis.
Document which replicates were used: When publishing results from incomplete data, clearly state which replicates contributed to aggregated statistics. The JSON output includes a
replicatesfield for this purpose.Re-run with complete data: Once all simulations finish, re-run analysis with
--recomputeto include all replicates:polyzymd analyze triad -c comparison.yaml --recompute
Consider statistical implications: Results from 2 replicates have larger uncertainty than 3+. Be appropriately cautious when interpreting results with fewer replicates than planned.
Example: Analyzing During Active Simulations
A common workflow when simulations are still running:
# Request all 5 planned replicates, but only 3 have completed
polyzymd analyze triad -c comparison.yaml -r 1-5 --eq-time 100ns
# Output shows:
# Skipping replicate 4: trajectory data not found...
# Skipping replicate 5: trajectory data not found...
# Aggregating 3 of 5 requested replicates. Skipped: [4, 5]
#
# Results saved to: triad_LipA_reps1-3_eq100ns.json
Later, when all simulations complete:
# Re-run to include all replicates
polyzymd analyze triad -c comparison.yaml -r 1-5 --eq-time 100ns --recompute
# Now all 5 are included:
# Results saved to: triad_LipA_reps1-5_eq100ns.json
Comparing Across Conditions
Statistical Considerations
When comparing triad metrics across conditions:
Use replicate-based statistics: Compare mean values across replicates, not frame-by-frame values
Report effect sizes: A large difference may be meaningful even if not statistically significant with few replicates
Consider biological significance: A 10% difference in contact fraction might be functionally important even if p > 0.05
Example Comparison
From the LipA study:
Condition Contact vs No Polymer
-------------------------------------------------
No Polymer 49.9% (control)
100% EGMA 50.6% +0.7% (similar)
100% SBMA 45.3% -4.6% (slightly lower)
75/25 SBMA/EGMA 6.9% -43.0% (much lower)
The 75/25 mixture shows a dramatic 43 percentage point reduction in triad contact compared to the control - a biologically meaningful difference regardless of statistical significance.
Common Pitfalls
Pitfall 1: Wrong Residue Numbering
Symptom: Very high distances (> 10 A) or selection errors
Cause: PDB residue numbers don’t match your selections
Solution: Check your topology file’s residue numbering. Use VMD or PyMOL to visualize and verify:
# In VMD console
atomselect top "resid 77 and name OG"
Pitfall 2: Incorrect Atom Names
Symptom: “Selection returned 0 atoms” error
Cause: Atom names differ between force fields
Solution: Check your topology for exact atom names. Common variations:
Residue |
AMBER |
CHARMM |
|---|---|---|
His (delta) |
ND1 |
ND1 |
His (epsilon) |
NE2 |
NE2 |
Ser |
OG |
OG |
Asp |
OD1, OD2 |
OD1, OD2 |
Pitfall 3: Threshold Too Strict
Symptom: Near-zero contact fraction for active enzyme
Cause: 3.5 A may be too strict for some systems
Solution: Try 4.0 A threshold and compare. If the enzyme is experimentally active, some threshold should show reasonable contact.
Pitfall 4: Insufficient Equilibration
Symptom: Contact fraction drifts over time; different equilibration times give very different results
Cause: Including non-equilibrated frames
Solution: Use RMSD analysis to determine equilibration time. The system should be equilibrated before triad analysis.
Pitfall 5: Ignoring Per-Pair Diagnostics
Symptom: Low simultaneous contact but unclear why
Cause: Not examining which pair is disrupted
Solution: Always check per-pair distances. This reveals whether:
Both pairs are moderately disrupted
One pair is specifically broken
The issue is distance vs. variability
Python API
Programmatic Analysis
from polyzymd.config.loader import load_config
from polyzymd.compare.config import ComparisonConfig
from polyzymd.analysis.triad import CatalyticTriadAnalyzer
# Load configs
comp_config = ComparisonConfig.from_yaml("comparison.yaml")
sim_config = load_config(comp_config.conditions[0].config)
# Create analyzer
analyzer = CatalyticTriadAnalyzer(
config=sim_config,
triad_config=comp_config.catalytic_triad,
equilibration="100ns",
)
# Single replicate
result = analyzer.compute(replicate=1)
print(f"Simultaneous contact: {result.simultaneous_contact_fraction * 100:.1f}%")
# Aggregated
agg_result = analyzer.compute_aggregated(replicates=[1, 2, 3])
print(f"Mean contact: {agg_result.overall_simultaneous_contact * 100:.1f} "
f"+/- {agg_result.sem_simultaneous_contact * 100:.1f}%")
Accessing Detailed Results
# Per-pair statistics
for pair in result.pair_results:
print(f"{pair.pair_label}:")
print(f" Mean: {pair.mean_distance:.2f} A")
print(f" Std: {pair.std_distance:.2f} A")
if pair.fraction_below_threshold:
print(f" Contact: {pair.fraction_below_threshold * 100:.1f}%")
# Autocorrelation info
if result.sim_contact_correlation_time:
print(f"Correlation time: {result.sim_contact_correlation_time:.1f} "
f"{result.sim_contact_correlation_time_unit}")
print(f"N independent: {result.sim_contact_n_independent}")
Loading Saved Results
from polyzymd.analysis.results.triad import TriadResult, TriadAggregatedResult
# Load single replicate result
result = TriadResult.load("analysis/triad/run_1/triad_LipA_eq100ns.json")
print(result.summary())
# Load aggregated result
agg = TriadAggregatedResult.load("analysis/triad/aggregated/triad_LipA_reps1-3_eq100ns.json")
print(agg.summary())
Plotting and Visualization
TODO: Plotting Documentation
Detailed documentation for plotting catalytic triad results is planned for a future release. In the meantime, you can:
Load JSON results and create custom plots with matplotlib
Use VMD/PyMOL to visualize triad geometry along trajectories
Export distance timeseries for external analysis
See the RMSF plotting documentation for general guidance on comparing analysis results across conditions.
References
Enzyme Mechanism
Hedstrom L. (2002) “Serine Protease Mechanism and Specificity.” Chemical Reviews 102:4501-4524. https://doi.org/10.1021/cr000033x
Classic review of serine protease catalytic mechanism.
Blow DM. (1976) “Structure and Mechanism of Chymotrypsin.” Accounts of Chemical Research 9:145-152.
Foundational paper on catalytic triad structure.
MD Analysis Best Practices
Grossfield A, Patrone PN, Roe DR, Schultz AJ, Siderius DW, Zuckerman DM. (2018) “Best Practices for Quantification of Uncertainty and Sampling Quality in Molecular Simulations.” Living Journal of Computational Molecular Science 1(1):5067. https://doi.org/10.33011/livecoms.1.1.5067
Definitive guide to uncertainty quantification in MD - the basis for PolyzyMD’s autocorrelation analysis.
H-Bond Geometry
Jeffrey GA, Saenger W. (1991) Hydrogen Bonding in Biological Structures. Springer-Verlag.
Comprehensive reference for hydrogen bond geometry in biomolecules.
See Also
Quick Start Guide - Get results fast
RMSF Analysis - Analyze flexibility
Comparing Conditions - Statistical comparisons
Statistical Best Practices - Detailed statistics guide