Amino acid code
https://www.genscript.com/Amino_Acid_Code.html
Kullback–Leibler (KL) divergence
https://www.youtube.com/watch?v=q0AkK8aYbLY
https://www.youtube.com/watch?v=SxGYPqCgJWM
https://www.youtube.com/watch?v=jmXDQUxgzsU
Copilot:
? What D-motifs Are
D-motifs are amino acid pairs (or triplets) that show statistically significant deviation from what you'd expect if amino acids were randomly distributed. This is quantified using:
Where:
( p(b^{(n)}) ) is the observed probability of a motif.
( q(b^{(n)}) ) is the expected probability assuming independence (from Eq. 2.8).
( k(p, q, b^{(n)}) ) is the log-ratio that measures how much more or less frequent the motif is than expected.
? The (D, σ)-Criterion
To decide whether a motif is statistically significant, they compare ( k(p, q, b^{(n)}) ) to the mean and standard deviation of all such motifs:
D-positive:
→ Motif occurs more often than expected. Possibly due to mutation or functional selection.D-negative:
→ Motif occurs less often than expected. Possibly suppressed or structurally unfavorable.
They use D = 2, which is a common threshold in statistics (roughly 95% confidence).
? Mutation Implications
You're absolutely right in connecting this to mutation detection:
If a motif is D-positive, it might be the result of a mutation that increases its frequency. This could be biologically meaningful—like stabilizing a protein or forming a disulfide bridge (e.g., CC).
If a motif is D-negative, it’s unlikely to occur naturally. So if it does appear, it might signal a rare mutation or an error.
For example:
CC is D-positive: occurs more than expected → could be a stabilizing mutation.
CM is D-negative: occurs less than expected → unlikely to be a natural mutation unless something disrupts the usual pattern.
? Practical Use
This framework lets you:
Detect mutations by scanning for motifs that cross the D-threshold.
Classify proteins based on motif content.
Design sensors or diagnostics that look for these motifs as biomarkers.
To connect motif density to actual shape, you'd need:
Structural databases like PDB to map motifs onto 3D folds.
Secondary structure prediction tools (e.g. PSIPRED, AlphaFold).
Domain annotation tools like InterPro or Pfam.
But the beauty of Shen et al.'s method is that it gives you a fast, sequence-only proxy for domain inference—especially useful when structure is unknown.
PROSITE – A curated database of protein domains and motifs
CryoDomain – Uses cryo-EM density maps to identify domains without sequence alignment
Nature study on essential domains – Uses transposon mutagenesis to identify essential domains based on density and disruption pattern
Common Methods to Extract and Read DNA/RNA:
PCR (Polymerase Chain Reaction): Amplifies specific DNA segments.
Sanger Sequencing: Classic method using chain termination.
Next-Generation Sequencing (NGS): High-throughput, massively parallel sequencing.
Nanopore Sequencing: Reads DNA/RNA by detecting changes in electrical current as molecules pass through a pore.
RNA-Seq: Specifically sequences RNA to study gene expression.
To understand the “how” but not "what", researchers combine:
Genomic context (e.g. mutation hotspots)
Environmental data (e.g. carcinogen exposure)
Functional assays (e.g. protein activity tests)
Evolutionary analysis (e.g. selective pressure)
based on Shiyi Shena, Bo Kaia, Jishou Ruana, J. Torin Huzilb, Eric Carpenterb, Jack A. Tuszynskib article about probabilistic analyses
Comments
Post a Comment