Substitution matrices
25 important questions on Substitution matrices
What is the definition of a substitution matrix?
For nucleotide sequences, there is a simple substitution model and a more complicated one. Name both and explain.
- Simple: Jukes-Cantor
- positive value for match eg 1, negative value for mismatch eg -1
- Frequencies of mutations are equal for all bases
- leading to matrix with crosses on diagonal, other positions all have same value (eg alfa)
- More complicated: Kimura
- takes into account transitions and transversions
- leading to matrix with crosses on diagonal, other positions have either value (alfa ), = transition or (beta) , = transversion
There are various models to correct for the fact that the true rate of evolution cannot be observed through nucleotide (or amino acid) exchange patterns (e.g. due to back mutations). What is the saturation level for nucleotide sequences? And for proteins? Explain this concept.
- Nucleotide sequences: saturation level = ~75%
- Protein sequences: saturation level = ~94%
- So this means: higher real mutations are no longer observable
Genetic saturation is the result of multiple substitutions at the same site in a sequence, or identical substitutions in different sequences, such that the apparent sequence divergence rate is lower than the actual divergence that has occurred.
--> observed distance lower than true distance
- Higher grades + faster learning
- Never study anything twice
- 100% sure, 100% understanding
What is the use of Jukes-Cantor and Kimura models in alignment?
- The main use in alignment is ‘correcting’ the alignment scores (Jukes-Cantor and Kimura models for DNA; (another) Kimura model for proteins).
- However, the majority of (multiple) alignment methods do not use this correction, since it is based upon crude sequence identity scores that only score residue matches.
- An important use of these methods however is in Phylogeny, where attempts are made to reconstruct the most-likely evolutionary decendance of sequences, particularly in Maximum Likelihood methods (see later lecture).
The Jukes-Cantor model can be used to calculate the "true" distance from the observed distance. How do you calculate d?
Where p is the proportion of sites that differ between two sequences. Here, d is measured in terms of the average number of mutations that have occured per site (not the time since divergence).
Assumptions: all base frequencies = 1/4, and rates are equal as well.
Works only for closely related sequences.
Which theory is by Kimura?
- At molecular level most evolutionary changes and most of variation within and between species is not caused by natural selection but by genetic drift of mutant alleles that are neutral.
- A neutral mutation does not affect an organism's ability to survive and reproduce.
- Theory allows for possibility that most mutations are deleterious (Darwin), but since these are rapidly removed by natural selection, they do not make significant contributions to variation within and between species at the molecular level.
- Mutations that are not deleterious are assumed to be mostly neutral rather than beneficial.
How do you calculate true rate of evolution with the Kimura model?
- An example: let sequences A and B differ by 30%. If 20% of changes are a result of transitions (ti) and 10% of changes are a result of transversions (tv), the evolutionary distance can be calculated using
- dAB = −1/2 ln(1 − 2 × 0.2 − 0.1) − 1/4 ln(1 − 2 × 0.1) = 0.40
How do you calculate the true rate of evolution for protein sequences?
- Express sequence distance as (1 – fraction identity)
- Protein sequences: dAB = - ln(1.0 – pAB – (pAB)2/5.0),
- where dAB is the corrected distance and pAB is observed percentage divergence between the two aligned sequences A and B.
- This is a heuristic formula. Only defined for p < 0.86.
What is better to align, DNA or protein?
- Many mutations within DNA are synonymous ⇒ divergence overestimation.
- Evolutionary relationships can be more accurately expressed using a 20×20 amino acid exchange table
- DNA sequences contain non-coding regions, which should be avoided in homology searches.
- Still an issue when translating into (six) protein sequences through a codon table.
- Searching at protein level: frameshifts can occur, leading to stretches of incorrect amino acids and possibly elongation or truncation. However, frameshifts normally result in stretches of highly unlikely amino acids.
Why are scoring matrices for amino acids more complicated than those for nucleotides?
- 20 vs 4
- Scoring has to reflect:
- Physio-chemical properties of aa’ s
- Likelihood of residues being substituted among truly homologous sequences
- Certain aa with similar properties can be more easily substituted: preserve structure/function
- “Disruptive” substitution is less likely to be selected in evolution (non functional proteins)
What is the source of the target and background probabilities we use?
- The “evolutionary true” alignments allow us to get statistics on biologically permissible amino acid mutations and derive the frequencies of observed pairs. These are the TARGET frequencies (20x20 combinations).
- The BACKGROUND frequencies are simply the frequencies at which each amino acid type is observed in these “trusted” data sets (20 values).
How are the scores in a substitution matrix calculated?
- Substitution matrices apply logarithmic conversions to describe the probability of amino acid substitutions
- The converted values are the so-called log-odds scores
- So they are simply the logarithmic ratios of the observed mutation frequency divided by the probability of substitution expected by random chance (target – background)
What do a positive, zero and negative score mean in a substitution matrix?
- a positive score means that the frequency of amino acid substitutions found in the high-confidence alignments is greater than would have occurred by random chance
- a zero score that the frequency is equal to that expected by chance
- a negative score that the frequency is less than that expected by chance
What are empirical matrices based on? Which two are most used?
PAM and BLOSUM
What is the key idea of the PAM matrices?
How is a PAM matrix constructed? 4 steps.
- Dayhoff used 71 protein families (each family having closely related family members), made hypothetical phylogenetic trees and recorded the number of observed substitutions (along each branch of the tree) in a 20x20 target matrix.
- The target matrix was then converted to frequencies by dividing each cell (a,b) over the sum of all other substitutions of a.
- The target matrix was normalized so that the expected number of substitutions covered 1% of the protein (PAM-1).
- Determine the final substitution matrix.
What would a PAM2 matrix mean?
What are 4 assumptions of the PAM matrix?
- Likelihood of amino acid X replacing Y is the same as Y replacing X
- Very closely related proteins are used, to decrease the mediated mutations such as X --> Y --> Z
- Replacement at any site depends only on the amino acid at that site and the probability of a given Markov model; all positions in proteins are eqally mutable
- All sequences have average amino acid composition
What is a mistake in the PAM250 matrix we are aware of?
PAM matrices are derived from global alignments of closely related sequences. On which alignments are the BLOSUm matrices based?
The BLOSUM matrices are based on the BLOCKS database. What kind of database is this?
- The Blocks Database contains multiple alignments of conserved regions in protein families.
- Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins.
- The blocks for the BLOCKS database are made automatically by looking for the most highly conserved regions in groups of proteins represented in the PROSITE database. These blocks are then calibrated against the SWISS-PROT database to obtain a measure of the random distribution of matches. It is these calibrated blocks that make up the BLOCKS database.
BLOSUM: what does the number after the matrix refer to?
- for BLOSUM30 blocks are used with >= 30% sequence identity
- High number - closely related sequences
- Low number - distant sequences
- BLOSUM62 is the most popular exchange matrix: best for general alignment.
Give 5 steps to obtain a BLOSUM matrix.
- Counting mutations
- Tallying mutation frequencies
- Matrix of mutation probabilities.
- Calculate abundance of each residue (marginal probabilities).
- Obtain matrix
- S(ij) = 2 log2 * p(ij) / (p(i) * p(j))
Give some summary points for PAM vs BLOSUM.
- based on explicit evolutionary model
- Derived from small, closely related proteins with ~15% divergence
- Higher PAM numbers to detect more remote sequence similarities
- Errors in PAM 1 are scaled 250X in PAM 250
- Based on empirical frequencies
- Uses much larger, more diverse set of protein sequences (30-90% ID)
- Lower BLOSUM numbers to detect more remote sequence similarities
- Errors in BLOSUM arise from errors in alignment
What is the twilight zone?
Around 20 % : 15 - 25
The question on the page originate from the summary of the following study material:
- A unique study and practice tool
- Never study anything twice again
- Get the grades you hope for
- 100% sure, 100% understanding