Substitution matrices

25 important questions on Substitution matrices

What is the definition of a substitution matrix?

A two-dimensional matrix with score values describing the probability of one amino-acid/nucleotide being replaced by another during sequence evolution.

For nucleotide sequences, there is a simple substitution model and a more complicated one. Name both and explain.

  • Simple: Jukes-Cantor
    • positive value for match eg 1, negative value for mismatch eg -1
    • Frequencies of mutations are equal for all bases
      • leading to matrix with crosses on diagonal, other positions all have same value (eg alfa)
  • More complicated: Kimura
    • takes into account transitions and transversions
      • leading to matrix with crosses on diagonal, other positions have either value (alfa ), = transition or (beta) , = transversion

There are various models to correct for the fact that the true rate of evolution cannot be observed through nucleotide (or amino acid) exchange patterns (e.g. due to back mutations). What is the saturation level for nucleotide sequences? And for proteins? Explain this concept.

  • Nucleotide sequences: saturation level = ~75%
  • Protein sequences: saturation level = ~94%
  • So this means: higher real mutations are no longer observable


Genetic saturation is the result of multiple substitutions at the same site in a sequence, or identical substitutions in different sequences, such that the apparent sequence divergence rate is lower than the actual divergence that has occurred.


--> observed distance lower than true distance
  • Higher grades + faster learning
  • Never study anything twice
  • 100% sure, 100% understanding
Discover Study Smart

What is the use of Jukes-Cantor and Kimura models in alignment?

  • The main use in alignment is ‘correcting’ the alignment scores (Jukes-Cantor and Kimura models for DNA; (another) Kimura model for proteins).
  • However, the majority of (multiple) alignment methods do not use this correction, since it is based upon crude sequence identity scores that only score residue matches.
  • An important use of these methods however is in Phylogeny, where attempts are made to reconstruct the most-likely evolutionary decendance of sequences, particularly in Maximum Likelihood methods (see later lecture).

The Jukes-Cantor model can be used to calculate the "true" distance from the observed distance. How do you calculate d?



Where p is the proportion of sites that differ between two sequences. Here, d is measured in terms of the average number of mutations that have occured per site (not the time since divergence).

Assumptions: all base frequencies = 1/4, and rates are equal as well.
Works only for closely related sequences.

Which theory is by Kimura?

Neutral theory of molecular evolution.
  • At molecular level most evolutionary changes and most of variation within and between species is not caused by natural selection but by genetic drift of mutant alleles that are neutral.
  • A neutral mutation does not affect an organism's ability to survive and reproduce.
  • Theory allows for possibility that most mutations are deleterious (Darwin), but since these are rapidly removed by natural selection, they do not make significant contributions to variation within and between species at the molecular level.
  • Mutations that are not deleterious are assumed to be mostly neutral rather than beneficial. 

How do you calculate true rate of evolution with the Kimura model?

dAB = −(1/2) ln(1 − 2pti − ptv) − (1/4) ln(1 − 2ptv)

  • An example: let sequences A and B differ by 30%. If 20% of changes are a result of transitions (ti) and 10% of changes are a result of transversions (tv), the evolutionary distance can be calculated using
  • dAB = −1/2 ln(1 − 2 × 0.2 − 0.1) − 1/4 ln(1 − 2 × 0.1) = 0.40  

How do you calculate the true rate of evolution for protein sequences?

use sequence identity with Kimura correction:
  • Express sequence distance as (1 – fraction identity)
  • Protein sequences: dAB = - ln(1.0 – pAB – (pAB)2/5.0),  
  • where dAB is the corrected distance and pAB is observed percentage divergence between the two aligned sequences A and B.
  • This is a heuristic formula. Only defined for p < 0.86.

What is better to align, DNA or protein?

If ORF exists, then align at protein level. Arguments:
  1. Many mutations within DNA are synonymous ⇒ divergence overestimation.
  2. Evolutionary relationships can be more accurately expressed using a 20×20 amino acid exchange table
  3. DNA sequences contain non-coding regions, which should be avoided in homology searches.
  4. Still an issue when translating into (six) protein sequences through a codon table.
  5. Searching at protein level: frameshifts can occur, leading to stretches of incorrect amino acids and possibly elongation or truncation. However, frameshifts normally result in stretches of highly unlikely amino acids.

Why are scoring matrices for amino acids more complicated than those for nucleotides?

  • 20 vs 4
  • Scoring has to reflect:
    • Physio-chemical properties of aa’ s
    • Likelihood of residues being substituted among truly homologous sequences
  • Certain aa with similar properties can be more easily substituted: preserve structure/function
  • “Disruptive” substitution is less likely to be selected in evolution (non functional proteins)

What is the source of the target and background probabilities we use?

High confidence alignments.
  • The “evolutionary true” alignments allow us to get statistics on biologically permissible amino acid mutations and derive the frequencies of observed pairs. These are the TARGET frequencies (20x20 combinations). 
  • The BACKGROUND frequencies are simply the frequencies at which each amino acid type is observed in these “trusted” data sets (20 values).

How are the scores in a substitution matrix calculated?

  • Substitution matrices apply logarithmic conversions to describe the probability of amino acid substitutions
  • The converted values are the so-called log-odds scores
  • So they are simply the logarithmic ratios of the observed mutation frequency divided by the probability of substitution expected by random chance (target – background)

What do a positive, zero and negative score mean in a substitution matrix?

  • a positive score means that the frequency of amino acid substitutions found in the high-confidence alignments is greater than would have occurred by random chance
  • a zero score that the frequency is equal to that expected by chance
  • a negative score that the frequency is less than that expected by chance

What are empirical matrices based on? Which two are most used?

Are based on surveys of actual amino acid substitutions among related proteins as observed in sets of protein multiple sequence alignments.

PAM and BLOSUM

What is the key idea of the PAM matrices?

trusted alignments of closely related sequences provide information about biologically permissible mutations.

How is a PAM matrix constructed? 4 steps.

  1. Dayhoff used 71 protein families (each family having closely related family members), made hypothetical phylogenetic trees and recorded the number of observed substitutions (along each branch of the tree) in a 20x20 target matrix.
  2. The target matrix was then converted to frequencies by dividing each cell (a,b) over the sum of all other substitutions of a. 
  3. The target matrix was normalized so that the expected number of substitutions covered 1% of the protein (PAM-1).
  4. Determine the final substitution matrix.

What would a PAM2 matrix mean?

Then you have M^(2): Mutations that happen in twice the evolution period of that for a PAM1

What are 4 assumptions of the PAM matrix?

  • Likelihood of amino acid X replacing Y is the same as Y replacing X
  • Very closely related proteins are used, to decrease the mediated mutations such as X --> Y --> Z
  • Replacement at any site depends only on the amino acid at that site and the probability of a given Markov model; all positions in proteins are eqally mutable
  • All sequences have average amino acid composition

What is a mistake in the PAM250 matrix we are aware of?

W-R exchange = 2, way too large looking at eg other values for W (all zero or negative). Due to paucity of data. The PAM matrices were created with a small dataset!

PAM matrices are derived from global alignments of closely related sequences. On which alignments are the BLOSUm matrices based?

Derived from local, un-gapped alignments of distantly related sequences (the BLOCKS database)

The BLOSUM matrices are based on the BLOCKS database. What kind of database is this?

  • The Blocks Database contains multiple alignments of conserved regions in protein families.
  • Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins.
  • The blocks for the BLOCKS database are made automatically by looking for the most highly conserved regions in groups of proteins represented in the PROSITE database. These blocks are then calibrated against the SWISS-PROT database to obtain a measure of the random distribution of matches. It is these calibrated blocks that make up the BLOCKS database.

BLOSUM: what does the number after the matrix refer to?

The number after the matrix (BLOSUM62) refers to the minimum percent identity of the blocks (in the BLOCKS database) used to construct the matrix (for BLOSUM62 all blocks have >=62% sequence identity);
  • for BLOSUM30 blocks are used with >= 30% sequence identity


  • High number - closely related sequences
  • Low number - distant sequences
  • BLOSUM62 is the most popular exchange matrix: best for general alignment.

Give 5 steps to obtain a BLOSUM matrix.

  1. Counting mutations
  2. Tallying mutation frequencies
  3. Matrix of mutation probabilities.
  4. Calculate abundance of each residue (marginal probabilities).
  5. Obtain matrix
    1. S(ij) = 2 log2 * p(ij) / (p(i) * p(j))

Give some summary points for PAM vs BLOSUM.

PAM:
  • based on explicit evolutionary model
  • Derived from small, closely related proteins with ~15% divergence
  • Higher PAM numbers to detect more remote sequence similarities
  • Errors in PAM 1 are scaled 250X in PAM 250
BLOSUM:
  • Based on empirical frequencies
  • Uses much larger, more diverse set of protein sequences (30-90% ID)
  • Lower BLOSUM numbers to detect more remote sequence similarities
  • Errors in BLOSUM arise from errors in alignment

What is the twilight zone?

The 'twilight zone' of protein sequence comparison is the region in which sequence similarity does not suffice to conclude e.g. structural similarity.

Around 20 % : 15 - 25

The question on the page originate from the summary of the following study material:

  • A unique study and practice tool
  • Never study anything twice again
  • Get the grades you hope for
  • 100% sure, 100% understanding
Remember faster, study better. Scientifically proven.
Trustpilot Logo