Introduction to phylogenetic/phylogenomic concepts and methods

19 important questions on Introduction to phylogenetic/phylogenomic concepts and methods

What is ancenstral sequence reconstruction?

  • Take node from the tree that is root for subtree and try to determine what te ancestral sequence would have been.
  • So for example: take the most frequent aa per position
  • Synthesize (!) this protein, investigate it function and structure, do some assays.
  • To obtain knowledge on the ancestral protein.

What are 2 (virtually synonymous) traditional uses of phylogenetic trees? And newer applocations?

  • Reconstructing species phylogenies
  • Input: MSA of a single gene family

New (omics era):
  • phylogenomic function prediction
  • exploring evolution of functional/structural domain
  • ortholog identification (reconcile a species tree and multi-gene tree)
  • constructing a guide tree for MSA
  • deriving sequence weight in profile and HMM construction
  • ancestral sequence reconstruction
  • functional site prediction
  • etc etc

Explain the difference between distance and character based phylogenetic trees.

Distance based:
  • Input is a matrix of distances between sequences
  • Distances can be computed in many ways (based on MSA)
  • Fundamental issue: once distance is derived, the MSA data is set aside, so charachters are thrown away.
  • E.g., Neighbor-Joining, UPGMA, etc.
  Character based:
  • Examine each character (= residue) seperately
  • Retain character information at every stage in the tree estimation
  • Note that a “distance” between clades can be computed at each stage in the tree construction while still being a character-based method
  • E.g., Maximum Likelihood, Maximum Parsimony, Mr. Bayes, SATCHMO
  • Higher grades + faster learning
  • Never study anything twice
  • 100% sure, 100% understanding
Discover Study Smart

Name 2 types of errors in trees.

  • In the branching order (topology)
    • Can be in the coarse branching order (close to the root):
      • Eg relative branching order between taxonomix groups
      • or between clades representing different genes in multi-gene tree including duplication events etc
    • In the fine branching order (closer to leaves):
      • Eg relative order within hominidae
  • In the branch lengths
    • In general, less of a problem for functional inferences in protein superfamily reconstruction
    • Mainly looked at for dating MRCAs
    • But for functinonal: when you have neofunctionalization: the new function probably at clade with long branch, old function at clade with short branch

Name 5 sources of errors in phylogenetic trees.

  • Sparse taxon sampling
  • Lineage-specific rate variation
  • Site-specific rate variation
  • Sequence fragments (or gene model errors)
  • Insufficient site data (e.g., short MSA)

Why is sparse taxon sampling a problem?

In protein superfamily reconstruction, refers to the selection of proteins (multiple genes and multiple species)
  • Sampling is sparse if there are large gaps in the evolutionary space (e.g., by restricting sequences based on some criteria, e.g., to fully sequenced genomes, or manually curated databases)
  • Sparse taxon sampling has much more influence on your tree than which algorithm you use!
  • So here: being really strict on sequence information quality can lead to sparse taxon sampling...

Why is lineage-specific rate variation a problem?

  • Historically refers to species or clades that are evolving more rapidly than others
    • Eg rat evolves faster then mouse, within rodent group
  • In protein superfamily reconstruction, refers to subfamilies (a group of orthologs) that are evolving rapidly (perhaps due to neo-functionalization)
    • So imagine after a duplication event, the duplicate that conserves function evolves much slower

Why is sites-specific rate variation a problem?

  • A site is a position (column) in the MSA–Less common in single gene trees (orthologs in different species) than among protein superfamilies
  • Very common in protein superfamilies due to diversification of function following gene duplication
  • Again you are dealing with different rates of evolution within a tree
  • This is one of the hardest things to account for in a simulation study for evaluating tree construction methods

Why are sequence fragments/gene model errors a problem?

  • Very common in protein sequence databases (especially in eukaryotic genomes)
  • Eg you have long intron, short exon, not recognized in physical annotation
  • Global alignments leads to misalignment
  • This is why many people believe in domain based methods!

Why is insufficient site data a problem?

  • Very common for trees based on single domains (esp. if <100aa) or in highly divergent MSAs following stringent masking protocols
    • You need at least 100 columns
    • strict masking --> loss of columns, eg with many gaps
  • Gene matrix approaches (see Delsuc et al) address this problem for species phylogeny estimation
  • Few informative sites (e.g.,  using protein sequences instead of DNA for closely related taxa)
    • imagine: you are left with very few columns with very high consersvation: not very informative
    • back at sparse sampling

What is a clade?

Group of leaves that are rooted by an internal node.

What is a bifurcating/binary tree?

A tree for which every internal node has valency/degree 3 (so 1 ancestor and two children).

Trees for which internal nodes can have >2 children are called “multifurcating trees"

What is the diameter of a tree?

  • The diameter of a tree is equal to the longest path between two leaves (including edge lengths, not simply number of edges)
  • Gives an idea of evolutionary distance described by the tree
  • Diameter of the tree is also examined in method studies

How are phylogenetic reconstruction algorithms usually validated?

Simulation studies
  • Given a “True” tree, generate data (multiple sequence alignments)
  • Generated trees: MSAs with no gaps! = Limitation
  • Compare estimated tree to true tree

How are false positive and false negatives determined in validation of trees?

From true tree, remove an edge. You have bipartition. Eg when removing indicated edge, you have s4 and s5 on one side. Check in inferred tree: is there a branch that when you remove it, you have s4 and s5 on one side. No --> false negative.

Finding false positives: do the same but reversed! Remove from inferred tree.

What are the 5 basic steps of constructing a phylogenetic tree?

  1. Gather homologs
  2. Construct a multiple sequence alignment
  3. Examine/edit alignment
    1. maybe remove a row: one sequence of mostly gaps
    2. or crop the alignment at N and C terminus, crop the gappy uneven stuff
  1. Alignment masking
    1. removing columns with too many gaps or too much variability in aa type (eg standard: remove neg blosum scores)
  2. Construct phylogenetic tree

Why do gene trees not always correspond with species trees? Why does it matter?

  • Common answer = Inclompete lineage sorting: not enough time for evolution to genetically distinguish between groups for certain genes
  • We actually don't know why it happens.
  • It matters because many algorithms for finding orthologs assume that the gene tree will follow the species tree.
    • climb up the species tree and stop when sometimes doesn't match anymore.
    • you stop too soon and don't get all orthologs in the tree

Name the 3 main ortholog prediction methods.

  • Phylogenomic approaches
    • Reconcile a protein superfamily phylogeny with a trusted reference species phylogeny
    • Advantages: high precision
    • Disadvantages: computationally intensive, requires handling fuzzy species phylogenies and noisy tree topologies
  • Graph-based approaches
    • Find clusters in graphs where edges are proportional to sequence distance
    • Advantages: fast, scalable–Disadvantages: less accurate
  • Reciprocal best BLAST
    • Find pairs of genes s.t. each is the other’s top hit in the other genome.
    • Advantages: fairly accurate.
    • Disadvantages: requires whole genomes (with no missed genes)

What are the 3 main tree reconstruction method types?

  • Distance based methods
    • These methods first convert the character matrix into a distance matrix that represents the evolutionary distancesbetween all pairs of species. The phylogenetic tree is then inferred from this distance matrix using algorithms such asneighbour joining (NJ)155or minimum evolution (ME)156.
  • Maximum parsimony
    • This method selects the tree that requires the minimum number of character changes to explain the observed data.
  • Likelihood methods
    • These methods are based on a function that calculates the probability that a given tree could have produced theobserved data (that is, the likelihood).

The question on the page originate from the summary of the following study material:

  • A unique study and practice tool
  • Never study anything twice again
  • Get the grades you hope for
  • 100% sure, 100% understanding
Remember faster, study better. Scientifically proven.
Trustpilot Logo