Introduction to phylogenetic/phylogenomic concepts and methods
19 important questions on Introduction to phylogenetic/phylogenomic concepts and methods
What is ancenstral sequence reconstruction?
- Take node from the tree that is root for subtree and try to determine what te ancestral sequence would have been.
- So for example: take the most frequent aa per position
- Synthesize (!) this protein, investigate it function and structure, do some assays.
- To obtain knowledge on the ancestral protein.
What are 2 (virtually synonymous) traditional uses of phylogenetic trees? And newer applocations?
- Reconstructing species phylogenies
- Input: MSA of a single gene family
New (omics era):
- phylogenomic function prediction
- exploring evolution of functional/structural domain
- ortholog identification (reconcile a species tree and multi-gene tree)
- constructing a guide tree for MSA
- deriving sequence weight in profile and HMM construction
- ancestral sequence reconstruction
- functional site prediction
- etc etc
Explain the difference between distance and character based phylogenetic trees.
- Input is a matrix of distances between sequences
- Distances can be computed in many ways (based on MSA)
- Fundamental issue: once distance is derived, the MSA data is set aside, so charachters are thrown away.
- E.g., Neighbor-Joining, UPGMA, etc.
- Examine each character (= residue) seperately
- Retain character information at every stage in the tree estimation
- Note that a “distance” between clades can be computed at each stage in the tree construction while still being a character-based method
- E.g., Maximum Likelihood, Maximum Parsimony, Mr. Bayes, SATCHMO
- Higher grades + faster learning
- Never study anything twice
- 100% sure, 100% understanding
Name 2 types of errors in trees.
- In the branching order (topology)
- Can be in the coarse branching order (close to the root):
- Eg relative branching order between taxonomix groups
- or between clades representing different genes in multi-gene tree including duplication events etc
- In the fine branching order (closer to leaves):
- Eg relative order within hominidae
- In the branch lengths
- In general, less of a problem for functional inferences in protein superfamily reconstruction
- Mainly looked at for dating MRCAs
- But for functinonal: when you have neofunctionalization: the new function probably at clade with long branch, old function at clade with short branch
Name 5 sources of errors in phylogenetic trees.
- Sparse taxon sampling
- Lineage-specific rate variation
- Site-specific rate variation
- Sequence fragments (or gene model errors)
- Insufficient site data (e.g., short MSA)
Why is sparse taxon sampling a problem?
- Sampling is sparse if there are large gaps in the evolutionary space (e.g., by restricting sequences based on some criteria, e.g., to fully sequenced genomes, or manually curated databases)
- Sparse taxon sampling has much more influence on your tree than which algorithm you use!
- So here: being really strict on sequence information quality can lead to sparse taxon sampling...
Why is lineage-specific rate variation a problem?
- Historically refers to species or clades that are evolving more rapidly than others
- Eg rat evolves faster then mouse, within rodent group
- In protein superfamily reconstruction, refers to subfamilies (a group of orthologs) that are evolving rapidly (perhaps due to neo-functionalization)
- So imagine after a duplication event, the duplicate that conserves function evolves much slower
Why is sites-specific rate variation a problem?
- A site is a position (column) in the MSA–Less common in single gene trees (orthologs in different species) than among protein superfamilies
- Very common in protein superfamilies due to diversification of function following gene duplication
- Again you are dealing with different rates of evolution within a tree
- This is one of the hardest things to account for in a simulation study for evaluating tree construction methods
Why are sequence fragments/gene model errors a problem?
- Very common in protein sequence databases (especially in eukaryotic genomes)
- Eg you have long intron, short exon, not recognized in physical annotation
- Global alignments leads to misalignment
- This is why many people believe in domain based methods!
Why is insufficient site data a problem?
- Very common for trees based on single domains (esp. if <100aa) or in highly divergent MSAs following stringent masking protocols
- You need at least 100 columns
- strict masking --> loss of columns, eg with many gaps
- Gene matrix approaches (see Delsuc et al) address this problem for species phylogeny estimation
- Few informative sites (e.g., using protein sequences instead of DNA for closely related taxa)
- imagine: you are left with very few columns with very high consersvation: not very informative
- back at sparse sampling
What is a clade?
What is a bifurcating/binary tree?
Trees for which internal nodes can have >2 children are called “multifurcating trees"
What is the diameter of a tree?
- The diameter of a tree is equal to the longest path between two leaves (including edge lengths, not simply number of edges)
- Gives an idea of evolutionary distance described by the tree
- Diameter of the tree is also examined in method studies
How are phylogenetic reconstruction algorithms usually validated?
- Given a “True” tree, generate data (multiple sequence alignments)
- Generated trees: MSAs with no gaps! = Limitation
- Compare estimated tree to true tree
How are false positive and false negatives determined in validation of trees?
Finding false positives: do the same but reversed! Remove from inferred tree.
What are the 5 basic steps of constructing a phylogenetic tree?
- Gather homologs
- Construct a multiple sequence alignment
- Examine/edit alignment
- maybe remove a row: one sequence of mostly gaps
- or crop the alignment at N and C terminus, crop the gappy uneven stuff
- Alignment masking
- removing columns with too many gaps or too much variability in aa type (eg standard: remove neg blosum scores)
- Construct phylogenetic tree
Why do gene trees not always correspond with species trees? Why does it matter?
- Common answer = Inclompete lineage sorting: not enough time for evolution to genetically distinguish between groups for certain genes
- We actually don't know why it happens.
- It matters because many algorithms for finding orthologs assume that the gene tree will follow the species tree.
- climb up the species tree and stop when sometimes doesn't match anymore.
- you stop too soon and don't get all orthologs in the tree
Name the 3 main ortholog prediction methods.
- Phylogenomic approaches
- Reconcile a protein superfamily phylogeny with a trusted reference species phylogeny
- Advantages: high precision
- Disadvantages: computationally intensive, requires handling fuzzy species phylogenies and noisy tree topologies
- Graph-based approaches
- Find clusters in graphs where edges are proportional to sequence distance
- Advantages: fast, scalable–Disadvantages: less accurate
- Reciprocal best BLAST
- Find pairs of genes s.t. each is the other’s top hit in the other genome.
- Advantages: fairly accurate.
- Disadvantages: requires whole genomes (with no missed genes)
What are the 3 main tree reconstruction method types?
- Distance based methods
- These methods first convert the character matrix into a distance matrix that represents the evolutionary distancesbetween all pairs of species. The phylogenetic tree is then inferred from this distance matrix using algorithms such asneighbour joining (NJ)155or minimum evolution (ME)156.
- Maximum parsimony
- This method selects the tree that requires the minimum number of character changes to explain the observed data.
- Likelihood methods
- These methods are based on a function that calculates the probability that a given tree could have produced theobserved data (that is, the likelihood).
The question on the page originate from the summary of the following study material:
- A unique study and practice tool
- Never study anything twice again
- Get the grades you hope for
- 100% sure, 100% understanding