Sequencing-BWA
15 important questions on Sequencing-BWA
Name some key points about next generation sequencing (NGS).
- Massively parallel sequencing of millions to billions of short fragments
- Very fast
- E.g. compared to Sanger sequencing – exploited in HGP- max 384 DNA samples in a single batch (run) in up to 24 runs a day
- Huge amounts of data generated in single sequencing experiment (many TBs)
- Much reduced cost (1 human genome: HGP 3 billion $ versus NGS ~10,000 $)
- Shorter fragments (reads) than with Sanger sequencing
- Many different techniques exist but based on approx. same principle. Differences reside mainly in chemical usage and the way fragments are stuck to the surface
What is the main problem in sequencing (for bioinformaticians)?
And the reads may contain (experimental) errors.
What are the two main methods for genome assembly?
- De novo assembly of a genome
- Assembly using alignment onto a reference genome
- Higher grades + faster learning
- Never study anything twice
- 100% sure, 100% understanding
How does de novo sequencing work?
A known problem in computer Science is the Shortest Superstring Problem (SSP), where all fragments are strung up to produce the shortest overall string (i.e. genome).
- However, the shortest possible string is not an ideal criterion because genomes have many repeating fragments
What are two main problems for de novo assembly?
- Multiple contigs (due to lack of overlapping reads)
- caused by lack of coverage: due to randomness of shearing process there is a chance that some regions of the genome are unsequenced
- Repeats
What is a contig (de novo assembly)?
Computer programs have to check the overlap of all against all fragments to find the most likely order of the fragments, resulting in a completely reconstructed DNA sequence.
Why can repeats cause major problems for the assembler?
- Reads corresponding to two separate repeats may be collapsed in a single contig
- Repeats with large intervening regions or multiple repeat regions (e.g. >600 repeats) cannot be resolved by mate pairs anymore
What do depth and coverage mean?
- sequencing depth: (average) number of reads per base (often on entire sequencing sample)
- coverage: average number of reads per base (often on specific region)
What should a de novo assembly algorithm do?
PROCESS – Cut reads in k-mers and determine overlap through string-matching (allow for small variations) – No reference needed
OUTPUT – Alignment and sequence of new strain
How are de Bruijn graphs used for genome assemby?
- Nodes are k-mers
- edges are k+1 mers, connecting two nodes
- such nodes need to have overlap of k-1 symbols
- directed edges such that suffix of outgoing node overlaps with prefix of incoming node
Errors create tips and bubbles --> remove
What should an algorithm do for alignment if reference if available?
PROCESS – String matching of the sequence reads against the reference genome sequence (allowing for small variations) – Reference and sequenced organism need to be closely related (at least the same species)
OUTPUT – Alignment and sequence of newly sequenced genome
With a ref genome: How to search through millions of reads?
- Input: reference genome and a set of reads (BAM file)
- Need fast way to look up reads that will potentially align well •
- Most current methods use “BurrowsWheeler transform”
- Also used in data compression (like “zipping”)
- Here it helps aligning the reads against the genome in a fast way
What is the problem of mismatches (ref alignment)?
- The reference genome will contain mismatches relative to the reads that should be aligned against it.
- There are various strategies to deal with matching fragments containing mismatches.
- However, compared to exact string matching, looking for alignments where symbols may differ will increase processing times.
--> BWA
What is Burrow-Wheeler Alignmer (BWA)?
- Fast aligner
- Based on indexing technique using suffix tree formalism
- Can handle inexact repeats with a defined maximum number of differences (mismatches or gaps)
What are 4 major problems with NGS.
- Huge amounts of data to process
- High error rate
- Lack of coverage
- Repeat sequences - this is the largest general problem for (de novo) assembly
The question on the page originate from the summary of the following study material:
- A unique study and practice tool
- Never study anything twice again
- Get the grades you hope for
- 100% sure, 100% understanding