Sequencing-BWA

15 important questions on Sequencing-BWA

Name some key points about next generation sequencing (NGS).

  • Massively parallel sequencing of millions to billions of short fragments
  • Very fast
    • E.g. compared to Sanger sequencing – exploited in HGP- max 384 DNA samples in a single batch (run) in up to 24 runs a day
  • Huge amounts of data generated in single sequencing experiment (many TBs)
  • Much reduced cost (1 human genome: HGP 3 billion $ versus NGS ~10,000 $)
  • Shorter fragments (reads) than with Sanger sequencing
  • Many different techniques exist but based on approx. same principle. Differences reside mainly in chemical usage and the way fragments are stuck to the surface

What is the main problem in sequencing (for bioinformaticians)?

The assembly! Reconstructing a DNA sequence from many randomly selected short fragments (reads).

And the reads may contain (experimental) errors.

What are the two main methods for genome assembly?

  • De novo assembly of a genome
  • Assembly using alignment onto a reference genome
  • Higher grades + faster learning
  • Never study anything twice
  • 100% sure, 100% understanding
Discover Study Smart

How does de novo sequencing work?

Reconstructing a complete genome de novo requires testing possible overlaps between all reads and then building the whole genome together according to some criterion:

A known problem in computer Science is the Shortest Superstring Problem (SSP), where all fragments are strung up to produce the shortest overall string (i.e. genome).
  • However, the shortest possible string is not an ideal criterion because genomes have many repeating fragments   

What are two main problems for de novo assembly?

  • Multiple contigs (due to lack of overlapping reads)
    • caused by lack of coverage: due to randomness of shearing process there is a chance that some regions of the genome are unsequenced
  • Repeats

What is a contig (de novo assembly)?

A continuous set of overlapping sequences.

Computer programs have to check the overlap of all against all fragments to find the most likely order of the fragments, resulting in a completely reconstructed DNA sequence.

Why can repeats cause major problems for the assembler?

  • Reads corresponding to two separate repeats may be collapsed in a single contig
  • Repeats with large intervening regions or multiple repeat regions (e.g. >600 repeats) cannot be resolved by mate pairs anymore

What do depth and coverage mean?

  • sequencing depth: (average) number of reads per base (often on entire sequencing sample)
  • coverage: average number of reads per base (often on specific region)

What should a de novo assembly algorithm do?

INPUT – Millions of sequenced fragments

PROCESS – Cut reads in k-mers and determine overlap through string-matching (allow for small variations) – No reference needed

OUTPUT – Alignment and sequence of new strain

How are de Bruijn graphs used for genome assemby?

  • Nodes are k-mers
  • edges are k+1 mers, connecting two nodes
  • such nodes need to have overlap of k-1  symbols
  • directed edges such that suffix of outgoing node overlaps with prefix of incoming node


Errors create tips and bubbles --> remove

What should an algorithm do for alignment if reference if available?

INPUT – Millions of sequenced fragments – A reference genome sequence

PROCESS – String matching of the sequence reads against the reference genome sequence (allowing for small variations) – Reference and sequenced organism need to be closely related (at least the same species)

OUTPUT – Alignment and sequence of newly sequenced genome

With a ref genome: How to search through millions of reads?

  • Input: reference genome and a set of reads (BAM file)
  • Need fast way to look up reads that will potentially align well •
  • Most current methods use “BurrowsWheeler transform”
    • Also used in data compression (like “zipping”)
    • Here it helps aligning the reads against the genome in a fast way

What is the problem of mismatches (ref alignment)?

  • The reference genome will contain mismatches relative to the reads that should be aligned against it.
  • There are various strategies to deal with matching fragments containing mismatches.
  • However, compared to exact string matching, looking for alignments where symbols may differ will increase processing times.

--> BWA

What is Burrow-Wheeler Alignmer (BWA)?

  • Fast aligner
      • Based on indexing technique using suffix tree formalism
  • Can handle inexact repeats with a defined maximum number of differences (mismatches or gaps)

What are 4 major problems with NGS.

  • Huge amounts of data to process
  • High error rate
  • Lack of coverage
  • Repeat sequences - this is the largest general problem for (de novo) assembly

The question on the page originate from the summary of the following study material:

  • A unique study and practice tool
  • Never study anything twice again
  • Get the grades you hope for
  • 100% sure, 100% understanding
Remember faster, study better. Scientifically proven.
Trustpilot Logo