Home / Summaries / Class notes - Algorithms in Sequence Analysis / genome-reads-fragments

Sequencing-BWA

Q: What are the two main methods for genome assembly?

De novo assembly of a genome Assembly using alignment onto a reference genome

15 important questions on Sequencing-BWA

Name some key points about next generation sequencing (NGS).

Massively parallel sequencing of millions to billions of short fragments
Very fast

E.g. compared to Sanger sequencing – exploited in HGP- max 384 DNA samples in a single batch (run) in up to 24 runs a day

Huge amounts of data generated in single sequencing experiment (many TBs)
Much reduced cost (1 human genome: HGP 3 billion $ versus NGS ~10,000 $)
Shorter fragments (reads) than with Sanger sequencing
Many different techniques exist but based on approx. same principle. Differences reside mainly in chemical usage and the way fragments are stuck to the surface

What is the main problem in sequencing (for bioinformaticians)?

The assembly! Reconstructing a DNA sequence from many randomly selected short fragments (reads).

And the reads may contain (experimental) errors.

What are the two main methods for genome assembly?

De novo assembly of a genome
Assembly using alignment onto a reference genome

How does de novo sequencing work?

Reconstructing a complete genome de novo requires testing possible overlaps between all reads and then building the whole genome together according to some criterion:

A known problem in computer Science is the Shortest Superstring Problem (SSP), where all fragments are strung up to produce the shortest overall string (i.e. genome).

However, the shortest possible string is not an ideal criterion because genomes have many repeating fragments

What are two main problems for de novo assembly?

Multiple contigs (due to lack of overlapping reads)

caused by lack of coverage: due to randomness of shearing process there is a chance that some regions of the genome are unsequenced

Repeats

What is a contig (de novo assembly)?

A continuous set of overlapping sequences.

Computer programs have to check the overlap of all against all fragments to find the most likely order of the fragments, resulting in a completely reconstructed DNA sequence.

Why can repeats cause major problems for the assembler?

Reads corresponding to two separate repeats may be collapsed in a single contig
Repeats with large intervening regions or multiple repeat regions (e.g. >600 repeats) cannot be resolved by mate pairs anymore

What do depth and coverage mean?

sequencing depth: (average) number of reads per base (often on entire sequencing sample)
coverage: average number of reads per base (often on specific region)

What should a de novo assembly algorithm do?

INPUT – Millions of sequenced fragments

PROCESS – Cut reads in k-mers and determine overlap through string-matching (allow for small variations) – No reference needed

OUTPUT – Alignment and sequence of new strain

How are de Bruijn graphs used for genome assemby?

Nodes are k-mers
edges are k+1 mers, connecting two nodes
such nodes need to have overlap of k-1 symbols
directed edges such that suffix of outgoing node overlaps with prefix of incoming node

Errors create tips and bubbles --> remove

What should an algorithm do for alignment if reference if available?

INPUT – Millions of sequenced fragments – A reference genome sequence

PROCESS – String matching of the sequence reads against the reference genome sequence (allowing for small variations) – Reference and sequenced organism need to be closely related (at least the same species)

OUTPUT – Alignment and sequence of newly sequenced genome

With a ref genome: How to search through millions of reads?

Input: reference genome and a set of reads (BAM file)
Need fast way to look up reads that will potentially align well •
Most current methods use “BurrowsWheeler transform”

Also used in data compression (like “zipping”)
Here it helps aligning the reads against the genome in a fast way

What is the problem of mismatches (ref alignment)?

The reference genome will contain mismatches relative to the reads that should be aligned against it.
There are various strategies to deal with matching fragments containing mismatches.
However, compared to exact string matching, looking for alignments where symbols may differ will increase processing times.

--> BWA

What is Burrow-Wheeler Alignmer (BWA)?

Fast aligner

Based on indexing technique using suffix tree formalism

Can handle inexact repeats with a defined maximum number of differences (mismatches or gaps)

What are 4 major problems with NGS.

Huge amounts of data to process
High error rate
Lack of coverage
Repeat sequences - this is the largest general problem for (de novo) assembly

The question on the page originate from the summary of the following study material:

Algorithms in Sequence Analysis

View summary

A unique study and practice tool
Never study anything twice again
Get the grades you hope for
100% sure, 100% understanding

Remember faster, study better. Scientifically proven.

Sequencing-BWA

15 important questions on Sequencing-BWA

Name some key points about next generation sequencing (NGS).

What is the main problem in sequencing (for bioinformaticians)?

What are the two main methods for genome assembly?

How does de novo sequencing work?

What are two main problems for de novo assembly?

What is a contig (de novo assembly)?

Why can repeats cause major problems for the assembler?

What do depth and coverage mean?

What should a de novo assembly algorithm do?

How are de Bruijn graphs used for genome assemby?

What should an algorithm do for alignment if reference if available?

With a ref genome: How to search through millions of reads?

What is the problem of mismatches (ref alignment)?

What is Burrow-Wheeler Alignmer (BWA)?

What are 4 major problems with NGS.

Summaries related to Intro + Pairwise alignment

Class notes - Algorithms in Sequence Analysis

Syllabus Introduction to systems biology

Structural Bioinformatics

Class notes - Biosystems Data Analysis

Indian Economics

Global politics

Essentials of international relations

Behavioral genetics

Management and organisational behaviour

Follow Up Engels idioom 4/5 H

International Business

Marketing fundamentals