a generalized form of sequence (better search capability) probabilities instead of a letter

Home / Summaries / Class notes - Algorithms in Sequence Analysis / blast-sequence-sequences

Homology searching 1 u

18 important questions on Homology searching 1 u

What does BLAST mean? Name some basic features.

Basic Local Alignment Search Tool

Heuristic!

A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees.

Basic idea:

Discard putatively unrelated sequences fast
High scoring segments have well conserved (almost identical) part
As well conserved segments are identified, extend these to the real alignment

What does well-conserved mean in BLAST?

BLAST works with k-words (words of length k)

k is a parameter
different for DNA (>10) and proteins (2..4), default k values are 11 and 3, resp.

word w1 is T-similar to w2 if the sum of pair scores is at least T (e.g. T=12)

What are the three basic steps of BLAST?

Preprocessing: the query sequence: extract all the k-words
Scan: for T-similar matches in database (fast step)
Extend: these to final alignments (slow step)

Explain the principle of step 2 of BLAST.

Step 2 = Rapidly scanning the database with DFA (Deterministic Finite-state Automaton)

search database for all occurrences of query words
can be a massive task
approach:

build a DFA (deterministic finite-state automaton) that recognizes all query words
run DB sequences through DFA
remember hits

What is a DFA, finate state machine?

abstract machine
constant amount of memory (states)
used in computation and languages
recognizes regular expressions
cp dmt*.pdf /home/john

BLAST:

Use all the T-similar k-words to build the Finite State Machine
Scan for exact matches

What was problematic with classic BLAST? What is different in the current version?

Gapples alignments only! Bit of a problem: only close evolution.

Current BLAST: two-hit method and extention using DP (GappedBLAST).

In the old-BLAST, how dit the extension step work?

Having the list of matches (hits) we extend alignment in both directions. …till the sum of scores drops below some level X from the best known.
(Do'nt immediately stop at -5, because next could be 10). Takes a lot of time --> heuristic method: find diagonal, extend.

extend hits in both directions without allowing gaps (if score ≥ 13)
terminate extension in one direction when score falls certain distance below best score for shorter extensions
return segment pairs scoring at least a threshold score S

Explain the extending step of current BLAST.

two-hit method
gapped BLAST
old extension step typically accounts for 90% of BLASTʼs execution time
key idea: do extension only when there are two non-overlapping hits on the same diagonal within distance A of each other

to maintain sensitivity, lower T parameter (T ≥ 11 instead of T ≥ 13)
more single hits found
but only small fraction have associated 2nd hit

Contrast classic BLAST and current BLAST.

Before:

relatively high T threshold for 3-letter word (hashed) lists • two-way hit extension (see earlier slides)

Current BLAST: • Lower T: many more hits (more 3-letter words accepted as match)

Relatively few hits (diagonal elements) will be on same matrix diagonal within a given distance A
Perform 2-way local Dynamic Programming (gapped BLAST) only on ‘two-hits’ (preceding bullet)

The new way is a bit faster on average and gives better (gapped) alignments and better alignment scores!

What are 2 more recent BLAST developments?

Indexing (hashing) of the database
PSI-BLAST

How does an indexed database work?

Preprocess the database

Hash the database with k-words
For each k-word store in which sequences it appears

So when you have a query seq, if there is an exact match, you know immediately the position in the database.

Indexing take lots of time but you do it once (well, you do it again when the database is updated...)

So the speed of the scanning step is improved.

What is the concept of PSI-BLAST?

Position-Specific Iterated BLAST
A profile (called PSSM by BLAST – Position Specific Scoring Matrix) is derived from the result of the first search (using a single query sequence)
Database is searched against the profile (instead of a sequence) in subsequent rounds
3 - 10 iterations are recommended

What is a profile?

a generalized form of sequence (better search capability)
probabilities instead of a letter

What is a sequence profile matrix?

A sequence profile is a frequency-based scoring matrix that approximates the likelihood of occurrence of an amino acid.

Profiles may have an extra column (21st position) to describe position-specific gap penalties.

In BLAST a profile is called Position-Specific Scoring Matrix(PSSM) and has only 20 columns (4 for DNA sequences)

How is a profile constructed? (in PSI-BLAST?)

Take significant BLAST hits
Make an alignment
Assign weights to sequences
Construct profile

In PSI-BLAST, pseudocounts are used in profile construction to make them more reliable.

useful when multiple alignment contains only few sequences so that there is no statistical sample per column yet
drawback is pulling all frequencies to prior frequencies, which reduces the differences between sequences

Why are Low-complexity sequences filtered out?

Low-complexity (sub)sequences have a biased composition and contain less information than high-complexity sequences
Because of the low information content, they often lead to spurious hits without a biological basis (for example, you can’t tell whether a poly-A sequence [AAAAA...] is more similar to a globin, an immunoglobulin or a kinase sequence)
That is why BLAST filters low-complexity regions in the query sequence out

What do BLAST and FASTA have in common?

Speed is obtained by delaying application of the dynamic programming technique to the moment where the most similar segments are already .

How can the first step of FASTA be made faster.

Rapid identical word search: searching for k-tuples of a certain size within a specified bandwith along search matrix diagonals.
+ hashing

in fasta with hashing: Algorithmic speed therefore is linear with (DB) sequence length or O(n). Compare this to finding all word match positions without a hash list (complexity is O(m*n)).

The question on the page originate from the summary of the following study material:

Algorithms in Sequence Analysis

View summary

A unique study and practice tool
Never study anything twice again
Get the grades you hope for
100% sure, 100% understanding

Remember faster, study better. Scientifically proven.

Homology searching 1 u

18 important questions on Homology searching 1 u

What does BLAST mean? Name some basic features.

What does well-conserved mean in BLAST?

What are the three basic steps of BLAST?

Explain the principle of step 2 of BLAST.

What is a DFA, finate state machine?

What was problematic with classic BLAST? What is different in the current version?

In the old-BLAST, how dit the extension step work?

Explain the extending step of current BLAST.

Contrast classic BLAST and current BLAST.

What are 2 more recent BLAST developments?

How does an indexed database work?

What is the concept of PSI-BLAST?

What is a profile?

What is a sequence profile matrix?

How is a profile constructed? (in PSI-BLAST?)

Why are Low-complexity sequences filtered out?

What do BLAST and FASTA have in common?

How can the first step of FASTA be made faster.

Summaries related to Intro + Pairwise alignment

Class notes - Algorithms in Sequence Analysis

Syllabus Introduction to systems biology

Structural Bioinformatics

Class notes - Biosystems Data Analysis

Indian Economics

Global politics

Essentials of international relations

Behavioral genetics

Management and organisational behaviour

Follow Up Engels idioom 4/5 H

International Business

Marketing fundamentals