Homology searching 1 u

18 important questions on Homology searching 1 u

What does BLAST mean? Name some basic features.

Basic Local Alignment Search Tool
  • Heuristic!
    • A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees. 
  • Basic idea:
    • Discard putatively unrelated sequences fast
    • High scoring segments have well conserved (almost identical) part
    • As well conserved segments are identified, extend these to the real alignment

What does well-conserved mean in BLAST?

  • BLAST works with k-words (words of length k)
    • k is a parameter
    • different for DNA (>10) and proteins (2..4), default k values are 11 and 3, resp.
  • word w1 is T-similar to w2 if the sum of pair scores is at least T (e.g. T=12)

What are the three basic steps of BLAST?

  1. Preprocessing: the query sequence: extract all the k-words
  2. Scan: for T-similar matches in database (fast step)
  3. Extend: these to final alignments (slow step)
  • Higher grades + faster learning
  • Never study anything twice
  • 100% sure, 100% understanding
Discover Study Smart

Explain the principle of step 2 of BLAST.

Step 2 = Rapidly scanning the database with DFA (Deterministic Finite-state Automaton)
  • search database for all occurrences of query words
  • can be a massive task
  • approach:
    • build a DFA (deterministic finite-state automaton) that recognizes all query words
    • run DB sequences through DFA
    • remember hits 

What is a DFA, finate state machine?

  • abstract machine
  • constant amount of memory (states)
  • used in computation and languages
  • recognizes regular expressions
  • cp dmt*.pdf /home/john

BLAST:
  • Use all the T-similar k-words to build the Finite State Machine
  • Scan for exact matches

What was problematic with classic BLAST? What is different in the current version?

Gapples alignments only! Bit of a problem: only close evolution.

Current BLAST: two-hit method and extention using DP (GappedBLAST).

In the old-BLAST, how dit the extension step work?

Having the list of matches (hits) we extend alignment in both directions. …till the sum of scores drops below some level X from the best known.
(Do'nt immediately stop at -5, because next could be 10). Takes a lot of time --> heuristic method: find diagonal, extend.


  • extend hits in both directions without allowing gaps (if score ≥ 13)
  • terminate extension in one direction when score falls certain distance below best score for shorter extensions   
  • return segment pairs scoring at least a threshold score S

Explain the extending step of current BLAST.

  • two-hit method
  • gapped BLAST
  • old extension step typically accounts for 90% of BLASTʼs execution time
  • key idea: do extension only when there are two non-overlapping hits on the same diagonal within distance A of each other
    • to maintain sensitivity, lower T parameter (T ≥ 11 instead of T ≥ 13)
    • more single hits found
    • but only small fraction have associated 2nd hit

Contrast classic BLAST and current BLAST.

  • Before:
    • relatively high T threshold for 3-letter word (hashed) lists • two-way hit extension (see earlier slides)
  • Current BLAST: • Lower T: many more hits (more 3-letter words accepted as match)
    • Relatively few hits (diagonal elements) will be on same matrix diagonal within a given distance A
    • Perform 2-way local Dynamic Programming (gapped BLAST) only on ‘two-hits’ (preceding bullet)

The new way is a bit faster on average and gives better (gapped) alignments and better alignment scores!

What are 2 more recent BLAST developments?

  • Indexing (hashing) of the database
  • PSI-BLAST

How does an indexed database work?

Preprocess the database
  • Hash the database with k-words
  • For each k-word store in which sequences it appears


So when you have a query seq, if there is an exact match, you know immediately the position in the database.

Indexing take lots of time but you do it once (well, you do it again when the database is updated...)

So the speed of the scanning step is improved.

What is the concept of PSI-BLAST?

  • Position-Specific Iterated BLAST
  • A profile (called PSSM by BLAST – Position Specific Scoring Matrix) is derived from the result of the first search (using a single query sequence)
  • Database is searched against the profile (instead of a sequence) in subsequent rounds 
  • 3 - 10 iterations are recommended

What is a profile?

  • a generalized form of sequence (better search capability)
  • probabilities instead of a letter

What is a sequence profile matrix?

A  sequence profile is a frequency-based scoring matrix that approximates the likelihood of occurrence of an amino acid.

Profiles may have an extra column (21st position) to describe position-specific gap penalties. 

In BLAST a profile is called Position-Specific Scoring Matrix(PSSM) and has only 20 columns (4 for DNA sequences)

How is a profile constructed? (in PSI-BLAST?)

  • Take significant BLAST hits
  • Make an alignment
  • Assign weights to sequences
  • Construct profile


In PSI-BLAST, pseudocounts are used in profile construction to make them more reliable.
  • useful when multiple alignment contains only few sequences so that there is no statistical sample per column yet
  • drawback is pulling all frequencies to prior frequencies, which reduces the differences between sequences 

Why are Low-complexity sequences filtered out?

  • Low-complexity (sub)sequences have a biased composition and contain less information than high-complexity sequences
  • Because of the low information content, they often lead to spurious hits without a biological basis (for example, you can’t tell whether a poly-A sequence [AAAAA...] is more similar to a globin, an immunoglobulin or a kinase sequence)
  • That is why BLAST filters low-complexity regions in the query sequence out

What do BLAST and FASTA have in common?

Speed is obtained by delaying application of the dynamic programming technique to the moment where the most similar segments are already .

How can the first step of FASTA be made faster.

Rapid identical word search: searching for k-tuples of a certain size within a specified bandwith along search matrix diagonals.
+ hashing

in fasta with hashing: Algorithmic speed therefore is linear with (DB) sequence length or O(n). Compare this to finding all word match positions without a hash list (complexity is O(m*n)).

The question on the page originate from the summary of the following study material:

  • A unique study and practice tool
  • Never study anything twice again
  • Get the grades you hope for
  • 100% sure, 100% understanding
Remember faster, study better. Scientifically proven.
Trustpilot Logo