About mpiBLAST

From BCCD 3.0

Revision as of 18:58, 2 November 2010 by Fitz (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

mpiBLAST is an open source parallelization of the National Center for Biological Information's (NCBI) BLAST. The official website is located at http://www.mpiblast.org/.

NCBI BLAST

According to the official BLAST website, the Basic Local Alignment Search Tool (BLAST)

finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

That's a mouthful! "Local alignment" refers to the way BLAST compares sequences. When dealing with a potentially large sequences, rather than looking for a match in entirety, it looks for smaller patches that match. When it finds a match, it evaluates the probability that this match would occur by chance. If it's highly likely the match occurred by chance, it probably isn't very statistically significant. (For instance, the sequence "CATA" occurs hundreds of times in any given genome. Precisely because it occurs so often, it doesn't represent something exciting - there's probably no functional similarity amongst all the CATA's.) These matches can be searched for in either a string of nucleotides (which make up DNA) or amino acids (which make up proteins). When running BLAST, you select which type your sequence is and choose the databases of known sequences you're interested in comparing it against.

mpiBLAST

mpiBLAST is based on BLAST. It takes the idea of BLAST one step further by running it in parallel. mpiBLAST is based on the Message Passing Interface (MPI) library. mpiBLAST makes it possible to compare one sequence to a database using multiple processes (running on the same computer or different computers) all running at once - in other words, in parallel. In particular if the database is rather large, which databases of entire genomes tend to be, this can speed things up quite a bit. It does this by first breaking up the database into different chunks to hand out to different processes. Each of these processes can then search through its database, looking for matches to the query, without needing to worry about what the other ones are doing. This kind of problem is called an "embarrassingly parallel" problem, because we can achieve nearly linear speedup. (We're not embarrased that it's parallel. We're embarrassed that fewer kinds of this problem aren't parallelized because they're so easy and rewarding to do.)

To get some idea of the size of the databases, the uncompressed nucleotide database for Drosophila melanogaster, the common fruit fly, is over 118 megabytes (MB), and D. melanogaster only has four chromosomes!

Uses of mpiBLAST/BLAST

There are a lot of reasons for wanting to compare sequences in order to learn the history of species or genes and gene families. Comparing genes from two different species can help identify how closely they are related and, based on the number of differences, approximately how long ago they diverged from a common ancestor. Or, an identified gene in one species can also be used to find its corresponding gene in another related species.

Instead we might be interested in which parts of a related gene have changed in two or more related species. If only part of a gene changes and the rest remains the same, we can infer that the unchanging part is necessary for the protein to function. We can then talk about which parts of the gene can change and which must stay the same. This can teach us about what a protein actually does for an organism - in other words, which part of the protein does the real work. Knowing which parts of a gene must stay the same can help us understand how mutations relate to certain diseases. Of course, we can also learn about this by directly comparing a mutated gene with a normal one.

Comparing sequences can also tell us a lot about the history of gene families, genes that are related to each other. Gene families often occur by duplication. If an organism has an extra copy of a gene and only needs one good copy, then one copy must stay the same but the other is "free to evolve" and take on a different purpose. Hemoglobin genes are a famous example of a gene family. A hemoglobin protein consists of four protein chains combined. These chains are encoded by different genes; the most well-known of these are the alpha and beta subunits that compose most of adult hemoglobin, but more types exist. Since these genes are related, if we have the sequence of one of them, we can run a search to find other related genes in a family. As with genes in different species, the number and location of changes amongst the different genes can tell us about the history and evolution of the species.

If you're interested in why we might search nucleotide sequences versus amino acids, please see the discussion at Genes versus Proteins and Introduction to DNA.

Running mpiBLAST

The great news is any of the types of investigations covered about can be done using just the BCCD! Check out Running mpiBLAST to get started.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox