Running mpiBLAST

From BCCD 3.0

Revision as of 18:56, 2 November 2010 by Fitz (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

mpiBLAST is a tool for searching large databases of nucleotides or proteins. (For more information on mpiBLAST, please check out About mpiBLAST). This page is a walkthrough of using the BCCD to perform an mpiBLAST search. This tutorial assumes that you have booted the BCCD on one or more machines already. It also assumes that you are running OpenMPI. (OpenMPI is the default environment on the BCCD. To double check this, or set it up if you've switched to MPICH, see Running Open MPI.)

Using mpiBLAST

This section assumes that you'll be running mpiblast using OpenMPI. Unless you've specifically configured your BCCD to run MPICH instead of OpenMPI, you're already running it. If this doesn't sound familiar, you can assume you're ok.

mpiBLAST is used in a similar manner to NCBI-Blast. mpiBLAST uses the same variables that are available for NCBI Blast, which means that you will need to have a .ncbirc file in your home directory. This file tells where mpiBLAST where to find its databases (the Shared variable) and workspace (the Local variable). To do this, log in as user bccd with the password you specified when booting up.

The .ncbirc file that is used for this looks like this:

  [mpiBLAST]
  Shared=/home/bccd/blastdb
  Local=/home/bccd/blastdb

If you don't have such a file in your home directory (which you don't if you haven't made one yourself), copy the above into the file ~/.ncbirc using nedit, nano, vi or your other favorite text editor not listed here.

After setting up your .ncbirc file, there are four steps to running mpiblast. To get started, make the blastdb director and navigate there:

mkdir ~/blastdb
cd ~/blastdb

Download a database from NIH (National Institute of Health)

In order to search a database using mpiBLAST, you first have to have a database. For this example we'll be using the Drosophila melonagaster (fruit fly) nucleotide database. You can download other databases (see the bottom of the page for links to additional databases) using the wget command. To get the Drosophila melonagaster database, for example, you would do something like this:

 wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/drosoph.nt.gz
 --17:00:38--  ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/drosoph.nt.gz
            => `drosoph.nt.gz'
 Resolving ftp.ncbi.nlm.nih.gov... 165.112.7.10
 Connecting to ftp.ncbi.nlm.nih.gov|165.112.7.10|:21... connected.
 Logging in as anonymous ... Logged in!
 ==> SYST ... done.    ==> PWD ... done.
 ==> TYPE I ... done.  ==> CWD /blast/db/FASTA ... done.
 ==> PASV ... done.    ==> RETR drosoph.nt.gz ... done.
 Length: 36,924,008 (35M) (unauthoritative)
 
 100%[====================================>] 36,924,008   326.82K/s    ETA 00:00
 
 17:02:28 (338.88 KB/s) - `drosoph.nt.gz' saved [36924008]

After downloading, be sure to decompress it, using gunzip <database name>.

Format the database using mpiformatdb

Now comes the time where we separate the database into chunks that can be accessed by different processors. --nfrags is used to specific the number of fragments that the database should be subdivided into. You'll want to split it into the same number of fragments as processors you'll use for running mpiBLAST. This is done with mpiformatdb. In this instance, we're splitting it into four ways.

bccd@node000:~$ mpiformatdb --nfrags=4 -i ./drosoph.nt -pF --quiet 
Reading input file
Done, read 1534943 lines
Reordering 1170 sequence entries
Breaking drosoph.nt (122 MB) into 4 fragments
Executing: formatdb -p F -i /tmp/reorderoUDWYw -N 4 -n /home/bccd/blastdb/drosoph.nt -o T 
Removed /tmp/reorderoUDWYw
Created 4 fragments.
bccd@node000:~$ ls blastdb
drosoph.nt  formatdb.log

If you're using a different database you downloaded, be sure to specify that path rather than ./drosoph.nt. The output of this, the different chunks of the database, will then to be dumped to the shared folder specified in the .ncbirc file. (If you used the default above, this is ~/blastdb.) (Verify this with ls ~/blastdb.)

Error again?!

If you see a long list of the phrase [formatdb] FATAL ERROR: File write error, you've run out of RAM. Oops! See Customization Tips and Tricks: Supplementing RAM.

Create a test sequence file

Finally we're ready to run mpiBLAST against a test sequence. You can either create your own by pasting it in:

bccd@node000:~/blastdb$ cat > blast.in 
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT

(Remember, use ctrl-D to close the reading from stdin.)

Then, run mpiblast as follows:

bccd@node000:~$ mpirun -np 4 -machinefile ~/machines mpiblast -d drosoph.nt -i blast.in -p blastn -o results.txt
bccd@node000:~$ ls
[other stuff..]  results.txt

The results file should look similar to this:

BLASTN 2.2.10 [Oct-19-2004]


Reference: Aaron E. Darling, Lucas Carey, and Wu-chun Feng,
"The design, implementation, and evaluation of mpiBLAST."
In Proceedings of ClusterWorld 2003, June 24-26 2003, San Jose, CA


Query= Test
         (560 letters)

Database: /home/bccd/blastdb/drosoph.nt 
           1170 sequences; 122,655,632 total letters



                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value

gb|AE003681.2|AE003681 Drosophila melanogaster genomic scaffold ...    36   0.86 
gb|AE002936.2|AE002936 Drosophila melanogaster genomic scaffold ...    36   0.86 
gb|AE003698.2|AE003698 Drosophila melanogaster genomic scaffold ...    36   0.86 
gb|AE003493.2|AE003493 Drosophila melanogaster genomic scaffold ...    36   0.86 
gb|AE002615.2|AE002615 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003441.1|AE003441 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003525.2|AE003525 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003587.2|AE003587 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003673.2|AE003673 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003648.1|AE003648 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003628.1|AE003628 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003431.2|AE003431 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003484.1|AE003484 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003495.2|AE003495 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE002665.2|AE002665 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003740.2|AE003740 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003723.3|AE003723 Drosophila melanogaster genomic scaffold ...    34   3.4  
gb|AE003447.2|AE003447 Drosophila melanogaster genomic scaffold ...    34   3.4  

>gb|AE003681.2|AE003681 Drosophila melanogaster genomic scaffold 142000013386035 section 6 of
              105, complete sequence
          Length = 329362

 Score = 36.2 bits (18), Expect = 0.86
 Identities = 18/18 (100%)
 Strand = Plus / Minus

                                
Query: 96     taaattaaaattttattg 113
              ||||||||||||||||||
Sbjct: 111644 taaattaaaattttattg 111627


>gb|AE002936.2|AE002936 Drosophila melanogaster genomic scaffold 142000013385220, complete
             sequence
          Length = 48123

 Score = 36.2 bits (18), Expect = 0.86
 Identities = 18/18 (100%)
 Strand = Plus / Minus

                               
Query: 97    aaattaaaattttattga 114
             ||||||||||||||||||
Sbjct: 40704 aaattaaaattttattga 40687


>gb|AE003698.2|AE003698 Drosophila melanogaster genomic scaffold 142000013386035 section 23 of
              105, complete sequence
          Length = 225827

 Score = 36.2 bits (18), Expect = 0.86
 Identities = 18/18 (100%)
 Strand = Plus / Minus

                                
Query: 107    tttattgacttaggtcac 124
              ||||||||||||||||||
Sbjct: 151021 tttattgacttaggtcac 151004


>gb|AE003493.2|AE003493 Drosophila melanogaster genomic scaffold 142000013386053 section 10 of
              30, complete sequence
          Length = 308092

 Score = 36.2 bits (18), Expect = 0.86
 Identities = 18/18 (100%)
 Strand = Plus / Minus

                                
<<snipped>>


  Database: /home/bccd/blastdb/drosoph.nt
    Posted date:  Dec 6, 2006  5:13 PM
  Number of letters in database: 30,663,804
  Number of sequences in database:  292
  
  Database: /home/bccd/blastdb/drosoph.nt.001
    Posted date:  Dec 6, 2006  5:13 PM
  Number of letters in database: 30,664,011
  Number of sequences in database:  293
  
  Database: /home/bccd/blastdb/drosoph.nt.002
    Posted date:  Dec 6, 2006  5:13 PM
  Number of letters in database: 30,664,004
  Number of sequences in database:  293
  
  Database: /home/bccd/blastdb/drosoph.nt.003
    Posted date:  Dec 6, 2006  5:13 PM
  Number of letters in database: 30,663,813
  Number of sequences in database:  292
  
Lambda     K      H
    1.37    0.711     1.31 

Gapped
Lambda     K      H
    1.37    0.711     1.31 


Matrix: blastn matrix:1 -3
Gap Penalties: Existence: 5, Extension: 2
Number of Hits to DB: 35,658
Number of Sequences: 1170
Number of extensions: 35658
Number of successful extensions: 72
Number of sequences better than 10.0: 18
Number of HSP's better than 10.0 without gapping: 18
Number of HSP's successfully gapped in prelim test: 0
Number of HSP's that attempted gapping in prelim test: 53
Number of HSP's gapped (non-prelim): 19
length of query: 1122
length of database: 122,655,632
effective HSP length: 18
effective length of query: 542
effective length of database: 122,634,572
effective search space: 66467938024
effective search space used: 66467938024
T: 0
A: 0
X1: 11 (21.8 bits)
X2: 15 (29.7 bits)
S1: 12 (24.3 bits)
S2: 17 (34.2 bits)

FMI

For more information...

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox