Basic Protocol 1: Installing and Running Blast Locally under Unix‐Like Operating Systems such as Linux
Necessary Resources
-
Hardware
-
The hardware requirements for running BLAST locally are modest indeed: any Intel or equivalent (e.g., AMD)–based architecture will be adequate for running this protocol. However a few considerations on hardware need to be addressed:
-
BLAST can be computationally intensive. For example, if searching large databases, using CPU‐intensive BLAST algorithms (e.g., TBLASTX) or searching with many sequences, more CPU power is better.
-
BLAST can be memory‐hungry; for ideal performance, one should have enough memory to load the entire indexed database comfortably into RAM.
-
Databases can be large and require large amounts of disk space. Researchers who decide to take on ambitious projects such as downloading the entire GenBank database for local searching should keep this in mind.
-
The nice thing about BLAST is that the hardware needs can be scaled easily by adding more disk space or RAM or moving to multiprocessor architectures. In addition, if using BLAST to query a large number of sequences against a common database, this process can very easily be parallelized by copying the indexed database, the relevant query sequences, and the blastall application to any number of machines on which the analyses are to be performed, and running them at the same time.
-
Software
-
There are several programs in the stand‐alone BLAST package. The main ones that are needed to run BLAST locally are formatdb to create BLASTable databases and blastall to query these databases using any of the favorite BLAST algorithms (blastn, blastp, blastx, tblastn, and tblastx). formatdb is a program for formatting FASTA formatted databases for searching using BLAST. Details on formatdb can be found in the file README.formatdb distributed with the BLAST package. The options for formatdb are listed in Table 3.11.1 . blastall is the main BLAST application. It is used for running queries against the indexed databases created with formatdb. Details on blastall can be found in the file README.bls distributed with the BLAST package. Some of the most commonly used blastall options are listed in Table 3.11.2 .
Table 3.1.1 Necessary Resources Options for formatdb a Options for formatdb Options for blastall b Options for blastall
|
Option |
Explanation |
‐t |
Title for database file [String] (optional) |
‐i |
Input file(s) for formatting [File In] (this parameter must be set) |
‐l |
Logfile name: [File Out] (optional) |
|
default = formatdb.log |
‐p |
Type of file [T/F] (optional): |
|
T = protein |
|
F = nucleotide |
|
default = T |
‐o |
Parse options [T/F] (optional): |
|
T (true) = parse SeqID and create indexes |
|
F (false) = do not parse SeqID; do not create indexes |
|
default = F |
‐a |
Input file is database in ASN.1 format (otherwise FASTA is expected) [T/F](optional): |
|
T = True |
|
F = False |
|
default = F |
‐b |
ASN.1 database in binary mode [T/F] (optional): |
|
T = binary |
|
F = text mode |
|
default = F |
‐e |
Input is a Seq entry [T/F] (optional) |
|
default = F |
‐n |
Base name for BLAST files [String] (optional) |
‐v |
Number of sequence bases to be created in the volume [Integer] (optional) |
|
default = 0 |
‐s |
Create indexes limited only to accessions: sparse [T/F] (optional)default = F |
‐V |
Verbose: check for nonunique string IDs in the database [T/F] (optional)default = F |
‐A |
Create ASN.1 structured deflines [T/F] (optional)default = F |
Option |
Explanation |
‐p |
Program name [String] |
|
Input should be one of blastp, blastn, blastx, tblastn, or, tblastx |
‐d |
Database [String] |
|
default = nr |
|
The database specified must first be formatted with formatdb. An example would be ‐d nr est, which will search both the nr and est databases, presenting the results as if one “virtual” database consisting of all the entries from both were searched. The statistics are based on the “virtual” database of nr and est. |
‐i |
Query file [File In] |
|
default = stdin |
|
The query should be in FASTA format. If multiple FASTA entries are in the input file, all queries will be searched. |
‐e |
Expectation value (E ) [Real] |
|
default = 10.0 |
‐o |
BLAST report output file [File Out] (optional) |
|
default = stdout |
‐F |
Filter query sequence (dust with BLASTN, seg with others) [String] |
|
default = T |
|
BLAST 2.0 and 2.1 use the dust low‐complexity filter for BLASTN and seg for the other programs. Both dust and seg are integral parts of the NCBI Toolkit and are accessed automatically. If one uses ‐F T then normal filtering by seg or dust (for BLASTN) occurs (likewise ‐F F means no filtering whatsoever). This option also takes a string as an argument. One may use such a string to change the specific parameters of seg or invoke other filters. |
‐S |
Query strands to search against database (for BLAST[NX], and TBLASTX). 3 is both, 1 is top, 2 is bottom [Integer] |
|
default = 3 |
‐T |
Produce HTML output [T/F] |
|
default = F |
‐l |
Restrict search of database to list of GI's [String] (optional) |
|
This option specifies that only a subset of the database should be searched, determined by the list of GI's (i.e., NCBI identifiers) in a file. One can obtain a list of gi's for a given Entrez query from http://www.ncbi.nlm.nih.gov/Entrez/batch.html. This file should be in the same directory as the database, or in the directory from which BLAST is called. |
‐U |
Use lowercase filtering of FASTA sequence [T/F] (optional) |
|
This option specifies that any lower‐case letters in the input FASTA file should be masked |
|
|
a For an example of using these options, see protocol 1 , step .
|
Table 3.1.2 Necessary Resources Options for formatdb a Options for formatdb Options for blastall b Options for blastall
|
Option |
Explanation |
‐t |
Title for database file [String] (optional) |
‐i |
Input file(s) for formatting [File In] (this parameter must be set) |
‐l |
Logfile name: [File Out] (optional) |
|
default = formatdb.log |
‐p |
Type of file [T/F] (optional): |
|
T = protein |
|
F = nucleotide |
|
default = T |
‐o |
Parse options [T/F] (optional): |
|
T (true) = parse SeqID and create indexes |
|
F (false) = do not parse SeqID; do not create indexes |
|
default = F |
‐a |
Input file is database in ASN.1 format (otherwise FASTA is expected) [T/F](optional): |
|
T = True |
|
F = False |
|
default = F |
‐b |
ASN.1 database in binary mode [T/F] (optional): |
|
T = binary |
|
F = text mode |
|
default = F |
‐e |
Input is a Seq entry [T/F] (optional) |
|
default = F |
‐n |
Base name for BLAST files [String] (optional) |
‐v |
Number of sequence bases to be created in the volume [Integer] (optional) |
|
default = 0 |
‐s |
Create indexes limited only to accessions: sparse [T/F] (optional)default = F |
‐V |
Verbose: check for nonunique string IDs in the database [T/F] (optional)default = F |
‐A |
Create ASN.1 structured deflines [T/F] (optional)default = F |
Option |
Explanation |
‐p |
Program name [String] |
|
Input should be one of blastp, blastn, blastx, tblastn, or, tblastx |
‐d |
Database [String] |
|
default = nr |
|
The database specified must first be formatted with formatdb. An example would be ‐d nr est, which will search both the nr and est databases, presenting the results as if one “virtual” database consisting of all the entries from both were searched. The statistics are based on the “virtual” database of nr and est. |
‐i |
Query file [File In] |
|
default = stdin |
|
The query should be in FASTA format. If multiple FASTA entries are in the input file, all queries will be searched. |
‐e |
Expectation value (E ) [Real] |
|
default = 10.0 |
‐o |
BLAST report output file [File Out] (optional) |
|
default = stdout |
‐F |
Filter query sequence (dust with BLASTN, seg with others) [String] |
|
default = T |
|
BLAST 2.0 and 2.1 use the dust low‐complexity filter for BLASTN and seg for the other programs. Both dust and seg are integral parts of the NCBI Toolkit and are accessed automatically. If one uses ‐F T then normal filtering by seg or dust (for BLASTN) occurs (likewise ‐F F means no filtering whatsoever). This option also takes a string as an argument. One may use such a string to change the specific parameters of seg or invoke other filters. |
‐S |
Query strands to search against database (for BLAST[NX], and TBLASTX). 3 is both, 1 is top, 2 is bottom [Integer] |
|
default = 3 |
‐T |
Produce HTML output [T/F] |
|
default = F |
‐l |
Restrict search of database to list of GI's [String] (optional) |
|
This option specifies that only a subset of the database should be searched, determined by the list of GI's (i.e., NCBI identifiers) in a file. One can obtain a list of gi's for a given Entrez query from http://www.ncbi.nlm.nih.gov/Entrez/batch.html. This file should be in the same directory as the database, or in the directory from which BLAST is called. |
‐U |
Use lowercase filtering of FASTA sequence [T/F] (optional) |
|
This option specifies that any lower‐case letters in the input FASTA file should be masked |
|
|
b For an example of using these options, see protocol 1 , step .
|
|