Installing, Maintaining, and Using a Local Copy of BLAST for Intranet and Workstation Use

互联网2013-12-31

874

Abstract
Table of Contents
Materials
Figures
Literature Cited

Abstract

The Basic Local Alignment Search Tool (BLAST) is one of the widest used and most useful applications in sequence?based bioinformatics analysis. Frequently it is not practical or possible to use remote BLAST services through the Internet due to restrictions of a security or technical nature or the need for high?throughput analysis requiring greater amounts of processing power than are available from remote services. This unit describes the steps involved in obtaining and installing a copy of the BLAST software for use on a local intranet or stand?alone workstation. Once installed, the BLAST package can be used to create BLAST?searchable nucleotide and protein sequence databanks. Various popular hardware (PPC, Intel) and operating system (MacOSX, FreeBSD and Linux) options for running and maintaining the software are discussed. Finally, steps for indexing proprietary and third party (publicly available) sequence databanks for use with BLAST and managing these resources are discussed.

Keywords: BLAST; sequence similarity searching; Unix?like operating systems

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Strategic Planning
Basic Protocol 1: Installing and Running Blast Locally under Unix‐Like Operating Systems such as Linux
Alternate Protocol 1: Installing and Running Blast Locally under Microsoft Windows
Commentary
Literature Cited
Figures
Tables

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

Basic Protocol 1: Installing and Running Blast Locally under Unix‐Like Operating Systems such as Linux

Necessary Resources

Hardware
- The hardware requirements for running BLAST locally are modest indeed: any Intel or equivalent (e.g., AMD)–based architecture will be adequate for running this protocol. However a few considerations on hardware need to be addressed:
  - BLAST can be computationally intensive. For example, if searching large databases, using CPU‐intensive BLAST algorithms (e.g., TBLASTX) or searching with many sequences, more CPU power is better.
  - BLAST can be memory‐hungry; for ideal performance, one should have enough memory to load the entire indexed database comfortably into RAM.
  - Databases can be large and require large amounts of disk space. Researchers who decide to take on ambitious projects such as downloading the entire GenBank database for local searching should keep this in mind.
- The nice thing about BLAST is that the hardware needs can be scaled easily by adding more disk space or RAM or moving to multiprocessor architectures. In addition, if using BLAST to query a large number of sequences against a common database, this process can very easily be parallelized by copying the indexed database, the relevant query sequences, and the blastall application to any number of machines on which the analyses are to be performed, and running them at the same time.

Software

There are several programs in the stand‐alone BLAST package. The main ones that are needed to run BLAST locally are formatdb to create BLASTable databases and blastall to query these databases using any of the favorite BLAST algorithms (blastn, blastp, blastx, tblastn, and tblastx). formatdb is a program for formatting FASTA formatted databases for searching using BLAST. Details on formatdb can be found in the file README.formatdb distributed with the BLAST package. The options for formatdb are listed in Table 3.11.1 . blastall is the main BLAST application. It is used for running queries against the indexed databases created with formatdb. Details on blastall can be found in the file README.bls distributed with the BLAST package. Some of the most commonly used blastall options are listed in Table 3.11.2 .

Table 3.1.1 Necessary Resources Options for formatdb a Options for formatdb Options for blastall b Options for blastall

Option	Explanation
‐t	Title for database file [String] (optional)
‐i	Input file(s) for formatting [File In] (this parameter must be set)
‐l	Logfile name: [File Out] (optional)
	default = formatdb.log
‐p	Type of file [T/F] (optional):
	T = protein
	F = nucleotide
	default = T
‐o	Parse options [T/F] (optional):
	T (true) = parse SeqID and create indexes
	F (false) = do not parse SeqID; do not create indexes
	default = F
‐a	Input file is database in ASN.1 format (otherwise FASTA is expected) [T/F](optional):
	T = True
	F = False
	default = F
‐b	ASN.1 database in binary mode [T/F] (optional):
	T = binary
	F = text mode
	default = F
‐e	Input is a Seq entry [T/F] (optional)
	default = F
‐n	Base name for BLAST files [String] (optional)
‐v	Number of sequence bases to be created in the volume [Integer] (optional)
	default = 0
‐s	Create indexes limited only to accessions: sparse [T/F] (optional)default = F
‐V	Verbose: check for nonunique string IDs in the database [T/F] (optional)default = F
‐A	Create ASN.1 structured deflines [T/F] (optional)default = F
Option	Explanation
‐p	Program name [String]
	Input should be one of blastp, blastn, blastx, tblastn, or, tblastx
‐d	Database [String]
	default = nr
	The database specified must first be formatted with formatdb. An example would be ‐d nr est, which will search both the nr and est databases, presenting the results as if one “virtual” database consisting of all the entries from both were searched. The statistics are based on the “virtual” database of nr and est.
‐i	Query file [File In]
	default = stdin
	The query should be in FASTA format. If multiple FASTA entries are in the input file, all queries will be searched.
‐e	Expectation value (E ) [Real]
	default = 10.0
‐o	BLAST report output file [File Out] (optional)
	default = stdout
‐F	Filter query sequence (dust with BLASTN, seg with others) [String]
	default = T
	BLAST 2.0 and 2.1 use the dust low‐complexity filter for BLASTN and seg for the other programs. Both dust and seg are integral parts of the NCBI Toolkit and are accessed automatically. If one uses ‐F T then normal filtering by seg or dust (for BLASTN) occurs (likewise ‐F F means no filtering whatsoever). This option also takes a string as an argument. One may use such a string to change the specific parameters of seg or invoke other filters.
‐S	Query strands to search against database (for BLAST[NX], and TBLASTX). 3 is both, 1 is top, 2 is bottom [Integer]
	default = 3
‐T	Produce HTML output [T/F]
	default = F
‐l	Restrict search of database to list of GI's [String] (optional)
	This option specifies that only a subset of the database should be searched, determined by the list of GI's (i.e., NCBI identifiers) in a file. One can obtain a list of gi's for a given Entrez query from http://www.ncbi.nlm.nih.gov/Entrez/batch.html. This file should be in the same directory as the database, or in the directory from which BLAST is called.
‐U	Use lowercase filtering of FASTA sequence [T/F] (optional)
	This option specifies that any lower‐case letters in the input FASTA file should be masked

^a For an example of using these options, see protocol 1 , step .

Table 3.1.2 Necessary Resources Options for formatdb a Options for formatdb Options for blastall b Options for blastall

Option	Explanation
‐t	Title for database file [String] (optional)
‐i	Input file(s) for formatting [File In] (this parameter must be set)
‐l	Logfile name: [File Out] (optional)
	default = formatdb.log
‐p	Type of file [T/F] (optional):
	T = protein
	F = nucleotide
	default = T
‐o	Parse options [T/F] (optional):
	T (true) = parse SeqID and create indexes
	F (false) = do not parse SeqID; do not create indexes
	default = F
‐a	Input file is database in ASN.1 format (otherwise FASTA is expected) [T/F](optional):
	T = True
	F = False
	default = F
‐b	ASN.1 database in binary mode [T/F] (optional):
	T = binary
	F = text mode
	default = F
‐e	Input is a Seq entry [T/F] (optional)
	default = F
‐n	Base name for BLAST files [String] (optional)
‐v	Number of sequence bases to be created in the volume [Integer] (optional)
	default = 0
‐s	Create indexes limited only to accessions: sparse [T/F] (optional)default = F
‐V	Verbose: check for nonunique string IDs in the database [T/F] (optional)default = F
‐A	Create ASN.1 structured deflines [T/F] (optional)default = F
Option	Explanation
‐p	Program name [String]
	Input should be one of blastp, blastn, blastx, tblastn, or, tblastx
‐d	Database [String]
	default = nr
	The database specified must first be formatted with formatdb. An example would be ‐d nr est, which will search both the nr and est databases, presenting the results as if one “virtual” database consisting of all the entries from both were searched. The statistics are based on the “virtual” database of nr and est.
‐i	Query file [File In]
	default = stdin
	The query should be in FASTA format. If multiple FASTA entries are in the input file, all queries will be searched.
‐e	Expectation value (E ) [Real]
	default = 10.0
‐o	BLAST report output file [File Out] (optional)
	default = stdout
‐F	Filter query sequence (dust with BLASTN, seg with others) [String]
	default = T
	BLAST 2.0 and 2.1 use the dust low‐complexity filter for BLASTN and seg for the other programs. Both dust and seg are integral parts of the NCBI Toolkit and are accessed automatically. If one uses ‐F T then normal filtering by seg or dust (for BLASTN) occurs (likewise ‐F F means no filtering whatsoever). This option also takes a string as an argument. One may use such a string to change the specific parameters of seg or invoke other filters.
‐S	Query strands to search against database (for BLAST[NX], and TBLASTX). 3 is both, 1 is top, 2 is bottom [Integer]
	default = 3
‐T	Produce HTML output [T/F]
	default = F
‐l	Restrict search of database to list of GI's [String] (optional)
	This option specifies that only a subset of the database should be searched, determined by the list of GI's (i.e., NCBI identifiers) in a file. One can obtain a list of gi's for a given Entrez query from http://www.ncbi.nlm.nih.gov/Entrez/batch.html. This file should be in the same directory as the database, or in the directory from which BLAST is called.
‐U	Use lowercase filtering of FASTA sequence [T/F] (optional)
	This option specifies that any lower‐case letters in the input FASTA file should be masked

^b For an example of using these options, see protocol 1 , step .

Files
- Input data files must be in FASTA format (see appendix 1B )

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 3.11.1 Verbose listing of all the files in the BLAST distribution file for Linux as they are “un‐tarred” (see , step ).

View Image

Figure 3.11.2 The file fungus.fasta, used as an example query file (see , step ).

View Image
Figure 3.11.3 The file fungus.blastp, an example of output from BLAST (see , steps and ).

View Image

Videos

Literature Cited

	Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410.
	Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25:3389‐3402.
Internet Resources
	http://www.ibiostation.com
	Web site of iBiostation, from which the book iBiostation Linux: Bioinformatics for Linux (2003), by M. Hobbs, T. G. Littlejohn and K. Castle (BioLateral Pty. Ltd., Sydney, Au.; ISBN 0‐9750583‐0‐4), may be purchased.