Mathematically Complete Nucleotide and Protein Sequence Searching Using Ssearch
互联网
- Abstract
- Table of Contents
- Materials
- Figures
- Literature Cited
Abstract
In this unit a protocol is described for predicting the structure of simple transmembrane a?helical bundles. The protocol is based on a global molecular dynamics search (GMDS) of the configuration space of the helical bundle, yielding several candidate structures. The correct structure among these candidates is selected using information from silent amino acid substitutions, employing the premise that only the correct structure must (by definition) accept all of the silent amino acid substitutions. Thus, the correct structure is found by repeating the GMDS for several close homologs and selecting the structure that persists in all of the trials.
Table of Contents
- Guidelines for Understanding Results
- Commentary
- Literature Cited
- Figures
Materials
Basic Protocol 1:
Necessary Resources
|
Figures
-
Figure 3.10.1 Example of an external matrix file acceptable to the Ssearch program. View Image -
Figure 3.10.2 Histogram of scores: The leftmost numeric column is the score. The center numeric column is the number of times that score occurred within the data search. The rightmost numeric column indicates the number of times that the score was expected to occur in the data library search, after correcting for library sequence length. View Image -
Figure 3.10.3 List of high‐scoring sequences: Information describing the library sequence is presented first, followed by the length of the sequence in parentheses. The next column contains the score of the comparison with the query sequence. The next column presents the score using an information‐content approach. The final column contains the statistical estimate of the likelihood that the match has arisen by chance. View Image
Videos
Literature Cited
Aho, A.V., Hopcroft, J.E., and Ullman, J.D. 1983. Data Structures and Algorithms. Addison‐Wesley, Reading, Mass. | |
Altschul, S.F. 1991. Amino acid substitution matricies from an information theoretic perspective. J. Mol. Biol. 219:555‐565 | |
Altschul, S.F., Boguski, M.S., Gish, W., and Wootton, J.C. 1994. Issues in searching molecular sequence databases. Nature Genet. 6:119‐129. | |
Altschul, S.F., Madden, T.L., Scheffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucl. Acids Res. 25:3389‐3402. | |
Agarwal, P. and States, D. 1998. Comparative accuracy of methods for protein sequence similarity search. Bioinformatics 14:40‐47. | |
Bork, P. and Gibson, T.J. 1996. Applying motif and profile searches. Methods Enzymol. 266:383‐402. | |
Brendel, V., Bucher, P., Nourbakhsh, I.R., Blaisdell, B.E., and Karlin, S. 1992. Methods and algorithms for statistical analysis of protein sequences. Proc. Natl. Acad. Sci. U.S.A. 89:2002‐2006. | |
Dayhoff, M., Schwartz, R.M., and Orcutt, B.C. 1978. A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure, Vol. 5, supp. 3 (M. Dayhoff, ed.) pp. 345‐352. National Biomedical Research Foundation, Silver Spring, Md. | |
Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matricies from protein blocks. Proc. Natl. Acad. Sci. U.S.A. 89:10915‐10919. | |
Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8:275‐282. | |
Kann, M., Qian, B., and Goldstein, R.A. 2000. Optimization of a new score function for the detection of remote homologs. Proteins 41:498‐503. | |
Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48:443‐453. | |
Nicholas, H.B., Deerfield, D.W. II, and Ropelewski, A.J. 2000. Strategies for searching sequence databases. BioTechniques 28:1174‐1191. | |
Nicholas, H.B., Ropelewski, A.J., Deerfield, D.W. II. 2002. Strategies for multiple sequence alignment. BioTechniques 32:572‐591. | |
Pearson, W.R. 1995. Comparison of methods for searching protein sequence databases. Protein Sci. 4:1145‐1160. | |
Pearson, W.R. 1998. Empirical statistical estimates for sequence similarity searches. J. Mol. Biol. 276:71‐84. | |
Pearson, W.R. and Lipman, D.J. 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85:2444‐2448. | |
Pizzi, E. and Frontali, C. 2001. Low‐complexity regions in Plasmodium falciparum proteins. Genome Res. 11:218‐229. | |
Ropelewski, A.J., Nicholas, H.B., Deerfield, D.W. II. 2000. Selective and sensitive comparison of genetic sequence data. In Industrial Strength Parallel Computing (A. Konges, ed.) pp. 453‐479. Morgan Kauffmann, San Francisco. | |
Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195‐197. | |
Spang, R. and Vingron, M. 1998. Statistics of large‐scale sequence searching. Bioinformatics 14:279‐284. | |
Spang, R. and Vingron, M. 2001. Limits of homology detection by pairwise sequence comparison. Bioinformatics 17:338‐342. | |
States, D.J., Gish, W., and Altschul, S.F. 1991. Improved sensitivity of nucleic acid database searches using application‐specific scoring matricies. Methods Enzymol. 3:66‐77. | |
Vingron, M. and Waterman, M.S. 1994. Sequence alignment and penalty choice: Review of concepts, case studies and implications. J. Mol. Biol. 235:1‐12. | |
Waterman, M.S. and Eggert, M. 1987. A new algorithm for subsequence alignments with application to tRNA‐rRNA comparisons. J. Mol. Biol. 197:723‐728 | |
Waterman, M.S. and Vingron, M. 1994. Rapid and accurate estimates of statistical significance for sequence database searches. Proc. Natl. Acad. Sci. U.S.A. 91:4625‐4628. | |
Webber, C. and Barton, G.J. 2003. Increased coverage obtained by combination of methods for protein sequence database searching. Bioinformatics 19:1397‐1403. | |
Weizhong, L., Po, F., Pawlowski, K., and Godzik, A. 2000. Saturated BLAST: An automated multiple intermediate sequence search used to detect distant homology. Bioinformatics 16:1105‐1110. | |
Wootton, J.C. and Federhen, S. 1993. Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17:149‐163. | |
Key References | |
Altschul et al., 1994. See above. | |
This review provides detailed information about local alignment statistics, extreme value distributions, scoring matrices, and low‐complexity regions. | |
Agarwal and States, 1998. See above. | |
This article compares the Smith‐Waterman, FASTA, original BLAST code, WU‐BLAST2, and Probabilistic Smith‐Waterman codes. | |
Nicholas et al., 2000. See above. | |
This review discusses the advantages and disadvantages of the BLAST, FASTA, and Smith‐Waterman search algorithms, how to select appropriate scoring matrices, scoring insertions and deletions, as well as a few different methods by which statistical significance can be computed. | |
Pearson, 1995. See above. | |
This article compares FASTA, Smith‐Waterman and original BLAST algorithms in the context of which method did the best job in finding members of 67 different protein superfamilies. | |
Internet Resources | |
ftp://ftp.virginia.edu/pub/fasta/ | |
The FASTA package (in which the Ssearch code is included) can be obtained here. | |
ftp://ftp.ncbi.nih.gov/blast/matrices/ | |
A variety of protein scoring matrices that can be used with the Ssearch code can be obtained here. |