我要登录|
免费注册
|
我的丁香通
- 企业机构：
- 成为企业机构
- 个人用户：
- 个人中心
移动端

大家都在搜

0 人通过求购买到了急需的产品

免费发布求购

发布求购

Phylogenomic Inference of Protein Molecular Function

互联网2013-12-31

930

Abstract
Table of Contents
Figures
Literature Cited

Abstract

With the explosion in sequence data, accurate prediction of protein function has become a vital task in prioritizing experimental investigation. While computationally efficient methods for homology?based function prediction have been developed to make this approach feasible in high?throughput mode, it is not without its dangers. Biological processes such as gene duplication, domain shuffling, and speciation produce families of related genes whose gene products can have vastly different molecular functions. Standard sequence?comparison approaches may not discriminate effectively among these candidate homologs, leading to errors in database annotations. In this unit, we describe phylogenomic approaches to reduce the error rate in function prediction. Phylogenomic inference of protein molecular function consists of a series of subtasks. Once a cluster of homologs is identified, a multiple sequence alignment and phylogenetic tree are constructed. Finally, the phylogenetic tree is overlaid with experimental data culled for the members of the family, and changes in biochemical function can be traced along the evolutionary tree.

Keywords: Evolution; Homolog; Ortholog; Paralog; Function prediction; Phylogenomic; Subfamily; Phylogenetic

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Basic Protocol 1: Identifying Homologs and Constructing a Multiple Sequence Alignment Using FlowerPower and MUSCLE
Basic Protocol 2: Multiple Sequence Alignment Analysis and Editing Using Belvu
Support Protocol 1: Downloading and Installing the Belvu Software
Basic Protocol 3: Constructing a Phylogenetic Tree using Bete
Basic Protocol 4: Phylogenomic Inference of Molecular Function using TreeNotator
Commentary
Literature Cited
Figures

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 6.9.1 Workflow for phylogenomic analysis. For every step in the workflow, the basic protocol in this unit that describes it is given in parentheses.

View Image

Figure 6.9.2 FlowerPower submission page. Paste the seed sequence in the box provided, in FASTA format. Type a valid E‐mail address in the box provided. Results will be sent to that address.

View Image

Figure 6.9.3 FlowerPower results page. The FlowerPower and MUSCLE alignments can be downloaded by clicking on the hyperlinks at bottom left. The FlowerPower cluster (unaligned sequences) can be downloaded using the second hyperlink from the bottom. The MUSCLE and FlowerPower alignments can be viewed directly on the Web page by clicking on the JalView buttons at right.

View Image

Figure 6.9.4 Aligned FASTA format. The aligned FASTA format displays aligned residues in uppercase and gaps as dashes.

View Image

Figure 6.9.5 Belvu display of an MSA. The alignment is displayed using the Belvu wrap‐around mode option. This displays a region of variability at the amino‐terminus, and another region (at bottom) where only one of the sequences aligns, causing the remaining sequences to have gap characters at these positions. The variable amino‐terminus and gappy columns can be deleted during alignment masking, prior to use of the alignment as the basis for phylogenetic tree construction.

View Image

Figure 6.9.6 UCSC A2M format. The A2M (for “align2model”, i.e., the alignment of a sequence to a hidden Markov model) format is designed to indicate the states used by an HMM to generate a sequence. HMM states include Match states (representing the consensus structure for a family), Delete states (used to skip over a consensus position), and Insert states (used to generate additional amino acids not contained in the consensus). The A2M format displays aligned residues in uppercase and “unaligned” characters (emitted in HMM insert states) in lowercase. Gaps in the aligned regions are indicated as dashes (‐) and are emitted in HMM skip/delete states. By contrast dots (.) are inserted post hoc in columns containing lowercase letters emitted in Insert state, so that all sequences have the same length.

View Image

Figure 6.9.7 BETE Web server submission page. The multiple sequence alignment is pasted into the box provided. Results are sent by E‐mail.

View Image

Figure 6.9.8 BETE results page. Download the Newick‐format tree for annotation by the TreeNotator software. To view the BETE subfamily decomposition, download the “Alignment organized by subfamily” file.

View Image

Figure 6.9.9 Newick file format. The Newick format is a standard tree file format readable by most tree viewers. The format uses nested parentheses to indicate the join order of the nodes. In the above example, the sequences in the tree are represented by their GenBank identifier, and the bootstrap values are indicated at each node. For example ((401124:100.0,401125:100.0):78.0,121071:100.0) means that the sequences 401124 and 401125 are joined at a node with bootstrap value of 78.0, and this node is joined to sequence 121071 with a bootstrap value of 100.

View Image

Figure 6.9.10 TreeNotator submission page. The tree is pasted in the input box in Newick format. Results are returned by E‐mail to the address provided.

View Image

Figure 6.9.11 TreeNotator results page. Click on the “View tree” button to display the annotated tree using the ATV software, or download the annotated tree (immediately under “File for download”) for display using other phylogenetic tree visualization software.

View Image

Figure 6.9.12 Annotated tree displayed with ATV viewer. For each sequence, the following data are retrieved from the corresponding database (UniProt or GenBank) and displayed at the leaves: Species, GenBank, or UniProt idenitifier and the annotation. The bootstrap values are displayed in green at internal nodes.

View Image

Figure 6.9.13 In the tree shown above, the root is labeled D to indicate a duplication event, joining two subtrees, each of which contains a group of putative orthologs. All other internal nodes are labeled S, to indicate speciation events. Nodes for subtrees containing only one representative of a species can safely be labeled with an S. While labeling nodes as indicative of speciation is straightforward, labeling nodes as indicative of duplication is less so, since multiple protein sequences encoded by the same gene can be present in a dataset (e.g., due to splice variants or simple database duplicates).

View Image

Figure 6.9.14 Discriminating orthologs and paralogs. This figure represents the evolution of a protein family through the joint processes of gene duplication and speciation. The ancestral gene, shown at top, undergoes duplication in the ancestral genome (at the node labeled D), producing two paralogous genes, A and B. A subsequent speciation event produces two species, S1 and S2, each having a copy of the A and B genes. A1 and B1 are clearly paralogs, as are A2 and B2. If the definition of ortholog used is “the same gene in different species,” then according to this tree, A1 and A2 are clearly orthologs, as are B1 and B2. By contrast, it cannot be asserted that A1 and B2 are orthologs, nor can it be asserted that B1 and A2 are orthologs. The A and B genes are related by a duplication event in the ancestral genome where they are paralogous. This figure is adapted from Koonin ().

View Image

Videos

Literature Cited

	Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410.
	Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucl. Acids Res. 25:3389‐3402.
	Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., O'Donovan, C., Redaschi, N., and Yeh, L.S. 2004. UniProt: The Universal Protein knowledgebase. Nucl. Acids Res. 32:D115‐D119.
	Benson, D.A., Karsch‐Mizrachi, I., Lipman, D.J., Ostell, J., and Wheeler, D.L. 2004. GenBank: Update. Nucl. Acids Res. 32:D23‐D26.
	Bork, P. and Koonin, E.V. 1998. Predicting functions from protein sequences—where are the bottlenecks? Nature Genet. 18:313‐318.
	Brenner, S.E. 1999. Errors in genome annotation. Trends Genet. 15:132‐133.
	Brown, D., Krishnamurthy, N., Dale, J.M., Christopher, W., and Sjölander, K. 2005. Subfamily HMMs in functional genomics. Pac. Symp. Biocomput. 322‐333.
	Citerne, H.L., Luo, D., Pennington, R.T., Coen, E., and Cronk, Q.C. 2003. A phylogenomic investigation of CYCLOIDEA‐like TCP genes in the Leguminosae. Plant Physiol. 131:1042‐1053.
	Devos, D. and Valencia, A. 2001. Intrinsic errors in genome annotation. Trends Genet. 17:429‐431.
	Doolittle, R.F. 1995. The multiplicity of domains in proteins. Annu. Rev. Biochem. 64:287‐314.
	Doolittle, R.F. and Bork, P. 1993. Evolutionarily mobile modules in proteins. Sci. Am. 269:50‐56.
	Edgar, R.C. 2004. MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113.
	Eisen, J.A. 1998. Phylogenomics: Improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 8:163‐167.
	Eisen, J.A. and Wu, M. 2002. Phylogenetic analysis and gene functional predictions: Phylogenomics in action. Theor. Popul. Biol. 61:481‐487.
	Felsenstein, J. 1988. Phylogenies from molecular sequences: Inference and reliability. Annu. Rev. Genet. 22:521‐565.
	Fitch, W.M. 1970. Distinguishing homologous from analogous proteins. Syst. Zool. 19:99‐113.
	Galperin, M.Y. and Koonin, E.V. 1998. Sources of systematic error in functional annotation of genomes: Domain rearrangement, non‐orthologous gene displacement and operon disruption. In Silico Biol. 1:55‐67.
	Gerlt, J.A. and Babbitt, P.C. 2001. Divergent evolution of enzymatic function: Mechanistically diverse superfamilies and functionally distinct suprafamilies. Annu. Rev. Biochem. 70:209‐246.
	Gilks, W.R., Audit, B., De Angelis, D., Tsoka, S., and Ouzounis, C.A. 2002. Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics 18:1641‐1649.
	Hasegawa, M. and Fujiwara, M. 1993. Relative efficiencies of the maximum likelihood, maximum parsimony, and neighbor‐joining methods for estimating protein phylogeny. Mol. Phylogenet. Evol. 2:1‐5.
	Hollich, V., Storm, C.E., and Sonnhammer, E.L. 2002. OrthoGUI: Graphical presentation of Orthostrapper results. Bioinformatics 18:1272‐1273.
	Huelsenbeck, J.P. and Ronquist, F. 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754‐755.
	Koonin, E.V. 2001. An apology for orthologs—or brave new memes. Genome Biol. 2:COMMENT1005.
	Kuhner, M.K. and Felsenstein, J. 1994. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 11:459‐468.
	McClure, M.A., Vasi, T.K., and Fitch, W.M. 1994. Comparative analysis of multiple protein‐sequence alignment methods. Mol. Biol. Evol. 11:571‐592.
	Sander, C. and Schneider, R. 1991. Database of homology‐derived protein structures and the structural meaning of sequence alignment. Proteins 9:56‐68.
	Saitou, N. and Nei, M. 1987. The neighbor‐joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406‐425.
	Sjölander, K. 1998. Phylogenetic inference in protein superfamilies: Analysis of SH2 domains. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6:165‐174.
	Sjölander, K. 2004. Phylogenomic inference of protein molecular function: Advances and challenges. Bioinformatics 20:170‐179.
	Sjölander, K., Karplus, K., Brown, M., Hughey, R., Krogh, A., Mian, I.S., and Haussler, D. 1996. Dirichlet mixtures: A method for improved detection of weak but significant protein sequence homology. Comput. Appl. Biosci. 12:327‐345.
	Storm, C.E. and Sonnhammer, E.L. 2002. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics 18:92‐99.
	Swofford, D. 2002. PAUundefined. Phylogenetic Analysis Using Parsimony ~undefinedand Other Methods). Version 4. Sinauer Associates, Sunderland, Mass.
	Thompson, J.D., Plewniak, F., and Poch, O. 1999. A comprehensive comparison of multiple sequence alignment programs. Nucl. Acids Res. 27:2682‐2690.
	Wheeler, W.C., Gatesy, J., and DeSalle, R. 1995. Elision: A method for accommodating multiple molecular sequence alignments with alignment‐ambiguous sites. Mol. Phylogenet. Evol. 4:1‐9.
	Zmasek, C.M. and Eddy, S.R. 2001. ATV: Display and manipulation of annotated phylogenetic trees. Bioinformatics 17:383‐384.
	Zmasek, C.M. and Eddy, S.R. 2002. RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics 16:3(1):14.
Key References
	Bork and Koonin, 1998. See above.
	The authors of this paper identify common problems associated with function prediction by homology and present ways to avoid these errors.
	Eisen, 1998. See above.
	Jonathan Eisen's cogent presentation of the raison d'etre behind phylogenomic analysis for improving prediction of gene function.
	Sjölander, 2004. See above.
	A detailed view of the challenges in phylogenomic analysis, with a description of new methods for key tasks in a phylogenomic pipeline.
Internet Resources
	http://phylogenomics.berkeley.edu/resources
	The BPG resources Web site includes a variety of user‐friendly resources for phylogenomic inference of protein molecular function. A description of all the available tools can also be found on the Web site.