DRAGON and DRAGON View: Information Annotation and Visualization Tools for Large‐Scale Expression Data

互联网2013-12-31

1038

Abstract
Table of Contents
Materials
Figures
Literature Cited

Abstract

The Database Referencing of Array Genes ONline (DRAGON) database system consists of information derived from publicly available databases including UniGene, SWISS?Prot, Pfam, and the Kyoto Encyclopedia of Genes and Genomes (KEGG). Through a Web?accessible interface, the DRAGON Annotate tool rapidly supplies information pertaining to a range of biological characteristics of all the genes in any large?scale gene expression data set. The subsequent inclusion of this information during data analysis and visualization allows for deeper insight into gene expression patterns. The set of DRAGON View tools provides methods for the analysis and visualization of expression patterns in relation to annotated information. Instead of incorporating the standard set of clustering and graphing tools available in many large?scale expression data analysis software packages, DRAGON View has been specifically designed to allow for the analysis of expression data in relation to the biological characteristics of gene sets.

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Basic Protocol 1: Preparing Data for Use with the DRAGON Database and Analyzing Data with Dragon View
Support Protocol 1: Analyzing Data with the DRAGON Families Tool
Guidelines for Understanding Results
Commentary
Literature Cited
Figures

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

Basic Protocol 1: Preparing Data for Use with the DRAGON Database and Analyzing Data with Dragon View

Necessary Resources

Windows, Linux, Unix, or Macintosh computer with Internet connection (preferably broadband connection, e.g., T1, T3, cable, or DSL service)

Internet browser: e.g., MS Internet Explorer 5 (or higher) or Netscape 6 (or higher) on Windows or Macintosh systems; Opera, Netscape 6 (or higher), or Mozilla on Linux‐based systems. Internet Explorer 5 or higher and Netscape 6 or higher are preferred, because Netscape 4.x is not capable of supporting all of the functionality provided in the DRAGON Paths tool.
Also required:
Spreadsheet program: e.g., MS Excel on Windows or Macintosh systems or Sun Microsystems Star Office suite on Linux systems.
Text editor: e.g., TextPad (http://www.textpad.com/) or Notepad on Windows systems; XEmacs (http://www.xemacs.org) on Linux systems.
Finally, for advanced users who may want to have more flexibility in the manipulation of their text files, the Perl programming language is powerful and easy to use and allows the user to perform automated text‐formatting, file‐creation, and file‐alteration functions that are useful when analyzing large data sets. Activestate (http://www.activestate.com) has developed a version of Perl available for Windows computers (http://www.activestate.com/Products/ActivePerl/). Otherwise http://www.perl.com Web site provides downloads of Perl for Linux, Unix, and Macintosh computers.

The Iyer et al. ( ) example data files were obtained from the Stanford Microarray data Web site (http://genome-www.stanford.edu/serum/data.html). The two files used for demonstration purposes in this unit may be downloaded respectively at the following URLs:
- http://genome-www.stanford.edu/serum/fig2data.txt
- http://genome-www.stanford.edu/serum/data/fig2clusterdata.txt
Both files are also available at the Current Protocols Web site:
- http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm
Optional: The DRAGON database is generated through the automated parsing of flat files provided by publicly available databases (see the DRAGON Web site for a list of the database flat files used by DRAGON). The information in these files is then loaded into a back‐end MySQL (http://www.mysql.com; unit 9.2 ) relational database for use by DRAGON (Fig. .). Although it may be easy and more intuitive for most users to access the information in these files via the DRAGON Web site, some readers may want to use this information in their own relational databases. For these purposes, all of the tables used in the DRAGON database are provided for download on the DRAGON Web site (http://pevsnerlab.kennedykrieger.org/download.htm) or can be ordered on CD if desired (http://pevsnerlab.kennedykrieger.org/order.htm).

Support Protocol 1: Analyzing Data with the DRAGON Families Tool

Necessary Resources

Windows, Linux, Unix, or Macintosh computer with an Internet connection (preferably broadband connection, e.g., T1, T3, cable, or DSL service)

Internet browser: e.g., MS Internet Explorer 5 (or higher) or Netscape 6 (or higher) on Windows or Macintosh systems; Opera, Netscape 6 (or higher) or Mozilla on Linux‐based systems. Internet Explorer 5 or higher and Netscape 6 or higher are preferred, because Netscape 4.x is not capable of supporting all of the functionality provided in the DRAGON Paths tool.
Also required:
Spreadsheet program: e.g., MS Excel on Windows or Macintosh systems or Sun Microsystems Star Office suite on Linux systems.
Text editor: e.g., TextPad (http://www.textpad.com/) or Notepad on Windows systems; MEmacs (http://www.xemacs.org) on Linux systems.

An Annotated master matrix file created by running the DRAGON Annotate Tool (figure2_combined_data_KWS.txt; see protocol 1 )

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 7.4.1 The DRAGON home page provides links to all available tools and data sources contained in DRAGON and DRAGON View. The page also contains links to all of the public data files that are used by DRAGON to generate its database.

View Image

Figure 7.4.2 The DRAGON Annotate page. (A ) The user is allowed to input data into a dialog box, or a tab‐delimited text file can be uploaded from a local file. (B ) The user selects options, then sends a request for annotation to the DRAGON database. Results may be returned as an HTML table, as a tab‐delimited text file (suitable for import into a spreadsheet such as Microsoft Excel), or as an E‐mail.

View Image

Figure 7.4.3 The DRAGON Families page.

View Image

Figure 7.4.4 As the final step in the analysis of the demonstration data, each time point contained in the Iyer et al. () data set, after having been associated with SWISS‐PROT keyword information by DRAGON Annotate, is analyzed using the DRAGON Families tool. The most coordinately up‐regulated gene families are shown here for three time points (15 min, 6 hr and 24 hr). Each gene is represented in its corresponding family as a box that is clickable and hyperlinked to the NCBI LocusLink entry for that gene. Across each row, all the boxes correspond to genes in a given family. Each box is also color‐coded on a scale from red (up‐regulated) to green (down‐regulated). A scale at the top of the analysis page (not shown) gives the association of colors with ratio values. For all the functional families that are annotated, the program returns the families ranked in order according to the average ratio expression value for all of the genes in that group. Note that overall there is less differential regulation occurring at the 15‐min time point since there are no bright red squares present. By 6 hr certain gene families, particularly those associated with inflammatory responses, are coordinately up‐regulated. Finally by 24 hr, cell cycle and mitotic gene families are coordinately differentially regulated, indicating that the cells are progressing through the cell cycle.

View Image

Figure 7.4.5 Examples of the graphical outputs of the three types of DRAGON View tools. (A ) DRAGON Families produces rows of green (down‐regulated), red (up‐regulated), and gray (unchanged) boxes (see scale for the range of ratio values represented by each color). Each box represents one gene and is hyperlinked to its corresponding UniGene entry. Each row has a type identifier to its right that is hyperlinked to its description. To the far right is the average ratio expression value for all of the genes in that family. All rows are sorted from the most up‐regulated family to the most down‐regulated family. (B ) DRAGON Order produces rows of black lines. Each line represents one gene and its location in the row represents its position on a gene list sorted by ratio expression values. Lines at the far left of represent the most up‐regulated genes (+) and lines at the far right represent the most down‐regulated (–). Each row's type (e.g., SWISS‐PROT keywords) is listed to the right. (C ) DRAGON Paths maps the location and ratio expression value of genes from the submitted gene list on to KEGG cellular pathway diagrams. A green (down‐regulated), red (up‐regulated) or gray (unchanged) circle followed by the ratio expression value is mapped to the upper left corner of each corresponding protein box. Each protein box is hyperlinked to its corresponding LocusLink entry.

View Image

Figure 7.4.6 Database architecture for DRAGON. The data contained in the DRAGON database is derived from Web‐accessible databases that are downloaded by FTP, parsed using Perl scripts, and stored in tables in the MySQL relational database management system. The DRAGON database is housed on a Dell PowerEdge 6300 dual processor server. The front end consists of a Web site that is searched using Perl (.cgi) scripts to allow for user‐defined queries of the database.

View Image

Figure 7.4.7 Overview of the information in DRAGON. This diagram represents a subset of the tables now available in DRAGON and the possible connections between them. Depending upon what type of information is desired different sets of tables are joined with the table containing microarray gene expression data that is as example, “Incyte Array Data” and “Incyte Numbers” in this diagram. Two “UniGene Human Numbers” tables are used to expand the “GenBank #s” from the “Incyte Numbers” table into all “GenBank #s” associated with each “UniGene ID” thereby providing a bridge between “GenBank #s” from the “Incyte Numbers” table and the “Swissprot Numbers”, “TrEMBL Numbers”, “Transfac Factors” and “Transfac Sites” tables. Further characterization of the proteins that genes from the microarray encode occurs by joining with tables derived from the SWISS‐PROT, Pfam, Interpro and OMIM databases.

View Image

Figure 7.4.8 DRAGON uses accession numbers to define biological characteristics of genes and proteins. A microarray is a regular array of thousands of unique cDNAs or oligonucleotides spotted on a solid support. Each spot contains cDNA corresponding to a specific gene that encodes a protein. Accession numbers derived from publicly available databases provide information about the biological characteristics of both the gene and its corresponding protein. At the gene level, “Transfac Site” and “Transfac Factor” numbers indicate the presence of promoter regions on the gene and factors that bind to those promoter regions respectively. The “GenBank no.” and “UniGene ID” refer to EST sequences corresponding to fragments of the gene and a cluster of those EST sequences respectively. The “UniGene Cytoband” indicates the chromosomal location of the gene. The “UniGene Name” is the name of the gene. The “OMIM no.” indicates whether the gene is known to be involved in any human diseases. At the protein level, “Pfam no.” and “Interpro no.” indicate which functional domains the protein contains. The “SWISS‐PROT no.” is a unique identifier for the protein and can be derived from either the SWISS‐PROT or TrEMBL databases. “SWISS‐PROT Keywords” are derived from a controlled vocabulary of 827 words that are assigned to proteins in the SWISS‐PROT database according to their function(s). “SWISS‐PROT Sequence” is the amino acid sequence for the protein. “SWISS‐PROT Name” is the SWISS‐PROT database name for the protein.

View Image

Videos

Literature Cited

	Bailey, S.N., Wu, R.Z., and Sabatini, D.M. 2002. Applications of transfected cell microarrays in high‐throughput drug discovery. Drug Discov. Today 7:S113‐S118.
	Bouton, C.M. and Pevsner, J. 2001. DRAGON: Database Referencing of Array Genes Online. Bioinformatics 16:1038‐1039.
	Bouton, C.M. and Pevsner, J. 2002. DRAGON View: Information visualization for annotated microarray data. Bioinformatics 18:323‐324.
	Bowtell, D.D.L. 1999. Options available‐from start to finish‐for obtaining expression data by microarray. Nat. Genet. Suppl. 21:25‐32.
	Cheung, V.G., Morley, M., Aguilar, F., Massimi, A., Kucherlapati, R., and Childs, G. 1999. Making and reading microarrays. Nat. Genet. Suppl. 21:15‐19.
	Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P.O., and Herskowitz, I. 1998. The transcriptional program of sporulation in budding yeast. Science 282:699‐705.
	Colantuoni, C., Henry, G., Zeger, S., and Pevsner, J. 2002. SNOMAD (Standardization and NOrmalization of MicroArray Data): Web‐accessible gene expression data analysis. Bioinformatics 18:1540‐1541.
	Duggan, D.J., Bittner, M., Chen, Y., Meltzer, P., and Trent, J.M. 1999. Expression profiling using cDNA microarrays. Nat. Genet. Suppl. 21:10‐14.
	Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome‐wide expression patterns. Proc. Natl. Acad. Sci. U.S.A.. 95:14863‐14868.
	Frishman, D., Heumann, K., Lesk, A., and Mewes, H‐W. 1998. Comprehensive, comprehensible, distributed and intelligent databases: Current status. Bioinformatics 14:551‐561.
	Gawantka, V., Pollet, N., Delius, H., Vingron, M., Pfister, R., Nitsch, R., Blumenstock, C., and Niehrs, C. 1998. Gene expression screening in Xenopus identifies molecular pathways, predicts gene function and provides a global view of embryonic gene expression. Mech. Dev. 77:95‐141.
	Gibbons, F.D. and Roth, F.P. 2002. Judging the quality of gene expression‐based clustering methods using gene annotation. Genome Res. 12:1574‐81.
	Heyer, L.J., Kruglyak, S., and Yooseph, S. 1999. Exploring expression data: Identification and analysis of coexpressed genes. Genome Res. 9:1106‐1115.
	Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T., Lee, J.C., Trent, J.M., Staudt, L.M., Hudson, J. Jr., Boguski, M.S., Lashkari, D., Shalon, D., Botstein, D., and Brown, P.O. 1999. The transcriptional program in the response of human fibroblasts to serum. Science 283:83‐87.
	Kanehisa, M. and Goto, S. 2000. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28:27‐30.
	Kanehisa, M. et al. 2002. The KEGG databases at GenomeNet. Nucleic Acids Res. 30:42‐46.
	Lal, S.P., Christopherson, R.I., and dos Remedios, C.G. 2002. Antibody arrays: An embryonic but rapidly growing technology. Drug Discov. Today 7:S143‐S149.
	Liang, S., Fuhrman, S., and Somogyi, R. 1998. Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. Biocomput. 3:18‐29.
	Lipshutz, R.J., Fodor, S.P.A., Gingeras, T.R., and Lockhart, D.J. 1999. High density synthetic oligonucleotide arrays. Nat. Genet. Suppl. 21:20‐24.
	Macauley, J., Wang, H., and Goodman, N. 1998. A model system for studying the integration of molecular biology databases. Bioinformatics 14:575‐582.
	Michaels, G.S., Carr, D.B., Askenaki, M., Fuhrman, S., Wen, X., and Somogyi, R. 1998. Cluster analysis and data visualization of large‐scale gene expression data. Pacific Symp. Biocomp. 3:42‐53.
	Somogyi, R., Fuhrman, S., Askenazi, M., and Wuensche, A. 1997. The gene expression matrix: Towards the extraction of genetic network architectures. Proc. Second World Cong. Nonlinear Analysts 1996. 30:1815‐1824.
	Spellman, P.T. and Rubin, G.M. 2002. Evidence for large domains of similarly expressed genes in the Drosophila genome. J. Biol. 1:5.1‐5.8.
	Szallasi, Z. 1999. Genetic network analysis in light of massively parallel biological data acquisition. Pac. Symp. Biocomp. 4:5‐16.
	Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., and Golub, T.R. 1999. Interpreting patterns of gene expression with self‐organizing maps: Methods and applications to hematopoetic differentiation. Proc. Natl. Acad. Sci. U.S.A. 96:2907‐2912.
	Toronen, P., Kolehmainen, M., Wong, G., and Castren, E. 1999. Analysis of gene expression data using self‐organizing maps. FEBS Lett. 451:142‐146.
	Velculescu, V.E., Zhang, L., Vogelstein, B., and Kinzler, K.W. 1995. Serial analysis of gene expression. Science 270:484‐7.
	Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith, S., Barker, J.L., and Somogyi, R. 1998. Large‐scale temporal gene expression mapping of central nervous system development. Proc. Natl. Acad. Sci. U.S.A. 95:334‐339.
	Zhang, M.Q. 1999. Large‐scale gene expression data analysis: A new challenge to computational biologists. Genome Res. 9:681‐688.
Key References
	Bouton and Pevsner, 2001. See above.
	Original publication concerning the DRAGON database.
	Bouton and Pevsner, 2002. See above.
	Original publication concerning the DRAGON View visualization tools.
	Bouton, C.M., Hossain, M.A., Frelin, L.P., Laterra, J., and Pevsner, J. 2001. Microarray analysis of differential gene expression in lead‐exposed astrocytes. Toxicol. Appl. Pharmacol. 176:34‐53.
	Research publication that reports use of DRAGON and DRAGON View in the context of a toxicogenomic microarray study.
	Iyer et al. 1999. See above.
	Reports the microarray study from which the example data sets for this unit were derived.