PubSearch and PubFetch: A Simple Management System for Semiautomated Retrieval and Annotation of Biological Information from the Literature

互联网2013-12-31

630

Abstract
Table of Contents
Figures
Literature Cited

Abstract

For most systems in biology, a large body of literature exists that describes the complexity of the system based on experimental results. Manual review of this literature to extract targeted information into biological databases is difficult and time consuming. To address this problem, we developed PubSearch and PubFetch, which store literature, keyword, and gene information in a relational database, index the literature with keywords and gene names, and provide a Web user interface for annotating the genes from experimental data found in the associated literature. A set of protocols is provided in this unit for installing, populating, running, and using PubSearch and PubFetch. In addition, we provide support protocols for performing controlled vocabulary annotations. Intended users of PubSearch and PubFetch are database curators and biology researchers interested in tracking the literature and capturing information about genes of interest in a more effective way than with conventional spreadsheets and lab notebooks.

Keywords: literature curation; ontology; controlled vocabulary; annotation; genes; relational database; web application

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Basic Protocol 1: Populating PubSearch
Support Protocol 1: Installing PubSearch
Support Protocol 2: Installing PubFetch for Use Outside of PubSearch
Alternate Protocol 1: Other Ways to Populate PubSearch
Basic Protocol 2: Setting up a PDF Repository for Full‐Text Indexing
Basic Protocol 3: Using PubSearch to Search Data
Basic Protocol 4: Using PubSearch to Add and Update Data
Basic Protocol 5: Using PubSearch to Make Gene Ontology Annotations
Basic Protocol 6: Generating and Loading InterProToGo Annotations
Commentary
Literature Cited
Figures
Tables

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 9.7.1 An overview of the PubSearch workflow illustrating how published articles, genes, and key biological terms are brought together and integrated within PubSearch. Manual review and annotation of these data creates an annotated database of literature, genes, and related information that can be used within PubSearch alone or exported to other applications.

View Image

Figure 9.7.2 A screenshot of the Navigation toolbar of PubSearch Web user interface. It lists the different types of user functions, links to usage guide, and text boxes for logging in.

View Image

Figure 9.7.3 A screenshot of the Search All function's result page showing the first page of results from a search with “water channel” (including quotes) as query string. Results are displayed in the order of “density” of the match, which is a measure of the frequency of the matching string over the length of the entry. Underlined text (shown also in blue on screen) indicates a hyperlink to more information.

View Image

Figure 9.7.4 A screenshot of the Search Hits function. Users can restrict the search by terms (lower left box), articles (lower right box), and validation status of the automated hits between terms and articles (upper left). Options for displaying the results are listed in the upper right corner.

View Image

Figure 9.7.5 A screenshot of Search Hits result page. Results are grouped by each article. The first column shows article information, the second column shows matching genes, the third column indicates information about the matching, and the fourth column displays the options for validating the matches between the papers and genes. Underlined text (shown also in blue on screen) indicates hyperlinks to more information.

View Image

Figure 9.7.6 A screenshot of the Article Search form.

View Image

Figure 9.7.7 A screenshot of the Article Detail page. Logged users can update the fields on this page.

View Image

Figure 9.7.8 A screenshot of the Add an Article form. This form allows users to insert an individual article. Entering the PubMed ID will retrieve all the article information from PubMed automatically using the PubFetch software, check for duplicates with articles in PubSearch database, insert the article if it does not yet exist in the database, and allow users to update the retrieved fields if necessary. If the PubMed ID is not known, users can enter the fields of the article.

View Image

Figure 9.7.9 A screenshot of the Add an Article function's preview page. If the fields of the new article have been entered manually, the preview page allows users to choose the correct publication source using a drop‐down menu, or to add a new publication source.

View Image

Figure 9.7.10 The PubSearch database is the central component of the PubSearch system. The following operations are performed during PubSearch use. In step , the PubSearch database is loaded in batch mode using input from other databases—e.g., articles from literature databases such as PubMed and Agricola using PubFetch software, biological data like gene, allele, and germplasm information from TAIR (an example model organism database), and ontologies from Ontology databases such as Gene Ontology and Plant Ontology. In step , the PubSearch database indexes the information by populating the Hit table using the Lucene engine. In step , through the Java API and a set of Web user interfaces, curators search, browse, edit, and add data, relying on the indexed data in the database. Finally, in step , the edited biological, literature, and annotation data are exported to the TAIR production database and other databases such as Gene Ontology and Plant Ontology.

View Image

Videos

Literature Cited

	The Gene Ontology Consortium. 2001. Creating the gene ontology resource: Design and implementation. Genome Res. 11:1425‐1433.
	Müller, H., Kenny, E.E., and Sternberg, P.W. 2004. Textpresso: An ontology‐based information retrieval and extraction system for biological literature. PLoS Biol 2:e309.
	Rhee, S.Y., Beavis, W., Berardini, T.Z., Chen, G., Dixon, D., Doyle, A., Garcia‐Hernandez, M., Huala, E., Lander, G., Montoya, M., Miller, N., Mueller, L.A., Mundodi, S., Reiser, L., Tacklind, J., Weems, D.C., Wu, Y., Xu, I., Yoo, D., Yoon, J., and Zhang, P. 2003. The Arabidopsis Information Resource (TAIR): A model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucl. Acids Res. 31:224‐228.
Internet Resources
	http://sourceforge.net/projects/geneontology
	Gene Ontology's SourceForge repository.
	http://pubsearch.org
	PubSearch homepage.
	http://tesuque.stanford.edu:9999/pubdemo
	PubSearch demo version.
	http://lists.sourceforge.net/lists/listinfo/gmod‐pubsearch‐dv
	PubSearch support mailing list.
	http://www.gmod.org
	Generic Model Organism Database project home page.