我要登录|
免费注册
|
我的丁香通
- 企业机构：
- 成为企业机构
- 个人用户：
- 个人中心
移动端

大家都在搜

0 人通过求购买到了急需的产品

免费发布求购

发布求购

Using Relational Databases for Improved Sequence Similarity Searching and Large‐Scale Genomic Analyses

互联网2013-12-31

651

Abstract
Table of Contents
Figures
Literature Cited

Abstract

Relational databases are designed to integrate diverse types of information and manage large sets of search results, greatly simplifying genome?scale analyses. Relational databases are essential for management and analysis of large?scale sequence analyses, and can also be used to improve the statistical significance of similarity searches by focusing on subsets of sequence libraries most likely to contain homologs. This unit describes using relational databases to improve the efficiency of sequence similarity searching and to demonstrate various large?scale genomic analyses of homology?related data. This unit describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. These include basic use of the database to generate a novel sequence library subset, how to extend and use seqdb_demo for the storage of sequence similarity search results and making use of various kinds of stored search results to address aspects of comparative genomic analysis.

Keywords: relational database; sequence similarity; comparative genomic analysis; homology

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Basic Protocol 1: Installing and Populating the seqdb_demo Relational Database
Basic Protocol 2: Extracting Sequences from seqdb_demo for Similarity Searching to Improve Homolog Searching
Basic Protocol 3: Storing Similarity Search Results in seqdb_demo
Basic Protocol 4: Analyzing Similarity Search Results: Identifying Ancient Proteins
Basic Protocol 5: Analyzing Similarity Search Results: Taxonomic Subsets
Basic Protocol 6: Analyzing Similarity Search Results: Intergenic Regions
Commentary
Literature Cited
Figures
Tables

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 9.4.1 A schema for protein sequence data. Each of the boxes represents one of the tables in the seqdb_demo database. Sequences are stored in the protein table, their descriptions and accession information are stored in the annot table, and taxonomic information is stored in the taxon and taxon_name tables. The links between the tables are shown with dashed lines. The symbols at the ends of the line indicate the type of relationship; e.g., the protein:annot relationship is a one‐to‐many relationship; each protein sequence can have many descriptions or annotations but an annotation refers to only one protein sequence. The abbreviations to the left of the table entry names indicate whether the entry is a primary key (PK) or foreign key (FK, a foreign key in one table is a primary key in another, and allows the information in the two tables to be “joined”), or if the entry is indexed (IX) for rapid lookup.

View Image

Figure 9.4.2 (A ) List of tables in the database created in , step , retrieved via the command. (B ) Description of columns in the database, retrieved via the command.

View Image

Figure 9.4.3 (A ) Number of protein sequences loaded into database from the nr sequence library, retrieved via the command. (B ) Number of different descriptions loaded into the database from nr, retrieved via the command. (C ) Information on a single protein, retrieved via the command. (D ) All annotations of a protein, retrieved via the command.

View Image

Figure 9.4.4 (A ) Total number of taxa loaded from the NCBI Taxonomy database, retrieved via the command. (B ) NCBI's ID for human, retrieved via the command.

View Image

Figure 9.4.5 SQL code used to generate a library of human sequences (note the space following the fourth “ ” symbol).

View Image

Figure 9.4.6 FASTA‐formatted human sequences, printed to human.lib.

View Image
Figure 9.4.7 SQL code used to generate a library of mammalian sequences from seqdb_demo.

View Image

Figure 9.4.8 A schema for similarity search results. Each of the boxes represents one of the tables used to collect alignment data in the seqdb_demo database. The search table records the parameters of the search; search_query and search_lib record information about the query and library sequences used for the search, and the search_hit table records the scores and boundaries of alignments between query and library sequences. The links between tables, primary keys (PK), and foreign keys (FK) are indicated as in Figure .

View Image

Figure 9.4.9 SQL statements to confirm successful importing of results. Bold text represents input; lightface text represents output.

View Image

Figure 9.4.10 List of highest‐scoring E. coli homologs to human sequences, obtained via the commands shown in step of .

View Image

Figure 9.4.11 SQL statement to identify human sequences involved in alignments from from step of , for a database system that allows subselects (see step of ).

View Image

Figure 9.4.12 SQL statement to identify human sequences involved in alignments from step of , for versions of MySQL that do not allow subselects (see step of ).

View Image

Figure 9.4.13 Intermediate besthits table produced by SQL from Figure .

View Image

Figure 9.4.14 Script used to produce a list of the ten best matches between E. coli and human proteins from the intermediate besthits table shown in Figure .

View Image

Figure 9.4.15 SQL statement used to create a temporary intermediate results table to store the taxon_id of all species in which a homolog to each query was found (see step of ). Bold text represents input; lightface text represents output.

View Image

Figure 9.4.16 SQL statement used to generate the temporary excludes table (see step of ). Bold text represents input; lightface text represents output.

View Image

Figure 9.4.17 SQL statement used to select the count of rows in temp_results where the query_id appears, given the desired taxonomic subsets.

View Image

Figure 9.4.18 Table returned by the query in Figure . The number of rows that this query returns (661) is the number of genes that have hits against proteins in both Bacteria and Eukaryota species, but have no significant hits against proteins from Archaea species.

View Image

Figure 9.4.19 SQL statement used to build a temporary table that contains the ranges of the successful hits, used in step of .

View Image

Figure 9.4.20 SQL statement used in step of , which contains an initial set of intergenic ranges for each query_id.

View Image

Figure 9.4.21 Two SQL statements used for adding the missed classes of beginning and ending “intergenic” DNA sequence to the igranges table (see step of ).

View Image

Figure 9.4.22 SQL statement to add any DNA sequence queries that did not match against anything and that have no rows in the hitranges table (see step of ).

View Image

Figure 9.4.23 Schematic illustration of one possible source of artifactually overlapping ranges; the collection of hits in (A ) lead to two igrange's as shown in (B ). Only the lowermost igrange, marked by the caret, is desired. See step of .

View Image

Figure 9.4.24 Schematic illustration of a second possible source of artifactually overlapping ranges: (A ) the begin and end of two small hits are spanned by a third, larger hit, leading to the ranges shown in (B ).

View Image

Figure 9.4.25 SQL statement for eliminating unwanted ranges from the final set of intergenic ranges.

View Image

Videos

Literature Cited

	Celko, J. 1999. Joe Celko's SQL for Smarties. Morgan Kaufmann, San Francisco.
Internet Resources
	ftp://ftp.ncbi.nih.gov/pub/blast/db/FASTA/nr.gz
	Comprehensive nr database (flat file protein sequence database).
	ftp://ftp.ncbi.nih.gov/pub/blast/db/FASTA/ swissprot.gz
	SwissProt protein database (flat file protein sequence database).
	ftp://ftp.pir.georgetown.edu.pir_databases/ psd/mysql/
	The Protein Identification Resource (PIR) at Georgetown University, which distributes the PIR protein database in relational format for the MySQL database program.