Using Relational Databases for Improved Sequence Similarity Searching and Large‐Scale Genomic Analyses
互联网
- Abstract
- Table of Contents
- Figures
- Literature Cited
Abstract
Relational databases are designed to integrate diverse types of information and manage large sets of search results, greatly simplifying genome?scale analyses. Relational databases are essential for management and analysis of large?scale sequence analyses, and can also be used to improve the statistical significance of similarity searches by focusing on subsets of sequence libraries most likely to contain homologs. This unit describes using relational databases to improve the efficiency of sequence similarity searching and to demonstrate various large?scale genomic analyses of homology?related data. This unit describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. These include basic use of the database to generate a novel sequence library subset, how to extend and use seqdb_demo for the storage of sequence similarity search results and making use of various kinds of stored search results to address aspects of comparative genomic analysis.
Keywords: relational database; sequence similarity; comparative genomic analysis; homology
Table of Contents
- Basic Protocol 1: Installing and Populating the seqdb_demo Relational Database
- Basic Protocol 2: Extracting Sequences from seqdb_demo for Similarity Searching to Improve Homolog Searching
- Basic Protocol 3: Storing Similarity Search Results in seqdb_demo
- Basic Protocol 4: Analyzing Similarity Search Results: Identifying Ancient Proteins
- Basic Protocol 5: Analyzing Similarity Search Results: Taxonomic Subsets
- Basic Protocol 6: Analyzing Similarity Search Results: Intergenic Regions
- Commentary
- Literature Cited
- Figures
- Tables
Materials
Figures
-
Figure 9.4.1 A schema for protein sequence data. Each of the boxes represents one of the tables in the seqdb_demo database. Sequences are stored in the protein table, their descriptions and accession information are stored in the annot table, and taxonomic information is stored in the taxon and taxon_name tables. The links between the tables are shown with dashed lines. The symbols at the ends of the line indicate the type of relationship; e.g., the protein:annot relationship is a one‐to‐many relationship; each protein sequence can have many descriptions or annotations but an annotation refers to only one protein sequence. The abbreviations to the left of the table entry names indicate whether the entry is a primary key (PK) or foreign key (FK, a foreign key in one table is a primary key in another, and allows the information in the two tables to be “joined”), or if the entry is indexed (IX) for rapid lookup. View Image -
Figure 9.4.2 (A ) List of tables in the database created in , step , retrieved via the command. (B ) Description of columns in the database, retrieved via the command. View Image -
Figure 9.4.3 (A ) Number of protein sequences loaded into database from the nr sequence library, retrieved via the command. (B ) Number of different descriptions loaded into the database from nr, retrieved via the command. (C ) Information on a single protein, retrieved via the command. (D ) All annotations of a protein, retrieved via the command. View Image -
Figure 9.4.4 (A ) Total number of taxa loaded from the NCBI Taxonomy database, retrieved via the command. (B ) NCBI's ID for human, retrieved via the command. View Image -
Figure 9.4.5 SQL code used to generate a library of human sequences (note the space following the fourth “ ” symbol). View Image -
Figure 9.4.6 FASTA‐formatted human sequences, printed to human.lib. View Image -
Figure 9.4.7 SQL code used to generate a library of mammalian sequences from seqdb_demo. View Image -
Figure 9.4.8 A schema for similarity search results. Each of the boxes represents one of the tables used to collect alignment data in the seqdb_demo database. The search table records the parameters of the search; search_query and search_lib record information about the query and library sequences used for the search, and the search_hit table records the scores and boundaries of alignments between query and library sequences. The links between tables, primary keys (PK), and foreign keys (FK) are indicated as in Figure . View Image -
Figure 9.4.9 SQL statements to confirm successful importing of results. Bold text represents input; lightface text represents output. View Image -
Figure 9.4.10 List of highest‐scoring E. coli homologs to human sequences, obtained via the commands shown in step of . View Image -
Figure 9.4.11 SQL statement to identify human sequences involved in alignments from from step of , for a database system that allows subselects (see step of ). View Image -
Figure 9.4.12 SQL statement to identify human sequences involved in alignments from step of , for versions of MySQL that do not allow subselects (see step of ). View Image -
Figure 9.4.13 Intermediate besthits table produced by SQL from Figure . View Image -
Figure 9.4.14 Script used to produce a list of the ten best matches between E. coli and human proteins from the intermediate besthits table shown in Figure . View Image -
Figure 9.4.15 SQL statement used to create a temporary intermediate results table to store the taxon_id of all species in which a homolog to each query was found (see step of ). Bold text represents input; lightface text represents output. View Image -
Figure 9.4.16 SQL statement used to generate the temporary excludes table (see step of ). Bold text represents input; lightface text represents output. View Image -
Figure 9.4.17 SQL statement used to select the count of rows in temp_results where the query_id appears, given the desired taxonomic subsets. View Image -
Figure 9.4.18 Table returned by the query in Figure . The number of rows that this query returns (661) is the number of genes that have hits against proteins in both Bacteria and Eukaryota species, but have no significant hits against proteins from Archaea species. View Image -
Figure 9.4.19 SQL statement used to build a temporary table that contains the ranges of the successful hits, used in step of . View Image -
Figure 9.4.20 SQL statement used in step of , which contains an initial set of intergenic ranges for each query_id. View Image -
Figure 9.4.21 Two SQL statements used for adding the missed classes of beginning and ending “intergenic” DNA sequence to the igranges table (see step of ). View Image -
Figure 9.4.22 SQL statement to add any DNA sequence queries that did not match against anything and that have no rows in the hitranges table (see step of ). View Image -
Figure 9.4.23 Schematic illustration of one possible source of artifactually overlapping ranges; the collection of hits in (A ) lead to two igrange's as shown in (B ). Only the lowermost igrange, marked by the caret, is desired. See step of . View Image -
Figure 9.4.24 Schematic illustration of a second possible source of artifactually overlapping ranges: (A ) the begin and end of two small hits are spanned by a third, larger hit, leading to the ranges shown in (B ). View Image -
Figure 9.4.25 SQL statement for eliminating unwanted ranges from the final set of intergenic ranges. View Image
Videos
Literature Cited
Celko, J. 1999. Joe Celko's SQL for Smarties. Morgan Kaufmann, San Francisco. | |
Internet Resources | |
ftp://ftp.ncbi.nih.gov/pub/blast/db/FASTA/nr.gz | |
Comprehensive nr database (flat file protein sequence database). | |
ftp://ftp.ncbi.nih.gov/pub/blast/db/FASTA/ swissprot.gz | |
SwissProt protein database (flat file protein sequence database). | |
ftp://ftp.pir.georgetown.edu.pir_databases/ psd/mysql/ | |
The Protein Identification Resource (PIR) at Georgetown University, which distributes the PIR protein database in relational format for the MySQL database program. |