Modeling Biology Using Relational Databases
互联网
- Abstract
- Table of Contents
- Materials
- Figures
- Literature Cited
Abstract
There are several different methodologies that can be used for designing a database schema; no one is the best for all occasions. This unit demonstrates two different techniques for designing relational tables and discusses when each should be used. These two techniques presented are (1) traditional Entity?Relationship (E?R) modeling and (2) a hybrid method that combines aspects of data warehousing and E?R modeling. The method of choice depends on (1) how well the information and all its inherent relationships are understood, (2) what types of questions will be asked, (3) how many different types of data will be included, and (4) how much data exists.
Table of Contents
- Basic Protocol 1: Using Entity‐Relationship Modeling to Design a Database
- Alternate Protocol 1: Using the Hybrid Method to Design a Database
- Commentary
- Literature Cited
- Figures
Materials
Basic Protocol 1: Using Entity‐Relationship Modeling to Design a Database
Necessary Resources
Alternate Protocol 1: Using the Hybrid Method to Design a Database
Necessary Resources
|
Figures
-
Figure 9.3.1 Flowchart outlining steps for designing table schema in Entity‐Relationship modeling, as described in the . View Image -
Figure 9.3.2 A small portion of data typically obtained from gene expression studies. The full dataset contains the name, an identifier (accession number), and a description for every probe on the chip; the organism and tissue from which the sample was obtained; how the sample was treated; and the fold change in the gene expression versus the normal tissue. The identifier uniquely specifies a gene, even across organisms. View Image -
Figure 9.3.3 A single table is created containing all of the different types of information obtained from the gene expression study. Often, this is done as a mental exercise. In this and subsequent figures showing tables, the table name is shown in the gray area at the top of the box representing a table. In the area below the gray area are the names of the columns used in the table's primary key, these will have a PK preceding them. In some of the figures, this will be blank depending on the step in the protocol. The names of all other columns in the table will be found in the bottom portion. Columns involved in foreign keys are indicated by a FK# where the # is a number that differentiates between the different foreign keys. Note that the naming scheme used in these figures does not correspond to any database product. Please refer to individual database product documentation for the proper naming of tables and columns. It is standard practice to use the singular form of a word when naming tables and columns. View Image -
Figure 9.3.4 Together, the columns gene id , tissue , and stimulus uniquely identify a row in the table. Accordingly, they are designated the primary key as indicated by the PK to the left of the column names. View Image -
Figure 9.3.5 A related set of columns that are attributes of only part of the compound (multiple column) primary key have been removed from the original table and placed in a new table gene . The prefix gene has been dropped from the column names for brevity. View Image -
Figure 9.3.6 The portion of the original table's primary key, gene id , described by the columns in the new table becomes the primary key of the new table. View Image -
Figure 9.3.7 A foreign key is set up between the gene id column in the expression table (the child table) and the gene id column in the gene table (the parent table). A foreign key is used to ensure data integrity between two tables; the only values allowed in the gene id column in the expression table are those that are found in the gene id column in the gene table. The arrow between the two tables indicates the parent‐child relationship between the two tables; the arrow points from the child to the parent. The FK1 in the expression table indicates the child column in the child table. View Image -
Figure 9.3.8 The columns organism scientific name and organism common name in the table expression are interdependent: they describe each other. Organism common name is moved to a new table organism . Arbitrarily, organism scientific name is selected to be the primary key in the new table organism and remains in the table expression . The organism prefix is removed from the column names for brevity. View Image -
Figure 9.3.9 A foreign key is set up from scientific name in the expression table to scientific name in the organism table as indicated by the arrow between the two tables and the FK2 in the expression table. View Image -
Figure 9.3.10 The SQL statements to create the tables in MySQL for the schema shown in Figure . To implement this schema in a different DBMS, the column types (e.g., text ) will need to be adapted appropriately. Please refer to individual DBMS documentation for supported data types. View Image -
Figure 9.3.11 The SQL statement to reproduce the spreadsheet shown in Figure based on the schema shown in Figure . View Image -
Figure 9.3.12 The SQL query to answer the question, “What genes have an expression level greater or equal to two‐fold in humans?” View Image -
Figure 9.3.13 Results to the SQL query shown in Figure obtained from a database implemented in MySQL. View Image -
Figure 9.3.14 The SQL query to answer the question, “What genes are differentially expressed in human tissues and what are the stimuli that caused the differential expression?” The order by clause uses position numbers since the select clause has an expression in it. The position numbers correspond to the columns named in the select clause: g.defn is in position 1 and e2.stimulus is in position 5. View Image -
Figure 9.3.15 Results to the SQL query shown in Figure . View Image -
Figure 9.3.16 Flowchart outlining steps for designing table schema for hybrid method modeling, as described in . View Image -
Figure 9.3.17 A small portion of typical sequence annotation obtained from public data sources. The full dataset contains the gene name, a sequence identifier (accession number), a description of the gene, the name of the repository where the identifier can be used to retrieve the sequence, and the tissue and organism from which the sequence was putatively obtained. View Image -
Figure 9.3.18 A single table is created containing all of different types of information obtained from the gene expression study. Often, this is done as a mental exercise. The abbreviation seq is used for the table name since sequence is a reserved word in many implementations of SQL. View Image -
Figure 9.3.19 Related sets of repeating data (entities) have been identified and moved into their own entity tables. View Image -
Figure 9.3.20 (A ) The general form of an entity table used in the hybrid method. It consists of a numeric primary key and 1 to N attributes. (B ) Two examples of entity tables. Redundancy in the table is either accepted or reduced through the liberal use of unique indices other than the primary key. View Image -
Figure 9.3.21 Each table is given a numeric primary key. This numeric identifier uniquely identifies a row in the table and is used to relate information in the table to information in a different table. View Image -
Figure 9.3.22 Data redundancy within a table is eliminated by identifying a column or a set of columns that uniquely identifies a row and then using them in a unique index, not in the primary key. A table will often have more than one unique index. View Image -
Figure 9.3.23 The figure shows the two different ways that a sequence can be linked to an organism and tissue. (A ) A sequence can be linked independently an organism and a tissue. This can lead to combinations of organism and tissue that might not have been found experimentally. (B ) A sequence is linked to an organism and a tissue. View Image -
Figure 9.3.24 (A ) The general form of a relationship table used in the hybrid method is shown. It consists of numeric columns corresponding to the primary keys of the entity tables in the relationship. All columns in a relationship table comprise its primary key and are linked back to their respective parent tables by foreign keys. (B ) An example of relationship tables is shown. View Image -
Figure 9.3.25 Relationship tables are created by linking related entities. In this schema, it is assumed that there is no direct relationship between organism and tissue . The primary key in each relationship table is comprised of all of the columns in the table. Foreign keys are set up between each relationship table and the entity tables involved in the relationship. View Image -
Figure 9.3.26 Relationship tables are created by linking related entities. In this schema, it is assumed that the organism and tissue tables are directly related. This is shown in the relationship table seq2organismtissue which contains three columns, one for each entity in the relationship. A comparison of this schema to the one in Figure shows its flexibility: the base tables seq , seq source , tissue and organism and the relationship table seq2seq source remain untouched. View Image -
Figure 9.3.27 This figure shows a second way of setting up the sequence‐organism‐tissue relationship. This form sets up a relationship between organism and tissue called library with its own primary key. This new entity is then related to sequence in seq2library and can be used in other relationships. View Image -
Figure 9.3.28 The SQL statements to create the base tables seq , seq_source , tissue , and organism in MySQL for the schemas shown in Figures , , and . To implement this schema in a different DBMS, the column types (e.g., bigint ) will need to be adapted appropriately. Please refer to specific DBMS documentation for supported data types. View Image -
Figure 9.3.29 The SQL statements to create the relationship tables in the schema where organism and tissue are related as shown in Figure . View Image -
Figure 9.3.30 The SQL statements to create the relationship tables in the schema where organism and tissue are related as shown in Figure . View Image -
Figure 9.3.31 The SQL statement to reproduce the spreadsheet shown in Figure . View Image -
Figure 9.3.32 The SQL query to return only those genes found in humans based on the schema shown in Figure . View Image -
Figure 9.3.33 Results to the SQL query shown in Figure obtained from a database implemented in MySQL. Name is shown as a column header since it is found in two tables and both are listed in the select clause of the query. View Image -
Figure 9.3.34 The SQL query to show all of the annotation in the database for the sequence with the identifier (accession number) nm_005229. Note that a DBMS query engine will consider nm_005229 and NM_005229 to be different entries. Care must be taken prior to loading the data to ensure that all identifiers are either all uppercase or all lowercase. View Image -
Figure 9.3.35 Results to the SQL query shown in Figure . View Image -
Figure 9.3.36 The SQL query to retrieve information on the gene with the name pepck regardless of the organism. As mentioned previously, the DBMS query engine will consider PEPCK or any variation with different upper and lowercase letters to be different from the all lowercase pepck . View Image -
Figure 9.3.37 Results to the SQL query shown in Figure . View Image -
Figure 9.3.38 The SQL query to find all genes in humans that are not found in mice. This query uses the group function minus which returns all entries in the first query that are not found in the second query. This function along with intersect and subselects are scheduled to be implemented in version 4.1 of MySQL. View Image -
Figure 9.3.39 Results to the SQL query shown in Figure obtained from a database implemented in Oracle. View Image -
Figure 9.3.40 The four types of relationships: 1‐to‐1, 1‐to‐N, M‐to‐1, and M‐to‐N. The crows‐feet indicate the item with the multiplicity. There are many more symbols used, these are the most common. View Image
Videos
Literature Cited
Literature Cited | |
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410. | |
Codd, E.F. 1990. The Relational Model for Database Management; Version 2. Addison‐Wesley Publishing, New York. | |
Date, C.J. 1995. An Introduction to Database Systems, 6th ed. Addison‐Wesley Publishing, New York. | |
Dodge, G. and Gorman, T. 1998. Oracle8 Data Warehousing. John Wiley & Sons, New York. | |
The Gene Ontology Consortium. 2000. Gene ontology: Tool for the unification of biology. Nature Gen. 25:25‐29. | |
Inmon, W.H., Imhoff, C., and Battas, G. 1995. Building the Operational Data Store. John Wiley & Sons, New York. | |
Kimball, R. 1996. The Data Warehouse Toolkit. John Wiley & Sons, New York. | |
Yarger, R.J., Reese, G., and King, T. 1999. MySQL & mSQL. O'Reilly & Associates, CA. |