Using BLAT to Find Sequence Similarity in Closely Related Genomes
互联网
- Abstract
- Table of Contents
- Figures
- Literature Cited
Abstract
The BLAST?Like Alignment Tool (BLAT) is used to find genomic sequences that match a protein or DNA sequence submitted by the user. BLAT is typically used for searching similar sequences within the same or closely related species. It was developed to align millions of expressed sequence tags and mouse whole?genome random reads to the human genome at a higher speed. It is freely available either on the Web or as a downloadable stand?alone program. BLAT search results provide a link for visualization in the University of California, Santa Cruz (UCSC) Genome Browser, where associated biological information may be obtained. Three example protocols are given: using an mRNA sequence to identify the exon?intron locations and associated gene in the genomic sequence of the same species, using a protein sequence to identify the coding regions in a genomic sequence and to search for gene family members in the same species, and using a protein sequence to find homologs in another species. Curr. Protoc. Bioinform. 37:10.8.1?10.8.24. © 2012 by John Wiley & Sons, Inc.
Keywords: sequence similarity; alignment; homology; BLAT; UCSC Genome Browser
Table of Contents
- Introduction
- Basic Protocol 1: Finding the Exon‐Intron Structure of a Gene
- Basic Protocol 2: Mapping a Protein Sequence to the Genome
- Support Protocol 1: Viewing All BLAT Matches on the Same Chromosome Simultaneously in the UCSC Genome Browser
- Basic Protocol 3: Finding a Gene Homolog in the Genome of Another Organism
- Commentary
- Literature Cited
- Figures
Materials
Figures
-
Figure 10.8.1 Web BLAT search screen for . The interface allows the user to easily specify (from left to right) the genome, assembly, mode of search, desired sorting of the results, and output format. The Genome pull‐down menu provides a choice of over 50 species from mammals, fish, invertebrates, yeast, and others. Some, such as human, will have more than one choice of genome assembly in the Assembly pull‐down menu. The “Query type” pull‐down menu provides an ability to choose the mode of the search. The DNA Query type used in this protocol searches a DNA query against a DNA database. Additional available options are described in the text. The Sort output pull‐down menu can be used to sort the results table. The options are “query, score”; “query, start”; “chromosome.start”; “chromosome.score”; and “score”. The “query, score” option first sorts by query ID (if multiple sequences are pasted into the input box) and then by score. Finally, the “Output type” pull‐down menu provides three options to present the results, “hyperlink,” “psl,” and “psl no header.” The choice of “Output type” as hyperlink yields a table with a link (Browser) to display each alignment in the UCSC Genome Browser and a link (details) to the details of the alignment. The psl output type provides details about mismatches, gaps, and blocks in a tabular format, and does not provide links to alignments or the genome browser. The PSL output format is described in detail in the text. Finally, the large text box is for the input sequence. The sequence needs to be in FASTA format as shown here for the query NM_00531.5, which is used in . Clicking on the “submit” button generates the table of alignments. The “I'm feeling lucky button” goes directly to the genome browser to display the genome alignment of the best scoring alignment of the first input sequence. The “clear” button resets the input text box. The “Browse” and “submit file” buttons are for uploading sequences from a file instead of copying them into the text box. In this protocol, the selected options are “Human” genome, “Feb 2009 GRCh37/hg19” assembly, “DNA” as query type, “query, score” for sorting the results, and “hyperlink” as the output type. View Image -
Figure 10.8.2 Results table for . The BLAT search results provide the following columns: (1) ACTIONS, links for visualization of the alignment in the UCSC Genome Browser (browser link) and a more detailed alignment text view (details link); (2) QUERY, an identifier for the query sequence; (3) SCORE, the number of matches with a penalty for mismatches and gaps (see subsection “score calculation” in the Commentary); (4) START, the location of the beginning of the alignment in the query sequence; (5) END, the location of the end of the alignment in the query sequence; (6) QSIZE, the length of the query sequence; (7) IDENTITY, an indication of the number of matching bases and gaps (see subsection “Percent identity calculation” in the Commentary); (8) CHRO, the chromosome; (9) STRAND, both query strands (‘+’ and ‘‐’) are checked in the alignment. In the translated alignment mode, a second ‘+’ or ‘‐’ for the genomic strand is provided; (10) START, the location of the beginning of the alignment in the genome sequence; (11) END, the location of the end of the alignment in the genome sequence; and (11) SPAN, the number of bases on the genome covered by the alignment. The information in the table—such as score, span, and identity—indicates the extent of the match. The results in this protocol are sorted by score; the top result has a much higher score than the others. The first row in the results shows that the QUERY NM_00531.5 matches the human genome with a score of 1638 (SCORE column) from its nucleotide 1 to 1647 (START and END columns next to the SCORE column). The query size is 1647 (QSIZE column). Thus, the entire query has coverage in the human genome with 100% identity (IDENTITY column). This alignment is on chromosome X (CHRO column), on the plus/forward strand (STRAND column) from nucleotide 38211736 to 38280703 (START and END columns next to the STRAND column), covering a range of 68968 (SPAN column) nucleotides. View Image -
Figure 10.8.3 Detailed alignment information for a part of the query cDNA sequence in . Capital blue letters indicate matching nucleotides in the cDNA sequence NM_00531.5 to the human genomic sequence. Since, in this alignment, query coverage identity is 100%, each nucleotide in the cDNA is capitalized. Light blue letters indicate where the blocks of the query sequence begin and end on the aligned genomic sequence, thus indicating the start and end positions of exons. The query is aligned to the genome in 10 blocks, as listed on the left bar of the page. Each block represents an exon; thus, there are 10 exons in the cDNA NM_00531.5. View Image -
Figure 10.8.4 Detailed alignment information for a part of the aligned target genome sequence (chromosome X) in . Upstream non‐aligned bases are shown in lowercase black letters; the first block of aligned sequence (exon) bases are shown in uppercase blue letters, followed by another non‐aligned block in lowercase black letters, an intron. The start and end of the exon are shown in light blue. Details can be found in the text. View Image -
Figure 10.8.5 Detailed side‐by‐side alignment information for the query cDNA and target genome sequence for the first match in . The figure shows the first block of the first alignment between the query and target genome divided into sections of 50 nucleotides. In each section, the top line represents the query cDNA NM_00531.5, and the line beneath it represents the human genomic DNA, chromosome X. In this block, the NM_00531.5 cDNA queries nucleotides with coordinates from 1 to 291 aligned with human chromosome X nucleotides (in the Feb 2009 GRCh37/hg19 assembly) from 38211736 to 38212026. The identity, shown by a vertical bar between the query and genome nucleotides, is 100%. Details can be found in the text. View Image -
Figure 10.8.6 UCSC Genome Browser display of the first match in , with the default gene annotation track added. The position/search box lists the displayed region coordinates, chromosome X :38211736 to 38280703, corresponding to the first match. This region is depicted by a red rectangle in the p arm of the chromosome X ideogram. The box below the ideogram displays two tracks. The top track, labeled “Your Sequence from BLAT Search”, shows the alignment of the query NM_00531.5. Each thick bar represents an alignment block identified in the “details” view in Figure . The second track is the UCSC gene track (labeled “UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics”). This track identifies the gene, OTC (the symbol written on the left side of the track), associated with the query transcript NM_00531.5. The exons in the gene, indicated by thick bars on the gene line, match the thick bars on the query line above. The gene is annotated on the plus strand as indicated by the >>> symbols on the line representing it. Information about the gene can be obtained by clicking on the gene name, OTC. Details can be found in the text. View Image -
Figure 10.8.7 The BLAT search screen for . Note that the option “protein” is selected for the Query type. Refer to the legend for Figure for additional information on the content of each search menu option. View Image -
Figure 10.8.8 The results of the BLAT search, , shown in Figure . The first hit has 100% coverage (IDENTITY column) on chromosome 11. Other hits also have high identity and long alignments. All of the hits in this example are on chromosome 11. Refer to the legend for Figure and text for additional information on the content of each column of the display. View Image -
Figure 10.8.9 Detailed alignment information for the first match of the query protein and chromosome 11 sequences in . The protein query NP_00509.1 sequence is listed at the top, and the region of the chromosome 11 sequence that aligned to the query is shown below it. In this match, all of the query sequence was aligned to the genome (translation), so all of the letters in the query sequence are capitals and colored blue. In the genomic DNA sequence below, the (translation of the) nucleotides that aligned to the protein sequence are shown in capital letters and colored blue. Refer to the legends for Figures and and to the text for additional information on the content of the display. View Image -
Figure 10.8.10 Detailed side‐by‐side alignment information for the query protein and genome sequences in . The result shows the alignment is in three blocks, indicating that the gene has three coding exons. Each alignment block is separated by a horizontal line and is divided into sections of 60 coordinates. In each section, the first sequence line provides the query amino acid sequence, and the second sequence line gives the nucleotide sequence of the aligning genome. For both lines, the starting and ending coordinates are at the nucleotide level. Thus, in the top section of the first block, the first line shows the sequence of the first 20 amino acids in the query. The coordinates for this line are 1 to 60, corresponding to the codons for the first 20 amino acids. The second sequence line shows the nucleotides with starting and ending coordinates (5248251 to 5248192) on the aligning chromosome, 11. The second section of this block shows alignment of query amino acids 21 to 30 (again the coordinates for the top line, 61‐90, are for the nucleotides in the codons) to the nucleotides 5248191 to 5248162 on chromosome 11. See the text for details. View Image -
Figure 10.8.11 UCSC Genome Browser display of the aligned genome region for the first match in . The region displayed in this view comprises nucleotides 5246831 to 5248251 of chromosome 11. The query sequence alignment is represented in the top track and the UCSC gene track is represented in the second line. A label to the left of this track indicates the symbol for the gene, HBB, represented in that track. Refer to the legend for Figure and the text for additional information on the content of the display. View Image -
Figure 10.8.12 The BLAT search screen for the . Note that the “sort output” selection is “chrom, start”. Refer to the legend for Figure for additional information on the content of each search menu option. View Image -
Figure 10.8.13 Results of the BLAT search in Figure . Since the sorting option “chrom, start” was used, the results are sorted by their position on the chromosome, as opposed to the “query, score” sorting in Figure . In this case, “start” refers to the column START after the STRAND column. Note that the hits are on chromosome 11 between positions 5246831 and 5290908. Refer to the legend for Figure and the text for additional information on the content of each column of the display. View Image -
Figure 10.8.14 UCSC Genome Browser display of the aligned genome region in the search described in the . After changing the view in the genome browser to include positions 5246831 and 5290908, the range identified in the previous figure, the browser shows genes corresponding to all six matches: HBB, HBD, HBBP1, HBG1, HBG2, and HBE1. Refer to the legend for Figure and the text for additional information on the content of the display. View Image -
Figure 10.8.15 The BLAT search screen for . When the Genome was changed to Chimp, the Assembly changed automatically. Refer to the legend for Figure for additional information on the content of each search menu option. View Image -
Figure 10.8.16 The results of the BLAT search shown in Figure . The first match has a much higher score than the second, and it matches over a longer span. Based on the columns CHRO, STRAND, START, and END, the second match lies entirely within the span of the first. Refer to the legend for Figure and the text for additional information on the content of each column of the display. View Image -
Figure 10.8.17 Detailed alignment information for the first match of the query protein and the chimp genome sequences in . The top section of the details page shows, in capital blue letters, the portion of the queried human protein that matched the chimp genome. Below that is shown the section for the chimp chromosome X sequence, again showing the aligning sequence in capital blue letters. The lowercase black letters indicate regions which are not aligned. The red arrow points to the amino acid threonine, shown in a lowercase black letter “t” at the 125th position of the query NP 00522.3, indicating a mismatch at that position with respect to the (translation of the) genome sequence. Refer to the legends for Figures and , and to the text, for additional information on the content of the display. View Image -
Figure 10.8.18 Detailed side‐by‐side alignment information for a portion of the first match of the query protein and chimp genome sequences in . Exact matches between the query and genome sequences are shown by a vertical line. Mismatches are shown by the letter code for the amino acid encoded by the aligned genomic sequence. To illustrate, the mismatch between the 125th amino acid threonine in the human query protein and methionine encoded by the chimp genome at the corresponding position is highlighted by a red rectangle. Refer to the legend for Figure and the text for additional information on the content of the display. View Image -
Figure 10.8.19 Results of the first match of displayed as a custom track in the UCSC Genome Browser. The track is labeled “User Supplied Track”. Instructions to generate a custom track using the PSL output format are described in the text. View Image
Videos
Literature Cited
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410. | |
Bina, M. 2006. Identification and mapping of paralogous genes on a known genomic DNA sequence. In Methods in Molecular Biology, Vol. 338: Gene Mapping, Discovery, and Expression (M. Bina, ed.) pp. 21‐29. Humana Press, Totowa, N.J. | |
Harper, C., Huang, C., Stryke, D., Kawamoto, M., Ferrin, T., and Babbitt, P. 2006. Comparison of methods for genomic localization of gene trap sequences. BMC Genomics 7:236. | |
Kent, W.J. 2002. BLAT: The BLAST‐like alignment tool. Genome Res. 12:656–664. | |
Morgulis, A., Coulouris, G., Raytselis, Y., Madden, T. L., Agarwala, R., and Schäffer, A.A. 2008. Database indexing for production MegaBLAST searches. Bioinformatics 24:1757–1764. | |
Yavatkar, A., Lin, Y., Ross, J., Fann, Y., Brody, T., and Odenwald, W. 2008. Rapid detection and curation of conserved DNA via enhanced‐BLAT and EvoPrinterHD analysis. BMC Genomics 9:106. | |
Key Reference | |
Kent, W.J. 2002. See above. | |
The original article by the author of BLAT discusses the rationale and algorithms used in its development. | |
Internet Resources | |
http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html#BLATAlign | |
UCSC online documentation for BLAT. | |
http://genome.ucsc.edu/goldenPath/help/blatSpec.html | |
BLAT program specifications and user guide. | |
http://genome.ucsc.edu/FAQ/FAQblat.html | |
FAQ about BLAT. | |
http://www.genome.gov/EdKit/pdfs/glossary_final‐linked.pdf | |
A glossary of useful terms for understanding genomic research. | |
http://blast.ncbi.nlm.nih.gov/Blast.cgi | |
BLAST Web site. |