丁香实验_LOGO
登录
提问
我要登录
|免费注册
点赞
收藏
wx-share
分享

RNA‐Seq Read Alignments with PALMapper

互联网

1087
  • Abstract
  • Table of Contents
  • Figures
  • Literature Cited

Abstract

 

Next?generation sequencing technologies have revolutionized genome and transcriptome sequencing. RNA?Seq experiments are able to generate huge amounts of transcriptome sequence reads at a fraction of the cost of Sanger sequencing. Reads produced by these technologies are relatively short and error prone. To utilize such reads for transcriptome reconstruction and gene?structure identification, one needs to be able to accurately align the sequence reads over intron boundaries. In this unit, we describe PALMapper, a fast and easy?to?use tool that is designed to accurately compute both unspliced and spliced alignments for millions of RNA?Seq reads. It combines the efficient read mapper GenomeMapper with the spliced aligner QPALMA, which exploits read?quality information and predictions of splice sites to improve the alignment accuracy. The PALMapper package is available as a command?line tool running on Unix or Mac OS X systems or through a Web interface based on Galaxy tools.Curr. Protoc. Bioinform. 32:11.6.1?11.6.37. © 2010 by John Wiley & Sons, Inc.

Keywords: RNA?Seq; sequence alignment; splice?site prediction; PALMapper; QPALMA; GenomeMapper; Galaxy

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Basic Protocol 1: Training QPALMA on the Command Line
  • Support Protocol 1: Installing Software
  • Alternate Protocol 1: Training QPALMA Using Galaxy Tools
  • Basic Protocol 2: Generating Alignments with PALMapper on the Command Line
  • Support Protocol 2: Predicting Splice Sites with mGene Using Galaxy Tools
  • Alternate Protocol 2: Generating Alignments with PALMapper Using Galaxy Tools
  • Basic Protocol 3: Evaluating Alignments on the Command Line
  • Alternate Protocol 3: Evaluating Alignments Using Galaxy Tools
  • Basic Protocol 4: Visualizing Results on GBrowse
  • Alternate Protocol 4: Visualizing Results on Galaxy Trackster
  • Guidelines for Understanding Results
  • Commentary
  • Literature Cited
  • Figures
  • Tables
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

  •   Figure Figure 11.6.1 Example of the SRX001872_read_sample.fastq input file. Each read to align is described on four rows in the Sanger FASTQ Format.
    View Image
  •   Figure Figure 11.6.2 Example of a genome annotation file in GFF3 format. See Table for details about GFF3 format.
    View Image
  •   Figure Figure 11.6.3 Example of the donor splice‐site prediction file in SPF format. Each line gives the chromosome name, the name of the signal (here don for donor splice site), the type of score, the nucleotidic position in the chromosome, the strand, and the score.
    View Image
  •   Figure Figure 11.6.4 Example of script to generate QPALMA training parameters. These parameters are detailed in Table . To train QPALMA without splice‐site predictions, the user has to replace the two lines concerning splice‐site prediction files by the following one: export QPALMA_enable_splice_scores=False # no splice site predictions.
    View Image
  •   Figure Figure 11.6.5 Example of a QPALMA parameter file generated during the training phase of QPALMA. The first lines starting with ## characters describe the parameters used for generating this file. The following lines represent the piece‐wise linear functions for intron length (h), donor splice sites (d), acceptor splice sites (a), and edit operations for match/mismatch/gap on read (q). Each piece‐wise linear function is defined by the range of possible x values (length for h, splice‐site predictions for d and a, quality for q, which is followed by the values for all support points (first, x‐values separated by commas then, after a space, y values separated by commas too). mmatrix defines fixed scores for a gap on DNA according to possible aligned base in read. Finally, prb_offset is the quality offset for determining the quality value from the ASCII quality character.
    View Image
  •   Figure Figure 11.6.6 Example of an alignment file in SAM format. The read string in column 10 and the quality string in column 11 have been shortened for reasons of presentation. These columns contain the full strings. A full description of the format is given in Li et al. (2009).
    View Image
  •   Figure Figure 11.6.7 Wiggle (variableStep) files are plain text files. They always begin with a track header, followed by a required attribute type and must have a value of wiggle_0. On the same line, two optional attributes are suggested: “name” and “description”, these indicate the name and description of data, respectively, and which will become the “Name” and “Note” fields of the generated GFF3 feature. The data is introduced by a line beginning with the keyword variableStep, and the argument chrom, which indicate the chromosome on which the features are located. This is followed by a series of two‐element lines indicating the start position of each feature, and its coverage score.
    View Image
  •   Figure Figure 11.6.8 Example of output alignment file generated by PALMapper in BED format.
    View Image
  •   Figure Figure 11.6.9 Graphical evaluation of comparison to the optimal alignment.
    View Image
  •   Figure Figure 11.6.10 Text output for evaluation of comparison to the optimal alignment.
    View Image
  •   Figure Figure 11.6.11 Graphical evaluation of alignment error distribution.
    View Image
  •   Figure Figure 11.6.12 A screenshot of visualizing read coverage ( xy plot) on chromosome II:1778600..1788500 of Caenorhadbitis elegans genome with annotated gene models using GBrowse. In the xy plot, the x and y axes represent the genomic position and read coverage score, respectively.
    View Image
  •   Figure Figure 11.6.13 A screenshot of visualizing read coverage ( xy plot) on chromosome II:7991695..8557597 of Caenorhadbitis elegans genome with annotated transcript models using Galaxy trackster.
    View Image

Videos

Literature Cited

   Blankenberg, D., Taylor, J., Schenck, I., He, J., Zhang, Y., Ghent, M., Veeraraghavan, N., Albert, I., Miller, W., Makova, K., Hardison, R., and Nekrutenko, A. 2007. A framework for collaborative analysis of ENCODE data: Making large‐scale analyses biologist‐friendly. Genome Res. 17:960‐964.
   Blankenberg, D., Von Kuster, G., Coraor, N., Ananda, G., Lazarus, R., Mangan, M., Nekrutenko, A., and Taylor, J. 2010. Galaxy: A Web‐based genome analysis tool for experimentalists. Curr. Protoc. Mol. Biol. 89:19.10.1‐19.10.21.
   De Bona, F., Ossowski, S., Schneeberger, K., and Rätsch, G. 2008. Optimal spliced alignments of short sequence reads. Bioinformatics 24:i174‐i180.
   Hillier, L.W., Reinke, V., Green, P., Hirst, M., Marra, M.A., and Waterston, R.H. 2009. Massively parallel sequencing of the polyadenylated transcriptome of C. elegans. Genome Res. 19:657‐666.
   Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data Processing Subgroup 2009. The sequence alignment/map (sam) format and samtools. Bioinformatics 25:2078‐2079.
   Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing, S., Kohlbacher, O., and Weigel, D. 2009. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10:R98.
   Schweikert, G., Zeller, G., Zien, A., Behr, J., Ong, C.‐S., Philips, P., Bohlen, A., Sonnenburg, S., and Rätsch, G. 2009a. mGene: A novel discriminative gene finding system. Genome Res. 19:2133‐2143.
   Schweikert, G., Behr, J., Zien, A., Zeller, G., Sonnenburg, S., and Rätsch, G. 2009b. mGene.web: A web service for accurate computational gene finding. Nucleic Acids Res. 37:W312‐W316.
   Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., and Rätsch, G. 2007. Accurate splice site prediction using support vector machines. BMC Bioinformatics 10:S7.
   Trapnell, C., Pachter, L., and Salzberg, S.L. 2009. TopHat: Discovering splice junctions with RNA‐Seq. Bioinformatics 25:1105‐1111.
Internet Resources
   http://www.fml.mpg.de/raetsch/suppl/palmapper/tutorial
   Supporting Web page for tutorial on the material in this unit .
   http://www.fml.mpg.de/raetsch/suppl/palmapper
   PALMapper project Web page.
   http://www.fml.mpg.de/raetsch/suppl/qpalma
   QPALMA project Web page.
   http://www.fml.mpg.de/raetsch/suppl/mgene
   mGene project Web page.
   http://www.fml.mpg.de/raetsch/suppl/splice
   ASP project Web page.
   http://ftp.tuebingen.mpg.de/pub/fml/raetsch‐lab/software/
   http server for downloading PALMapper, QPALMA, mGene, and ASP.
   http://galaxy.fml.mpg.de/
   Galaxy server.
   http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
   General Feature Format (GFF) specification: Get detailed information about the GFF and download scripts for converting various computational analyses to GFF format.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library
 
ad image
提问
扫一扫
丁香实验小程序二维码
实验小助手
丁香实验公众号二维码
扫码领资料
反馈
TOP
打开小程序