RNA‐Seq Read Alignments with PALMapper

互联网2013-12-31

1089

Abstract
Table of Contents
Figures
Literature Cited

Abstract

Next?generation sequencing technologies have revolutionized genome and transcriptome sequencing. RNA?Seq experiments are able to generate huge amounts of transcriptome sequence reads at a fraction of the cost of Sanger sequencing. Reads produced by these technologies are relatively short and error prone. To utilize such reads for transcriptome reconstruction and gene?structure identification, one needs to be able to accurately align the sequence reads over intron boundaries. In this unit, we describe PALMapper, a fast and easy?to?use tool that is designed to accurately compute both unspliced and spliced alignments for millions of RNA?Seq reads. It combines the efficient read mapper GenomeMapper with the spliced aligner QPALMA, which exploits read?quality information and predictions of splice sites to improve the alignment accuracy. The PALMapper package is available as a command?line tool running on Unix or Mac OS X systems or through a Web interface based on Galaxy tools.Curr. Protoc. Bioinform. 32:11.6.1?11.6.37. © 2010 by John Wiley & Sons, Inc.

Keywords: RNA?Seq; sequence alignment; splice?site prediction; PALMapper; QPALMA; GenomeMapper; Galaxy

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Introduction
Basic Protocol 1: Training QPALMA on the Command Line
Support Protocol 1: Installing Software
Alternate Protocol 1: Training QPALMA Using Galaxy Tools
Basic Protocol 2: Generating Alignments with PALMapper on the Command Line
Support Protocol 2: Predicting Splice Sites with mGene Using Galaxy Tools
Alternate Protocol 2: Generating Alignments with PALMapper Using Galaxy Tools
Basic Protocol 3: Evaluating Alignments on the Command Line
Alternate Protocol 3: Evaluating Alignments Using Galaxy Tools
Basic Protocol 4: Visualizing Results on GBrowse
Alternate Protocol 4: Visualizing Results on Galaxy Trackster
Guidelines for Understanding Results
Commentary
Literature Cited
Figures
Tables

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 11.6.1 Example of the SRX001872_read_sample.fastq input file. Each read to align is described on four rows in the Sanger FASTQ Format.

View Image

Figure 11.6.2 Example of a genome annotation file in GFF3 format. See Table for details about GFF3 format.

View Image

Figure 11.6.3 Example of the donor splice‐site prediction file in SPF format. Each line gives the chromosome name, the name of the signal (here don for donor splice site), the type of score, the nucleotidic position in the chromosome, the strand, and the score.

View Image

Figure 11.6.4 Example of script to generate QPALMA training parameters. These parameters are detailed in Table . To train QPALMA without splice‐site predictions, the user has to replace the two lines concerning splice‐site prediction files by the following one: export QPALMA_enable_splice_scores=False # no splice site predictions.

View Image

Figure 11.6.5 Example of a QPALMA parameter file generated during the training phase of QPALMA. The first lines starting with ## characters describe the parameters used for generating this file. The following lines represent the piece‐wise linear functions for intron length (h), donor splice sites (d), acceptor splice sites (a), and edit operations for match/mismatch/gap on read (q). Each piece‐wise linear function is defined by the range of possible x values (length for h, splice‐site predictions for d and a, quality for q, which is followed by the values for all support points (first, x‐values separated by commas then, after a space, y values separated by commas too). mmatrix defines fixed scores for a gap on DNA according to possible aligned base in read. Finally, prb_offset is the quality offset for determining the quality value from the ASCII quality character.

View Image

Figure 11.6.6 Example of an alignment file in SAM format. The read string in column 10 and the quality string in column 11 have been shortened for reasons of presentation. These columns contain the full strings. A full description of the format is given in Li et al. (2009).

View Image

Figure 11.6.7 Wiggle (variableStep) files are plain text files. They always begin with a track header, followed by a required attribute type and must have a value of wiggle_0. On the same line, two optional attributes are suggested: “name” and “description”, these indicate the name and description of data, respectively, and which will become the “Name” and “Note” fields of the generated GFF3 feature. The data is introduced by a line beginning with the keyword variableStep, and the argument chrom, which indicate the chromosome on which the features are located. This is followed by a series of two‐element lines indicating the start position of each feature, and its coverage score.

View Image

Figure 11.6.8 Example of output alignment file generated by PALMapper in BED format.

View Image
Figure 11.6.9 Graphical evaluation of comparison to the optimal alignment.

View Image
Figure 11.6.10 Text output for evaluation of comparison to the optimal alignment.

View Image
Figure 11.6.11 Graphical evaluation of alignment error distribution.

View Image

Figure 11.6.12 A screenshot of visualizing read coverage ( xy plot) on chromosome II:1778600..1788500 of Caenorhadbitis elegans genome with annotated gene models using GBrowse. In the xy plot, the x and y axes represent the genomic position and read coverage score, respectively.

View Image

Figure 11.6.13 A screenshot of visualizing read coverage ( xy plot) on chromosome II:7991695..8557597 of Caenorhadbitis elegans genome with annotated transcript models using Galaxy trackster.

View Image

Videos

Literature Cited

	Blankenberg, D., Taylor, J., Schenck, I., He, J., Zhang, Y., Ghent, M., Veeraraghavan, N., Albert, I., Miller, W., Makova, K., Hardison, R., and Nekrutenko, A. 2007. A framework for collaborative analysis of ENCODE data: Making large‐scale analyses biologist‐friendly. Genome Res. 17:960‐964.
	Blankenberg, D., Von Kuster, G., Coraor, N., Ananda, G., Lazarus, R., Mangan, M., Nekrutenko, A., and Taylor, J. 2010. Galaxy: A Web‐based genome analysis tool for experimentalists. Curr. Protoc. Mol. Biol. 89:19.10.1‐19.10.21.
	De Bona, F., Ossowski, S., Schneeberger, K., and Rätsch, G. 2008. Optimal spliced alignments of short sequence reads. Bioinformatics 24:i174‐i180.
	Hillier, L.W., Reinke, V., Green, P., Hirst, M., Marra, M.A., and Waterston, R.H. 2009. Massively parallel sequencing of the polyadenylated transcriptome of C. elegans. Genome Res. 19:657‐666.
	Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data Processing Subgroup 2009. The sequence alignment/map (sam) format and samtools. Bioinformatics 25:2078‐2079.
	Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing, S., Kohlbacher, O., and Weigel, D. 2009. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10:R98.
	Schweikert, G., Zeller, G., Zien, A., Behr, J., Ong, C.‐S., Philips, P., Bohlen, A., Sonnenburg, S., and Rätsch, G. 2009a. mGene: A novel discriminative gene finding system. Genome Res. 19:2133‐2143.
	Schweikert, G., Behr, J., Zien, A., Zeller, G., Sonnenburg, S., and Rätsch, G. 2009b. mGene.web: A web service for accurate computational gene finding. Nucleic Acids Res. 37:W312‐W316.
	Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., and Rätsch, G. 2007. Accurate splice site prediction using support vector machines. BMC Bioinformatics 10:S7.
	Trapnell, C., Pachter, L., and Salzberg, S.L. 2009. TopHat: Discovering splice junctions with RNA‐Seq. Bioinformatics 25:1105‐1111.
Internet Resources
	http://www.fml.mpg.de/raetsch/suppl/palmapper/tutorial
	Supporting Web page for tutorial on the material in this unit .
	http://www.fml.mpg.de/raetsch/suppl/palmapper
	PALMapper project Web page.
	http://www.fml.mpg.de/raetsch/suppl/qpalma
	QPALMA project Web page.
	http://www.fml.mpg.de/raetsch/suppl/mgene
	mGene project Web page.
	http://www.fml.mpg.de/raetsch/suppl/splice
	ASP project Web page.
	http://ftp.tuebingen.mpg.de/pub/fml/raetsch‐lab/software/
	http server for downloading PALMapper, QPALMA, mGene, and ASP.
	http://galaxy.fml.mpg.de/
	Galaxy server.
	http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
	General Feature Format (GFF) specification: Get detailed information about the GFF and download scripts for converting various computational analyses to GFF format.