Gene Identification in Prokaryotic Genomes, Phages, Metagenomes, and EST Sequences with GeneMarkS Suite

互联网2013-12-31

1064

Abstract
Table of Contents
Figures
Literature Cited
Supplementary Material

Abstract

This unit describes how to use several gene?finding programs from the GeneMark line developed for finding protein?coding ORFs in genomic DNA of prokaryotic species, in genomic DNA of eukaryotic species with intronless genes, in genomes of viruses and phages, and in prokaryotic metagenomic sequences, as well as in EST sequences with spliced?out introns. These bioinformatics tools were demonstrated to have state?of?the?art accuracy and have been frequently used for gene annotation in novel nucleotide sequences. An additional advantage of these sequence?analysis tools is that the problem of algorithm parameterization is solved automatically, with parameters estimated by iterative self?training (unsupervised training). Curr. Protoc. Bioinform. 35:4.5.1?4.5.17. © 2011 by John Wiley & Sons, Inc.

Keywords: gene finding; hidden Markov model; unsupervised parameter estimation

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Introduction
Basic Protocol 1: Using GeneMarkS
Basic Protocol 2: Using GeneMark.hmm for Prokaryotic Gene Prediction
Basic Protocol 3: Using GeneMark for Prokaryotic Gene Prediction
Basic Protocol 4: Using the Heuristic Approach for Prokaryotic Model Building
Basic Protocol 5: Using MetaGeneMark for Finding Genes in Metagenomes
Guidelines for Understanding Results
Commentary
Literature Cited
Figures

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 4.5.1 User interface for the GeneMarkS gene‐finding program. Required input includes a DNA sequence in FASTA format. The algorithm (1) estimates the HMM model parameters from the input sequence via unsupervised iterative training, and (2) makes a final run of GeneMark.hmm to get gene predictions.

View Image

Figure 4.5.2 User interface for the GeneMark.hmm program. Required input includes a DNA sequence in FASTA format and the set of species‐specific parameters for the model.

View Image

Figure 4.5.3 The text output of GeneMark.hmm using both the Atypical and Typical model. The predicted gene coordinates, the strand, and the model type are listed for each gene.

View Image

Figure 4.5.4 The graphical output from GeneMark combined with the output of GeneMark.hmm. The genes predicted by the Typical model are shown in black (solid line); the genes predicted by the Atypical model are shown in red (dashed line). The black horizontal bars indicate protein coding regions predicted by GeneMark.hmm.

View Image

Figure 4.5.5 The user interface of GeneMark. Required input includes a DNA sequence in FASTA format and the set of species‐specific parameters of the model.

View Image

Figure 4.5.6 The text output of the GeneMark program. Open reading frames predicted as genes are listed, along with the average coding potential of an ORF and the probability for alternative ATG triplets to be a translation start .

View Image

Figure 4.5.7 The graphical output of GeneMark. The six different panels represent the six possible reading frames, three of each in the direct and reverse strands.

View Image

Figure 4.5.8 The user interface for the Heuristic approach. Required input includes a DNA sequence in FASTA format. This method finds model parameters for the gene‐prediction algorithm from the given (short) sequence. GeneMark.hmm runs with these parameters to predict genes.

View Image

Figure 4.5.9 The user interface for the MetaGeneMark program. Required input includes a set of DNA sequences in multi‐ FASTA format.

View Image

Videos

Literature Cited

Literature Cited
	Besemer, J. and Borodovsky, M. 1999. Heuristic approach to deriving models for gene finding. Nucleic Acids Res. 27:3911‐3920.
	Besemer, J., Lomsadze, A., and Borodovsky, M. 2001. GeneMarkS: A self‐training method for prediction of gene starts in microbial genomes: Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29:2607‐2618.
	Borodovsky, M. and McIninch, J. 1993. GENMARK: Parallel gene recognition for both DNA strands. Comput. Chem. 17:123‐133.
	Borodovsky, M., Sprizhitsky, Yu., Golovanov, E., and Alexandrov, A. 1986a. Statistical patterns in primary structures of functional regions in the E. coli genome: I. Oligonucleotide frequencies analysis. Mol. Biol. 20:826‐833.
	Borodovsky, M., Sprizhitsky, Y., Golovanov, E., and Alexandrov, A. 1986b. Statistical patterns in primary structures of functional regions in the E. coli genome: II. Non‐homogeneous Markov models. Mol. Biol. 20:833‐840.
	Borodovsky, M., Sprizhitsky, Y., Golovanov, E., and Alexandrov, A. 1986c. Statistical patterns in primary structures of functional regions in the E. coli genome: III. Computer recognition of coding regions. Mol. Biol. 20:1145‐1150.
	Borodovsky, M., Rudd, K., and Koonin, Eu. 1994a. Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. Nucleic Acids Res. 22:4756‐4767.
	Borodovsky, M., Koonin, Eu., and Rudd, K. 1994b. New genes in old sequences: A strategy for finding genes in a bacterial genome. Trends Biochem. Sci. 19:309‐313.
	Borodovsky, M., McIninch, J., Koonin, E., Rudd, K., Medigue, C., and Danchin, A. 1995. Detection of new genes in the bacterial genome using Markov models for three gene classes. Nucleic Acids Res. 23:3554‐3562.
	Bult, C.J., White, O., Olsen, G.J., Zhou, L., Fleischmann, R.D., Sutton, G.G., Blake, J.A., FitzGerald, L.M., Clayton, R.A., Gocayne, J.D., Kerlavage, A.R., Dougherty, B.A., Tomb, J.‐F., Adams, M.D., Reich, C.I., Overbeek, R., Kirkness, E.F., Weinstock, K.G., Merrick, J.M., Glodek, A., Scott, J.L., Geoghagen, N.S.M., Weidman, J.F., Fuhrmann, J.L., Nguyen, D., Utterback, T.R., Kelley, J.M., Peterson, J.D., Sadow, P.W., Hanna, M.C., Cotton, M.D., Roberts, K.M., Hurst, M.A., Kaine, B.P., Borodovsky, M., Klenk, H.‐P., Fraser, C.M., Smith, H.O., Woese, C.R., and Venter, J.C. 1996. Complete genome sequence of the methanogenic archaeon Methanococcus jannaschii. Science 273:1058‐1073
	Durbin, R., Eddy, S., Krough, A., and Mitchison, G. 1998. Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, U.K.
	Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.‐F., Dougherty, B.A., Merrick, J.M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C.A., Gocayne, J.D., Scott, J.D., Shirley, R., Liu, L.‐I., Glodek, A., Kelley, J.M., Weidman, J.F., Phillips, C.A., Spriggs, T., Hedblom, E., Cotton, M.D., Utterback, T.R., Hanna, M.C., Nguyen, D.T., Saudek, D.M., Brandon, R.C., Fine, L.D., Fritchman, J.L., Fuhrmann, J.L., Geoghagen, N.S.M., Gnehm, C.L., McDonald, L.A., Small, K.V., Fraser, C.M., Smith, H.O., and Venter, J.C. 1995. Whole‐genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496‐512.
	Fraser, C.M., Gocayne, J.D., White, O., Adams, M.D., Clayton, R.A., Fleischmann, R.D., Bult, C.J., Kerlavage, A.R., Sutton, G., Kelley, J.M., Fritchman, J.L., Weidman, J.F., Small, K.V., Sandusky, M., Fuhrmann, J.L., Nguyen, D.T., Utterback, T.R., Saudek, D.M., Phillips, C.A., Merrick, J.M., Tomb, J.‐F., Dougherty, B.A., Bott, K.F., Hu, P.‐C., Lucier, T.S., Peterson, S.N., Smith, H.O., Hutchison, C.A. III, and Venter, J.C. 1995. The minimal gene complement of Mycoplasma genitalium. Science 270:397‐403.
	Hayes, W. and Borodovsky, M. 1998. How to interpret anonymous genome? Machine learning approach to gene identification. Genome Res. 8:1154‐1171.
	Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., and Wootton, J.C. 1993. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262:208‐214.
	Lukashin, A.V. and Borodovsky, M. 1998. GeneMark.hmm: New solutions for gene finding. Nucleic Acids Res. 26:1107‐1115.
	Mills, R., Rozanov, M., Lomsadze, A., Tatusova, T., and Borodovsky, M. 2003. Improving gene annotation in complete viral genomes. Nucleic Acids Res. 31:7041‐7055.
	Tatusov, R.L., Mushegian, A.R., Bork, P., Brown, N.P., Hayes, W., Borodovsky, M., Rudd, K.E., and Koonin, E.V. 1996. Metabolism and evolution of H. influenzae deduced from whole genome comparison to E. coli. Curr. Biol. 6:279‐291.
	Tomb, J., White, O., Kerlavage, A.R., Clayton, R.A., Sutton, G.G., Fleischmann, R.D., Ketchum, K.A., Klenk, H.P., Gill, S., Dougherty, B.A., Nelson, K., Quackenbush, J., Zhou, L., Kirkness, E.F., Peterson, S., Loftus, B., Richardson, D., Dodson, R., Khalak, H.G., Glodek, A., McKenney, K., Fitzegerald, L.M., Lee, N., Adams, M.D., Hickey, E.K., Berg, D.E., Gocayne, J.D., Utterback, T.R., Peterson, J.D., Kelley, J.M., Cotton, M.D., Weidman, J.M., Fujii, C., Bowman, C., Watthey, L., Wallin, E., Hayes, W.S., Borodovsky, M., Karp, P.D., Smith, H.O., Fraser, C.M., and Venter, J.C. 1997. The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature 388:539‐547
	Zhu, W., Lomsadze, A., and Borodovsky, M. 2010. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 38:e132.