Analyzing Copy Number Variation Using SNP Array Data: Protocols for Calling CNV and Association Tests

互联网2013-12-31

2512

Abstract
Table of Contents
Materials
Figures
Literature Cited

Abstract

High?density SNP genotyping technology provides a low?cost, effective tool for conducting Genome Wide Association (GWA) studies. The wide adoption of GWA studies has indeed led to discoveries of disease? or trait?associated SNPs, some of which were subsequently shown to be causal. However, the nearly universal shortcoming of many GWA studies?missing heritability?has prompted great interest in searching for other types of genetic variation, such as copy number variation (CNV). Certain CNVs have been reported to alter disease susceptibility. Algorithms and tools have been developed to identify CNVs using SNP array hybridization intensity data. Such an approach provides an additional source of data with almost no extra cost. In this unit, we demonstrate the steps for calling CNVs from Illumina SNP array data using PennCNV and performing association analysis using R and PLINK. Curr. Protoc. Hum. Genet . 79:1.27.1?1.27.15. © 2013 by John Wiley & Sons, Inc.

Keywords: copy number variations (CNV); CNV calling; genome?wide association studies; SNP genotyping array; association study; burden analysis

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Introduction
Basic Protocol 1: Detect CNVs from Illumina Whole‐Genome Genotyping Array Data Using PennCNV
Basic Protocol 2: Use of R to Perform Association Tests for Common CNVs
Basic Protocol 3: Use of PLINK to Perform Burden Tests for Rare or Non‐Overlapping CNVs
Support Protocol 1: Visually Inspect CNVs on the UCSC Genome Browser
Commentary
Literature Cited
Figures

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Materials

Basic Protocol 1: Detect CNVs from Illumina Whole‐Genome Genotyping Array Data Using PennCNV

Materials

Signal intensity data: LRR (Log R Ratio) and BAF (B Allele Frequency) of each individual and each probe
Additional input files for PennCNV as described in its manual: PFB (Population Frequency of B allele), HMM, and GCModel files
Linux environment with PennCNV installed: we assume the user has PennCNV installed or has the knowledge on how to obtain and install the software; more information is available on the PennCNV Web site (http://www.openbioinformatics.org/penncnv/penncnv_installation.html)
GenomeStudio or BeadStudio (Illumina) for exporting signal intensity files from Illumina SNP array project files
We encourage the reader to browse the respective software package Web sites (provided at the end of this unit) to find out more details on hardware requirements. In general, modern PCs and Linux servers with 2 to 4 GB RAM should be sufficient for running the programs we use in this unit. Analysis of larger datasets (on the scale of thousands subjects) may require more storage space and memory.
Illumina recommends running GenomeStudio (formerly BeadStudio) on a Windows (XP or later) computer with Intel Celeron Duo or faster 64‐bit CPU, at least 8 GB memory, at least 100 GB storage space, and 1280 × 1024 screen resolution for better viewing
PennCNV runs on Linux systems. Both source code and pre‐compiled executables are available. Instructions for installation on Windows systems with Cygwin or ActivePerl are provided on the PennCNV Web site. See Time Considerations for processing large datasets.
Precompiled executables for Windows and Linux systems are available for both R and PLINK

Basic Protocol 2: Use of R to Perform Association Tests for Common CNVs

Materials

Output file from the PennCNV software (see protocol 1 ) that contains all called CNVs
Individuals' case/control status and phenotypes or factors that may confound the relation between CNVs and the disease state; these pieces of information are equivalent to those of PLINK FAM and covariate files
Linux environment with R installed. We assume the user has installed R or has the knowledge on how to obtain and install the software from the R Web site (http://www.r‐project.org/). Comprehensive documentation is available there.
R script (penncnv2cnpr.r, which can be downloaded at http://www.currentprotocols.com/protocol/hg0127)
For hardware requirements, see protocol 1 materials list

Basic Protocol 3: Use of PLINK to Perform Burden Tests for Rare or Non‐Overlapping CNVs

Materials

Output file from the PennCNV software (see protocol 1 ) that contains all called CNVs
PLINK FAM file from the Genome‐Wide Association Study (GWAS) SNP data.
Optional files describing user‐specified genomic regions for burden tests. For example, a file containing the coordinates of all known genes on the human genome. Each row specifies one genomic region (chromosome, start, and end positions).
Linux environment with PLINK installed. We assume the user has installed PLINK or has the knowledge on how to obtain and install it (http://pngu.mgh.harvard.edu/∼purcell/plink/). Comprehensive documentation is available there.
For hardware requirements, see protocol 1 materials list

Support Protocol 1: Visually Inspect CNVs on the UCSC Genome Browser

Materials

Output file from the PennCNV software (see protocol 1 ) that contains all called CNVs
A Web browser that is compatible with the UCSC Genome Browser.
Format CNV files into the BED format. A BED file is a tab‐delimited file that represents genomic features, such as genes or CNVs as integer intervals one interval per line, and describes how these intervals to be displayed on the UCSC browser as a custom track (see unit 18.6 ). Only the first three fields—chromosome/scaffold name, start position and end position—describing the genomic location are required but the optional fields, such as name, strand, etc., and the “track line,” make the visualization more informative. Please refer to http://genome.ucsc.edu/FAQ/FAQformat.html#format1 for more details. For this protocol, we put seven fields and a track line in one BED file. An example may look like this:track name=test1 description=test1 visibility=3 colorByStrand="255,0,0 0,0,255" useScore=0 Although the track line appears as two lines, it is in fact one single line. Both of the last two fields being identical to the second one, i.e., the start position, makes the bars representing the CNVs thinner so as to accommodate more CNVs in one given space. For a small number of items, a generic text editor or Excel is sufficient to do the conversion manually. For a large number, however, a program is usually needed to do it efficiently and correctly. The following is an example Perl script that reads a PennCNV output file and converts it into a BED file. The script gives the contrast between deletions and duplications by assigning a strand status to each CNV (‘+’ when CN < 2 and ‘−’ when CN > 2), and using the “colorByStrand” attribute in the track line. The two colors for the two strands are specified by RGB color codes and divided by a space. To visualize the contrast between cases and controls, then the coding for strand should be used to encode disease status instead, and thus duplications and deletions should be separated into two tracks:
- #!/bin/perl
- use strict;
- ## This script prints the output to STDOUT. Use redirect to output the results to a file.
- # check if track name and input filename are provided
- die "Usage: $0 trackname infile\n" if scalar @ARGV < 2;
- my ($track, $infile) = @ARGV;
- # print the track line
- printf("track name=$track description=$track visibility=3 colorByStrand=\"255,0,0 0,0,255\" useScore=0\n");
- # open the input file and start processing line by line
- open(FIN, $infile) ∥ die "cannot open $infile\n";
- while (<FIN>) {
- # split one line into fields (the delimiter can be one or multiple spaces)
- my @arr=split(/\s+/,$_);
- # further split the first field into chr and positions
- my @ele=split(/[:‐]/,$arr[0]);
- # convert to 0‐based position
- my $start = $ele[1] ‐ 1;
- # split the copy number field
- my @cn=split(/[,=]/,$ele[3]);
- # assign deletion (CN<2) to positive strand '+' and duplication '‐' printf("%s\t%d\t%d\t%s\t%s\t%d\t%d\n",$ele[0],$start,$ele[2],$arr[4],$cn[2] < 2 ? '+' : '‐', $start, $start)
- }
- close FIN;
For hardware requirements, see protocol 1 materials list

GO TO THE FULL PROTOCOL:

PDF or HTML at Wiley Online Library

Figures

Figure 1.27.1 (A ) Calling SNP genotypes by the ratio of probe intensities (allele frequencies) on hybridization arrays. (B ) Examples where copy number variations alter total intensities and allele frequencies.

View Image

Figure 1.27.2 A section of a chromosome to demonstrate how Copy Number Polymorphic Regions (CNPRs) are constructed. In this example, PennCNV has been run to call CNVs from SNP array data of six individuals (indiv1 through 6). All called CNVs from all individuals were pooled together. All non‐redundant end points of the CNVs become break points that would be used to partition the chromosome. A pair of break points form a CNPR. Every CNV is then decomposed into multiple consecutive CNPRs. Red: Copy Number (CN) = 1; Blue: CN=3; Black: CN=4. Based on the type of CNV (CN=1 or CN=3) one individual has in a CNPR (CN=2 if no CNV was called), a matrix can be generated.

View Image

Videos

Literature Cited

	Barnes, C., Plagnol, V., Fitzgerald, T., Redon, R., Marchini, J., Clayton, D., and Hurles, M.E. 2008. A robust statistical method for case‐control association testing with copy number variation. Nat. Genet. 40:1245‐1252.
	Bochukova, E.G., Huang, N., Keogh, J., Henning, E., Purmann, C., Blaszczyk, K., Saeed S., Hamilton‐Shield, J., Clayton‐Smith, J., O'Rahilly, S., Hurles, M.E., and Farooqi, I.S. 2010. Large, rare chromosomal deletions associated with severe early‐onset obesity. Nature 463:666‐670.
	Colella, S., Yau, C., Taylor, J.M., Mirza, G., Butler, H., Clouston, P., Bassett, A.S., Seller, A., Holmes, C.C., and Ragoussis, J. 2007. QuantiSNP: An Objective Bayes Hidden‐Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res. 35:2013‐2025.
	Conrad, D.F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O., Zhang, Y., Aerts, J., Andrews, T.D., Barnes, C., Campbell, P., Fitzgerald, T., Hum, M., Ihm, C.H., Kristiansson, K., Macarthur, D.G., Macdonald, J.R., Onyiah, I., Pang, A.W., Robson, S., Stirrups, K., Valsesia, A., Walter, K., Wei, J.; Wellcome Trust Case Control Consortium, Tyler‐Smith, C., Carter, N.P., Lee, C., Scherer, S.W., and Hurles, M.E. 2010. Origins and functional impact of copy number variation in the human genome. Nature 464:704‐712.
	Diskin, S.J., Li, M., Hou, C., Yang, S., Glessner, J., Hakonarson, H., Bucan, M., Maris, J.M., and Wang, K. 2008. Adjustment of genomic waves in signal intensities from whole‐genome SNP genotyping platforms. Nucleic Acids Res. 36:e126.
	Kidd, J.M., Cooper, G.M., Donahue, W.F., Hayden, H.S., Sampas, N., Graves, T., Hansen, N., Teague, B., Alkan, C., Antonacci, F., Haugen, E., Zerr, T., Yamada, N.A., Tsang, P., Newman, T.L., Tüzün, E., Cheng, Z., Ebling, H.M., Tusneem, N., David, R., Gillett, W., Phelps, K.A., Weaver, M., Saranga, D., Brand, A., Tao, W., Gustafson, E., McKernan, K., Chen, L., Malig, M., Smith, J.D., Korn, J.M., McCarroll, S.A., Altshuler, D.A., Peiffer, D.A., Dorschner, M., Stamatoyannopoulos, J., Schwartz, D., Nickerson, D.A., Mullikin, J.C., Wilson, R.K., Bruhn, L., Olson, M.V., Kaul, R., Smith, D.R., and Eichler, E.E. 2008. Mapping and sequencing of structural variation from eight human genomes. Nature 453:56‐64.
	Korn, J.M., Kuruvilla, F.G., McCarroll, S.A., Wysoker, A., Nemesh, J., Cawley, S., Hubbell, E., Veitch, J., Collins, P.J., Darvishi, K., Lee, C., Nizzari, M.M., Gabriel, S.B., Purcell, S., Daly, M.J., and Altshuler, D. 2008. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat. Genet. 40:1253‐1260.
	Merikangas, A.K., Corvin, A.P., and Gallagher, L. 2009. Copy‐number variants in neurodevelopmental disorders: Promises and challenges. Trends Genet. 25:536‐544.
	Purcell, S., Neale, B., Todd‐Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly, M.J., and Sham, P.C. 2007. PLINK: A tool set for whole‐genome association and population‐based linkage analyses. Am. J. Hum. Genet. 81:559‐575.
	Redon, R., Ishikawa, S., Fitch, K.R., Feuk, L., Perry, G.H., Andrews, T.D., Fiegler, H., Shapero, M.H., Carson, A.R., Chen, W., Cho, E.K., Dallaire, S., Freeman, J.L., González, J.R., Gratacòs, M., Huang, J., Kalaitzopoulos, D., Komura, D., MacDonald, J.R., Marshall, C.R., Mei, R., Montgomery, L., Nishimura, K., Okamura, K., Shen, F., Somerville, M.J., Tchinda, J., Valsesia, A., Woodwark, C., Yang, F., Zhang, J., Zerjal, T., Zhang, J., Armengol, L., Conrad, D.F., Estivill, X., Tyler‐Smith, C., Carter, N.P., Aburatani, H., Lee, C., Jones, K.W., Scherer, S.W., and Hurles, M.E. 2006. Global variation in copy number in the human genome. Nature 444:444‐454.
	Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S.F., Hakonarson, H., and Bucan, M. 2007. PennCNV: An integrated hidden Markov model designed for high‐resolution copy number variation detection in whole‐genome SNP genotyping data. Genome Res. 17:1665‐1674.
	Wellcome Trust Case Control Consortium, Craddock, N., Hurles, M. E., Cardin, N., Pearson, R. D., Plagnol, V., Robson, S., Vukcevic, D., Barnes, C., Conrad, D.F., Giannoulatou, E., Holmes, C., Marchini, J.L., Stirrups, K., Tobin, M.D., Wain, L.V., Yau, C., Aerts, J., Ahmad, T., Andrews, T.D., Arbury, H., Attwood, A., Auton, A., Ball, S.G., Balmforth, A.J., Barrett, J.C., Barroso, I., Barton, A., Bennett, A.J., Bhaskar, S., Blaszczyk, K., Bowes, J., Brand, O.J., Braund, P.S., Bredin, F., Breen, G., Brown, M.J., Bruce, I.N., Bull, J., Burren, O.S., Burton, J., Byrnes, J., Caesar, S., Clee, C.M., Coffey, A.J., Connell, J.M., Cooper, J.D., Dominiczak, A.F., Downes, K., Drummond, H.E., Dudakia, D., Dunham, A., Ebbs, B., Eccles, D., Edkins, S., Edwards, C., Elliot, A., Emery, P., Evans, D.M., Evans, G., Eyre, S., Farmer, A., Ferrier, I.N., Feuk, L., Fitzgerald, T., Flynn, E., Forbes, A., Forty, L., Franklyn, J.A., Freathy, R.M., Gibbs, P., Gilbert, P., Gokumen, O., Gordon‐Smith, K., Gray, E., Green, E., Groves, C.J., Grozeva, D., Gwilliam, R., Hall, A., Hammond, N., Hardy, M., Harrison, P., Hassanali, N., Hebaishi, H., Hines, S., Hinks, A., Hitman, G.A., Hocking, L., Howard, E., Howard, P., Howson, J.M., Hughes, D., Hunt, S., Isaacs, J.D., Jain, M., Jewell, D.P., Johnson, T., Jolley, J.D., Jones, I.R., Jones, L.A., Kirov, G., Langford, C.F., Lango‐Allen, H., Lathrop, G.M., Lee, J., Lee, K.L., Lees, C., Lewis, K., Lindgren, C.M., Maisuria‐Armer, M., Maller, J., Mansfield, J., Martin, P., Massey, D.C., McArdle, W.L., McGuffin, P., McLay, K.E., Mentzer, A., Mimmack, M.L., Morgan, A.E., Morris, A.P., Mowat, C., Myers, S., Newman, W., Nimmo, E.R., O'Donovan, M.C., Onipinla, A., Onyiah, I., Ovington, N.R., Owen, M.J., Palin, K., Parnell, K., Pernet, D., Perry, J.R., Phillips, A., Pinto, D., Prescott, N.J., Prokopenko, I., Quail, M.A., Rafelt, S., Rayner, N.W., Redon, R., Reid, D.M., Ring, S.M., Robertson, N., Russell, E., St Clair, D., Sambrook, J.G., Sanderson, J.D., Schuilenburg, H., Scott, C.E., Scott, R., Seal, S., Shaw‐Hawkins, S., Shields, B.M., Simmonds, M.J., Smyth, D.J., Somaskantharajah, E., Spanova, K., Steer, S., Stephens, J., Stevens, H.E., Stone, M.A., Su, Z., Symmons, D.P., Thompson, J.R., Thomson, W., Travers, M.E., Turnbull, C., Valsesia, A., Walker, M., Walker, N.M., Wallace, C., Warren‐Perry, M., Watkins, N.A., Webster, J., Weedon, M.N., Wilson, A.G., Woodburn, M., Wordsworth, B.P., Young, A.H., Zeggini, E., Carter, N.P., Frayling, T.M., Lee, C., McVean, G., Munroe, P.B., Palotie, A., Sawcer, S.J., Scherer, S.W., Strachan, D.P., Tyler‐Smith, C., Brown, M.A., Burton, P.R., Caulfield, M.J., Compston, A., Farrall, M., Gough, S.C., Hall, A.S., Hattersley, A.T., Hill, A.V., Mathew, C.G., Pembrey, M., Satsangi, J., Stratton, M.R., Worthington, J., Deloukas, P., Duncanson, A., Kwiatkowski, D.P., McCarthy, M.I., Ouwehand, W., Parkes, M., Rahman, N., Todd, J.A., Samani, N.J., and Donnelly, P. 2010. Genome‐wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature 464:713‐720.
Internet Resources
	http://www.openbioinformatics.org/penncnv/
	PennCNV Web site. Users can download the PennCNV source code, compile, and install on their own computers. The Web site also contains a wealth of information including program manual, annotation files, tutorials for the PennCNV software, and other useful tips such as visualization and quality control recommendations.
	http://www.r‐project.org/
	R Web site. R is a free program for statistical computing and visualization. Users can download the compiled R package for their specific computing platforms. The Web site also lists URLs to the Comprehensive R Archive Network (CRAN). CRAN hosts user‐contributed packages that provide additional analysis capabilities.
	http://www.illumina.com/software/genomestudio_software.ilmn
	Illumina GenomeStudio Web site: The Web site contains instructions and FAQs for the GenomeStudio software, which is required to export SNP intensities from Illumina Chip projects for CNV calling. Illumina customers can obtain the software for free.
	http://pngu.mgh.harvard.edu/~purcell/plink/
	PLINK Web site. PLINK is developed by Shaun Purcell at Harvard University. The free, open‐source program is widely used by the research community to process and analyze genome‐wide association studies (GWAS). Users can download the source code or obtain pre‐compiled binaries for installation from this Web site. This Web site also contains very detailed instructions on how to use the program.
	http://genome.ucsc.edu/
	UCSC Genome Browser. Users can go to UCSC Genome Browser to download genomic annotations, or visualize CNV calls on the reference genome as outlined in the Support Protocol.
	http://www.humgen.nl/SNP_databases.html
	List of Genetic variation databases. The Center for Human and Clinical Genetics at Leiden University Medical Center maintains a comprehensive list of genetic variation databases, including CNV databases.
	http://hgsv.washington.edu/
	The Human Genome Structural Variation Project. This Web site, maintained by the Eichler lab at the University of Washington, provides a detailed map of CNVs and large structural variants.
	http://www.sanger.ac.uk/research/areas/humangenetics/cnv/
	The Copy Number Variation (CNV) Project. The database is maintained by the Wellcome Trust Sanger Institute. It hosts CNVs identified through a variety of genotyping and hybridization approaches and provides extensive information of known CNV/phenotype associations.
	http://projects.tcag.ca/variation/project.html
	The Database of Genomic Variants. This database is maintained by the University of Toronto Centre for Applied Genomics. The database is a comprehensive catalog of structural variants in the human genome by collecting published reports on healthy controls in the literature. It can be used as controls in studies to correlate CNVs with diseases and traits.