08/15/13

SNP calling

Identifying single nucleotide polymorphisms (SNPs) is an important step in the AdapTree project. Currently, we are getting back exome sequence data from ~1300 individuals in two species, interior spruce and lodgepole pine. We need to accurately identify variant sites so we can use this information to determine which sites are potentially responsible for climatic adaptation (e.g. SNPs associated with phenotypes and climatic variables). We are also using these SNPs to develop a 50k SNP array in both species. SNP quality is important, as we do not want to waste space on our array with falsely identified SNPs. On the other hand we also do not want to miss important genes that might be under selection because our SNP calling criteria are too strict.

I have tested various SNP calling methods using exome re-sequencing data from 12 interior spruce samples. I tried Bowtie2, BWA (mem), Picard (mark duplicates) and GATK for indel realignment and base quality recalibration. For SNP calling I used mpileup with and without BAQ as well as the Unified Genotyper from GATK. My results are in a series of blog posts on the Rieseberg lab blog and I hope you find them useful. Please let me know if you have any suggestions for SNP calling. We only want to do the alignments and SNP calling once for the entire set of samples, because it is going to take a long time!

SNP calling I – alignment programs and PCR duplicates

SNP calling II – Creating a reference for GATK and Picard

SNP calling III – The indel problem

SNP calling IV – Base quality score recalibration

SNP calling V – SNP and genotype calling

SNP calling VI – Variant filtering