Identifying single nucleotide polymorphisms (SNPs) is an important step in the AdapTree project. Currently, we are getting back exome sequence data from ~1300 individuals in two species, interior spruce and lodgepole pine. We need to accurately identify variant sites so we can use this information to determine which sites are potentially responsible for climatic adaptation (e.g. SNPs associated with phenotypes and climatic variables). We are also using these SNPs to develop a 50k SNP array in both species. SNP quality is important, as we do not want to waste space on our array with falsely identified SNPs. On the other hand we also do not want to miss important genes that might be under selection because our SNP calling criteria are too strict.
I have tested various SNP calling methods using exome re-sequencing data from 12 interior spruce samples. I tried Bowtie2, BWA (mem), Picard (mark duplicates) and GATK for indel realignment and base quality recalibration. For SNP calling I used mpileup with and without BAQ as well as the Unified Genotyper from GATK. My results are in a series of blog posts on the Rieseberg lab blog and I hope you find them useful. Please let me know if you have any suggestions for SNP calling. We only want to do the alignments and SNP calling once for the entire set of samples, because it is going to take a long time!
SNP calling I – alignment programs and PCR duplicates
SNP calling II – Creating a reference for GATK and Picard
SNP calling III – The indel problem
SNP calling IV – Base quality score recalibration
SNP calling V – SNP and genotype calling
SNP calling VI – Variant filtering