This tutorial provides a step-by-step guide for detecting Single Nucleotide Polymorphisms (SNPs) and Insertions/Deletions (INDELs) from high-throughput sequencing data. The workflow involves individual-level and population-level variant calling, followed by filtering to retain high-confidence variants. The tools used include GATK, Bcftools, Vcftools, and Bedtools.
Required Software
- GATK (Genome Analysis Toolkit) – for variant calling
- Bcftools – for variant annotation and filtering
- Vcftools – for additional variant filtering
- Bedtools – for variant region manipulation
Required Files
- Reference Genome: reference.fasta (FASTA format)
- Indexed BAM Files: Sorted and duplicate-removed BAM files (*.bam and *.bai)
Step 1: Individual-Level Variant Calling
1.1 Variant Calling with GATK HaplotypeCaller
GATK’s HaplotypeCaller is used for detecting SNPs and INDELs at the individual level. The command below generates a GVCF file for a single sample:
gatk --java-options "-Xmx36g" HaplotypeCaller \
-R reference.fasta \ # Reference genome
-I out.sort.rmdup.bam \ # Input BAM file
-ERC GVCF \ # Generate a GVCF file
--heterozygosity 0.01 \ # Expected heterozygosity (adjustable based on species)
-O individual.g.vcf.gz # Output GVCF file
Step 2: Population-Level Variant Calling
2.1 Generate GVCF Files for Each Sample
For multiple individuals, repeat HaplotypeCaller for each sample:
gatk --java-options "-Xmx36g" HaplotypeCaller \
--native-pair-hmm-threads 6 \ # Multi-threading for efficiency
-R reference.fasta \
-I $Input_bam/$species.rmdup.sort.bam \ # Input BAM file
-ERC GVCF \ # Generate GVCF output
--heterozygosity 0.01 \ # Expected heterozygosity
-O $species.g.vcf.gz # Output GVCF file
2.2 Combine GVCFs from Multiple Samples
Combine all individual GVCF files into a single file using CombineGVCFs:
gatk –java-options “-Xmx36g” CombineGVCFs \
-R reference.fasta \ # Reference genome
–variant sample1.g.vcf.gz \ # Input GVCF file 1
–variant sample2.g.vcf.gz \ # Input GVCF file 2
-O combined.g.vcf.gz # Output combined GVCF
2.3 Genotype GVCFs
Convert the combined GVCF into a final VCF file containing genotype information:
gatk --java-options "-Xmx36g" GenotypeGVCFs \
-R reference.fasta \ # Reference genome
-V combined.g.vcf.gz \ # Input combined GVCF
-O combined.genotyped.vcf.gz \ # Output final VCF file
--heterozygosity 0.01 \ # SNP heterozygosity
--indel-heterozygosity 0.0025 \ # INDEL heterozygosity
--all-sites # Include all sites for completeness
Step 3: Variant Filtering
3.1 Annotate Variants
To ensure each variant has a unique identifier:
bcftools annotate --set-id +'%CHROM\:%POS' -O z -o variant.annote.vcf.gz variant.vcf.gz
3.2 Remove SNPs Near INDELs (5bp Window)
3.2.1 Separate SNPs and INDELs
vcftools --gzvcf variant.annote.vcf.gz --remove-indels --maf 0.0000001 --recode --recode-INFO-all --out variant.annote.snp
bgzip variant.annote.snp.recode.vcf
vcftools --gzvcf variant.annote.vcf.gz --keep-only-indels --maf 0.0000001 --recode --recode-INFO-all --out variant.annote.indel
bgzip variant.annote.indel.recode.vcf
3.2.2 Filter SNPs Near INDELs
bcftools view -h variant.annote.vcf.gz > header.txt
bedtools window -a variant.annote.snp.recode.vcf -b variant.annote.indel.recode.vcf -w 5 > variant.rm_indel_mark.vcf
cat header.txt variant.rm_indel_mark.vcf | bgzip > variant.rm_indel_mark.vcf.gz
3.3 Retain Only Biallelic SNPs
Biallelic SNPs (only two alleles per site) are commonly used in downstream analyses:
vcftools --gzvcf variant.rm_indel_mark.vcf.gz --max-alleles 2 --min-alleles 2 --recode --recode-INFO-all --out variant.biallelic
bgzip variant.biallelic.recode.vcf
3.4 Filter by Read Depth (DP > 5)
To remove low-depth variants:
vcftools --gzvcf variant.biallelic.recode.vcf.gz --minDP 5 --recode --recode-INFO-all --out variant.DP5
bgzip variant.DP5.recode.vcf
3.5 Filter by Genotyping Quality (GQ > 10)
To retain high-confidence genotypes:
vcftools --gzvcf variant.DP5.recode.vcf.gz --minGQ 10 --recode --recode-INFO-all --out variant.GQ10
bgzip variant.GQ10.recode.vcf
3.6 Filter by Missing Data and Minor Allele Frequency
vcftools --gzvcf variant.GQ10.recode.vcf.gz --max-missing 0.8 --maf 0.01 --recode --recode-INFO-all --out variant.filtered
bgzip variant.filtered.recode.vcf
License and Copyright
Copyright (C) 2024 Zhiqin Long, Xingwan Yi.