Population Genetic Analysis

This tutorial provides a step-by-step guide for filtering SNP data, constructing phylogenetic trees, applying principal component analysis (PCA) and analyzing population structure by using various methods.

Required Tools

  • Plink
  • VCFtools
  • FastTree
  • IQ-TREE
  • RAxML-NG
  • admixture

Required Files

  • Input VCF file: raw_variants.vcf (containing the SNP data)
  • Reference Genome: Used implicitly in SNP calling and alignment

Step 1: Filtering SNP Data

Before conducting phylogenetic tree construction, we filter the SNP dataset by removing missing data, low MAF variants, and ensuring only biallelic sites remain.

plink \
--vcf raw_variants.vcf \
--geno 0.1 \ # Remove SNPs with more than 10% missing data
--maf 0.01 \ # Remove SNPs with minor allele frequency < 0.01
--biallelic-only strict \ # Keep only biallelic SNPs
--out filtered_variants \ # Output file prefix
--recode vcf-iid \ # Output in VCF format
--allow-extra-chr \ # Allow non-standard chromosome names
--set-missing-var-ids @:# \ # Standardize SNP IDs
--keep-allele-order

Step 2: Linkage Disequilibrium (LD) Pruning

To avoid bias due to highly correlated SNPs, LD pruning is performed.

plink --vcf filtered_variants.vcf --indep-pairwise 50 10 0.2 --out ld_pruned_snps --allow-extra-chr

plink --vcf filtered_variants.vcf --extract ld_pruned_snps.prune.in --recode vcf-iid --out final_filtered_variants 

Step 3: Converting VCF to Phylip Format

run_pipeline.pl -Xms1G -Xmx5G -SortGenotypeFilePlugin -inputFile final_filtered_variants.vcf -outputFile sorted_variants.vcf -fileType VCF

run_pipeline.pl -Xms1G -Xmx5G -importGuess sorted_variants.vcf -ExportPlugin -saveAs aligned_sequences.phy -format Phylip_Inter

Step 4: Phylogenetic Tree Construction

1. FastTree

FastTree \
-nt \ # Nucleotide sequences
-gtr \ # Use the GTR model
aligned_sequences.phy > fasttree_output.nwk

2. IQ-TREE

iqtree \
-s aligned_sequences.phy \
-nt 4 \ # Use 4 threads
-m MFP+ASC \ # ModelFinder Plus with ascertainment bias correction
-bb 1000 \ # Bootstrap analysis
-pre iqtree_output

Specifying a Fixed Model in IQ-TREE

iqtree \
-s aligned_sequences.phy \
-nt 4 \
-m GTR+ASC \ # Use GTR model
-bb 1000 \
-pre iqtree_output

3. RAxML-NG

raxml-ng --msa aligned_sequences.phy --model GTR+G --prefix raxml_tree --threads 5 --seed 2 > raxml.log 2> raxml.err

RAxML-NG with Bootstrap Analysis

raxml-ng --all --msa aligned_sequences.phy --model GTR+G --bs-trees 100 –prefix raxml_bootstrap --threads 5 --seed 2 > raxml_bs.log 2> raxml_bs.err

Step 5: Principal Component Analysis

plink --vcf final_filtered_variants.vcf --pca 15 --out PCA_out --allow-extra-chr --set-missing-var-ids @:#

# –pca 15 indicates that PLINK should compute the first 15 principal components.

Step 6: Population Structure Analysis

Convert a VCF file to PLINK binary format

plink --vcf final_filtered_variants.vcf --make-bed --out all --allow-extra-chr --keep-allele-order --set-missing-var-ids @:#

Run Admixture for multiple k values to perform population structure analysis, using cross-validation to assess the model’s fit.

for k in {2..10}; do
admixture --cv -j2 all.bed $k 1>admix.${k}.log 2>&1
done

Extracts the “CV” (cross-validation) values from the log files generated by the admixture tool for each k value to compare model performance and identify the optimal number of populations.

grep "CV" admix.*.log

License and Copyright

Copyright (C) 2025 Xingwan Yi

Spam prevention powered by Akismet