Population Genetic Analysis – Applications of Digital Research Infrastructure (DRI) in Evolutionary Biology

This tutorial provides a step-by-step guide for filtering SNP data, constructing phylogenetic trees, applying principal component analysis (PCA) and analyzing population structure by using various methods.

Required Tools

Plink
VCFtools
FastTree
IQ-TREE
RAxML-NG
admixture

Required Files

Input VCF file: raw_variants.vcf (containing the SNP data)
Reference Genome: Used implicitly in SNP calling and alignment

Step 1: Filtering SNP Data

Before conducting phylogenetic tree construction, we filter the SNP dataset by removing missing data, low MAF variants, and ensuring only biallelic sites remain.

plink \ --vcf raw_variants.vcf \ --geno 0.1 \ # Remove SNPs with more than 10% missing data --maf 0.01 \ # Remove SNPs with minor allele frequency < 0.01 --biallelic-only strict \ # Keep only biallelic SNPs --out filtered_variants \ # Output file prefix --recode vcf-iid \ # Output in VCF format --allow-extra-chr \ # Allow non-standard chromosome names --set-missing-var-ids @:# \ # Standardize SNP IDs --keep-allele-order

Step 2: Linkage Disequilibrium (LD) Pruning

To avoid bias due to highly correlated SNPs, LD pruning is performed.

plink --vcf filtered_variants.vcf --indep-pairwise 50 10 0.2 --out ld_pruned_snps --allow-extra-chr

plink --vcf filtered_variants.vcf --extract ld_pruned_snps.prune.in --recode vcf-iid --out final_filtered_variants

Step 3: Converting VCF to Phylip Format

run_pipeline.pl -Xms1G -Xmx5G -SortGenotypeFilePlugin -inputFile final_filtered_variants.vcf -outputFile sorted_variants.vcf -fileType VCF

run_pipeline.pl -Xms1G -Xmx5G -importGuess sorted_variants.vcf -ExportPlugin -saveAs aligned_sequences.phy -format Phylip_Inter

Step 4: Phylogenetic Tree Construction

1. FastTree

FastTree \ -nt \ # Nucleotide sequences -gtr \ # Use the GTR model aligned_sequences.phy > fasttree_output.nwk

2. IQ-TREE

iqtree \ -s aligned_sequences.phy \ -nt 4 \ # Use 4 threads -m MFP+ASC \ # ModelFinder Plus with ascertainment bias correction -bb 1000 \ # Bootstrap analysis -pre iqtree_output

Specifying a Fixed Model in IQ-TREE

iqtree \ -s aligned_sequences.phy \ -nt 4 \ -m GTR+ASC \ # Use GTR model -bb 1000 \ -pre iqtree_output

3. RAxML-NG

raxml-ng --msa aligned_sequences.phy --model GTR+G --prefix raxml_tree --threads 5 --seed 2 > raxml.log 2> raxml.err

RAxML-NG with Bootstrap Analysis

raxml-ng --all --msa aligned_sequences.phy --model GTR+G --bs-trees 100 –prefix raxml_bootstrap --threads 5 --seed 2 > raxml_bs.log 2> raxml_bs.err

Step 5: Principal Component Analysis

plink --vcf final_filtered_variants.vcf --pca 15 --out PCA_out --allow-extra-chr --set-missing-var-ids @:#

# –pca 15 indicates that PLINK should compute the first 15 principal components.

Step 6: Population Structure Analysis

Convert a VCF file to PLINK binary format

plink --vcf final_filtered_variants.vcf --make-bed --out all --allow-extra-chr --keep-allele-order --set-missing-var-ids @:#

Run Admixture for multiple k values to perform population structure analysis, using cross-validation to assess the model’s fit.

for k in {2..10}; do admixture --cv -j2 all.bed $k 1>admix.${k}.log 2>&1 done

Extracts the “CV” (cross-validation) values from the log files generated by the admixture tool for each k value to compare model performance and identify the optimal number of populations.

grep "CV" admix.*.log