This tutorial provides a step-by-step guide for filtering SNP data, constructing phylogenetic trees, applying principal component analysis (PCA) and analyzing population structure by using various methods.
Required Tools
- Plink
- VCFtools
- FastTree
- IQ-TREE
- RAxML-NG
- admixture
Required Files
- Input VCF file: raw_variants.vcf (containing the SNP data)
- Reference Genome: Used implicitly in SNP calling and alignment
Step 1: Filtering SNP Data
Before conducting phylogenetic tree construction, we filter the SNP dataset by removing missing data, low MAF variants, and ensuring only biallelic sites remain.
plink \
--vcf raw_variants.vcf \
--geno 0.1 \ # Remove SNPs with more than 10% missing data
--maf 0.01 \ # Remove SNPs with minor allele frequency < 0.01
--biallelic-only strict \ # Keep only biallelic SNPs
--out filtered_variants \ # Output file prefix
--recode vcf-iid \ # Output in VCF format
--allow-extra-chr \ # Allow non-standard chromosome names
--set-missing-var-ids @:# \ # Standardize SNP IDs
--keep-allele-order
Step 2: Linkage Disequilibrium (LD) Pruning
To avoid bias due to highly correlated SNPs, LD pruning is performed.
plink --vcf filtered_variants.vcf --indep-pairwise 50 10 0.2 --out ld_pruned_snps --allow-extra-chr
plink --vcf filtered_variants.vcf --extract ld_pruned_snps.prune.in --recode vcf-iid --out final_filtered_variantsÂ
Step 3: Converting VCF to Phylip Format
run_pipeline.pl -Xms1G -Xmx5G -SortGenotypeFilePlugin -inputFile final_filtered_variants.vcf -outputFile sorted_variants.vcf -fileType VCF
run_pipeline.pl -Xms1G -Xmx5G -importGuess sorted_variants.vcf -ExportPlugin -saveAs aligned_sequences.phy -format Phylip_Inter
Step 4: Phylogenetic Tree Construction
1. FastTree
FastTree \
-nt \ # Nucleotide sequences
-gtr \ # Use the GTR model
aligned_sequences.phy > fasttree_output.nwk
2. IQ-TREE
iqtree \
-s aligned_sequences.phy \
-nt 4 \ # Use 4 threads
-m MFP+ASC \ # ModelFinder Plus with ascertainment bias correction
-bb 1000 \ # Bootstrap analysis
-pre iqtree_output
Specifying a Fixed Model in IQ-TREE
iqtree \
-s aligned_sequences.phy \
-nt 4 \
-m GTR+ASC \ # Use GTR model
-bb 1000 \
-pre iqtree_output
3. RAxML-NG
raxml-ng --msa aligned_sequences.phy --model GTR+G --prefix raxml_tree --threads 5 --seed 2 > raxml.log 2> raxml.err
RAxML-NG with Bootstrap Analysis
raxml-ng --all --msa aligned_sequences.phy --model GTR+G --bs-trees 100 –prefix raxml_bootstrap --threads 5 --seed 2 > raxml_bs.log 2> raxml_bs.err
Step 5: Principal Component Analysis
plink --vcf final_filtered_variants.vcf --pca 15 --out PCA_out --allow-extra-chr --set-missing-var-ids @:#
# –pca 15 indicates that PLINK should compute the first 15 principal components.
Step 6: Population Structure Analysis
Convert a VCF file to PLINK binary format
plink --vcf final_filtered_variants.vcf --make-bed --out all --allow-extra-chr --keep-allele-order --set-missing-var-ids @:#
Run Admixture for multiple k values to perform population structure analysis, using cross-validation to assess the model’s fit.
for k in {2..10}; do
admixture --cv -j2 all.bed $k 1>admix.${k}.log 2>&1
done
Extracts the “CV” (cross-validation) values from the log files generated by the admixture tool for each k value to compare model performance and identify the optimal number of populations.
grep "CV" admix.*.log
License and Copyright
Copyright (C) 2025 Xingwan Yi