Genome-Wide Association (GWA) Study

This tutorial provides a complete guide to conducting GWAS using SNP and INDEL data with EMMAX. The workflow includes VCF-to-TPED conversion, phenotype data preparation, kinship matrix computation, and GWAS execution. The final output can be used for visualization through Manhattan plots to identify significant genomic associations.

Required Tools and Files

Software

  • EMMAX: GWAS analysis software
  • PLINK: Genotype data processing tool

Required Files

  • SNP VCF File: genotype_snp.vcf
  • INDEL VCF File: genotype_indel.vcf
  • Phenotype Data File: phenotype_data.txt
  • Population Structure (Q Matrix): population_structure.Q

1. GWAS Analysis with SNP Data

Step 1: Convert VCF to TPED Format (Required by EMMAX)

plink --vcf genotype_snp.vcf \
--maf 0.05 --geno 0.1 \ # Filter out SNPs with MAF < 0.05 and missing rate > 10%
--recode 12 transpose \ # Convert genotype to 12 format (TPED format)
--output-missing-genotype 0 \ # Represent missing data as 0
--out snp_filtered \ # Output file prefix
--set-missing-var-ids @:# \ # Set SNP names
--allow-extra-chr \
--keep-allele-order

Output Files

snp_filtered.log
snp_filtered.nosex
snp_filtered.tfam
snp_filtered.tped

Step 2: Sort Phenotype Data

perl sort_phenotype.pl snp_filtered.tfam phenotype_data.txt > phenotype_sorted.txt

Step 3: Compute Kinship Matrix

emmax-kin-intel64 -v -d 10 -o kinship_matrix snp_filtered

Step 4: Perform GWAS Analysis

emmax-intel64 -v -d 10 -t snp_filtered -p phenotype_sorted.txt -k kinship_matrix -o gwas_results

Step 5: Process Results for Visualization

awk '{print $1"\t"$2"\t"$3"\t"$4"\t"}' snp_filtered.tped | \
paste - gwas_results.ps | \
awk 'BEGIN{print "SNP\tCHR\tBP\tP"}{if($2==$5){print $2"\t"$1"\t"$4"\t"$NF}}' > gwas_results_processed.txt

In R

# Install required R packages
install.packages("qqman")
install.packages("ggplot2")
library(qqman)
library(ggplot2)


# Load data
snp_res <- read.table("gwas_results_processed.txt", header=TRUE)

# Calculate Bonferroni threshold
total_snps <- nrow(snp_res)
bonferroni_thresh <- 0.05 / total_snps
cat("SNP Bonferroni threshold:", bonferroni_thresh, "\n")

# Manhattan Plot
png("manhattan_snp.png", width=12, height=6, units="in", res=300)
manhattan(snp_res,
chr="CHR",
bp="BP",
snp="SNP",
p="P",
main="SNP Manhattan Plot",
annotatePval = bonferroni_thresh,
annotateTop = FALSE,
col = c("skyblue3", "deepskyblue4"),
suggestiveline = -log10(bonferroni_thresh),
genomewideline = -log10(1e-6))
dev.off()

# Q-Q Plot
png("qqplot_snp.png", width=8, height=8, units="in", res=300)
qq(snp_res$P, main="Q-Q Plot: SNP GWAS")
dev.off()

Step 6: SNP GWAS with Q Matrix (Population Structure as Covariate)

awk '{print $1}' snp_filtered.tfam | paste - population_structure.Q | awk '{print $1" "$1" 1 "$2" "$3}' > Q_matrix_formatted.txt

emmax-intel64 -v -d 10 -t snp_filtered -p phenotype_sorted.txt -k kinship_matrix -c Q_matrix_formatted.txt -o gwas_results_Q

2. GWAS Analysis with INDEL Data

Step 1: Convert VCF to TPED Format

plink --vcf genotype_indel.vcf --maf 0.01 --geno 0.5 --recode 12 transpose --out indel_filtered --set-missing-var-ids @:# --allow-extra-chr --keep-allele-order

Step 2: Sort Phenotype Data

perl sort_phenotype.pl indel_filtered.tfam phenotype_data.txt > phenotype_sorted.txt

Step 3: Compute Kinship Matrix

emmax-kin-intel64 -v -d 10 -o kinship_matrix_indel indel_filtered

Step 4: Run GWAS Analysis

emmax-intel64 -v -d 10 -t indel_filtered -p phenotype_sorted.txt -k kinship_matrix_indel -o gwas_results_indel

Step 5: Process Results for Visualization

awk '{print $1"\t"$2"\t"$3"\t"$4"\t"}' indel_filtered.tped | \
paste - gwas_results_indel.ps | \
awk 'BEGIN{print "SNP\tCHR\tBP\tP"}{if($2==$5){print $2"\t"$1"\t"$4"\t"$NF}}' > gwas_results_indel_processed.txt

In R


library(qqman)
library(ggplot2)

# Load data
indel_res <- read.table("gwas_results_indel_processed.txt", header=TRUE)

# Calculate INDEL threshold
total_indels <- nrow(indel_res)
indel_thresh <- 0.05 / total_indels
cat("INDEL Bonferroni threshold:", indel_thresh, "\n")

# Manhattan Plot
png("manhattan_indel.png", width=12, height=6, units="in", res=300)
manhattan(indel_res,
chr="CHR",
bp="BP",
snp="SNP",
p="P",
main="INDEL Manhattan Plot",
annotatePval = indel_thresh,
annotateTop = FALSE,
col = c("salmon", "firebrick4"),
suggestiveline = -log10(indel_thresh),
genomewideline = -log10(1e-5))
dev.off()

# Q-Q Plot
png("qqplot_indel.png", width=8, height=8, units="in", res=300)
qq(indel_res$P, main="Q-Q Plot: INDEL GWAS")
dev.off()

License and Copyright

Copyright (C) 2025 Xingwan Yi

Spam prevention powered by Akismet