Genome-Environment Association (GEA) analysis identifies genetic variants associated with environmental variables. LFMM (Latent Factor Mixed Model) is a widely used method for GEA, accounting for population structure and cryptic relatedness. This tutorial provides a step-by-step guide to performing GEA using LFMM.
Required Tools
- LFMM: R package for genome-environment association analysis
- PLINK: For genotype data formatting
- R: For statistical analysis and visualization
Required Files
- Genotype Data (VCF format):
population_genotypes.vcf
- Environmental Data (CSV format):
environmental_data.csv
Step 1: Convert VCF to LFMM Format
Use PLINK to generate additive coding (0/1/2) with proper missing value handling:
plink --vcf population_genotypes.vcf \
--maf 0.05 --geno 0.1 \
--recodeA \
--output-missing-genotype -9 \
--out genotype_data \
--allow-extra-chr \
--double-id
Step 2: Match Sample Ordering
Load the required R packages:
install.packages("LEA") # If not installed
library(LEA)
# Read genotype sample IDs
tfam <- read.table("genotype_data.tfam", header=FALSE)
sample_ids <- tfam$V2 # PLINK's .tfam column 2 contains sample IDs# Load and align environmental data
env_data <- read.csv("environmental_data.csv", row.names=1)
env_data <- env_data[match(sample_ids, rownames(env_data)), ]
# Verify alignment
stopifnot(all(rownames(env_data) == sample_ids)) # Critical check
Step 3: Determine Optimal Latent Factors (K)
In R:
lfmm2geno("genotype_data.raw", output="genotype_data.geno")
project <- snmf("genotype_data.geno",
K=1:5,
entropy=TRUE,
repetitions=5,
CPU=4)
# Select K with minimum cross-entropy
cross_ent <- cross.entropy(project, K=1:5)
best_k <- which.min(colMeans(cross_ent))
Step 4: Single-Variable LFMM Analysis
In R:
# Standardize environmental variable
env_var <- scale(env_data$Temperature) # Example: Temperature
# Run LFMM with optimal K
mod_lfmm <- lfmm_ridge(Y = "genotype_data.geno",
X = env_var,
K = best_k,
lambda = 1e-5)
# Association testing
pv <- lfmm_test(Y = "genotype_data.geno",
X = env_var,
lfmm = mod_lfmm)
pvals <- pv$pvalues[,1] # Extract p-values for current variable
Step 5: Environment-Specific Multiple Testing
Benjamini-Hochberg FDR correction:
pvals_adj <- p.adjust(pvals, method="fdr")
significant_snps <- which(pvals_adj < 0.05)
# Output results
cat("Significant SNPs for Temperature:", length(significant_snps), "\n")
write.csv(data.frame(SNP=significant_snps,
p.adj=pvals_adj[significant_snps]),
"Temperature_associations.csv")
Step 6: Visualization
plot(-log10(pvals_adj),
main = "Temperature Associations (FDR < 0.05)",
xlab = "SNP Index",
ylab = "-log10(adjusted p)",
col = ifelse(pvals_adj < 0.05, "red", "black"),
pch = 16)
abline(h = -log10(0.05), col="blue", lty=2)
Other method: Genome-environment association analysis with BayPass
License and Copyright
Copyright (C) 2025 Xingwan Yi