Genome-Environment Association (GEA) Analysis

Genome-Environment Association (GEA) analysis identifies genetic variants associated with environmental variables. LFMM (Latent Factor Mixed Model) is a widely used method for GEA, accounting for population structure and cryptic relatedness. This tutorial provides a step-by-step guide to performing GEA using LFMM.

Required Tools

  • LFMM: R package for genome-environment association analysis
  • PLINK: For genotype data formatting
  • R: For statistical analysis and visualization

Required Files

  • Genotype Data (VCF format): population_genotypes.vcf
  • Environmental Data (CSV format): environmental_data.csv

Step 1: Convert VCF to LFMM Format

Use PLINK to generate additive coding (0/1/2) with proper missing value handling:

plink --vcf population_genotypes.vcf \
--maf 0.05 --geno 0.1 \
--recodeA \
--output-missing-genotype -9 \
--out genotype_data \
--allow-extra-chr \
--double-id

Step 2: Match Sample Ordering

Load the required R packages:

install.packages("LEA") # If not installed
library(LEA)

# Read genotype sample IDs
tfam <- read.table("genotype_data.tfam", header=FALSE)
sample_ids <- tfam$V2 # PLINK's .tfam column 2 contains sample IDs


# Load and align environmental data
env_data <- read.csv("environmental_data.csv", row.names=1)
env_data <- env_data[match(sample_ids, rownames(env_data)), ]

# Verify alignment
stopifnot(all(rownames(env_data) == sample_ids)) # Critical check

Step 3: Determine Optimal Latent Factors (K)

In R:

lfmm2geno("genotype_data.raw", output="genotype_data.geno")

project <- snmf("genotype_data.geno",
K=1:5,
entropy=TRUE,
repetitions=5,
CPU=4)

# Select K with minimum cross-entropy
cross_ent <- cross.entropy(project, K=1:5)
best_k <- which.min(colMeans(cross_ent))

Step 4: Single-Variable LFMM Analysis

In R:

# Standardize environmental variable
env_var <- scale(env_data$Temperature) # Example: Temperature

# Run LFMM with optimal K
mod_lfmm <- lfmm_ridge(Y = "genotype_data.geno",
X = env_var,
K = best_k,
lambda = 1e-5)

# Association testing
pv <- lfmm_test(Y = "genotype_data.geno",
X = env_var,
lfmm = mod_lfmm)
pvals <- pv$pvalues[,1] # Extract p-values for current variable

Step 5: Environment-Specific Multiple Testing

Benjamini-Hochberg FDR correction:

pvals_adj <- p.adjust(pvals, method="fdr")
significant_snps <- which(pvals_adj < 0.05)

# Output results
cat("Significant SNPs for Temperature:", length(significant_snps), "\n")
write.csv(data.frame(SNP=significant_snps,
p.adj=pvals_adj[significant_snps]),
"Temperature_associations.csv")

Step 6: Visualization

plot(-log10(pvals_adj),
main = "Temperature Associations (FDR < 0.05)",
xlab = "SNP Index",
ylab = "-log10(adjusted p)",
col = ifelse(pvals_adj < 0.05, "red", "black"),
pch = 16)
abline(h = -log10(0.05), col="blue", lty=2)

Other method: Genome-environment association analysis with BayPass

License and Copyright

Copyright (C) 2025 Xingwan Yi

Spam prevention powered by Akismet