Genome-Environment Association (GEA) Analysis – Applications of Digital Research Infrastructure (DRI) in Evolutionary Biology

Genome-Environment Association (GEA) analysis identifies genetic variants associated with environmental variables. LFMM (Latent Factor Mixed Model) is a widely used method for GEA, accounting for population structure and cryptic relatedness. This tutorial provides a step-by-step guide to performing GEA using LFMM.

Required Tools

LFMM: R package for genome-environment association analysis
PLINK: For genotype data formatting
R: For statistical analysis and visualization

Required Files

Genotype Data (VCF format): population_genotypes.vcf
Environmental Data (CSV format): environmental_data.csv

Step 1: Convert VCF to LFMM Format

Use PLINK to generate additive coding (0/1/2) with proper missing value handling:

plink --vcf population_genotypes.vcf \ --maf 0.05 --geno 0.1 \ --recodeA \ --output-missing-genotype -9 \ --out genotype_data \ --allow-extra-chr \ --double-id

Step 2: Match Sample Ordering

Load the required R packages:

install.packages("LEA") # If not installed library(LEA)

# Read genotype sample IDs tfam <- read.table("genotype_data.tfam", header=FALSE) sample_ids <- tfam$V2 # PLINK's .tfam column 2 contains sample IDs

# Load and align environmental data env_data <- read.csv("environmental_data.csv", row.names=1) env_data <- env_data[match(sample_ids, rownames(env_data)), ]

# Verify alignment stopifnot(all(rownames(env_data) == sample_ids)) # Critical check

Step 3: Determine Optimal Latent Factors (K)

In R:

lfmm2geno("genotype_data.raw", output="genotype_data.geno")

project <- snmf("genotype_data.geno", K=1:5, entropy=TRUE, repetitions=5, CPU=4)

# Select K with minimum cross-entropy cross_ent <- cross.entropy(project, K=1:5) best_k <- which.min(colMeans(cross_ent))

Step 4: Single-Variable LFMM Analysis

In R:

# Standardize environmental variable env_var <- scale(env_data$Temperature) # Example: Temperature # Run LFMM with optimal K mod_lfmm <- lfmm_ridge(Y = "genotype_data.geno", X = env_var, K = best_k, lambda = 1e-5) # Association testing pv <- lfmm_test(Y = "genotype_data.geno", X = env_var, lfmm = mod_lfmm) pvals <- pv$pvalues[,1] # Extract p-values for current variable

Step 5: Environment-Specific Multiple Testing

Benjamini-Hochberg FDR correction:

pvals_adj <- p.adjust(pvals, method="fdr") significant_snps <- which(pvals_adj < 0.05) # Output results cat("Significant SNPs for Temperature:", length(significant_snps), "\n") write.csv(data.frame(SNP=significant_snps, p.adj=pvals_adj[significant_snps]), "Temperature_associations.csv")

Step 6: Visualization

plot(-log10(pvals_adj), main = "Temperature Associations (FDR < 0.05)", xlab = "SNP Index", ylab = "-log10(adjusted p)", col = ifelse(pvals_adj < 0.05, "red", "black"), pch = 16) abline(h = -log10(0.05), col="blue", lty=2)

Other method: Genome-environment association analysis with BayPass