High-Throughput Sequencing Data Preprocessing – Applications of Digital Research Infrastructure (DRI) in Evolutionary Biology

This tutorial provides a step-by-step guide for preprocessing high-throughput sequencing data using tools such as Trimmomatic, BWA-MEM2, Samtools, and Picard. The workflow includes quality trimming of sequencing reads, aligning them to the reference genome, sorting the resulting BAM files, and removing duplicate reads to ensure accurate variant calling.

Required Tools and Files:

Trimmomatic (for quality trimming)
BWA-MEM2 (for read alignment)
Samtools (for BAM file sorting and statistics)
Picard (for duplicate removal)

Required Files:

Reference Genome File: genome.fasta (FASTA format containing the reference genome sequences)
Sequencing Data Files: Paired-end FASTQ files, such as S1_1.fq.gz and S1_2.fq.gz.

Step 1: Quality Trimming with Trimmomatic

java -jar trimmomatic-0.36.jar PE S1_1.fq.gz S1_2.fq.gz --baseout S1_out.fq.gz LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:36 -threads 10

#PE: Paired-end sequencing data.

#LEADING:20: Trim bases with quality below 20 from the start of the read.

#TRAILING:20: Trim bases with quality below 20 from the end of the read.

#SLIDINGWINDOW:4:20: Trim when the average quality within a sliding window of 4 bases falls below 20.

#MINLEN:36: Discard reads shorter than 36 bases after trimming.

#-threads 10: Use 10 threads to speed up the processing.

Step 2: Build Genome Index

samtools faidx genome.fasta

bwa-mem2 index genome.fasta

java -jar picard.jar CreateSequenceDictionary R=genome.fastajava -jar picard.jar CreateSequenceDictionary R=genome.fasta

Step 3: Align Reads to the Reference Genome

Once the genome index is built, the next step is to align the sequencing reads to the reference genome using BWA-MEM2. After alignment, we will use Samtools to convert the resulting SAM file to BAM format and sort it.

bwa-mem2 mem -t 2 -R '@RG\tID:S1\tSM:S1\tPL:illumina' genome.fasta S1_out_1P.fq.gz S1_out_2P.fq.gz > S1.sam

samtools view -bS S1.sam > S1.bam

samtools sort -@ 2 -m 1G -o S1.sorted.bam S1.bam

Step 4: Remove Duplicates

Duplicate reads may arise due to PCR amplification and should be removed to avoid bias in variant calling.

java -jar picard.jar MarkDuplicates I=S1.sorted.bam O=S1.sorted.markdup.bam CREATE_INDEX=true REMOVE_DUPLICATES=true M=S1.marked_dup_metrics.txt

Step 5: Quality Control and Statistics

After removing duplicates, it’s important to check the quality of the BAM files. This includes alignment statistics and coverage information.

samtools flagstat S1.sorted.bam > S1.sorted.bam.flagstat

samtools depth S1.sorted.bam > S1.sorted.bam.depth

These steps prepare your data for downstream variant calling and ensure that only high-quality data is used for further analysis.