This tutorial provides a step-by-step guide for preprocessing high-throughput sequencing data using tools such as Trimmomatic, BWA-MEM2, Samtools, and Picard. The workflow includes quality trimming of sequencing reads, aligning them to the reference genome, sorting the resulting BAM files, and removing duplicate reads to ensure accurate variant calling.
Required Tools and Files:
- Trimmomatic (for quality trimming)
- BWA-MEM2 (for read alignment)
- Samtools (for BAM file sorting and statistics)
- Picard (for duplicate removal)
Required Files:
- Reference Genome File: genome.fasta (FASTA format containing the reference genome sequences)
- Sequencing Data Files: Paired-end FASTQ files, such as S1_1.fq.gz and S1_2.fq.gz.
Step 1: Quality Trimming with Trimmomatic
java -jar trimmomatic-0.36.jar PE S1_1.fq.gz S1_2.fq.gz --baseout S1_out.fq.gz LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:36 -threads 10
#PE: Paired-end sequencing data.
#LEADING:20: Trim bases with quality below 20 from the start of the read.
#TRAILING:20: Trim bases with quality below 20 from the end of the read.
#SLIDINGWINDOW:4:20: Trim when the average quality within a sliding window of 4 bases falls below 20.
#MINLEN:36: Discard reads shorter than 36 bases after trimming.
#-threads 10: Use 10 threads to speed up the processing.
Step 2: Build Genome Index
samtools faidx genome.fasta
bwa-mem2 index genome.fasta
java -jar picard.jar CreateSequenceDictionary R=genome.fasta
java -jar picard.jar CreateSequenceDictionary R=genome.fasta
Step 3: Align Reads to the Reference Genome
Once the genome index is built, the next step is to align the sequencing reads to the reference genome using BWA-MEM2. After alignment, we will use Samtools to convert the resulting SAM file to BAM format and sort it.
bwa-mem2 mem -t 2 -R '@RG\tID:S1\tSM:S1\tPL:illumina' genome.fasta S1_out_1P.fq.gz S1_out_2P.fq.gz > S1.sam
samtools view -bS S1.sam > S1.bam
samtools sort -@ 2 -m 1G -o S1.sorted.bam S1.bam
Step 4: Remove Duplicates
Duplicate reads may arise due to PCR amplification and should be removed to avoid bias in variant calling.
java -jar picard.jar MarkDuplicates I=S1.sorted.bam O=S1.sorted.markdup.bam CREATE_INDEX=true REMOVE_DUPLICATES=true M=S1.marked_dup_metrics.txt
Step 5: Quality Control and Statistics
After removing duplicates, it’s important to check the quality of the BAM files. This includes alignment statistics and coverage information.
samtools flagstat S1.sorted.bam > S1.sorted.bam.flagstat
samtools depth S1.sorted.bam > S1.sorted.bam.depth
These steps prepare your data for downstream variant calling and ensure that only high-quality data is used for further analysis.
License and Copyright
Copyright (C) 2024 Xingwan Yi, Zhiqin Long.