QC of top off RNAseq files for Montipora capitata larval thermal tolerance 2023 project

This post details QC of the Montipora capitata 2023 larval thermal tolerance project RNAseq files from second delivery of top off sequencing. See my notebook posts and my GitHub repo for information on this project.

Also see my notebook post on sequence download and SRA upload for these files.

1. Run MultiQC on raw data from second sequencing (untrimmed and unfiltered)

cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/scripts

nano qc_raw_second_sequencing.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --mem=100GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH --output="qc_raw_second_seq-%j.out"
#SBATCH --error="qc_raw_second_seq-%j.err"
#SBATCH -D /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/raw-sequences/second_sequencing

# load modules needed
module load fastp/0.19.7-foss-2018b
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2

# fastqc of raw reads
fastqc *.fastq.gz

#generate multiqc report
multiqc ./ --filename multiqc_report_raw_second_sequencing.html 

echo "Raw MultiQC report generated." $(date)

Run the script.

sbatch qc_raw_second_sequencing.sh

Move .md5 files to md5_files folder.

mv ./*.md5 md5_files

Run as Job ID 312074 on 13 April 2024

Move .fastqc files to a QC folder to keep folder organized within the second_sequencing directory.

mkdir raw_fastqc
mv *fastqc.html raw_fastqc
mv *fastqc.zip raw_fastqc
mv multiqc* raw_fastqc

Download the multiQC report to my desktop to view (in a new terminal not logged into Andromeda).

scp ashuffmyer@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/mcap-2023-rnaseq/raw-sequences/second_sequencing/raw_fastqc/multiqc_report_raw_second_sequencing.html ~/MyProjects/larval_symbiont_TPC/data/rna_seq/QC

Move individual fastqc files to local machine. fastqc_second_sequencing_raw

scp ashuffmyer@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/mcap-2023-rnaseq/raw-sequences/second_sequencing/raw_fastqc/\*fastqc.html  ~/MyProjects/larval_symbiont_TPC/data/rna_seq/QC/fastqc_second_sequencing_raw

The raw sequence MultiQC report can be found on GitHub here.

Here are the things I noticed from the QC report:

See this previous post with statistics of samples provided by Azenta.

  • Top off sequencing
    • Samples that had the lowest read depth in the first round of sequencing have the highest read depth in this top off sequencing. Azenta sequenced samples to bring all samples to a total of ~18M reads. This is why there are variable read depths in these top off sequences.
Sample Name % Dups % GC M Seqs
R107s_R1_001 53.2% 43% 17.5
R107s_R2_001 50.7% 44% 17.5
R55s_R1_001 34.0% 43% 2.9
R55s_R2_001 32.5% 44% 2.9
R56s_R1_001 29.5% 44% 1.7
R56s_R2_001 27.9% 45% 1.7
R57s_R1_001 37.2% 43% 3.3
R57s_R2_001 35.4% 44% 3.3
R58s_R1_001 30.8% 43% 1.4
R58s_R2_001 29.1% 44% 1.4
R59s_R1_001 56.8% 44% 33.7
R59s_R2_001 54.6% 44% 33.7
R60s_R1_001 33.0% 44% 2.5
R60s_R2_001 31.5% 44% 2.5
R62s_R1_001 34.7% 43% 2.3
R62s_R2_001 32.9% 44% 2.3
R67s_R1_001 48.2% 43% 17.4
R67s_R2_001 46.2% 44% 17.4
R75s_R1_001 46.0% 43% 11.1
R75s_R2_001 44.2% 44% 11.1
R83s_R1_001 52.9% 43% 17.6
R83s_R2_001 49.4% 44% 17.6
R91s_R1_001 54.1% 43% 17.5
R91s_R2_001 52.5% 44% 17.5
R99s_R1_001 51.1% 44% 16.2
R99s_R2_001 46.8% 44% 16.2

Statistics are similar to the values from the original sequencing for these samples.

  • Adapter content present in sequences, as expected, because we have not yet removed adapters.

  • Some samples have warnings for GC content. We will revisit this after trimming.

  • There are a high proportion of overrepresented sequences. This is expected with RNAseq data as there are likely genes that are highly and consistently expressed.

  • All reads are the same length (150 bp).

  • High quality scores.

  • Low N content

2. Trimming adapters

I ran trimming steps using the script from trimming original sequences, detailed in my previous post.

I first moved the .md5 files to an md5 folder to keep only .fastq files in raw-sequences.

Make a new folder for trimmed sequences.

cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/
mkdir trimmed-sequences-second-seq
cd scripts
nano trim_adapters_second_seq.sh

I will use the following settings for trimming in fastp. Fastp documentation can be found here.

  • detect_adapter_for_pe \
    • This enables auto detection of adapters for paired end data
  • qualified_quality_phred 30 \
    • Filters reads based on phred score >=30
  • unqualified_percent_limit 10 \
    • percents of bases are allowed to be unqualified, set here as 10%
  • length_required 100 \
    • Removes small reads shorter than 100 bp. We have read lengths of 150 bp.
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/scripts           
#SBATCH -o adapter-trim-second-seq-%j.out
#SBATCH -e adapter-trim-second-seq-%j.error

# Load modules needed 
module load fastp/0.19.7-foss-2018b
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2

# Make an array of sequences to trim in raw data directory 

cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/raw-sequences/second_sequencing
array1=($(ls *R1_001.fastq.gz))

echo "Read trimming of adapters started." $(date)

# fastp and fastqc loop 
for i in ${array1[@]}; do
    fastp --in1 ${i} \
        --in2 $(echo ${i}|sed s/_R1/_R2/)\
        --out1 /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences-second-seq/trim.${i} \
        --out2 /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences-second-seq/trim.$(echo ${i}|sed s/_R1/_R2/) \
        --detect_adapter_for_pe \
        --qualified_quality_phred 30 \
        --unqualified_percent_limit 10 \
        --length_required 100 

done

echo "Read trimming of adapters completed." $(date)
sbatch trim_adapters_second_seq.sh

Job ID 312075
Ran on April 13 2024, completed in about 1.5 hours.

An example output from the error file looks like this:

Detecting adapter sequence for read1...
Illumina TruSeq Adapter Read 1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA

Detecting adapter sequence for read2...
No adapter detected for read2

Read1 before filtering:
total reads: 17533977
total bases: 2630096550
Q20 bases: 2576383048(97.9577%)
Q30 bases: 2477902823(94.2134%)

Read1 after filtering:
total reads: 11457851
total bases: 1693380571
Q20 bases: 1685110826(99.5116%)
Q30 bases: 1663210624(98.2184%)

Read2 before filtering:
total reads: 17533977
total bases: 2630096550
Q20 bases: 2486634287(94.5454%)
Q30 bases: 2272601196(86.4075%)

Read2 aftering filtering:
total reads: 11457851
total bases: 1693487809
Q20 bases: 1677451996(99.0531%)
Q30 bases: 1639980332(96.8404%)

Filtering result:
reads passed filter: 22915702
reads failed due to low quality: 11923964
reads failed due to too many N: 146
reads failed due to too short: 228142
reads with adapter trimmed: 3211274
bases trimmed due to adapters: 77884301

Duplication rate: 34.8162%

Insert size peak (evaluated by paired-end reads): 186

JSON report: fastp.json
HTML report: fastp.html

Move script out and error files and fastp files to the trimmed sequence folder to keep things organized.

cd raw-sequences/second_sequencing
mv fastp* ../../trimmed-sequences-second-seq/

cd ../../trimmed-sequences-second-seq/
mv fastp.html fastp_second_seq.html

I then downloaded the fastp.html report to look at trimming information.

scp ashuffmyer@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences-second-seq/fastp_second_seq.html ~/MyProjects/larval_symbiont_TPC/data/rna_seq/QC/

This file can be found on GitHub here.

Here are the results:

General statistics

fastp version: 0.19.7 (https://github.com/OpenGene/fastp)
sequencing: paired end (150 cycles + 150 cycles)
mean length before filtering: 150bp, 150bp
mean length after filtering: 147bp, 147bp
duplication rate: 30.411524%
Insert size peak: 169
Detected read1 adapter: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA

Before filtering

total reads: 32.443162 M
total bases: 4.866474 G
Q20 bases: 4.629020 G (95.120605%)
Q30 bases: 4.274311 G (87.831783%)
GC content: 44.578400%

After filtering

total reads: 18.461102 M
total bases: 2.725187 G
Q20 bases: 2.703615 G (99.208398%)
Q30 bases: 2.651870 G (97.309640%)
GC content: 43.742790%

Filtering results

reads passed filters: 18.461102 M (56.902906%)
reads with low quality: 13.784328 M (42.487622%)
reads with too many N: 72 (0.000222%)
reads too short: 197.660000 K (0.609250%)

These results show that filtering improved quality of reads and removed about 45% of reads due to length (a small proportion) and quality (most of reads removed). Average Q30 bases improved from 87% to 97%. Adapters were removed.

Next, I’ll run another round of fastqc and multiqc to see how this changed or improved our qc results.

3. FastQC and MultiQC on trimmed sequences

Make a script for running QC on trimmed sequences.

cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/scripts

nano qc_trimmed_second_seq.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences-second-seq           
#SBATCH -o trimmed-qc-second-seq-%j.out
#SBATCH -e trimmed-qc-second-seq-%j.error

# load modules needed
module load fastp/0.19.7-foss-2018b
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2

# fastqc of raw reads
fastqc *.fastq.gz 

#generate multiqc report
multiqc ./ --filename multiqc_report_trimmed_second_seq.html 

echo "Trimmed MultiQC report generated." $(date)

Run the script.

sbatch qc_trimmed_second_seq.sh

Job 312079 started on 13 April, ended on 13 April.

Make folders for QC files to go into.

cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences-second-seq
mkdir trimmed_qc_files 

mv *fastqc.html trimmed_qc_files
mv *fastqc.zip trimmed_qc_files
mv multiqc* trimmed_qc_files

Copy the multiqc html file to my computer.

scp ashuffmyer@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences-second-seq/trimmed_qc_files/multiqc_report_trimmed_second_seq.html ~/MyProjects/larval_symbiont_TPC/data/rna_seq/QC

Here are some of the main results:

Sample Name % Duplication GC content % PF % Adapter % Dups % GC M Seqs
fastp 30.4% 43.7% 56.9% 8.9%      
trim.R107s_R1_001         51.3% 43% 11.5
trim.R107s_R2_001         51.1% 43% 11.5
trim.R55s_R1_001         32.7% 43% 1.9
trim.R55s_R2_001         32.6% 43% 1.9
trim.R56s_R1_001         28.0% 44% 1.1
trim.R56s_R2_001         27.8% 44% 1.1
trim.R57s_R1_001         35.7% 43% 2.2
trim.R57s_R2_001         35.6% 43% 2.2
trim.R58s_R1_001         29.5% 43% 0.9
trim.R58s_R2_001         29.3% 43% 0.9
trim.R59s_R1_001         54.8% 43% 22.1
trim.R59s_R2_001         54.9% 43% 22.1
trim.R60s_R1_001         32.0% 43% 1.7
trim.R60s_R2_001         31.9% 43% 1.7
trim.R62s_R1_001         33.3% 43% 1.5
trim.R62s_R2_001         33.2% 43% 1.5
trim.R67s_R1_001         46.2% 43% 11.3
trim.R67s_R2_001         46.3% 43% 11.3
trim.R75s_R1_001         44.1% 43% 7.3
trim.R75s_R2_001         44.2% 43% 7.3
trim.R83s_R1_001         50.0% 43% 10.7
trim.R83s_R2_001         49.7% 43% 10.7
trim.R91s_R1_001         52.5% 43% 11.6
trim.R91s_R2_001         52.9% 43% 11.6
trim.R99s_R1_001         48.2% 43% 9.2
trim.R99s_R2_001         47.7% 43% 9.2
  • Fastp filtering: most reads filtered were due to low quality

  • Average insert size is 133 bp and within expected range (150 with total overlap to <300 with less overlap).

  • Read depth is highest for samples with lowest read depth in first round of sequencing.

  • Duplication passes QC.

  • GC content passes QC.

  • Quality is very high

  • Low N

No samples found with any adapter contamination > 0.1%.

Samples had less than 1% of reads made up of overrepresented sequences.

Copy the individual fastqc files to my computer.

scp ashuffmyer@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences-second-seq/trimmed_qc_files/\*fastqc.html ~/MyProjects/larval_symbiont_TPC/data/rna_seq/QC/fastqc_second_sequencing_trimmed 

Raw FastQC files can be found on GitHub here and trimmed FastQC files can be found on GitHub here.

Written on April 13, 2024