Hawaii 2023 Larval Tolerance project ITS2 Analysis Part 1
This post details download and QC for the ITS2 analysis pipeline for 2023 Hawaii Larval Thermal Tolerance project.
Overview
The preparation of these samples are detailed in posts for PCR and prep, bead clean up, and preparation for sequencing.
Samples were sequenced on 2 x 300 bp sequencing on a MiSeq M00763 at the URI RI-INBRE Molecular Informatics Core facility.
The plate maps and metadata were detailed in Jill’s notebook and have been uploaded to the project GitHub here.
We will analyze data using SymPortal. The wiki for SymPortal can be found here. SymPortal will perform all quality control filtering of the sequence data and convert the raw sequence data into database objects. There is no need for pre filtering or trimming.
1. Move sequence files from storage on Andromeda
Log into Andromeda
cd /data/putnamlab/ashuffmyer
Make a new directory
mkdir hawaii_2023_its2
cd hawaii_2023_its2
mkdir raw-sequences
The directory I want the files to be in is now /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
.
Navigate to where the files are stored on Andromeda and show directory location.
/data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey
Copy files into my directory along with .md5 files from the original download.
cd /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*md5 /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
Now, the file URI_download.md5
is in my directory.
Plate | Well | Sample ID | Sequencing ID | Volume (uL) | Type | Project |
---|---|---|---|---|---|---|
1 | A1 | R55 | HP801 | 20 | ITS2 amplicon | A Huffmyer |
1 | B1 | R56 | HP802 | 20 | ITS2 amplicon | A Huffmyer |
1 | C1 | R57 | HP803 | 20 | ITS2 amplicon | A Huffmyer |
1 | D1 | R58 | HP804 | 20 | ITS2 amplicon | A Huffmyer |
1 | E1 | R59 | HP805 | 20 | ITS2 amplicon | A Huffmyer |
1 | F1 | R60 | HP806 | 20 | ITS2 amplicon | A Huffmyer |
1 | G1 | R61 | HP807 | 20 | ITS2 amplicon | A Huffmyer |
1 | H1 | R62 | HP808 | 20 | ITS2 amplicon | A Huffmyer |
1 | A2 | R63 | HP809 | 20 | ITS2 amplicon | A Huffmyer |
1 | B2 | R64 | HP810 | 20 | ITS2 amplicon | A Huffmyer |
1 | C2 | R65 | HP811 | 20 | ITS2 amplicon | A Huffmyer |
1 | D2 | R66 | HP812 | 20 | ITS2 amplicon | A Huffmyer |
1 | E2 | R67 | HP813 | 20 | ITS2 amplicon | A Huffmyer |
1 | F2 | R68 | HP814 | 20 | ITS2 amplicon | A Huffmyer |
1 | G2 | R69 | HP815 | 20 | ITS2 amplicon | A Huffmyer |
1 | H2 | R70 | HP816 | 20 | ITS2 amplicon | A Huffmyer |
1 | A3 | R71 | HP817 | 20 | ITS2 amplicon | A Huffmyer |
1 | B3 | R72 | HP818 | 20 | ITS2 amplicon | A Huffmyer |
Next, copy all files that have sequence ID HP810-HP818.
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP801* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP802* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP803* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP804* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP805* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP806* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP807* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP808* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP809* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP810* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP811* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP812* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP813* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP814* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP815* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP816* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP817* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
cp /data/putnamlab/KITT/hputnam/20240603_ITS2_Ashey/*HP818* /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences
Generate a new md5 file.
md5sum *.fastq.gz > checkmd5_20240610.md5
md5sum -c checkmd5_20240610.md5
Output was as follows:
HP801_S1_L001_R1_001.fastq.gz: OK
HP801_S1_L001_R2_001.fastq.gz: OK
HP802_S13_L001_R1_001.fastq.gz: OK
HP802_S13_L001_R2_001.fastq.gz: OK
HP803_S25_L001_R1_001.fastq.gz: OK
HP803_S25_L001_R2_001.fastq.gz: OK
HP804_S37_L001_R1_001.fastq.gz: OK
HP804_S37_L001_R2_001.fastq.gz: OK
HP805_S49_L001_R1_001.fastq.gz: OK
HP805_S49_L001_R2_001.fastq.gz: OK
HP806_S61_L001_R1_001.fastq.gz: OK
HP806_S61_L001_R2_001.fastq.gz: OK
HP807_S73_L001_R1_001.fastq.gz: OK
HP807_S73_L001_R2_001.fastq.gz: OK
HP808_S85_L001_R1_001.fastq.gz: OK
HP808_S85_L001_R2_001.fastq.gz: OK
HP809_S2_L001_R1_001.fastq.gz: OK
HP809_S2_L001_R2_001.fastq.gz: OK
HP810_S14_L001_R1_001.fastq.gz: OK
HP810_S14_L001_R2_001.fastq.gz: OK
HP811_S26_L001_R1_001.fastq.gz: OK
HP811_S26_L001_R2_001.fastq.gz: OK
HP812_S38_L001_R1_001.fastq.gz: OK
HP812_S38_L001_R2_001.fastq.gz: OK
HP813_S50_L001_R1_001.fastq.gz: OK
HP813_S50_L001_R2_001.fastq.gz: OK
HP814_S62_L001_R1_001.fastq.gz: OK
HP814_S62_L001_R2_001.fastq.gz: OK
HP815_S74_L001_R1_001.fastq.gz: OK
HP815_S74_L001_R2_001.fastq.gz: OK
HP816_S86_L001_R1_001.fastq.gz: OK
HP816_S86_L001_R2_001.fastq.gz: OK
HP817_S3_L001_R1_001.fastq.gz: OK
HP817_S3_L001_R2_001.fastq.gz: OK
HP818_S15_L001_R1_001.fastq.gz: OK
HP818_S15_L001_R2_001.fastq.gz: OK
I then compared the checksums between the original download and this transfer.
All values match.
| sample | transfer | original | match |
|--------------------------------|-----------------------------------|-----------------------------------|-------|
| HP801_S1_L001_R1_001.fastq.gz | 155fe1769c46c61cb44e941f477f5138 | 155fe1769c46c61cb44e941f477f5138 | TRUE |
| HP801_S1_L001_R2_001.fastq.gz | c6e77ed27b3ce14f3259f3fd5ec68a7a | c6e77ed27b3ce14f3259f3fd5ec68a7a | TRUE |
| HP802_S13_L001_R1_001.fastq.gz | 54540f695e4fcb75e75d17a3812df244 | 54540f695e4fcb75e75d17a3812df244 | TRUE |
| HP802_S13_L001_R2_001.fastq.gz | e48b14bc7a860eaa8a0f87996c228e99 | e48b14bc7a860eaa8a0f87996c228e99 | TRUE |
| HP803_S25_L001_R1_001.fastq.gz | b2733dfae02af0b37fe33da110d5dc94 | b2733dfae02af0b37fe33da110d5dc94 | TRUE |
| HP803_S25_L001_R2_001.fastq.gz | 6e4a04b4f6a34320254e034c07c151e3 | 6e4a04b4f6a34320254e034c07c151e3 | TRUE |
| HP804_S37_L001_R1_001.fastq.gz | eeb8f912a37ff7960fc24ea54f1b5814 | eeb8f912a37ff7960fc24ea54f1b5814 | TRUE |
| HP804_S37_L001_R2_001.fastq.gz | 48e3931dd08218037139794a7fa7946f | 48e3931dd08218037139794a7fa7946f | TRUE |
| HP805_S49_L001_R1_001.fastq.gz | c4eb066550ceaaf319694afcb60c9085 | c4eb066550ceaaf319694afcb60c9085 | TRUE |
| HP805_S49_L001_R2_001.fastq.gz | 21fbe1a83419571f87d4239001f78bed | 21fbe1a83419571f87d4239001f78bed | TRUE |
| HP806_S61_L001_R1_001.fastq.gz | 8b3fc5db863500e5a9d5db1e923dc8df | 8b3fc5db863500e5a9d5db1e923dc8df | TRUE |
| HP806_S61_L001_R2_001.fastq.gz | 53ea67cc508fa18e59cb14a23d5716b5 | 53ea67cc508fa18e59cb14a23d5716b5 | TRUE |
| HP807_S73_L001_R1_001.fastq.gz | f1d5968f749ab5f043c90859d6644cf5 | f1d5968f749ab5f043c90859d6644cf5 | TRUE |
| HP807_S73_L001_R2_001.fastq.gz | 78cc72eaf72c57350f364f1b74c724a4 | 78cc72eaf72c57350f364f1b74c724a4 | TRUE |
| HP808_S85_L001_R1_001.fastq.gz | c358d7ff4fd35369dffdc5c6571e1c3a | c358d7ff4fd35369dffdc5c6571e1c3a | TRUE |
| HP808_S85_L001_R2_001.fastq.gz | ab11a1f0f93585a957e4b44138fb1827 | ab11a1f0f93585a957e4b44138fb1827 | TRUE |
| HP809_S2_L001_R1_001.fastq.gz | 62ae0e1e58c036eb024efeeb0acf833b | 62ae0e1e58c036eb024efeeb0acf833b | TRUE |
| HP809_S2_L001_R2_001.fastq.gz | 799cc7eaef0e7469a578a28c06855ede | 799cc7eaef0e7469a578a28c06855ede | TRUE |
| HP810_S14_L001_R1_001.fastq.gz | a124d8b34d7bf30d2a31fef416fb2810 | a124d8b34d7bf30d2a31fef416fb2810 | TRUE |
| HP810_S14_L001_R2_001.fastq.gz | 25b464be803665601477d3a68d1f27c3 | 25b464be803665601477d3a68d1f27c3 | TRUE |
| HP811_S26_L001_R1_001.fastq.gz | ba32acae384a652aa183cef2d4f48836 | ba32acae384a652aa183cef2d4f48836 | TRUE |
| HP811_S26_L001_R2_001.fastq.gz | 27ab28dcadca2620cb804c668b0fa59a | 27ab28dcadca2620cb804c668b0fa59a | TRUE |
| HP812_S38_L001_R1_001.fastq.gz | 2a3e366c5d5213e6097b8f027ba9a944 | 2a3e366c5d5213e6097b8f027ba9a944 | TRUE |
| HP812_S38_L001_R2_001.fastq.gz | c5d2c4980d2c59b8dda2ee7209439c48 | c5d2c4980d2c59b8dda2ee7209439c48 | TRUE |
| HP813_S50_L001_R1_001.fastq.gz | ee918481267951be50e7cbada194f7e1 | ee918481267951be50e7cbada194f7e1 | TRUE |
| HP813_S50_L001_R2_001.fastq.gz | d8c0aabe67e0745d67e52814a0c5a035 | d8c0aabe67e0745d67e52814a0c5a035 | TRUE |
| HP814_S62_L001_R1_001.fastq.gz | 40e99e87485b650ca25660cbdeb44e87 | 40e99e87485b650ca25660cbdeb44e87 | TRUE |
| HP814_S62_L001_R2_001.fastq.gz | 150603951008ef7d831069cd38d8bc75 | 150603951008ef7d831069cd38d8bc75 | TRUE |
| HP815_S74_L001_R1_001.fastq.gz | d2b55ec408e2eeea3fb9aa93262587fa | d2b55ec408e2eeea3fb9aa93262587fa | TRUE |
| HP815_S74_L001_R2_001.fastq.gz | f790fc8ab4bfc50fa716d53c95c77191 | f790fc8ab4bfc50fa716d53c95c77191 | TRUE |
| HP816_S86_L001_R1_001.fastq.gz | 28b2c59d33942b1795156f70258ce89c | 28b2c59d33942b1795156f70258ce89c | TRUE |
| HP816_S86_L001_R2_001.fastq.gz | 52ae315a36c5afde7cb88192f5976fbb | 52ae315a36c5afde7cb88192f5976fbb | TRUE |
| HP817_S3_L001_R1_001.fastq.gz | da59bc2cb23d998246d397708f045ec6 | da59bc2cb23d998246d397708f045ec6 | TRUE |
| HP817_S3_L001_R2_001.fastq.gz | 441c4daed60182ee7c15c6779fc0e4ab | 441c4daed60182ee7c15c6779fc0e4ab | TRUE |
| HP818_S15_L001_R1_001.fastq.gz | 79c5722cd6e797af0ed840e4e4576fb7 | 79c5722cd6e797af0ed840e4e4576fb7 | TRUE |
| HP818_S15_L001_R2_001.fastq.gz | 0df639fb9362f29484040d51bf4917e3 | 0df639fb9362f29484040d51bf4917e3 | TRUE |
Files are now ready for QC.
2. Run FastQC and MultiQC on raw sequence files
Write a script to conduct QC.
cd /data/putnamlab/ashuffmyer/hawaii_2023_its2/
mkdir scripts
cd scripts
nano raw-qc.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --mem=100GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH --output="qc_raw-%j.out"
#SBATCH --error="qc_raw-%j.err"
# load modules needed
module load fastp/0.19.7-foss-2018b
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2
# fastqc of raw reads
fastqc /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences/*.fastq.gz
#generate multiqc report
multiqc /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences --filename /data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences/multiqc_report_raw.html
echo "Raw MultiQC report generated." $(date)
sbatch raw-qc.sh
Job 320310 started at 10:20am on 10 June 2024 and finished after about 10 minutes.
Move the multiQC file to my computer. This file and all individual QC files are on GitHub here. Multi
scp ashuffmyer@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences/multiqc_report_raw.html ~/MyProjects/larval_symbiont_TPC/data/its2/raw_QC
scp ashuffmyer@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/hawaii_2023_its2/raw-sequences/\*fastqc.html ~/MyProjects/larval_symbiont_TPC/data/its2/raw_QC
View the results in the MultiQC file.
There is low adapter content, but not zero adapter content.
There are many overrepresented sequences, which we expect from ITS2 in this species. We expect a dominance of few symbiont sequences.
There is a high percentage of N content in the start of the sequences.
There are quality issues, particularly at the start and end of the sequences.
There is a bit of a weird distribution in GC content. We have seen this in past QC of ITS2 data.
There are many sequences with low quality scores.
There are many duplicate reads, as we expect.
There is high duplication of sequences, again as we expect.
There are many small sequences (<60 bp). There are also many sequences at about 300 bp, which is our sequencing length (2x300 bp).
We will next work on SymPortal submission of these raw files.