COTS Project Bioinformatic Analysis
This post details scripts for bioinformatics processing of COTS Project RNAseq data.
Lucy’s RNAseq COTS data bioinformatics
Set up
Made a folder in Unity at the following location:
/work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman
Made a raw_data
, scripts
, and trimmed_data
folder.
Sym link data
Sym link the raw data to the new location.
ln -s /project/pi_hputnam_uri_edu/20250107_COTS_LG/*fastq.gz /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/raw_data
Run multiQC on raw data
Create a script
cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/scripts
nano raw_qc.sh
Make a script
#!/bin/bash
#SBATCH --job-name=fastqc_raw
#SBATCH --nodes=1 --cpus-per-task=8
#SBATCH --mem=250G # Requested Memory
#SBATCH -p gpu # Partition
#SBATCH -G 1 # Number of GPUs
#SBATCH --time=24:00:00 # Job time limit
#SBATCH -o slurm-fastqc_raw.out # %j = job ID
#SBATCH -e slurm-fastqc_raw.err # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/raw_data
#load modules
module load uri/main
module load fastqc/0.12.1
module load MultiQC/1.12-foss-2021b
#run fastqc on raw data
fastqc /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/raw_data/*.fastq.gz -o /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/raw_multiqc/
#generate multiqc report
multiqc /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/raw_multiqc/ --filename multiqc_report_raw.html
echo "Initial QC of raw seq data complete." $(date)
Run the job
sbatch raw_qc.sh
Started at 15:00 on 2/20/2025
Completed after approx. 24 hours.
Transfer raw multiQC report to desktop to view.
The file can be viewed on GitHub here.
In window outside Unity (logged onto my own computer):
scp ashuffmyer_uri_edu@unity.rc.umass.edu:/work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/raw_data/multiqc_report_raw.html /Users/ashuffmyer/MyProjects/CoTS-RNAseq/output/seq_qc
We have some adapter content and lots of variation in GC histograms.
Trim raw sequences
I made a folder called trimmed_data
for the trimmed sequence files to go into.
I will use the following settings for trimming in fastp
. Fastp documentation can be found here.
detect_adapter_for_pe \
- This enables auto detection of adapters for paired end data
qualified_quality_phred 30 \
- Filters reads based on phred score >=30
unqualified_percent_limit 10 \
- percents of bases are allowed to be unqualified, set here as 10%
length_required 100 \
- Removes reads shorter than 100 bp. We have read lengths of all 150 bp, so this is not a stringent filtration.
cut_right cut_right_window_size 5 cut_right_mean_quality 20
- Jill has used this sliding cut window in her script. I am going to leave it out for our trimming and we can re-evaluate the QC to see if we need to implement cutting.
Write a script.
cd scripts
nano trimming.sh
Search for fastp module.
module --show_hidden spider fastp
Fastp module is fastp/0.23.2-GCC-11.2.0
on Unity.
#!/bin/bash
#SBATCH --job-name=trimming
#SBATCH --nodes=1 --cpus-per-task=8
#SBATCH --mem=250G # Requested Memory
#SBATCH -p gpu # Partition
#SBATCH -G 1 # Number of GPUs
#SBATCH --time=10:00:00 # Job time limit
#SBATCH -o slurm-trimming.out # %j = job ID
#SBATCH -e slurm-trimming.err # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_data
#load modules
module load uri/main
module load fastqc/0.12.1
module load MultiQC/1.12-foss-2021b
module load fastp/0.23.2-GCC-11.2.0
# Make an array of sequences to trim in raw data directory
cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/raw_data/
array1=($(ls *R1_001.fastq.gz))
echo "Read trimming of adapters started." $(date)
# fastp and fastqc loop
for i in ${array1[@]}; do
fastp --in1 ${i} \
--in2 $(echo ${i}|sed s/_R1/_R2/)\
--out1 /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_data/trim.${i} \
--out2 //work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_data/trim.$(echo ${i}|sed s/_R1/_R2/) \
--detect_adapter_for_pe \
--qualified_quality_phred 30 \
--unqualified_percent_limit 10 \
--length_required 100
done
echo "Read trimming of adapters completed." $(date)
Run the script.
sbatch trimming.sh
Job started at 14:00 on 22 February 2025.
Job completed at 21:00 on 22 Feb 2025 (~7 hours).
MultiQC on trimmed sequences
Run FastQC and generate a MultiQC report on the trimmed sequences.
Make a script.
mkdir trimmed_multiqc
cd scripts
nano trimmed_qc.sh
#!/bin/bash
#SBATCH --job-name=trimmed_qc
#SBATCH --nodes=1 --cpus-per-task=8
#SBATCH --mem=250G # Requested Memory
#SBATCH -p gpu # Partition
#SBATCH -G 1 # Number of GPUs
#SBATCH --time=24:00:00 # Job time limit
#SBATCH -o slurm-trim_qc.out # %j = job ID
#SBATCH -e slurm-trim_qc.err # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_multiqc
#load modules
module load uri/main
module load fastqc/0.12.1
module load MultiQC/1.12-foss-2021b
#run fastqc on trimmed data
fastqc /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_data/*.fastq.gz -o /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_multiqc/
#generate multiqc report
multiqc /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_multiqc/ --filename multiqc_report_trimmed.html
echo "Initial QC of trimmed seq data complete." $(date)
Run the script.
sbatch trimmed_qc.sh
Job started at 11:15 on 23 Feb 2025, ended at about 24 h.
Copy file to computer.
scp ashuffmyer_uri_edu@unity.rc.umass.edu:/work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_multiqc/multiqc_report_trimmed.html /Users/ashuffmyer/MyProjects/CoTS-RNAseq/output/seq_qc
Adapters are removed, quality scores are high. QC content is weird, but likely due to symbiont and microbial reads seen in Hollie’s tests of alignment previously. We will see what happens at the alignment steps.
The file can be viewed on GitHub here.
Make species specific lists of samples
Make a list of Acropora and Porites files.
nano acr-list.txt
497R_R1_001.fastq.gz
568R_R1_001.fastq.gz
410R_R1_001.fastq.gz
549R_R1_001.fastq.gz
414R_R1_001.fastq.gz
321RA_R1_001.fastq.gz
581R_R1_001.fastq.gz
370R_R1_001.fastq.gz
419R_R1_001.fastq.gz
331RA_R1_001.fastq.gz
336R_R1_001.fastq.gz
512R_R1_001.fastq.gz
380R_R1_001.fastq.gz
571R_R1_001.fastq.gz
586R_R1_001.fastq.gz
468R_R1_001.fastq.gz
497R_R2_001.fastq.gz
568R_R2_001.fastq.gz
410R_R2_001.fastq.gz
549R_R2_001.fastq.gz
414R_R2_001.fastq.gz
321RA_R2_001.fastq.gz
581R_R2_001.fastq.gz
370R_R2_001.fastq.gz
419R_R2_001.fastq.gz
331RA_R2_001.fastq.gz
336R_R2_001.fastq.gz
512R_R2_001.fastq.gz
380R_R2_001.fastq.gz
571R_R2_001.fastq.gz
586R_R2_001.fastq.gz
468R_R2_001.fastq.gz
nano por-list.txt
225R_R1_001.fastq.gz
235R_R1_001.fastq.gz
43R_R1_001.fastq.gz
211R_R1_001.fastq.gz
218R_R1_001.fastq.gz
34R_R1_001.fastq.gz
227R_R1_001.fastq.gz
16R_R1_001.fastq.gz
61R_R1_001.fastq.gz
86R_R1_001.fastq.gz
236R_R1_001.fastq.gz
244R_R1_001.fastq.gz
82R_R1_001.fastq.gz
71R_R1_001.fastq.gz
253R_R1_001.fastq.gz
76R_R1_001.fastq.gz
225R_R2_001.fastq.gz
235R_R2_001.fastq.gz
43R_R2_001.fastq.gz
211R_R2_001.fastq.gz
218R_R2_001.fastq.gz
34R_R2_001.fastq.gz
227R_R2_001.fastq.gz
16R_R2_001.fastq.gz
61R_R2_001.fastq.gz
86R_R2_001.fastq.gz
236R_R2_001.fastq.gz
244R_R2_001.fastq.gz
82R_R2_001.fastq.gz
71R_R2_001.fastq.gz
253R_R2_001.fastq.gz
76R_R2_001.fastq.gz
Make a folder for Acropora and Porites files.
mkdir acr
mkdir por
Sym link Acropora files to the acr
folder.
while IFS= read -r filename; do
trimmed_file="trimmed_data/trim.$filename"
if [[ -f "$trimmed_file" ]]; then
ln -s "$(realpath "$trimmed_file")" "acr/trim.$filename"
else
echo "Warning: $trimmed_file not found" >&2
fi
done < acr-list.txt
Sym link Porites files to the por
folder.
while IFS= read -r filename; do
trimmed_file="trimmed_data/trim.$filename"
if [[ -f "$trimmed_file" ]]; then
ln -s "$(realpath "$trimmed_file")" "por/trim.$filename"
else
echo "Warning: $trimmed_file not found" >&2
fi
done < por-list.txt
Alignment to genomes
Index genomes
Copy genome fasta files to my directory to prevent permission issues.
mkdir refs
mkdir por
mkdir acr
cp /work/pi_hputnam_uri_edu/refs/Plutea_genome/plut_final_2.1.fasta /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por
cp /work/pi_hputnam_uri_edu/refs/Plutea_genome/plut2v1.1.genes.gff3 /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por
cp /work/pi_hputnam_uri_edu/refs/Ahyacinthus_genome/Ahyacinthus_genome_V1/Ahyacinthus.chrsV1.fasta /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/acr
cp /work/pi_hputnam_uri_edu/refs/Ahyacinthus_genome/Ahyacinthus_genome_V1/Ahyacinthus.coding.gff3 /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/acr
Index Porites genome
cd scripts
nano por-genome.sh
#!/bin/bash
#SBATCH --job-name=STAR_genome_index_Plutea
#SBATCH --nodes=1 --cpus-per-task=8
#SBATCH --mem=100G # Requested Memory
#SBATCH -p gpu # Partition
#SBATCH -G 1 # Number of GPUs
#SBATCH --time=24:00:00 # Job time limit
#SBATCH -o slurm-por-genome.out # %j = job ID
#SBATCH -e slurm-por-genome.err # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por
#load modules
module load uri/main
module load STAR/2.7.11b-GCC-12.3.0
STAR --runThreadN 6 \
--runMode genomeGenerate \
--genomeDir Plutea_index \
--genomeFastaFiles /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por/plut_final_2.1.fasta \
--sjdbGTFfile /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por/plut2v1.1.genes.gff3 \
--sjdbGTFtagExonParentTranscript Parent \
--sjdbOverhang 99 \
--genomeSAindexNbases 13
sbatch por-genome.sh
Index Acropora genome
cd scripts
nano acr-genome.sh
#!/bin/bash
#SBATCH --job-name=STAR_genome_index_Ahya
#SBATCH --nodes=1 --cpus-per-task=8
#SBATCH --mem=100G # Requested Memory
#SBATCH -p gpu # Partition
#SBATCH -G 1 # Number of GPUs
#SBATCH --time=24:00:00 # Job time limit
#SBATCH -o slurm-acr-genome.out # %j = job ID
#SBATCH -e slurm-acr-genome.err # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/acr
module load uri/main
module load STAR/2.7.11b-GCC-12.3.0
STAR --runThreadN 6 \
--runMode genomeGenerate \
--genomeDir Ahya_index \
--genomeFastaFiles /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/acr/Ahyacinthus.chrsV1.fasta \
--sjdbGTFfile /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/acr/Ahyacinthus.coding.gff3 \
--sjdbGTFtagExonParentTranscript Parent \
--sjdbOverhang 99
sbatch acr-genome.sh
Jobs started at 12:00 24 February 2025. Jobs finished after about 10-30 min.
Acropora
Start a scratch directory to temporarily store the output .bam files. Call the directory cots
and use for 30 days. It was created on Feb 25th, so we should be prepared to extend or move files we need to keep by March 25th ish.
ws_allocate cots 30
Info: creating workspace.
/scratch3/workspace/ashuffmyer_uri_edu-cots
remaining extensions : 5
remaining time in days: 30
Align all ACR files in the relevant folder to generated index files.
cd scripts
nano acr-align.sh
#!/bin/bash
#SBATCH --job-name=acr-align
#SBATCH --nodes=1 --cpus-per-task=15
#SBATCH --mem=200G # Requested Memory
#SBATCH -p gpu # Partition
#SBATCH -G 1 # Number of GPUs
#SBATCH -t 7-24:00:00
#SBATCH -q long #job lasting over 2 days
#SBATCH -o slurm-acr-align.out # %j = job ID
#SBATCH -e slurm-acr-align.err # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/acr/
#load modules
echo "Loading programs" $(date)
module load uri/main
module load STAR/2.7.11b-GCC-12.3.0
echo "Starting read alignment." $(date)
#loop through all files to align them to genome
for i in /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/acr/*_R1_001.fastq.gz; do
# Define the corresponding R2 file by replacing _R1_ with _R2_
r2_file="${i/_R1_001.fastq.gz/_R2_001.fastq.gz}"
# Define output prefix
output_prefix="/scratch3/workspace/ashuffmyer_uri_edu-cots/acr/$(basename "${i%_R1_001.fastq.gz}")_"
# Run STAR alignment
STAR --runMode alignReads \
--genomeDir /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/acr/Ahya_index \
--runThreadN 10 \
--readFilesCommand zcat \
--readFilesIn "$i" "$r2_file" \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMattributes Standard \
--genomeSAindexNbases 13 \
--outFileNamePrefix "$output_prefix"
done
echo "Alignment of Trimmed Seq data complete." $(date)
sbatch acr-align.sh
This script outputs files in the scratch acr
directory that I made.
Porites
Align all POR files in the relevant folder using generated genome index.
cd scripts
nano por-align.sh
#!/bin/bash
#SBATCH --job-name=por-align
#SBATCH --nodes=1 --cpus-per-task=15
#SBATCH --mem=200G # Requested Memory
#SBATCH -p gpu # Partition
#SBATCH -G 1 # Number of GPUs
#SBATCH -t 7-24:00:00
#SBATCH -q long #job lasting over 2 days
#SBATCH -o slurm-por-align.out # %j = job ID
#SBATCH -e slurm-por-align.err # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/por/
#load modules
echo "Loading programs" $(date)
module load uri/main
module load STAR/2.7.11b-GCC-12.3.0
echo "Starting read alignment." $(date)
#loop through all files to align them to genome
for i in /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/por/*_R1_001.fastq.gz; do
# Define the corresponding R2 file by replacing _R1_ with _R2_
r2_file="${i/_R1_001.fastq.gz/_R2_001.fastq.gz}"
# Define output prefix
output_prefix="/scratch3/workspace/ashuffmyer_uri_edu-cots/por/$(basename "${i%_R1_001.fastq.gz}")_"
# Run STAR alignment
STAR --runMode alignReads \
--genomeDir /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por/Plutea_index \
--runThreadN 10 \
--readFilesCommand zcat \
--readFilesIn "$i" "$r2_file" \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMattributes Standard \
--outFileNamePrefix "$output_prefix"
done
echo "Alignment of Trimmed Seq data complete." $(date)
sbatch por-align
This script outputs files in the scratch acr
directory that I made.
Both jobs started on 27 Feb 2025 at 07:00.
Look at mapping results
Obtain mapping percentages after job was completed. Use the samtools flagstat
function that does a full pass through the input file to calculate and print statistics to stdout. Provides counts for each of 13 categories based primarily on bit flags in the FLAG field. Each category in the output is broken down into QC pass and QC fail, which is presented as “#PASS + #FAIL” followed by a description of the category.
Porites
interactive
cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/por/
module load SAMtools/1.9-foss-2018b
for i in *.bam; do
echo "${i}" >> mapped_reads_counts
samtools flagstat ${i} | grep "mapped (" >> mapped_reads_counts
done
Acropora
interactive
cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/acr/
module load SAMtools/1.9-foss-2018b
for i in *.bam; do
echo "${i}" >> mapped_reads_counts
samtools flagstat ${i} | grep "mapped (" >> mapped_reads_counts
done