E5 Deep Dive RNAseq Count Matrix Analysis

This post details analysis and bioinformatic steps to generate a gene count matrix from RNAseq data for the E5 Deep Dive Project.

More information on this project can be found on the GitHub repo.

1. Obtain RNAseq files from NCBI SRA

Obtain SRR numbers

First, we needed to obtain RNAseq files from NCBI from the project PRJNA744403. I first obtained a list of SRR numbers that identify the specific sequence files for RNAseq. I did this by going to all SRA links for the project and selecting all the RNAseq files (32 total). In the upper right hand corner, I selected “Send to” and “File”. This outputs a .txt file with a list of all SRR numbers.

Download RNAseq files

I then logged into Andromeda and created a folder for these files at /data/putnamlab/ashuffmyer/e5-deepdive with raw, scripts, and refs folders. Jill Ashey is also working on this project.

I then copied the SRR identifiers from the .txt file produced above into a file in the raw folder using nano SraAccList.txt and pasting the identifiers. The file looks like this:

SRR15101688
SRR15101689
SRR15101690
SRR15101691
SRR15101692
SRR15101693
SRR15101694
SRR15101695
SRR15101696
SRR15101697
SRR15101699
SRR15101700
SRR15101701
SRR15101702
SRR15101703
SRR15101704
SRR15101705
SRR15101706
SRR15101707
SRR15101708
SRR15101710
SRR15101711
SRR15101712
SRR15101713
SRR15101715
SRR15101718
SRR15101719
SRR15101720
SRR15101721
SRR15101722
SRR15101723
SRR15101724

Next, I downloaded the files using the NCBI SRA Toolkit which is installed on Andromeda as module SRA-Toolkit/2.10.9-gompi-2020b.

With the help of Sam and Steven in the Roberts Lab, I wrote a script to download the files.

nano download_sra.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --account=putnamlab                  
#SBATCH --error="download_sra_error" #if your job fails, the error report will be put in this file
#SBATCH --output="download_sra_output" #once your job is completed, any final job report comments will be put in this file

module load SRA-Toolkit/2.10.9-gompi-2020b

prefetch --option-file ../raw/SraAccList.txt -O ../raw #this creates a folder for each SRR in the .txt list and outputs in the raw data folder  

# Enable recursive globbing
shopt -s globstar

# Run fasterq-dump on any SRA file in any directory and split into read 1 and 2 files and put in raw folder 
for file in ../raw/**/*.sra
do
  fasterq-dump --outdir ../raw --split-files "${file}"
done

#Remove the SRR directories that are no longer needed
rm -r ../raw/SRR*/

#zip all files 
gzip ../raw/*.fastq

#Generate checksums 
md5sum  ../raw/*.fastq.gz > ../raw/md5.original.download.20230315

sbatch download_sra.sh

This script uses prefetch to retrieve the data files for each SRR number and stores them in the raw data folder with a directory for each file.

The fasterq-dump command then takes the .sra file from each directory and converts them to read 1 and read 2 fastq files.

Finally, the SRR directories are removed that are no longer needed and we generate md5 checksums.

This script can be used for any SRA download with a custom list of SRR numbers.

Download reference files

The genome is stored on NCBI.

We downloaded the reference files from the Reef Genomics Database.

cd refs

#genome scaffolds
wget http://pver.reefgenomics.org/download/Pver_genome_assembly_v1.0.fasta.gz

#gene models CDS
wget http://pver.reefgenomics.org/download/Pver_genes_names_v1.0.fna.gz

#gene models proteins
wget http://pver.reefgenomics.org/download/Pver_proteins_names_v1.0.faa.gz

#gene models GFF
wget http://pver.reefgenomics.org/download/Pver_genome_assembly_v1.0.gff3.gz

#Generate checksums 
md5sum  *.gz > md5.original.refs.download.20230315

Finally, I updated permissions after all files were downloaded so Jill can collaborate on this project.

chmod u=rwx,g=rwx,o=rwx,a=rwx -R e5-deepdive

Additional steps will be added to this post as we move forward.

2. QC using FastQC and MultiQC

Make directories for fastqc data

mkdir fastqc_raw

Write script for raw QC

nano fastqc_raw.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts              
#SBATCH --error="fastqc_raw_error" #if your job fails, the error report will be put in this file
#SBATCH --output="fastqc_raw_output" #once your job is completed, any final job report comments will be put in this file

# Load modules needed 
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2

echo "Start raw QC" $(date)

cd /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq

for file in /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/*fastq.gz
do 
fastqc $file --outdir /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/fastqc_raw
done

echo "End raw QC" $(date)

# Compile MultiQC report from fastQC files 
cd /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/fastqc_raw
multiqc --interactive ./

echo "Cleaned MultiQC report generated." $(date)

Submitted batch job 243914. Script ran in ~4.5 hours.

Raw sequence MultiQC summary:

The raw MultiQC report can be found on the E5 Deep Dive GitHub repo here.

There was high adapter content in these raw sequences. Overall, the quality score was high and there were no red flags in other metrics.

fastqc_sequence_counts_plot
fastqc_per_base_sequence_quality_plot
fastqc_per_sequence_quality_scores_plot
fastqc_per_sequence_gc_content_plot
fastqc_per_base_n_content_plot
fastqc_sequence_duplication_levels_plot
fastqc_overrepresented_sequencesi_plot
fastqc_adapter_content_plot

3. Trimming sequences with FastP

Make a script to trim sequences with FastP. We will be removing adapters, trimming poly-g, and trimming by quality phred score with a cut off of 30.

nano fastp.sh

#!/bin/bash

#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts              
#SBATCH --error="fastp_error" #if your job fails, the error report will be put in this file
#SBATCH --output="fastp_output" #once your job is completed, any final job report comments will be put in this file

# Load modules needed 
module load fastp/0.19.7-foss-2018b

echo "Start trimming with fastp" $(date)

# Make array of sequences to trim 
cd /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw
array1=($(ls *1.fastq.gz))

# fastq and fastqc loop
for i in ${array1[@]}; do
    fastp --in1 ${i} \
        --in2 $(echo ${i}|sed s/_1/_2/)\
        --out1 /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/trim/trim.${i} \
        --out2 /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/trim/trim.$(echo ${i}|sed s/_1/_2/) \        
        --detect_adapter_for_pe \
        --qualified_quality_phred 30 \
        --trim_poly_g 
done

echo "Read trimming of adapters complete." $(date)

Submitted batch job 244006. Took about 8 hrs to run

4. QC trimmed sequences

Now we will QC the trimmed sequences. Make a script.

nano fastqc_trim.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts              
#SBATCH --error="fastqc_trim_error" #if your job fails, the error report will be put in this file
#SBATCH --output="fastqc_trim_output" #once your job is completed, any final job report comments will be put in this file

# Load modules needed 
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2

echo "Start trimmed QC" $(date)

cd /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq

for file in /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/trim/*fastq.gz
do 
fastqc $file --outdir /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/fastqc_trim
done

echo "End trimmed QC" $(date)

# Compile MultiQC report from fastQC files 
cd /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/fastqc_trim
multiqc --interactive ./

echo "Trimmed MultiQC report generated." $(date)

Submitted batch job 244018.

Trimmed sequence MultiQC summary:

The trimmed MultiQC report can be found on the E5 Deep Dive GitHub repo here.

Trimming removed the adapter content in these sequences. The quality scores are very high across the sequence length, so we are not going to trim by length in this iteration. We can return to the trimming step in the future if we need to make changes.

fastqc_adapter_content_plot fastqc_sequence_duplication_levels_plot fastqc_sequence_length_distribution_plot fastqc_per_base_n_content_plot fastqc_per_sequence_gc_content_plot fastqc_per_sequence_quality_scores_plot fastqc_per_base_sequence_quality_plot fastqc_sequence_counts_plot

Calculate sequence length

We also ran a script to calculate sequence length of raw and trimmed sequences.

Make a script.

nano length.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts              
#SBATCH --error="seq_length_error" #if your job fails, the error report will be put in this file
#SBATCH --output="seq_length_output" #once your job is completed, any final job report comments will be put in this file

echo "Start sequence length counts" $(date)

cd /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq

zgrep -c "@SRR" /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/*.gz > /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/raw_seq_counts.txt

zgrep -c "@SRR" /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/trim/*.gz > /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/trim/trim_seq_counts.txt

echo "End sequence length counts" $(date)

Submitted batch job 244020.

The results are below. There are about 20-25 million sequences in each file.

/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101688_1.fastq.gz:23396439
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101688_2.fastq.gz:23396439
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101689_1.fastq.gz:24990687
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101689_2.fastq.gz:24990687
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101690_1.fastq.gz:23267238
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101690_2.fastq.gz:23267238
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101691_1.fastq.gz:29417106
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101691_2.fastq.gz:29417106
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101692_1.fastq.gz:22578178
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101692_2.fastq.gz:22578178
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101693_1.fastq.gz:20905338
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101693_2.fastq.gz:20905338
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101694_1.fastq.gz:22990550
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101694_2.fastq.gz:22990550
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101695_1.fastq.gz:16484060
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101695_2.fastq.gz:16484060
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101696_1.fastq.gz:24030431
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101696_2.fastq.gz:24030431
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101697_1.fastq.gz:24846067
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101697_2.fastq.gz:24846067
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101699_1.fastq.gz:22733345
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101699_2.fastq.gz:22733345
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101700_1.fastq.gz:25158606
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101700_2.fastq.gz:25158606
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101701_1.fastq.gz:29523059
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101701_2.fastq.gz:29523059
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101702_1.fastq.gz:23796710
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101702_2.fastq.gz:23796710
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101703_1.fastq.gz:22630044
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101703_2.fastq.gz:22630044
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101704_1.fastq.gz:24455094
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101704_2.fastq.gz:24455094
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101705_1.fastq.gz:27805087
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101705_2.fastq.gz:27805087
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101706_1.fastq.gz:18568153
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101706_2.fastq.gz:18568153
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101707_1.fastq.gz:29131006
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101707_2.fastq.gz:29131006
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101708_1.fastq.gz:25068848
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101708_2.fastq.gz:25068848
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101710_1.fastq.gz:23675842
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101710_2.fastq.gz:23675842
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101711_1.fastq.gz:24020166
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101711_2.fastq.gz:24020166
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101712_1.fastq.gz:24937261
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101712_2.fastq.gz:24937261
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101713_1.fastq.gz:16381619
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101713_2.fastq.gz:16381619
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101715_1.fastq.gz:21751368
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101715_2.fastq.gz:21751368
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101718_1.fastq.gz:27076800
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101718_2.fastq.gz:27076800
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101719_1.fastq.gz:24642131
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101719_2.fastq.gz:24642131
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101720_1.fastq.gz:26361873
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101720_2.fastq.gz:26361873
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101721_1.fastq.gz:17900262
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101721_2.fastq.gz:17900262
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101722_1.fastq.gz:25641209
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101722_2.fastq.gz:25641209
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101723_1.fastq.gz:23770610
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101723_2.fastq.gz:23770610
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101724_1.fastq.gz:25368402
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101724_2.fastq.gz:25368402

5. Align sequences to the reference genome

We are using HISAT2 and samtools to build the reference genome and align our sequences to this reference. See Step 1 above for reference genome information.

Build reference genome

Create a script.

nano build.sh

#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts              
#SBATCH --error="build_error" #if your job fails, the error report will be put in this file
#SBATCH --output="build_output" #once your job is completed, any final job report comments will be put in this file

# load modules needed
module load HISAT2/2.2.1-gompi-2021b #Alignment to reference genome: HISAT2

# Unzip reference genome 
#gunzip /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/refs/Pver_genome_assembly_v1.0.fasta.gz

# Index reference genome 
hisat2-build -f /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/refs/Pver_genome_assembly_v1.0.fasta Pver_ref
echo "Reference genome indexed." $(date)

This built the reference genome we need to do the alignment.

Align sequences

Create script.

nano align.sh

#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --account=putnamlab  
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications        
#SBATCH --error="align_error" #if your job fails, the error report will be put in this file
#SBATCH --output="align_output" #once your job is completed, any final job report comments will be put in this file

# load modules needed
module load HISAT2/2.2.1-foss-2019b #Alignment to reference genome: HISAT2
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools

## Genome already referenced 

echo "Start alignment" $(date)

# Alignment of clean reads to the reference genome
# Make array of sequences to trim 
array1=($(ls /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/trim/*_1.fastq.gz | xargs -n 1 basename))

for i in ${array1[@]}; do
    hisat2 -p 8 --rna-strandness RF --dta -q -x /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/refs/Pver_ref -1 /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/trim/${i} -2 /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/trim/$(echo ${i}|sed s/_1/_2/) -S /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/mapped/${i}.sam
    samtools sort -@ 8 -o /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/mapped/${i}.bam /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/mapped/${i}.sam
    echo "${i} bam-ified!"
    rm /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/mapped/${i}.sam
done

echo "Alignment completed" $(date)

This script will align the sequences to the references, generate .bam and .sam files, and then remove the unneeded and large .sam files.

This was started on 20230323 at 09:45 Pacific Time job 244260.

Alignment rates range from 69-75% as expected.

Calculate the alignment rates

Next, run a script to output mapping and alignment statistics.

nano mapping_rate.sh

#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab  
#SBATCH -D /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts/
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications        
#SBATCH --error="mapping_rate_error" #if your job fails, the error report will be put in this file
#SBATCH --output="mapping_rate_output" #once your job is completed, any final job report comments will be put in this file

module load SAMtools/1.9-foss-2018b 

echo "Start mapping rate calculations from .bam files" $(date)

for i in /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/mapped/*.bam; do
    echo "${i}" >> /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/mapped/mapped_reads_counts_Pverr
    samtools flagstat ${i} | grep "mapped (" >> /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/mapped/mapped_reads_counts_Pverr
done

echo "Mapping rates completed" $(date)

6. Assemble and map sequences to the reference

Fix GFF file formats

Before assembling reads, we must fix the GFF format to include transcript_id= and gene_id= in the information column using an R script. The R script is below and available on GitHub here.

#This script add transcript and gene id into GFF file for alignment.  

#Here, I'll be adding transcript_id= and gene_id= to 'gene' column in order to properly assemble our aligned data  

#Load libraries and data. 

#Load libraries
library(tidyverse)
library(R.utils)

#Load gene gff file 

gff <- read.csv(file = "~/Desktop/GFFs/pverr/Pver_genome_assembly_v1.0.gff3", header = F, sep = "\t", skip = 1)

#Rename columns 

colnames(gff) <- c("scaffold", "Gene.Predict", "id", "gene.start","gene.stop", "pos1", "pos2","pos3", "gene")

# Remove all rows with "#" character - this gff has # denoting protein sequences. Since we don't need the protein sequences right now, I'm going to remove them 

gff <- gff[!grepl("#", gff$scaffold),]

#Create transcript ID

gff$transcript_id <- sub(";.*", "", gff$gene)
gff$transcript_id <- gsub("ID=", "", gff$transcript_id) #remove ID= 
head(gff)

#Create Parent ID

gff$parent_id <- sub(".*Parent=", "", gff$gene)
gff$parent_id <- sub(";.*", "", gff$parent_id)
gff$parent_id <- gsub("ID=", "", gff$parent_id) #remove ID= 
head(gff)

#Add these values back into the gene column separated by semicolons

gff <- gff %>% 
  mutate(gene = ifelse(id != "gene", paste0(gene, ";transcript_id=", gff$transcript_id, ";gene_id=", gff$parent_id),  paste0(gene)))
head(gff)

## if this gff does not work in the stringtie script, go back and edit so that transcript_id=Pver_g1.t2 instead of transcript_id=Pver_g1.t2.utr5p1 or transcript_id=Pver_g1.t2.exon1, etc

#Remove parent and transcript columns

gff <- gff %>%
  select(!transcript_id)%>%
  select(!parent_id)

#Save file 

write.table(gff, file = "~/Desktop/GFFs/pverr/Pver_genome_assembly_v1_fixed.gff3", sep="\t", col.names = FALSE, row.names=FALSE, quote=FALSE)

Zip file and upload to HPC for bioinformatic use

gzip /Users/jillashey/Desktop/GFFs/pverr/Pver_genome_assembly_v1_fixed.gff3

scp /Users/jillashey/Desktop/GFFs/pverr/Pver_genome_assembly_v1_fixed.gff3.gz jillashey@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/refs

Now we can assemble the reads via stringtie!

Assemble reads with StringTie

Prepare the directories.

cd /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/refs
gunzip Pver_genome_assembly_v1_fixed.gff3
cd ../
mkdir assembled
cd scripts 

Make a new script.

nano assemble.sh

#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab  
#SBATCH -D /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts/
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications        
#SBATCH --error="assemble_error" #if your job fails, the error report will be put in this file
#SBATCH --output="assemble_output" #once your job is completed, any final job report comments will be put in this file

module load StringTie/2.2.1-GCC-11.2.0

echo "StringTie assembly start" $(date)

array=($(ls /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/mapped/*.bam)) #Make an array of sequences to assemble

for i in ${array[@]}; do
  stringtie -p 8 -e -B -G /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/refs/Pver_genome_assembly_v1_fixed.gff3 -A /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/assembled/${i}.gene_abund.tab -o /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/assembled/${i}.gtf ${i}
  echo "StringTie assembly for seq file ${i}" $(date)
done

echo "StringTie assembly complete, starting assembly analysis" $(date)

We ran this script and it successfully generated .gtf files as expected. However, there was an error that we have not seen before. We will look at the data and see if it caused an issue in file formats. Preliminary investigations do not show any issues with the files.

head assemble_error 

Warning: invalid start coordinate at line:
3_prime_partial true			NA	NA				;transcript_id=;gene_id=
Warning: invalid start coordinate at line:
5_prime_partial true			NA	NA				;transcript_id=;gene_id=
Warning: invalid start coordinate at line:
3_prime_partial true			NA	NA				;transcript_id=;gene_id=
Warning: invalid start coordinate at line:
5_prime_partial true			NA	NA				;transcript_id=;gene_id=
Warning: invalid start coordinate at line:
3_prime_partial true			NA	NA				;transcript_id=;gene_id=

Merge GTF files

nano merge.sh

#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab  
#SBATCH -D /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts/
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications        
#SBATCH --error="merge_error" #if your job fails, the error report will be put in this file
#SBATCH --output="merge_output" #once your job is completed, any final job report comments will be put in this file

# Load modules 
module load StringTie/2.2.1-GCC-11.2.0
module load GffCompare/0.12.6-GCC-11.2.0 
module load Python/3.9.6-GCCcore-11.2.0

echo "StringTie merge start" $(date)

# Make list of gtfs
ls /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/assembled/*.gtf > gtf_list.txt 

# Merge gtfs
stringtie --merge -e -p 8 -G /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/refs/Pver_genome_assembly_v1_fixed.gff3 -o /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/assembled/Pverr_merged.gtf gtf_list.txt #Merge GTFs 
echo "Stringtie merge complete" $(date)

# Compare gtfs and compute accuracy 
gffcompare -r /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/refs/Pver_genome_assembly_v1_fixed.gff3 -G -o /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/assembled/merged /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/assembled/Pverr_merged.gtf
echo "GFFcompare complete, Starting gene count matrix assembly..." $(date)

## Note from ZD code: 
#Note: the merged part is actually redundant and unnecessary unless we perform the original stringtie step without the -e function and perform
#re-estimation with -e after stringtie --merge, but will redo the pipeline later and confirm that I get equal results.

Assembly complete!

7. Generate gene count matrix with PrepDE

Copy the prepDE.py into the scripts folder from recent project. This file can be directly downloaded from StringTie GitHub here.

cp /data/putnamlab/ashuffmyer/pairs-rnaseq/prepDE.py /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts  

Compile the gene count matrix by first generating a list of files and then applying the prepDE.py script. This was run in an interactive session.

cd assembled/ 

for filename in *bam.gtf; do echo $filename $PWD/$filename; done > listGTF.txt

interactive

module load Python/2.7.15-foss-2018b

python prepDE.py -g Pverr_gene_count_matrix.csv -i ./listGTF.txt

Note that we had an error running this script with Python v3. We instead loaded a previous Python v2. This may not be an issue with more recent verions of the prepDE.py script.

Finally, add gene count matrix to E5 repository on GitHub. It is available at this link here.

scp ashuffmyer@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/assembled/Pverr_gene_count_matrix.csv ~/MyProjects/E5/deep-dive/A-Pver/data/rna-seq

The next step will be to assemble metadata and run preliminary analyses.

Written on March 15, 2023