E5 Deep Dive RNAseq Count Matrix Analysis
This post details analysis and bioinformatic steps to generate a gene count matrix from RNAseq data for the E5 Deep Dive Project.
More information on this project can be found on the GitHub repo.
1. Obtain RNAseq files from NCBI SRA
Obtain SRR numbers
First, we needed to obtain RNAseq files from NCBI from the project PRJNA744403. I first obtained a list of SRR numbers that identify the specific sequence files for RNAseq. I did this by going to all SRA links for the project and selecting all the RNAseq files (32 total). In the upper right hand corner, I selected “Send to” and “File”. This outputs a .txt file with a list of all SRR numbers.
Download RNAseq files
I then logged into Andromeda and created a folder for these files at /data/putnamlab/ashuffmyer/e5-deepdive
with raw
, scripts
, and refs
folders. Jill Ashey is also working on this project.
I then copied the SRR identifiers from the .txt file produced above into a file in the raw
folder using nano SraAccList.txt
and pasting the identifiers. The file looks like this:
SRR15101688
SRR15101689
SRR15101690
SRR15101691
SRR15101692
SRR15101693
SRR15101694
SRR15101695
SRR15101696
SRR15101697
SRR15101699
SRR15101700
SRR15101701
SRR15101702
SRR15101703
SRR15101704
SRR15101705
SRR15101706
SRR15101707
SRR15101708
SRR15101710
SRR15101711
SRR15101712
SRR15101713
SRR15101715
SRR15101718
SRR15101719
SRR15101720
SRR15101721
SRR15101722
SRR15101723
SRR15101724
Next, I downloaded the files using the NCBI SRA Toolkit which is installed on Andromeda as module SRA-Toolkit/2.10.9-gompi-2020b
.
With the help of Sam and Steven in the Roberts Lab, I wrote a script to download the files.
nano download_sra.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --error="download_sra_error" #if your job fails, the error report will be put in this file
#SBATCH --output="download_sra_output" #once your job is completed, any final job report comments will be put in this file
module load SRA-Toolkit/2.10.9-gompi-2020b
prefetch --option-file ../raw/SraAccList.txt -O ../raw #this creates a folder for each SRR in the .txt list and outputs in the raw data folder
# Enable recursive globbing
shopt -s globstar
# Run fasterq-dump on any SRA file in any directory and split into read 1 and 2 files and put in raw folder
for file in ../raw/**/*.sra
do
fasterq-dump --outdir ../raw --split-files "${file}"
done
#Remove the SRR directories that are no longer needed
rm -r ../raw/SRR*/
#zip all files
gzip ../raw/*.fastq
#Generate checksums
md5sum ../raw/*.fastq.gz > ../raw/md5.original.download.20230315
sbatch download_sra.sh
This script uses prefetch
to retrieve the data files for each SRR number and stores them in the raw
data folder with a directory for each file.
The fasterq-dump
command then takes the .sra
file from each directory and converts them to read 1 and read 2 fastq files.
Finally, the SRR directories are removed that are no longer needed and we generate md5 checksums.
This script can be used for any SRA download with a custom list of SRR numbers.
Download reference files
The genome is stored on NCBI.
We downloaded the reference files from the Reef Genomics Database.
cd refs
#genome scaffolds
wget http://pver.reefgenomics.org/download/Pver_genome_assembly_v1.0.fasta.gz
#gene models CDS
wget http://pver.reefgenomics.org/download/Pver_genes_names_v1.0.fna.gz
#gene models proteins
wget http://pver.reefgenomics.org/download/Pver_proteins_names_v1.0.faa.gz
#gene models GFF
wget http://pver.reefgenomics.org/download/Pver_genome_assembly_v1.0.gff3.gz
#Generate checksums
md5sum *.gz > md5.original.refs.download.20230315
Finally, I updated permissions after all files were downloaded so Jill can collaborate on this project.
chmod u=rwx,g=rwx,o=rwx,a=rwx -R e5-deepdive
Additional steps will be added to this post as we move forward.
2. QC using FastQC and MultiQC
Make directories for fastqc data
mkdir fastqc_raw
Write script for raw QC
nano fastqc_raw.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts
#SBATCH --error="fastqc_raw_error" #if your job fails, the error report will be put in this file
#SBATCH --output="fastqc_raw_output" #once your job is completed, any final job report comments will be put in this file
# Load modules needed
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2
echo "Start raw QC" $(date)
cd /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq
for file in /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/*fastq.gz
do
fastqc $file --outdir /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/fastqc_raw
done
echo "End raw QC" $(date)
# Compile MultiQC report from fastQC files
cd /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/fastqc_raw
multiqc --interactive ./
echo "Cleaned MultiQC report generated." $(date)
Submitted batch job 243914. Script ran in ~4.5 hours.
Raw sequence MultiQC summary:
The raw MultiQC report can be found on the E5 Deep Dive GitHub repo here.
There was high adapter content in these raw sequences. Overall, the quality score was high and there were no red flags in other metrics.
3. Trimming sequences with FastP
Make a script to trim sequences with FastP. We will be removing adapters, trimming poly-g, and trimming by quality phred score with a cut off of 30.
nano fastp.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts
#SBATCH --error="fastp_error" #if your job fails, the error report will be put in this file
#SBATCH --output="fastp_output" #once your job is completed, any final job report comments will be put in this file
# Load modules needed
module load fastp/0.19.7-foss-2018b
echo "Start trimming with fastp" $(date)
# Make array of sequences to trim
cd /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw
array1=($(ls *1.fastq.gz))
# fastq and fastqc loop
for i in ${array1[@]}; do
fastp --in1 ${i} \
--in2 $(echo ${i}|sed s/_1/_2/)\
--out1 /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/trim/trim.${i} \
--out2 /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/trim/trim.$(echo ${i}|sed s/_1/_2/) \
--detect_adapter_for_pe \
--qualified_quality_phred 30 \
--trim_poly_g
done
echo "Read trimming of adapters complete." $(date)
Submitted batch job 244006. Took about 8 hrs to run
4. QC trimmed sequences
Now we will QC the trimmed sequences. Make a script.
nano fastqc_trim.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts
#SBATCH --error="fastqc_trim_error" #if your job fails, the error report will be put in this file
#SBATCH --output="fastqc_trim_output" #once your job is completed, any final job report comments will be put in this file
# Load modules needed
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.9-intel-2020a-Python-3.8.2
echo "Start trimmed QC" $(date)
cd /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq
for file in /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/trim/*fastq.gz
do
fastqc $file --outdir /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/fastqc_trim
done
echo "End trimmed QC" $(date)
# Compile MultiQC report from fastQC files
cd /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/fastqc_trim
multiqc --interactive ./
echo "Trimmed MultiQC report generated." $(date)
Submitted batch job 244018.
Trimmed sequence MultiQC summary:
The trimmed MultiQC report can be found on the E5 Deep Dive GitHub repo here.
Trimming removed the adapter content in these sequences. The quality scores are very high across the sequence length, so we are not going to trim by length in this iteration. We can return to the trimming step in the future if we need to make changes.
Calculate sequence length
We also ran a script to calculate sequence length of raw and trimmed sequences.
Make a script.
nano length.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts
#SBATCH --error="seq_length_error" #if your job fails, the error report will be put in this file
#SBATCH --output="seq_length_output" #once your job is completed, any final job report comments will be put in this file
echo "Start sequence length counts" $(date)
cd /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq
zgrep -c "@SRR" /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/*.gz > /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/raw_seq_counts.txt
zgrep -c "@SRR" /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/trim/*.gz > /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/trim/trim_seq_counts.txt
echo "End sequence length counts" $(date)
Submitted batch job 244020.
The results are below. There are about 20-25 million sequences in each file.
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101688_1.fastq.gz:23396439
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101688_2.fastq.gz:23396439
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101689_1.fastq.gz:24990687
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101689_2.fastq.gz:24990687
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101690_1.fastq.gz:23267238
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101690_2.fastq.gz:23267238
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101691_1.fastq.gz:29417106
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101691_2.fastq.gz:29417106
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101692_1.fastq.gz:22578178
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101692_2.fastq.gz:22578178
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101693_1.fastq.gz:20905338
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101693_2.fastq.gz:20905338
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101694_1.fastq.gz:22990550
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101694_2.fastq.gz:22990550
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101695_1.fastq.gz:16484060
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101695_2.fastq.gz:16484060
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101696_1.fastq.gz:24030431
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101696_2.fastq.gz:24030431
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101697_1.fastq.gz:24846067
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101697_2.fastq.gz:24846067
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101699_1.fastq.gz:22733345
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101699_2.fastq.gz:22733345
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101700_1.fastq.gz:25158606
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101700_2.fastq.gz:25158606
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101701_1.fastq.gz:29523059
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101701_2.fastq.gz:29523059
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101702_1.fastq.gz:23796710
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101702_2.fastq.gz:23796710
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101703_1.fastq.gz:22630044
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101703_2.fastq.gz:22630044
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101704_1.fastq.gz:24455094
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101704_2.fastq.gz:24455094
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101705_1.fastq.gz:27805087
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101705_2.fastq.gz:27805087
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101706_1.fastq.gz:18568153
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101706_2.fastq.gz:18568153
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101707_1.fastq.gz:29131006
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101707_2.fastq.gz:29131006
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101708_1.fastq.gz:25068848
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101708_2.fastq.gz:25068848
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101710_1.fastq.gz:23675842
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101710_2.fastq.gz:23675842
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101711_1.fastq.gz:24020166
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101711_2.fastq.gz:24020166
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101712_1.fastq.gz:24937261
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101712_2.fastq.gz:24937261
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101713_1.fastq.gz:16381619
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101713_2.fastq.gz:16381619
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101715_1.fastq.gz:21751368
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101715_2.fastq.gz:21751368
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101718_1.fastq.gz:27076800
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101718_2.fastq.gz:27076800
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101719_1.fastq.gz:24642131
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101719_2.fastq.gz:24642131
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101720_1.fastq.gz:26361873
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101720_2.fastq.gz:26361873
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101721_1.fastq.gz:17900262
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101721_2.fastq.gz:17900262
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101722_1.fastq.gz:25641209
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101722_2.fastq.gz:25641209
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101723_1.fastq.gz:23770610
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101723_2.fastq.gz:23770610
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101724_1.fastq.gz:25368402
/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/raw/SRR15101724_2.fastq.gz:25368402
5. Align sequences to the reference genome
We are using HISAT2
and samtools
to build the reference genome and align our sequences to this reference. See Step 1 above for reference genome information.
Build reference genome
Create a script.
nano build.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts
#SBATCH --error="build_error" #if your job fails, the error report will be put in this file
#SBATCH --output="build_output" #once your job is completed, any final job report comments will be put in this file
# load modules needed
module load HISAT2/2.2.1-gompi-2021b #Alignment to reference genome: HISAT2
# Unzip reference genome
#gunzip /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/refs/Pver_genome_assembly_v1.0.fasta.gz
# Index reference genome
hisat2-build -f /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/refs/Pver_genome_assembly_v1.0.fasta Pver_ref
echo "Reference genome indexed." $(date)
This built the reference genome we need to do the alignment.
Align sequences
Create script.
nano align.sh
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --error="align_error" #if your job fails, the error report will be put in this file
#SBATCH --output="align_output" #once your job is completed, any final job report comments will be put in this file
# load modules needed
module load HISAT2/2.2.1-foss-2019b #Alignment to reference genome: HISAT2
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools
## Genome already referenced
echo "Start alignment" $(date)
# Alignment of clean reads to the reference genome
# Make array of sequences to trim
array1=($(ls /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/trim/*_1.fastq.gz | xargs -n 1 basename))
for i in ${array1[@]}; do
hisat2 -p 8 --rna-strandness RF --dta -q -x /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/refs/Pver_ref -1 /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/trim/${i} -2 /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/trim/$(echo ${i}|sed s/_1/_2/) -S /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/mapped/${i}.sam
samtools sort -@ 8 -o /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/mapped/${i}.bam /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/mapped/${i}.sam
echo "${i} bam-ified!"
rm /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/mapped/${i}.sam
done
echo "Alignment completed" $(date)
This script will align the sequences to the references, generate .bam
and .sam
files, and then remove the unneeded and large .sam
files.
This was started on 20230323 at 09:45 Pacific Time job 244260.
Alignment rates range from 69-75% as expected.
Calculate the alignment rates
Next, run a script to output mapping and alignment statistics.
nano mapping_rate.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts/
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --error="mapping_rate_error" #if your job fails, the error report will be put in this file
#SBATCH --output="mapping_rate_output" #once your job is completed, any final job report comments will be put in this file
module load SAMtools/1.9-foss-2018b
echo "Start mapping rate calculations from .bam files" $(date)
for i in /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/mapped/*.bam; do
echo "${i}" >> /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/mapped/mapped_reads_counts_Pverr
samtools flagstat ${i} | grep "mapped (" >> /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/mapped/mapped_reads_counts_Pverr
done
echo "Mapping rates completed" $(date)
6. Assemble and map sequences to the reference
Fix GFF file formats
Before assembling reads, we must fix the GFF format to include transcript_id= and gene_id= in the information column using an R script. The R script is below and available on GitHub here.
#This script add transcript and gene id into GFF file for alignment.
#Here, I'll be adding transcript_id= and gene_id= to 'gene' column in order to properly assemble our aligned data
#Load libraries and data.
#Load libraries
library(tidyverse)
library(R.utils)
#Load gene gff file
gff <- read.csv(file = "~/Desktop/GFFs/pverr/Pver_genome_assembly_v1.0.gff3", header = F, sep = "\t", skip = 1)
#Rename columns
colnames(gff) <- c("scaffold", "Gene.Predict", "id", "gene.start","gene.stop", "pos1", "pos2","pos3", "gene")
# Remove all rows with "#" character - this gff has # denoting protein sequences. Since we don't need the protein sequences right now, I'm going to remove them
gff <- gff[!grepl("#", gff$scaffold),]
#Create transcript ID
gff$transcript_id <- sub(";.*", "", gff$gene)
gff$transcript_id <- gsub("ID=", "", gff$transcript_id) #remove ID=
head(gff)
#Create Parent ID
gff$parent_id <- sub(".*Parent=", "", gff$gene)
gff$parent_id <- sub(";.*", "", gff$parent_id)
gff$parent_id <- gsub("ID=", "", gff$parent_id) #remove ID=
head(gff)
#Add these values back into the gene column separated by semicolons
gff <- gff %>%
mutate(gene = ifelse(id != "gene", paste0(gene, ";transcript_id=", gff$transcript_id, ";gene_id=", gff$parent_id), paste0(gene)))
head(gff)
## if this gff does not work in the stringtie script, go back and edit so that transcript_id=Pver_g1.t2 instead of transcript_id=Pver_g1.t2.utr5p1 or transcript_id=Pver_g1.t2.exon1, etc
#Remove parent and transcript columns
gff <- gff %>%
select(!transcript_id)%>%
select(!parent_id)
#Save file
write.table(gff, file = "~/Desktop/GFFs/pverr/Pver_genome_assembly_v1_fixed.gff3", sep="\t", col.names = FALSE, row.names=FALSE, quote=FALSE)
Zip file and upload to HPC for bioinformatic use
gzip /Users/jillashey/Desktop/GFFs/pverr/Pver_genome_assembly_v1_fixed.gff3
scp /Users/jillashey/Desktop/GFFs/pverr/Pver_genome_assembly_v1_fixed.gff3.gz jillashey@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/refs
Now we can assemble the reads via stringtie!
Assemble reads with StringTie
Prepare the directories.
cd /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/refs
gunzip Pver_genome_assembly_v1_fixed.gff3
cd ../
mkdir assembled
cd scripts
Make a new script.
nano assemble.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts/
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --error="assemble_error" #if your job fails, the error report will be put in this file
#SBATCH --output="assemble_output" #once your job is completed, any final job report comments will be put in this file
module load StringTie/2.2.1-GCC-11.2.0
echo "StringTie assembly start" $(date)
array=($(ls /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/mapped/*.bam)) #Make an array of sequences to assemble
for i in ${array[@]}; do
stringtie -p 8 -e -B -G /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/refs/Pver_genome_assembly_v1_fixed.gff3 -A /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/assembled/${i}.gene_abund.tab -o /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/assembled/${i}.gtf ${i}
echo "StringTie assembly for seq file ${i}" $(date)
done
echo "StringTie assembly complete, starting assembly analysis" $(date)
We ran this script and it successfully generated .gtf files as expected. However, there was an error that we have not seen before. We will look at the data and see if it caused an issue in file formats. Preliminary investigations do not show any issues with the files.
head assemble_error
Warning: invalid start coordinate at line:
3_prime_partial true NA NA ;transcript_id=;gene_id=
Warning: invalid start coordinate at line:
5_prime_partial true NA NA ;transcript_id=;gene_id=
Warning: invalid start coordinate at line:
3_prime_partial true NA NA ;transcript_id=;gene_id=
Warning: invalid start coordinate at line:
5_prime_partial true NA NA ;transcript_id=;gene_id=
Warning: invalid start coordinate at line:
3_prime_partial true NA NA ;transcript_id=;gene_id=
Merge GTF files
nano merge.sh
#!/bin/bash
#SBATCH -t 100:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=jillashey@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts/
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --error="merge_error" #if your job fails, the error report will be put in this file
#SBATCH --output="merge_output" #once your job is completed, any final job report comments will be put in this file
# Load modules
module load StringTie/2.2.1-GCC-11.2.0
module load GffCompare/0.12.6-GCC-11.2.0
module load Python/3.9.6-GCCcore-11.2.0
echo "StringTie merge start" $(date)
# Make list of gtfs
ls /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/assembled/*.gtf > gtf_list.txt
# Merge gtfs
stringtie --merge -e -p 8 -G /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/refs/Pver_genome_assembly_v1_fixed.gff3 -o /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/assembled/Pverr_merged.gtf gtf_list.txt #Merge GTFs
echo "Stringtie merge complete" $(date)
# Compare gtfs and compute accuracy
gffcompare -r /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/refs/Pver_genome_assembly_v1_fixed.gff3 -G -o /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/assembled/merged /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/assembled/Pverr_merged.gtf
echo "GFFcompare complete, Starting gene count matrix assembly..." $(date)
## Note from ZD code:
#Note: the merged part is actually redundant and unnecessary unless we perform the original stringtie step without the -e function and perform
#re-estimation with -e after stringtie --merge, but will redo the pipeline later and confirm that I get equal results.
Assembly complete!
7. Generate gene count matrix with PrepDE
Copy the prepDE.py into the scripts folder from recent project. This file can be directly downloaded from StringTie GitHub here.
cp /data/putnamlab/ashuffmyer/pairs-rnaseq/prepDE.py /data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/scripts
Compile the gene count matrix by first generating a list of files and then applying the prepDE.py script. This was run in an interactive session.
cd assembled/
for filename in *bam.gtf; do echo $filename $PWD/$filename; done > listGTF.txt
interactive
module load Python/2.7.15-foss-2018b
python prepDE.py -g Pverr_gene_count_matrix.csv -i ./listGTF.txt
Note that we had an error running this script with Python v3. We instead loaded a previous Python v2. This may not be an issue with more recent verions of the prepDE.py script.
Finally, add gene count matrix to E5 repository on GitHub. It is available at this link here.
scp ashuffmyer@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/e5-deepdive/rna-seq/assembled/Pverr_gene_count_matrix.csv ~/MyProjects/E5/deep-dive/A-Pver/data/rna-seq
The next step will be to assemble metadata and run preliminary analyses.