Alignment and assembly of RNAseq reads for Montipora capitata larval thermal tolerance 2023 project
This post details alignment and assembly steps of the bioinformatic pipeline for the Montipora capitata 2023 larval thermal tolerance project RNAseq. See my notebook posts and my GitHub repo for information on this project.
You can find QC for these files in this post for original data delivery and this post for second data delivery for additional top off sequencing.
Files for this project include RNAseq files for 54 samples. Of these, 13 samples required additional sequencing to meet project deliverables for read depth. These files are denoted with an “s” after the sample prefix in the file name.
Samples with two rounds of sequencing have been QC’d individually and all files pass QC checks. The files for each sample will be combined/concatenated prior to alignment and assembly.
1. Concatenate files for samples with multiple files
First, sym link all second sequencing (“s”) trimmed files into the same directory as our original trimmed files.
cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/
mv trimmed-sequences-second-seq trimmed-sequences
cd trimmed-sequences/
ln -s /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences/trimmed-sequences-second-seq/*fastq.gz /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences
Next, generate a file that has a list of the number of reads in each file. We will use this to verify that the concatenate functions work correctly.
cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences/
interactive
for file in $(ls *.fastq.gz); do echo $file && echo $(($(zcat $file | wc -l) / 4)); done > fastq_reads.txt
The output is as follows.
File | Trimmed Reads |
---|---|
trim.R100_R1_001.fastq.gz |
22,620,372 |
trim.R100_R2_001.fastq.gz |
22,620,372 |
trim.R101_R1_001.fastq.gz |
24,721,395 |
trim.R101_R2_001.fastq.gz |
24,721,395 |
trim.R102_R1_001.fastq.gz |
21,719,378 |
trim.R102_R2_001.fastq.gz |
21,719,378 |
trim.R103_R1_001.fastq.gz |
15,045,349 |
trim.R103_R2_001.fastq.gz |
15,045,349 |
trim.R104_R1_001.fastq.gz |
20,320,917 |
trim.R104_R2_001.fastq.gz |
20,320,917 |
trim.R105_R1_001.fastq.gz |
19,295,017 |
trim.R105_R2_001.fastq.gz |
19,295,017 |
trim.R106_R1_001.fastq.gz |
23,294,462 |
trim.R106_R2_001.fastq.gz |
23,294,462 |
trim.R107_R1_001.fastq.gz |
4,822,107 |
trim.R107_R2_001.fastq.gz |
4,822,107 |
trim.R107s_R1_001.fastq.gz |
11,457,851 |
trim.R107s_R2_001.fastq.gz |
11,457,851 |
trim.R108_R1_001.fastq.gz |
19,644,360 |
trim.R108_R2_001.fastq.gz |
19,644,360 |
trim.R55_R1_001.fastq.gz |
10,429,847 |
trim.R55_R2_001.fastq.gz |
10,429,847 |
trim.R55s_R1_001.fastq.gz |
1,920,344 |
trim.R55s_R2_001.fastq.gz |
1,920,344 |
trim.R56_R1_001.fastq.gz |
10,950,200 |
trim.R56_R2_001.fastq.gz |
10,950,200 |
trim.R56s_R1_001.fastq.gz |
1,081,171 |
trim.R56s_R2_001.fastq.gz |
1,081,171 |
trim.R57_R1_001.fastq.gz |
9,818,404 |
trim.R57_R2_001.fastq.gz |
9,818,404 |
trim.R57s_R1_001.fastq.gz |
2,218,538 |
trim.R57s_R2_001.fastq.gz |
2,218,538 |
trim.R58_R1_001.fastq.gz |
11,599,234 |
trim.R58_R2_001.fastq.gz |
11,599,234 |
trim.R58s_R1_001.fastq.gz |
922,553 |
trim.R58s_R2_001.fastq.gz |
922,553 |
trim.R59_R1_001.fastq.gz |
4,181,880 |
trim.R59_R2_001.fastq.gz |
4,181,880 |
trim.R59s_R1_001.fastq.gz |
22,099,783 |
trim.R59s_R2_001.fastq.gz |
22,099,783 |
trim.R60_R1_001.fastq.gz |
10,371,860 |
trim.R60_R2_001.fastq.gz |
10,371,860 |
trim.R60s_R1_001.fastq.gz |
1,683,270 |
trim.R60s_R2_001.fastq.gz |
1,683,270 |
trim.R61_R1_001.fastq.gz |
12,840,605 |
trim.R61_R2_001.fastq.gz |
12,840,605 |
trim.R62_R1_001.fastq.gz |
10,586,054 |
trim.R62_R2_001.fastq.gz |
10,586,054 |
trim.R62s_R1_001.fastq.gz |
1,533,496 |
trim.R62s_R2_001.fastq.gz |
1,533,496 |
trim.R63_R1_001.fastq.gz |
16,366,473 |
trim.R63_R2_001.fastq.gz |
16,366,473 |
trim.R64_R1_001.fastq.gz |
21,582,713 |
trim.R64_R2_001.fastq.gz |
21,582,713 |
trim.R65_R1_001.fastq.gz |
20,881,937 |
trim.R65_R2_001.fastq.gz |
20,881,937 |
trim.R66_R1_001.fastq.gz |
21,685,720 |
trim.R66_R2_001.fastq.gz |
21,685,720 |
trim.R67_R1_001.fastq.gz |
5,427,517 |
trim.R67_R2_001.fastq.gz |
5,427,517 |
trim.R67s_R1_001.fastq.gz |
11,256,254 |
trim.R67s_R2_001.fastq.gz |
11,256,254 |
trim.R68_R1_001.fastq.gz |
20,670,181 |
trim.R68_R2_001.fastq.gz |
20,670,181 |
trim.R69_R1_001.fastq.gz |
26,128,049 |
trim.R69_R2_001.fastq.gz |
26,128,049 |
trim.R70_R1_001.fastq.gz |
22,666,926 |
trim.R70_R2_001.fastq.gz |
22,666,926 |
trim.R71_R1_001.fastq.gz |
19,026,441 |
trim.R71_R2_001.fastq.gz |
19,026,441 |
trim.R72_R1_001.fastq.gz |
22,047,415 |
trim.R72_R2_001.fastq.gz |
22,047,415 |
trim.R73_R1_001.fastq.gz |
23,215,360 |
trim.R73_R2_001.fastq.gz |
23,215,360 |
trim.R74_R1_001.fastq.gz |
23,372,268 |
trim.R74_R2_001.fastq.gz |
23,372,268 |
trim.R75_R1_001.fastq.gz |
7,020,548 |
trim.R75_R2_001.fastq.gz |
7,020,548 |
trim.R75s_R1_001.fastq.gz |
7,259,253 |
trim.R75s_R2_001.fastq.gz |
7,259,253 |
trim.R76_R1_001.fastq.gz |
22,091,635 |
trim.R76_R2_001.fastq.gz |
22,091,635 |
trim.R77_R1_001.fastq.gz |
26,679,183 |
trim.R77_R2_001.fastq.gz |
26,679,183 |
trim.R78_R1_001.fastq.gz |
23,235,087 |
trim.R78_R2_001.fastq.gz |
23,235,087 |
trim.R79_R1_001.fastq.gz |
16,737,093 |
trim.R79_R2_001.fastq.gz |
16,737,093 |
trim.R80_R1_001.fastq.gz |
23,504,936 |
trim.R80_R2_001.fastq.gz |
23,504,936 |
trim.R81_R1_001.fastq.gz |
21,501,629 |
trim.R81_R2_001.fastq.gz |
21,501,629 |
trim.R82_R1_001.fastq.gz |
23,350,580 |
trim.R82_R2_001.fastq.gz |
23,350,580 |
trim.R83_R1_001.fastq.gz |
5,263,676 |
trim.R83_R2_001.fastq.gz |
5,263,676 |
trim.R83s_R1_001.fastq.gz |
10,665,955 |
trim.R83s_R2_001.fastq.gz |
10,665,955 |
trim.R84_R1_001.fastq.gz |
21,296,217 |
trim.R84_R2_001.fastq.gz |
21,296,217 |
trim.R85_R1_001.fastq.gz |
23,839,002 |
trim.R85_R2_001.fastq.gz |
23,839,002 |
trim.R86_R1_001.fastq.gz |
21,015,291 |
trim.R86_R2_001.fastq.gz |
21,015,291 |
trim.R87_R1_001.fastq.gz |
16,971,668 |
trim.R87_R2_001.fastq.gz |
16,971,668 |
trim.R88_R1_001.fastq.gz |
19,964,497 |
trim.R88_R2_001.fastq.gz |
19,964,497 |
trim.R89_R1_001.fastq.gz |
21,793,515 |
trim.R89_R2_001.fastq.gz |
21,793,515 |
trim.R90_R1_001.fastq.gz |
21,601,078 |
trim.R90_R2_001.fastq.gz |
21,601,078 |
trim.R91_R1_001.fastq.gz |
4,829,073 |
trim.R91_R2_001.fastq.gz |
4,829,073 |
trim.R91s_R1_001.fastq.gz |
11,581,372 |
trim.R91s_R2_001.fastq.gz |
11,581,372 |
trim.R92_R1_001.fastq.gz |
18,998,179 |
trim.R92_R2_001.fastq.gz |
18,998,179 |
trim.R93_R1_001.fastq.gz |
21,278,345 |
trim.R93_R2_001.fastq.gz |
21,278,345 |
trim.R94_R1_001.fastq.gz |
21,321,504 |
trim.R94_R2_001.fastq.gz |
21,321,504 |
trim.R95_R1_001.fastq.gz |
18,342,977 |
trim.R95_R2_001.fastq.gz |
18,342,977 |
trim.R96_R1_001.fastq.gz |
23,025,861 |
trim.R96_R2_001.fastq.gz |
23,025,861 |
trim.R97_R1_001.fastq.gz |
21,450,165 |
trim.R97_R2_001.fastq.gz |
21,450,165 |
trim.R98_R1_001.fastq.gz |
23,491,288 |
trim.R98_R2_001.fastq.gz |
23,491,288 |
trim.R99_R1_001.fastq.gz |
5,743,661 |
trim.R99_R2_001.fastq.gz |
5,743,661 |
trim.R99s_R1_001.fastq.gz |
9,230,551 |
trim.R99s_R2_001.fastq.gz |
9,230,551 |
The samples with “s” indicate second data delivery. Notice that samples with lowest read depth in the original data delivery have much greater read depth in the second round of sequencing to meet project deliverables at Azenta.
Now, concatenate the files for each sample by concatenating R1 files together and R2 files together for samples that had additional sequencing. Concatenate files manually since we have a small number and I want to check each one. Name the concatenated files with “c” after the sample prefix.
cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences
interactive
cat trim.R55*R1*.fastq.gz >> trim.R55c_R1_001.fastq.gz #done
cat trim.R55*R2*.fastq.gz >> trim.R55c_R2_001.fastq.gz #done
cat trim.R56*R1*.fastq.gz >> trim.R56c_R1_001.fastq.gz #done
cat trim.R56*R2*.fastq.gz >> trim.R56c_R2_001.fastq.gz #done
cat trim.R57*R1*.fastq.gz >> trim.R57c_R1_001.fastq.gz #done
cat trim.R57*R2*.fastq.gz >> trim.R57c_R2_001.fastq.gz #done
cat trim.R58*R1*.fastq.gz >> trim.R58c_R1_001.fastq.gz #done
cat trim.R58*R2*.fastq.gz >> trim.R58c_R2_001.fastq.gz #done
cat trim.R59*R1*.fastq.gz >> trim.R59c_R1_001.fastq.gz #done
cat trim.R59*R2*.fastq.gz >> trim.R59c_R2_001.fastq.gz #done
cat trim.R60*R1*.fastq.gz >> trim.R60c_R1_001.fastq.gz #done
cat trim.R60*R2*.fastq.gz >> trim.R60c_R2_001.fastq.gz #done
cat trim.R62*R1*.fastq.gz >> trim.R62c_R1_001.fastq.gz #done
cat trim.R62*R2*.fastq.gz >> trim.R62c_R2_001.fastq.gz #done
cat trim.R67*R1*.fastq.gz >> trim.R67c_R1_001.fastq.gz #done
cat trim.R67*R2*.fastq.gz >> trim.R67c_R2_001.fastq.gz #done
cat trim.R75*R1*.fastq.gz >> trim.R75c_R1_001.fastq.gz #done
cat trim.R75*R2*.fastq.gz >> trim.R75c_R2_001.fastq.gz #done
cat trim.R83*R1*.fastq.gz >> trim.R83c_R1_001.fastq.gz #done
cat trim.R83*R2*.fastq.gz >> trim.R83c_R2_001.fastq.gz #done
cat trim.R91*R1*.fastq.gz >> trim.R91c_R1_001.fastq.gz #done
cat trim.R91*R2*.fastq.gz >> trim.R91c_R2_001.fastq.gz #done
cat trim.R99*R1*.fastq.gz >> trim.R99c_R1_001.fastq.gz #done
cat trim.R99*R2*.fastq.gz >> trim.R99c_R2_001.fastq.gz #done
cat trim.R107*R1*.fastq.gz >> trim.R107c_R1_001.fastq.gz #done
cat trim.R107*R2*.fastq.gz >> trim.R107c_R2_001.fastq.gz #done
exit
The notation for these is to concatenate the first and second data delivery together for R1 and then for R2.
ls trim.R55*R1*.fastq.gz
generates trim.R55_R1_001.fastq.gz trim.R55s_R1_001.fastq.gz
and
ls trim.R55*R2*.fastq.gz
generates trim.R55_R2_001.fastq.gz trim.R55s_R2_001.fastq.gz
Test that the number of reads increases for one file.
cat trim.R55*R1*.fastq.gz >> trim.R55c_R1_001.fastq.gz
$(($(zcat trim.R55c_R1_001.fastq.gz | wc -l) / 4))
R55 R1 original was 10,429,847 reads and R55 R1 second delivery was 1,920,344 reads. It is now 12,350,191 reads. 12,350,191 reads is expected. This code worked, proceed with all files.
Now, move any files from the concatenation out of this directory.
For files that were concatenated, view the number of reads against the expected value (original + second).
interactive
# view concatenated reads
for file in $(ls *c*.fastq.gz); do echo $file && echo $(($(zcat $file | wc -l) / 4)); done > fastq_cat_reads.txt
exit
All concatenated files have the expected number of reads.
Sample | Expected Total | Cancatenated Total |
---|---|---|
R55 | 12,350,191 | 12,350,191 |
R56 | 12,031,371 | 12,031,371 |
R57 | 12,036,942 | 12,036,942 |
R58 | 12,521,787 | 12,521,787 |
R59 | 26,281,663 | 26,281,663 |
R60 | 12,055,130 | 12,055,130 |
R62 | 12,119,550 | 12,119,550 |
R67 | 16,683,771 | 16,683,771 |
R75 | 14,279,801 | 14,279,801 |
R83 | 15,929,631 | 15,929,631 |
R91 | 16,410,445 | 16,410,445 |
R99 | 14,974,,212 | 14,974,212 |
R107 | 16,279,958 | 16,279,958 |
Finally, make a new directory with sym links to the files that will go forward for mapping and alignment (concatenated files for those with multiple data deliveries).
cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences
mkdir cat-sequences
cd cat-sequences
ln -s /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences/*fastq.gz /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences/cat-sequences
Remove files that will not be used.
#remove the second data delivery files
rm *s*.fastq.gz
#manually remove the first data delivery files
rm trim.R55_R*fastq.gz
rm trim.R56_R*fastq.gz
rm trim.R57_R*fastq.gz
rm trim.R58_R*fastq.gz
rm trim.R59_R*fastq.gz
rm trim.R60_R*fastq.gz
rm trim.R62_R*fastq.gz
rm trim.R67_R*fastq.gz
rm trim.R75_R*fastq.gz
rm trim.R83_R*fastq.gz
rm trim.R91_R*fastq.gz
rm trim.R99_R*fastq.gz
rm trim.R107_R*fastq.gz
The files we want to proceed with are now in /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences/cat-sequences
and are as follows:
trim.R100_R1_001.fastq.gz trim.R57c_R1_001.fastq.gz trim.R68_R1_001.fastq.gz trim.R79_R1_001.fastq.gz trim.R90_R1_001.fastq.gz
trim.R100_R2_001.fastq.gz trim.R57c_R2_001.fastq.gz trim.R68_R2_001.fastq.gz trim.R79_R2_001.fastq.gz trim.R90_R2_001.fastq.gz
trim.R101_R1_001.fastq.gz trim.R58c_R1_001.fastq.gz trim.R69_R1_001.fastq.gz trim.R80_R1_001.fastq.gz trim.R91c_R1_001.fastq.gz
trim.R101_R2_001.fastq.gz trim.R58c_R2_001.fastq.gz trim.R69_R2_001.fastq.gz trim.R80_R2_001.fastq.gz trim.R91c_R2_001.fastq.gz
trim.R102_R1_001.fastq.gz trim.R59c_R1_001.fastq.gz trim.R70_R1_001.fastq.gz trim.R81_R1_001.fastq.gz trim.R92_R1_001.fastq.gz
trim.R102_R2_001.fastq.gz trim.R59c_R2_001.fastq.gz trim.R70_R2_001.fastq.gz trim.R81_R2_001.fastq.gz trim.R92_R2_001.fastq.gz
trim.R103_R1_001.fastq.gz trim.R60c_R1_001.fastq.gz trim.R71_R1_001.fastq.gz trim.R82_R1_001.fastq.gz trim.R93_R1_001.fastq.gz
trim.R103_R2_001.fastq.gz trim.R60c_R2_001.fastq.gz trim.R71_R2_001.fastq.gz trim.R82_R2_001.fastq.gz trim.R93_R2_001.fastq.gz
trim.R104_R1_001.fastq.gz trim.R61_R1_001.fastq.gz trim.R72_R1_001.fastq.gz trim.R83c_R1_001.fastq.gz trim.R94_R1_001.fastq.gz
trim.R104_R2_001.fastq.gz trim.R61_R2_001.fastq.gz trim.R72_R2_001.fastq.gz trim.R83c_R2_001.fastq.gz trim.R94_R2_001.fastq.gz
trim.R105_R1_001.fastq.gz trim.R62c_R1_001.fastq.gz trim.R73_R1_001.fastq.gz trim.R84_R1_001.fastq.gz trim.R95_R1_001.fastq.gz
trim.R105_R2_001.fastq.gz trim.R62c_R2_001.fastq.gz trim.R73_R2_001.fastq.gz trim.R84_R2_001.fastq.gz trim.R95_R2_001.fastq.gz
trim.R106_R1_001.fastq.gz trim.R63_R1_001.fastq.gz trim.R74_R1_001.fastq.gz trim.R85_R1_001.fastq.gz trim.R96_R1_001.fastq.gz
trim.R106_R2_001.fastq.gz trim.R63_R2_001.fastq.gz trim.R74_R2_001.fastq.gz trim.R85_R2_001.fastq.gz trim.R96_R2_001.fastq.gz
trim.R107c_R1_001.fastq.gz trim.R64_R1_001.fastq.gz trim.R75c_R1_001.fastq.gz trim.R86_R1_001.fastq.gz trim.R97_R1_001.fastq.gz
trim.R107c_R2_001.fastq.gz trim.R64_R2_001.fastq.gz trim.R75c_R2_001.fastq.gz trim.R86_R2_001.fastq.gz trim.R97_R2_001.fastq.gz
trim.R108_R1_001.fastq.gz trim.R65_R1_001.fastq.gz trim.R76_R1_001.fastq.gz trim.R87_R1_001.fastq.gz trim.R98_R1_001.fastq.gz
trim.R108_R2_001.fastq.gz trim.R65_R2_001.fastq.gz trim.R76_R2_001.fastq.gz trim.R87_R2_001.fastq.gz trim.R98_R2_001.fastq.gz
trim.R55c_R1_001.fastq.gz trim.R66_R1_001.fastq.gz trim.R77_R1_001.fastq.gz trim.R88_R1_001.fastq.gz trim.R99c_R1_001.fastq.gz
trim.R55c_R2_001.fastq.gz trim.R66_R2_001.fastq.gz trim.R77_R2_001.fastq.gz trim.R88_R2_001.fastq.gz trim.R99c_R2_001.fastq.gz
trim.R56c_R1_001.fastq.gz trim.R67c_R1_001.fastq.gz trim.R78_R1_001.fastq.gz trim.R89_R1_001.fastq.gz
trim.R56c_R2_001.fastq.gz trim.R67c_R2_001.fastq.gz trim.R78_R2_001.fastq.gz trim.R89_R2_001.fastq.gz
2. Align reads to M. capitata genome using hisat2
Download reference genome and functional annotations
I first downloaded the Montipora capitata reference genome details in Stephens et al. 2022 and available as Version 3 online here.
The URL’s of the files we need to download are:
-
Genome assembly:
http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.assembly.fasta.gz
-
GFF:
http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.gff3.gz
Other files available include the following. We will use the functional annotation files in downstream steps.
-
Functional annotation:
http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.EggNog_results.txt.gz
-
Protein:
http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.pep.faa.gz
-
Nucleotide CDS:
http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.cds.fna.gz
-
CD Search Functional Annotation:
http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.Conserved_Domain_Search_results.txt.gz
-
KAAS Functional Annotation:
http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.genes.KEGG_results.txt.gz
Fix GFF3 format
Note that I have to first correct the format of the .gff3 file. I had to do this before in a previous analysis because it generated many unknown gene names even though the sequences mapped to the reference. We have to correct the format of column 9.
Download gff to local computer a the website listed above to ~/MyProjects/larval_symbiont_TPC/scripts/bioinformatics
.
cd ~/MyProjects/larval_symbiont_TPC/data/rna_seq
gunzip Montipora_capitata_HIv3.genes.gff3.gz
I next ran this R script in the GitHub R project for this data.
#This script add transcript and gene id into GFF file for alignment.
#Here, I'll be adding transcript_id= and gene_id= to 'gene' column because we needs that label to map our RNAseq data
#Load libraries and data.
library(tidyverse)
library(R.utils)
#Load gene gff file
gff <- read.csv(file="data/rna_seq/Montipora_capitata_HIv3.genes.gff3", header=FALSE, sep="\t")
#Rename columns
colnames(gff) <- c("scaffold", "Gene.Predict", "id", "gene.start","gene.stop", "pos1", "pos2","pos3", "gene")
head(gff)
#Create transcript ID
gff$transcript_id <- sub(";.*", "", gff$gene)
gff$transcript_id <- gsub("ID=", "", gff$transcript_id) #remove ID=
gff$transcript_id <- gsub("Parent=", "", gff$transcript_id) #remove ID=
head(gff)
#Create Parent ID
gff$parent_id <- sub(".*Parent=", "", gff$gene)
gff$parent_id <- sub(";.*", "", gff$parent_id)
gff$parent_id <- gsub("ID=", "", gff$parent_id) #remove ID=
head(gff)
#Now add these values into the gene column separated by semicolons.
gff <- gff %>%
mutate(gene = ifelse(id != "gene", paste0(gene, ";transcript_id=", gff$transcript_id, ";gene_id=", gff$parent_id), paste0(gene)))
head(gff)
#Now remove the transcript and parent ID separate columns.
gff<-gff %>%
select(!transcript_id)%>%
select(!parent_id)
head(gff)
#Save file. Then upload this to Andromeda for use in bioinformatic steps.
write.table(gff, file="data/rna_seq/Montipora_capitata_HIv3.genes_fixed.gff3", sep="\t", col.names = FALSE, row.names=FALSE, quote=FALSE)
Upload file to Andromeda.
scp ~/MyProjects/larval_symbiont_TPC/data/rna_seq/Montipora_capitata_HIv3.genes_fixed.gff3 ashuffmyer@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/mcap-2023-rnaseq/references/
Transfer files to Andromeda
Download reference genome files to Andromeda folder.
cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/
mkdir references
cd references
wget http://cyanophora.rutgers.edu/montipora/Montipora_capitata_HIv3.assembly.fasta.gz
gunzip Montipora_capitata_HIv3.assembly.fasta.gz
These files are now available at /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/references
as Montipora_capitata_HIv3.assembly.fasta
and Montipora_capitata_HIv3.genes_fixed.gff3
.
Alignment with hisat2
Write a script for alignment. These are based on Jill Ashey’s scripts.
cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/scripts
nano align.sh
The manual for hisat2 can be found here. Here, we will align trimmed and concatenated sequences generated above. We will use multiple processors (-p 8
) and --rna-strandness RF
to specify strand-specific information. We will keep the .sam file. The options used are below:
-f
Reads are FASTA files - used for building reference inhisat2-build
x
The basename of the index for the reference genome-p 8
Use multiple processors (8)--rna-strandness RF
Specify strand-specific information: the default is unstranded. For paired-end reads, use either FR or RF. With this option being used, every read alignment will have an XS attribute tag: ’+’ means a read belongs to a transcript on ‘+’ strand of genome. ‘-‘ means a read belongs to a transcript on ‘-‘ strand of genome.--dta
Downstream transcriptome assembly. Report alignments tailored for transcript assemblers including StringTie. With this option, HISAT2 requires longer anchor lengths for de novo discovery of splice sites. This leads to fewer alignments with short-anchors, which helps transcript assemblers improve significantly in computation and memory usage. We are using StringTie below.-1
Comma-separated list of files containing mate 1s (filename usually includes _1) (R1 files)-2
Comma-separated list of files containing mate 2s (filename usually includes _2) (R2 files)sed s/_R1/_R2/
: This is a sed command that performs a search and replace operation. It replaces_R1
with_R2
. This allows the R1 file to be specified in the input array and the program will read in the corresponding R2 file.-S
File to write SAM alignments to.- In the
samtools sort
function,-@
sets number of threads for parallel processing and-o
sets the output file.
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences/cat-sequences/
#SBATCH -o align-out.out
#SBATCH -e align-error.error
# load modules needed
module load HISAT2/2.2.1-foss-2019b #Alignment to reference genome: HISAT2
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools
#echo "Building genome reference" $(date)
# index the reference genome for Pacuta output index to working directory
hisat2-build -f /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/references/Montipora_capitata_HIv3.assembly.fasta /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/references/Mcapitata_ref
echo "Referece genome indexed. Starting alingment" $(date)
# This script exports alignments as bam files
# sorts the bam file because Stringtie takes a sorted file for input (--dta)
# removes the sam file because it is no longer needed
array=($(ls /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences/cat-sequences/*_R1_001.fastq.gz)) # call the clean sequences - make an array to align
for i in ${array[@]}; do
sample_name=`echo $i| awk -F [.] '{print $2}'`
hisat2 -p 8 --rna-strandness RF --dta -x /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/references/Mcapitata_ref -1 ${i} -2 $(echo ${i}|sed s/_R1/_R2/) -S ${sample_name}.sam
samtools sort -@ 8 -o ${sample_name}.bam ${sample_name}.sam
echo "${i} bam-ified!"
rm ${sample_name}.sam
done
echo "Alignment complete!" $(date)
sbatch align.sh
Job ID 312488 started at 11:00 on 19 April 2024.
Job finished at 03:00 on 20 April 2024.
Obtain mapping percentages after job was completed. Use the samtools flagstat
function that does a full pass through the input file to calculate and print statistics to stdout. Provides counts for each of 13 categories based primarily on bit flags in the FLAG field. Each category in the output is broken down into QC pass and QC fail, which is presented as “#PASS + #FAIL” followed by a description of the category.
interactive
cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences/cat-sequences/
module load SAMtools/1.9-foss-2018b
for i in *.bam; do
echo "${i}" >> mapped_reads_counts
samtools flagstat ${i} | grep "mapped (" >> mapped_reads_counts
done
View the results.
less mapped_reads_counts
R100_R1_001.bam
48546456 + 0 mapped (88.52% : N/A)
R101_R1_001.bam
50823099 + 0 mapped (88.61% : N/A)
R102_R1_001.bam
47629227 + 0 mapped (88.93% : N/A)
R103_R1_001.bam
34061219 + 0 mapped (88.57% : N/A)
R104_R1_001.bam
38331360 + 0 mapped (83.35% : N/A)
R105_R1_001.bam
38291367 + 0 mapped (86.94% : N/A)
R106_R1_001.bam
44981515 + 0 mapped (85.40% : N/A)
R107c_R1_001.bam
34154018 + 0 mapped (88.21% : N/A)
R108_R1_001.bam
39118124 + 0 mapped (86.82% : N/A)
R55c_R1_001.bam
24191832 + 0 mapped (85.93% : N/A)
R56c_R1_001.bam
21720710 + 0 mapped (80.27% : N/A)
R57c_R1_001.bam
22995053 + 0 mapped (84.52% : N/A)
R58c_R1_001.bam
24584697 + 0 mapped (85.47% : N/A)
R59c_R1_001.bam
50409476 + 0 mapped (84.59% : N/A)
R60c_R1_001.bam
23258702 + 0 mapped (84.56% : N/A)
R61_R1_001.bam
28733929 + 0 mapped (89.69% : N/A)
R62c_R1_001.bam
25793167 + 0 mapped (87.32% : N/A)
R63_R1_001.bam
31334562 + 0 mapped (84.81% : N/A)
R64_R1_001.bam
42529670 + 0 mapped (86.47% : N/A)
R65_R1_001.bam
41257485 + 0 mapped (86.56% : N/A)
R66_R1_001.bam
43227744 + 0 mapped (87.96% : N/A)
R67c_R1_001.bam
32687009 + 0 mapped (87.56% : N/A)
R68_R1_001.bam
40790317 + 0 mapped (86.25% : N/A)
R69_R1_001.bam
56513901 + 0 mapped (87.15% : N/A)
R70_R1_001.bam
44931912 + 0 mapped (87.46% : N/A)
R71_R1_001.bam
37858692 + 0 mapped (88.31% : N/A)
R72_R1_001.bam
44050034 + 0 mapped (87.44% : N/A)
R73_R1_001.bam
46100943 + 0 mapped (85.56% : N/A)
R74_R1_001.bam
45238135 + 0 mapped (83.95% : N/A)
R75c_R1_001.bam
27623064 + 0 mapped (84.72% : N/A)
R76_R1_001.bam
42983597 + 0 mapped (84.87% : N/A)
R77_R1_001.bam
50734492 + 0 mapped (83.71% : N/A)
R78_R1_001.bam
45822615 + 0 mapped (86.63% : N/A)
R79_R1_001.bam
37810491 + 0 mapped (87.37% : N/A)
R80_R1_001.bam
47352909 + 0 mapped (87.42% : N/A)
R81_R1_001.bam
41085082 + 0 mapped (83.82% : N/A)
R82_R1_001.bam
46121413 + 0 mapped (86.04% : N/A)
R83c_R1_001.bam
32597478 + 0 mapped (87.63% : N/A)
R84_R1_001.bam
43999246 + 0 mapped (87.44% : N/A)
R85_R1_001.bam
47890197 + 0 mapped (87.61% : N/A)
R86_R1_001.bam
41182836 + 0 mapped (87.11% : N/A)
R87_R1_001.bam
33690883 + 0 mapped (87.49% : N/A)
R88_R1_001.bam
46415439 + 0 mapped (88.28% : N/A)
R89_R1_001.bam
43717178 + 0 mapped (87.52% : N/A)
R90_R1_001.bam
42916789 + 0 mapped (87.57% : N/A)
R91c_R1_001.bam
34549408 + 0 mapped (88.12% : N/A)
R92_R1_001.bam
41520832 + 0 mapped (88.47% : N/A)
R93_R1_001.bam
67123654 + 0 mapped (92.22% : N/A)
R94_R1_001.bam
43946968 + 0 mapped (86.17% : N/A)
R95_R1_001.bam
36354614 + 0 mapped (87.17% : N/A)
R96_R1_001.bam
52937351 + 0 mapped (87.08% : N/A)
R97_R1_001.bam
45017808 + 0 mapped (88.38% : N/A)
R98_R1_001.bam
45239251 + 0 mapped (84.92% : N/A)
R99c_R1_001.bam
35015923 + 0 mapped (90.36% : N/A)
All mapping percentages >80%! This are great for coral alignment and are the upper range of what I have seen for this species.
Note that the output files are named with “R1” because that is the name of files in the input array. R1 and R2 reads were used in the mapping step. If you want, you could remove R1 from the names.
3. Assemble and quantify read counts with stringtie
First, sym link .bam files to a new bam-files directory. I am using sym links throughout this pipeline so that the original scripts will overwrite existing files if something needs to be re run. If desired, you can keep all files in the same directory. I like to keep things in discrete directories to keep it organized.
cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq
mkdir bam-files
cd bam-files
ln -s /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/trimmed-sequences/cat-sequences/*.bam /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/bam-files
Write a script for assembly.
cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/scripts
nano assembly.sh
We will use stringtie for assembly with the following options:
-p 8
for using multiple processors-e
this option directs StringTie to operate in expression estimation mode; this limits the processing of read alignments to estimating the coverage of the transcripts given with the -G option (hence this option requires -G).-B
This switch enables the output of Ballgown input table files (.ctab) containing coverage data for the reference transcripts given with the -G option. (See the Ballgown documentation for a description of these files.) With this option StringTie can be used as a direct replacement of the tablemaker program included with the Ballgown distribution. If the option -o is given as a full path to the output transcript file, StringTie will write the .ctab files in the same directory as the output GTF.-G
Use a reference annotation file (in GTF or GFF3 format) to guide the assembly process. The output will include expressed reference transcripts as well as any novel transcripts that are assembled. This option is required by options -B, -b, -e, -C.-A
Gene abundances will be reported (tab delimited format) in the output file with the given name.-o
output file name for the merged transcripts GTF (default: stdout)
#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=10
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ashuffmyer@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/bam-files
#SBATCH -o assemble-out.out
#SBATCH -e assemble-error.error
module load StringTie/2.2.1-GCC-11.2.0
echo "Assembling transcripts using stringtie" $(date)
cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/bam-files
array=($(ls *.bam)) #Make an array of sequences to assemble
for i in ${array[@]}; do
sample_name=`echo $i| awk -F [_] '{print $1}'`
stringtie -p 8 -e -B -G /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/references/Montipora_capitata_HIv3.genes_fixed.gff3 -A ${sample_name}.gene_abund.tab -o ${sample_name}.gtf ${i}
echo "StringTie assembly for seq file ${i}" $(date)
done
echo "Assembly for each sample complete " $(date)
sbatch assembly.sh
Job ID 312546 started at 14:00 on 20 April 2024. Completed 16:00 on April 2024.
Sym link .gtf files to new directory.
cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/
mkdir gtf-files
cd gtf-files
ln -s /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/bam-files/*.gtf /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/gtf-files
ln -s /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/bam-files/*.tab /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/gtf-files
4. Prepare .gtf files and generate gene count matrix
Make list of .gtf files. I am doing these steps individually, but can be added to a bash script if desired.
cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/gtf-files
ls *.gtf > gtf_list.txt
Merge .gtf’s. Use same stringtie settings as above and generate a merged gtf file.
interactive
module load StringTie/2.1.4-GCC-9.3.0
stringtie --merge -e -p 8 -G /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/references/Montipora_capitata_HIv3.genes_fixed.gff3 -o Mcapitata_merged.gtf gtf_list.txt
Assess the accuracy of the merged assembly with gffcompare
.
interactive
module purge
module load GffCompare/0.12.1-GCCcore-8.3.0
gffcompare -r /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/references/Montipora_capitata_HIv3.genes_fixed.gff3 Mcapitata_merged.gtf
54384 reference transcripts loaded.
2 duplicate reference transcripts discarded.
54384 query transfrags loaded.
View the merged file.
less gffcmp.stats
# gffcompare v0.12.1 | Command line was:
#gffcompare -r /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/references/Montipora_capitata_HIv3.genes_fixed.gff3 Mcapitata_merged.gtf
#
#= Summary for dataset: Mcapitata_merged.gtf
# Query mRNAs : 54384 in 54185 loci (36023 multi-exon transcripts)
# (141 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs : 54382 in 54185 loci (36023 multi-exon)
# Super-loci w/ reference transcripts: 54185
#-----------------| Sensitivity | Precision |
Base level: 100.0 | 100.0 |
Exon level: 100.0 | 100.0 |
Intron level: 100.0 | 100.0 |
Intron chain level: 100.0 | 100.0 |
Transcript level: 100.0 | 100.0 |
Locus level: 100.0 | 100.0 |
Matching intron chains: 36023
Matching transcripts: 54358
Matching loci: 54183
Missed exons: 0/256029 ( 0.0%)
Novel exons: 0/256028 ( 0.0%)
Missed introns: 0/201643 ( 0.0%)
Novel introns: 0/201643 ( 0.0%)
Missed loci: 0/54185 ( 0.0%)
Novel loci: 0/54185 ( 0.0%)
Total union super-loci across all input datasets: 54185
54384 out of 54384 consensus transcripts written in gffcmp.annotated.gtf (0 discarded as redundant)
Make gtf list text file for gene count matrix creation; still in interactive mode.
for filename in R*.gtf; do echo $filename $PWD/$filename; done > listGTF.txt
Download the prepDE.py script from here and put it in the stringtie output folder.
In a terminal not logged into Andromeda:
scp ~/MyProjects/larval_symbiont_TPC/scripts/bioinformatics/prepDE.py ashuffmyer@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/mcap-2023-rnaseq/scripts/
Load python and compile the gene count matrix
interactive
cd /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/gtf-files
module purge
module load Python/2.7.18-GCCcore-9.3.0
python /data/putnamlab/ashuffmyer/mcap-2023-rnaseq/scripts/prepDE.py -g Mcapitata2023_gene_count_matrix.csv -i listGTF.txt
We do not have any STRG gene names, which means that fixing the GFF file allowed for correct identification of annotated genes/transcripts.
Transfer gene count matrix to desktop.
scp ashuffmyer@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/mcap-2023-rnaseq/gtf-files/Mcapitata2023_gene_count_matrix.csv ~/MyProjects/larval_symbiont_TPC/data/rna_seq