COTS Project Bioinformatic Analysis

This post details scripts for bioinformatics processing of COTS Project RNAseq data.

Lucy’s RNAseq COTS data bioinformatics

Set up

Made a folder in Unity at the following location:

/work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman

Made a raw_data, scripts, and trimmed_data folder.

Sym link data

Sym link the raw data to the new location.

ln -s /project/pi_hputnam_uri_edu/20250107_COTS_LG/*fastq.gz /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/raw_data

Run multiQC on raw data

Create a script

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/scripts

nano raw_qc.sh

Make a script

#!/bin/bash
#SBATCH --job-name=fastqc_raw
#SBATCH --nodes=1 --cpus-per-task=8
#SBATCH --mem=250G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH --time=24:00:00  # Job time limit
#SBATCH -o slurm-fastqc_raw.out  # %j = job ID
#SBATCH -e slurm-fastqc_raw.err  # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/raw_data

#load modules 
module load uri/main
module load fastqc/0.12.1
module load MultiQC/1.12-foss-2021b

#run fastqc on raw data
fastqc /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/raw_data/*.fastq.gz -o /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/raw_multiqc/

#generate multiqc report
multiqc /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/raw_multiqc/ --filename multiqc_report_raw.html 

echo "Initial QC of raw seq data complete." $(date)

Run the job

sbatch raw_qc.sh

Started at 15:00 on 2/20/2025

Completed after approx. 24 hours.

Transfer raw multiQC report to desktop to view.

The file can be viewed on GitHub here.

In window outside Unity (logged onto my own computer):

scp ashuffmyer_uri_edu@unity.rc.umass.edu:/work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/raw_data/multiqc_report_raw.html /Users/ashuffmyer/MyProjects/CoTS-RNAseq/output/seq_qc

We have some adapter content and lots of variation in GC histograms.

Trim raw sequences

I made a folder called trimmed_data for the trimmed sequence files to go into.

I will use the following settings for trimming in fastp. Fastp documentation can be found here.

detect_adapter_for_pe \
- This enables auto detection of adapters for paired end data
qualified_quality_phred 30 \
- Filters reads based on phred score >=30
unqualified_percent_limit 10 \
- percents of bases are allowed to be unqualified, set here as 10%
length_required 100 \
- Removes reads shorter than 100 bp. We have read lengths of all 150 bp, so this is not a stringent filtration.
cut_right cut_right_window_size 5 cut_right_mean_quality 20
- Jill has used this sliding cut window in her script. I am going to leave it out for our trimming and we can re-evaluate the QC to see if we need to implement cutting.

Write a script.

cd scripts 
nano trimming.sh

Search for fastp module.

module --show_hidden spider fastp

Fastp module is fastp/0.23.2-GCC-11.2.0 on Unity.

#!/bin/bash
#SBATCH --job-name=trimming
#SBATCH --nodes=1 --cpus-per-task=8
#SBATCH --mem=250G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH --time=10:00:00  # Job time limit
#SBATCH -o slurm-trimming.out  # %j = job ID
#SBATCH -e slurm-trimming.err  # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_data

#load modules 
module load uri/main
module load fastqc/0.12.1
module load MultiQC/1.12-foss-2021b
module load fastp/0.23.2-GCC-11.2.0

# Make an array of sequences to trim in raw data directory 

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/raw_data/

array1=($(ls *R1_001.fastq.gz))

echo "Read trimming of adapters started." $(date)

# fastp and fastqc loop 
for i in ${array1[@]}; do
    fastp --in1 ${i} \
        --in2 $(echo ${i}|sed s/_R1/_R2/)\
        --out1 /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_data/trim.${i} \
        --out2 //work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_data/trim.$(echo ${i}|sed s/_R1/_R2/) \
        --detect_adapter_for_pe \
        --qualified_quality_phred 30 \
        --unqualified_percent_limit 10 \
        --length_required 100 

done

echo "Read trimming of adapters completed." $(date)

Run the script.

sbatch trimming.sh

Job started at 14:00 on 22 February 2025.

Job completed at 21:00 on 22 Feb 2025 (~7 hours).

MultiQC on trimmed sequences

Run FastQC and generate a MultiQC report on the trimmed sequences.

Make a script.

mkdir trimmed_multiqc 

cd scripts 
nano trimmed_qc.sh

#!/bin/bash
#SBATCH --job-name=trimmed_qc
#SBATCH --nodes=1 --cpus-per-task=8
#SBATCH --mem=250G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH --time=24:00:00  # Job time limit
#SBATCH -o slurm-trim_qc.out  # %j = job ID
#SBATCH -e slurm-trim_qc.err  # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_multiqc

#load modules 
module load uri/main
module load fastqc/0.12.1
module load MultiQC/1.12-foss-2021b

#run fastqc on trimmed data
fastqc /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_data/*.fastq.gz -o /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_multiqc/

#generate multiqc report
multiqc /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_multiqc/ --filename multiqc_report_trimmed.html 

echo "Initial QC of trimmed seq data complete." $(date)

Run the script.

sbatch trimmed_qc.sh

Job started at 11:15 on 23 Feb 2025, ended at about 24 h.

Copy file to computer.

scp ashuffmyer_uri_edu@unity.rc.umass.edu:/work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_multiqc/multiqc_report_trimmed.html /Users/ashuffmyer/MyProjects/CoTS-RNAseq/output/seq_qc

Adapters are removed, quality scores are high. QC content is weird, but likely due to symbiont and microbial reads seen in Hollie’s tests of alignment previously. We will see what happens at the alignment steps.

The file can be viewed on GitHub here.

Make species specific lists of samples

Make a list of Acropora and Porites files.

nano acr-list.txt

497R_R1_001.fastq.gz
568R_R1_001.fastq.gz
410R_R1_001.fastq.gz
549R_R1_001.fastq.gz
414R_R1_001.fastq.gz
321RA_R1_001.fastq.gz
581R_R1_001.fastq.gz
370R_R1_001.fastq.gz
419R_R1_001.fastq.gz
331RA_R1_001.fastq.gz
336R_R1_001.fastq.gz
512R_R1_001.fastq.gz
380R_R1_001.fastq.gz
571R_R1_001.fastq.gz
586R_R1_001.fastq.gz
468R_R1_001.fastq.gz
497R_R2_001.fastq.gz
568R_R2_001.fastq.gz
410R_R2_001.fastq.gz
549R_R2_001.fastq.gz
414R_R2_001.fastq.gz
321RA_R2_001.fastq.gz
581R_R2_001.fastq.gz
370R_R2_001.fastq.gz
419R_R2_001.fastq.gz
331RA_R2_001.fastq.gz
336R_R2_001.fastq.gz
512R_R2_001.fastq.gz
380R_R2_001.fastq.gz
571R_R2_001.fastq.gz
586R_R2_001.fastq.gz
468R_R2_001.fastq.gz

nano por-list.txt

225R_R1_001.fastq.gz
235R_R1_001.fastq.gz
43R_R1_001.fastq.gz
211R_R1_001.fastq.gz
218R_R1_001.fastq.gz
34R_R1_001.fastq.gz
227R_R1_001.fastq.gz
16R_R1_001.fastq.gz
61R_R1_001.fastq.gz
86R_R1_001.fastq.gz
236R_R1_001.fastq.gz
244R_R1_001.fastq.gz
82R_R1_001.fastq.gz
71R_R1_001.fastq.gz
253R_R1_001.fastq.gz
76R_R1_001.fastq.gz
225R_R2_001.fastq.gz
235R_R2_001.fastq.gz
43R_R2_001.fastq.gz
211R_R2_001.fastq.gz
218R_R2_001.fastq.gz
34R_R2_001.fastq.gz
227R_R2_001.fastq.gz
16R_R2_001.fastq.gz
61R_R2_001.fastq.gz
86R_R2_001.fastq.gz
236R_R2_001.fastq.gz
244R_R2_001.fastq.gz
82R_R2_001.fastq.gz
71R_R2_001.fastq.gz
253R_R2_001.fastq.gz
76R_R2_001.fastq.gz

Make a folder for Acropora and Porites files.

mkdir acr
mkdir por

Sym link Acropora files to the acr folder.

while IFS= read -r filename; do
    trimmed_file="trimmed_data/trim.$filename"
    if [[ -f "$trimmed_file" ]]; then
        ln -s "$(realpath "$trimmed_file")" "acr/trim.$filename"
    else
        echo "Warning: $trimmed_file not found" >&2
    fi
done < acr-list.txt

Sym link Porites files to the por folder.

while IFS= read -r filename; do
    trimmed_file="trimmed_data/trim.$filename"
    if [[ -f "$trimmed_file" ]]; then
        ln -s "$(realpath "$trimmed_file")" "por/trim.$filename"
    else
        echo "Warning: $trimmed_file not found" >&2
    fi
done < por-list.txt

Alignment to genomes

Index genomes

Copy genome fasta files to my directory to prevent permission issues.

mkdir refs 
mkdir por
mkdir acr

cp /work/pi_hputnam_uri_edu/refs/Plutea_genome/plut_final_2.1.fasta /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por

cp /work/pi_hputnam_uri_edu/refs/Plutea_genome/plut2v1.1.genes.gff3 /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por

cp /work/pi_hputnam_uri_edu/refs/Ahyacinthus_genome/Ahyacinthus_genome_V1/Ahyacinthus.chrsV1.fasta /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/acr

cp /work/pi_hputnam_uri_edu/refs/Ahyacinthus_genome/Ahyacinthus_genome_V1/Ahyacinthus.coding.gff3 /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/acr

Index Porites genome

cd scripts 
nano por-genome.sh

#!/bin/bash
#SBATCH --job-name=STAR_genome_index_Plutea
#SBATCH --nodes=1 --cpus-per-task=8
#SBATCH --mem=100G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH --time=24:00:00  # Job time limit
#SBATCH -o slurm-por-genome.out  # %j = job ID
#SBATCH -e slurm-por-genome.err  # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por

#load modules
module load uri/main
module load STAR/2.7.11b-GCC-12.3.0

STAR --runThreadN 6 \
--runMode genomeGenerate \
--genomeDir Plutea_index \
--genomeFastaFiles /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por/plut_final_2.1.fasta \
--sjdbGTFfile /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por/plut2v1.1.genes.gff3 \
--sjdbGTFtagExonParentTranscript Parent \
--sjdbOverhang 99 \
--genomeSAindexNbases 13

sbatch por-genome.sh

Index Acropora genome

cd scripts 
nano acr-genome.sh

#!/bin/bash
#SBATCH --job-name=STAR_genome_index_Ahya
#SBATCH --nodes=1 --cpus-per-task=8
#SBATCH --mem=100G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH --time=24:00:00  # Job time limit
#SBATCH -o slurm-acr-genome.out  # %j = job ID
#SBATCH -e slurm-acr-genome.err  # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/acr

module load uri/main
module load STAR/2.7.11b-GCC-12.3.0

STAR --runThreadN 6 \
--runMode genomeGenerate \
--genomeDir Ahya_index \
--genomeFastaFiles /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/acr/Ahyacinthus.chrsV1.fasta \
--sjdbGTFfile /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/acr/Ahyacinthus.coding.gff3 \
--sjdbGTFtagExonParentTranscript Parent \
--sjdbOverhang 99

sbatch acr-genome.sh

Jobs started at 12:00 24 February 2025. Jobs finished after about 10-30 min.

Acropora

Start a scratch directory to temporarily store the output .bam files. Call the directory cots and use for 30 days. It was created on Feb 25th, so we should be prepared to extend or move files we need to keep by March 25th ish.

ws_allocate cots 30

Info: creating workspace.
/scratch3/workspace/ashuffmyer_uri_edu-cots
remaining extensions  : 5
remaining time in days: 30

Align all ACR files in the relevant folder to generated index files.

cd scripts 
nano acr-align.sh

#!/bin/bash
#SBATCH --job-name=acr-align
#SBATCH --nodes=1 --cpus-per-task=15
#SBATCH --mem=200G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH -t 7-24:00:00
#SBATCH -q long #job lasting over 2 days
#SBATCH -o slurm-acr-align.out  # %j = job ID
#SBATCH -e slurm-acr-align.err  # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/acr/

#load modules
echo "Loading programs" $(date)
module load uri/main
module load STAR/2.7.11b-GCC-12.3.0

echo "Starting read alignment." $(date)
#loop through all files to align them to genome
for i in /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/acr/*_R1_001.fastq.gz; do

# Define the corresponding R2 file by replacing _R1_ with _R2_
    r2_file="${i/_R1_001.fastq.gz/_R2_001.fastq.gz}"
    
# Define output prefix
    output_prefix="/scratch3/workspace/ashuffmyer_uri_edu-cots/acr/$(basename "${i%_R1_001.fastq.gz}")_"
    
 # Run STAR alignment
    STAR --runMode alignReads \
        --genomeDir /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/acr/Ahya_index \
        --runThreadN 10 \
        --readFilesCommand zcat \
        --readFilesIn "$i" "$r2_file" \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMunmapped Within \
        --outSAMattributes Standard \
        --genomeSAindexNbases 13 \
        --outFileNamePrefix "$output_prefix"
done

echo "Alignment of Trimmed Seq data complete." $(date)

sbatch acr-align.sh

This script outputs files in the scratch acr directory that I made.

Porites

Align all POR files in the relevant folder using generated genome index.

cd scripts 
nano por-align.sh

#!/bin/bash
#SBATCH --job-name=por-align
#SBATCH --nodes=1 --cpus-per-task=15
#SBATCH --mem=200G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH -t 7-24:00:00
#SBATCH -q long #job lasting over 2 days
#SBATCH -o slurm-por-align.out  # %j = job ID
#SBATCH -e slurm-por-align.err  # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/por/

#load modules
echo "Loading programs" $(date)
module load uri/main
module load STAR/2.7.11b-GCC-12.3.0

echo "Starting read alignment." $(date)
#loop through all files to align them to genome
for i in /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/por/*_R1_001.fastq.gz; do

# Define the corresponding R2 file by replacing _R1_ with _R2_
    r2_file="${i/_R1_001.fastq.gz/_R2_001.fastq.gz}"
    
# Define output prefix
    output_prefix="/scratch3/workspace/ashuffmyer_uri_edu-cots/por/$(basename "${i%_R1_001.fastq.gz}")_"
    
# Run STAR alignment
    STAR --runMode alignReads \
        --genomeDir /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por/Plutea_index \
        --runThreadN 10 \
        --readFilesCommand zcat \
        --readFilesIn "$i" "$r2_file" \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMunmapped Within \
        --outSAMattributes Standard \
        --outFileNamePrefix "$output_prefix"
done

echo "Alignment of Trimmed Seq data complete." $(date)

sbatch por-align

This script outputs files in the scratch por directory that I made.

Both jobs started on 27 Feb 2025 at 07:00. Jobs finished on 1 March.

Look at mapping results

Obtain mapping percentages after job was completed. Use the samtools flagstat function that does a full pass through the input file to calculate and print statistics to stdout. Provides counts for each of 13 categories based primarily on bit flags in the FLAG field. Each category in the output is broken down into QC pass and QC fail, which is presented as “#PASS + #FAIL” followed by a description of the category.

Porites

Write a script to generate alignment stats.

cd scripts
nano por-stats.sh

#!/bin/bash
#SBATCH --job-name=por-stats
#SBATCH --nodes=1 --cpus-per-task=15
#SBATCH --mem=100G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH -t 24:00:00
#SBATCH -o slurm-por-stats.out  # %j = job ID
#SBATCH -e slurm-por-stats.err  # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/por/

module load uri/main
module load SAMtools/1.18-GCC-12.3.0

cd /scratch3/workspace/ashuffmyer_uri_edu-cots/por/

# Define output file
output_file="por_alignment_stats.txt"

# Clear the output file if it exists
> "$output_file"

for i in *.bam; do
    echo "${i}" >> "$output_file"
    samtools flagstat "${i}" | grep "mapped (" >> "$output_file"
done

sbatch por-stats.sh

Move file to user directory

cp /scratch3/workspace/ashuffmyer_uri_edu-cots/por/por_alignment_stats.txt /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/por/

The output looks like this:

trim.16R_Aligned.sortedByCoord.out.bam
3797576 + 0 mapped (2.36% : N/A)
3314780 + 0 primary mapped (2.07% : N/A)
trim.211R_Aligned.sortedByCoord.out.bam
500889612 + 0 mapped (48.39% : N/A)
470004384 + 0 primary mapped (46.80% : N/A)
trim.218R_Aligned.sortedByCoord.out.bam
41218742 + 0 mapped (30.63% : N/A)
39657624 + 0 primary mapped (29.82% : N/A)
trim.225R_Aligned.sortedByCoord.out.bam
866866 + 0 mapped (1.33% : N/A)
753202 + 0 primary mapped (1.15% : N/A)
trim.227R_Aligned.sortedByCoord.out.bam
94051372 + 0 mapped (56.82% : N/A)
87824202 + 0 primary mapped (55.13% : N/A)
trim.235R_Aligned.sortedByCoord.out.bam
40775646 + 0 mapped (26.80% : N/A)
38140810 + 0 primary mapped (25.51% : N/A)
trim.236R_Aligned.sortedByCoord.out.bam
57274150 + 0 mapped (34.73% : N/A)
52577952 + 0 primary mapped (32.82% : N/A)
trim.244R_Aligned.sortedByCoord.out.bam
42584786 + 0 mapped (29.85% : N/A)
40319002 + 0 primary mapped (28.72% : N/A)
trim.253R_Aligned.sortedByCoord.out.bam
58346292 + 0 mapped (38.89% : N/A)
54584348 + 0 primary mapped (37.31% : N/A)
trim.34R_Aligned.sortedByCoord.out.bam
510168 + 0 mapped (0.68% : N/A)
456620 + 0 primary mapped (0.61% : N/A)
trim.43R_Aligned.sortedByCoord.out.bam
2473270 + 0 mapped (1.97% : N/A)
2222166 + 0 primary mapped (1.77% : N/A)
trim.61R_Aligned.sortedByCoord.out.bam
48904270 + 0 mapped (37.72% : N/A)
45761608 + 0 primary mapped (36.17% : N/A)
trim.71R_Aligned.sortedByCoord.out.bam
39363360 + 0 mapped (28.30% : N/A)
36603192 + 0 primary mapped (26.84% : N/A)
trim.76R_Aligned.sortedByCoord.out.bam
49636416 + 0 mapped (32.19% : N/A)
45848520 + 0 primary mapped (30.48% : N/A)
trim.82R_Aligned.sortedByCoord.out.bam
39928236 + 0 mapped (29.10% : N/A)
36333724 + 0 primary mapped (27.19% : N/A)
trim.86R_Aligned.sortedByCoord.out.bam
40120754 + 0 mapped (31.95% : N/A)
38270032 + 0 primary mapped (30.93% : N/A)

There is a wide range in mapping. There is a group of samples (n=5) with mapping of 0.5-3%. There are 12 samples mapping at 25% or higher. We need to see if the samples with low mapping are from the “eaten” treatment. We will also compare mapping to the P. evermanni genome next.

Acropora

cd scripts
nano acr-stats.sh

#!/bin/bash
#SBATCH --job-name=acr-stats
#SBATCH --nodes=1 --cpus-per-task=15
#SBATCH --mem=100G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH -t 24:00:00
#SBATCH -o slurm-acr-stats.out  # %j = job ID
#SBATCH -e slurm-acr-stats.err  # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/acr/

module load uri/main
module load SAMtools/1.18-GCC-12.3.0

cd /scratch3/workspace/ashuffmyer_uri_edu-cots/acr/

# Define output file
output_file="acr_alignment_stats.txt"

# Clear the output file if it exists
> "$output_file"

for i in *.bam; do
    echo "${i}" >> "$output_file"
    samtools flagstat "${i}" | grep "mapped (" >> "$output_file"
done

sbatch acr-stats.sh

Move file to user directory

cp /scratch3/workspace/ashuffmyer_uri_edu-cots/acr/acr_alignment_stats.txt /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/acr/

The output looks like this:

trim.321RA_Aligned.sortedByCoord.out.bam
101145264 + 0 mapped (75.26% : N/A)
91192744 + 0 primary mapped (73.29% : N/A)
trim.331RA_Aligned.sortedByCoord.out.bam
99763802 + 0 mapped (75.29% : N/A)
88927416 + 0 primary mapped (73.09% : N/A)
trim.336R_Aligned.sortedByCoord.out.bam
110716976 + 0 mapped (61.04% : N/A)
100464150 + 0 primary mapped (58.71% : N/A)
trim.370R_Aligned.sortedByCoord.out.bam
34239394 + 0 mapped (24.31% : N/A)
30624808 + 0 primary mapped (22.32% : N/A)
trim.380R_Aligned.sortedByCoord.out.bam
99627838 + 0 mapped (74.16% : N/A)
89986620 + 0 primary mapped (72.16% : N/A)
trim.410R_Aligned.sortedByCoord.out.bam
80526182 + 0 mapped (46.17% : N/A)
73098402 + 0 primary mapped (43.77% : N/A)
trim.414R_Aligned.sortedByCoord.out.bam
87333162 + 0 mapped (38.67% : N/A)
78162862 + 0 primary mapped (36.07% : N/A)
trim.419R_Aligned.sortedByCoord.out.bam
106151016 + 0 mapped (68.00% : N/A)
94939130 + 0 primary mapped (65.52% : N/A)
trim.468R_Aligned.sortedByCoord.out.bam
100108574 + 0 mapped (67.37% : N/A)
90603050 + 0 primary mapped (65.14% : N/A)
trim.497R_Aligned.sortedByCoord.out.bam
3090986 + 0 mapped (2.22% : N/A)
2760422 + 0 primary mapped (1.98% : N/A)
trim.512R_Aligned.sortedByCoord.out.bam
134343958 + 0 mapped (74.83% : N/A)
121579012 + 0 primary mapped (72.91% : N/A)
trim.549R_Aligned.sortedByCoord.out.bam
77332112 + 0 mapped (66.06% : N/A)
69416060 + 0 primary mapped (63.60% : N/A)
trim.568R_Aligned.sortedByCoord.out.bam
74142102 + 0 mapped (53.01% : N/A)
66507248 + 0 primary mapped (50.29% : N/A)
trim.571R_Aligned.sortedByCoord.out.bam
76962776 + 0 mapped (47.52% : N/A)
69663792 + 0 primary mapped (45.04% : N/A)
trim.581R_Aligned.sortedByCoord.out.bam
23750278 + 0 mapped (19.23% : N/A)
21582796 + 0 primary mapped (17.78% : N/A)
trim.586R_Aligned.sortedByCoord.out.bam
87532254 + 0 mapped (68.31% : N/A)
79263634 + 0 primary mapped (66.12% : N/A)

There is a pretty wide range in mapping for Acropora as well, but the lowest value is 17%.

Run alignment of POR files to the P. lobata/lutea genome

Index the genome

Download the files using the script written by Lucy.

cd refs
mkdir por-ever 
cd por-ever

wget https://www.genoscope.cns.fr/corals/data/Porites_evermanni_v1.annot.gff

wget https://www.genoscope.cns.fr/corals/data/Porites_evermanni_v1.fa

Now available as Porites_evermanni_v1.annot.fa and Porites_evermanni_v1.annot.gff in the refs/por-ever folder.

cd scripts 
nano por-ever-genome.sh

module --show_hidden spider gffread

#!/bin/bash
#SBATCH --job-name=Pevermanni_index
#SBATCH --nodes=1 --cpus-per-task=10
#SBATCH --mem=100G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH --time=24:00:00  # Job time limit
#SBATCH -o slurm-por-ever-genome.out  # %j = job ID
#SBATCH -e slurm-por-ever-genome.err  # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por-ever

#load modules
module load uri/main
module load STAR/2.7.11b-GCC-12.3.0
module load gffread/0.12.7

#convert gff to gtf 
gffread Porites_evermanni_v1.annot.gff -T -o Porites_evermanni_v1.annot.gtf

STAR --runThreadN 6 \
--runMode genomeGenerate \
--genomeDir Pevermanni_index \
--genomeFastaFiles /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por-ever/Porites_evermanni_v1.fa \
--sjdbGTFfile /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por-ever/Porites_evermanni_v1.annot.gtf \
--sjdbGTFtagExonParentTranscript Parent \
--sjdbOverhang 99 \
--genomeSAindexNbases 13

sbatch por-ever-genome.sh

Submitted job on 3 March at 08:00. Finished in about 5 min.

Run alignment to the genome to compare alignment rates with P. evermanni genome

cd cots-gorman 
mkdir por-ever 

cd /scratch3/workspace/ashuffmyer_uri_edu-cots/
mkdir por-ever 

Align all POR files in the relevant folder using generated P. evermanni genome index.

cd scripts 
nano por-ever-align.sh

#!/bin/bash
#SBATCH --job-name=por-ever-align
#SBATCH --nodes=1 --cpus-per-task=15
#SBATCH --mem=200G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH -t 7-24:00:00
#SBATCH -q long #job lasting over 2 days
#SBATCH -o slurm-por-ever-align.out  # %j = job ID
#SBATCH -e slurm-por-ever-align.err  # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/por-ever/

#load modules
echo "Loading programs" $(date)
module load uri/main
module load STAR/2.7.11b-GCC-12.3.0

echo "Starting read alignment." $(date)
#loop through all files to align them to genome
for i in /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/por/*_R1_001.fastq.gz; do

# Define the corresponding R2 file by replacing _R1_ with _R2_
    r2_file="${i/_R1_001.fastq.gz/_R2_001.fastq.gz}"
    
# Define output prefix
    output_prefix="/scratch3/workspace/ashuffmyer_uri_edu-cots/por-ever/$(basename "${i%_R1_001.fastq.gz}")_"
    
# Run STAR alignment
    STAR --runMode alignReads \
        --genomeDir /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por-ever/Pevermanni_index \
        --runThreadN 10 \
        --readFilesCommand zcat \
        --readFilesIn "$i" "$r2_file" \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMunmapped Within \
        --outSAMattributes Standard \
        --outFileNamePrefix "$output_prefix"
done

echo "Alignment of Trimmed Seq data complete." $(date)

sbatch por-ever-align.sh

This script outputs files in the scratch por-ever directory that I made.

Submitted job at 08:00 on March 3, finished on March 4.

Write a script to generate alignment stats.

cd scripts
nano por-ever-stats.sh

#!/bin/bash
#SBATCH --job-name=por-ever-stats
#SBATCH --nodes=1 --cpus-per-task=15
#SBATCH --mem=100G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH -t 24:00:00
#SBATCH -o slurm-por-ever-stats.out  # %j = job ID
#SBATCH -e slurm-por-ever-stats.err  # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/por-ever/

module load uri/main
module load SAMtools/1.18-GCC-12.3.0

cd /scratch3/workspace/ashuffmyer_uri_edu-cots/por-ever/

# Define output file
output_file="por_ever_alignment_stats.txt"

# Clear the output file if it exists
> "$output_file"

for i in *.bam; do
    echo "${i}" >> "$output_file"
    samtools flagstat "${i}" | grep "mapped (" >> "$output_file"
done

sbatch por-ever-stats.sh

Submitted job on March 4 at 19:00.

Move file to user directory

cp /scratch3/workspace/ashuffmyer_uri_edu-cots/por-ever/por_ever_alignment_stats.txt /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/por-ever/

The output looks like this:

trim.16R_Aligned.sortedByCoord.out.bam
9932412 + 0 mapped (6.14% : N/A)
8301790 + 0 primary mapped (5.18% : N/A)
trim.211R_Aligned.sortedByCoord.out.bam
trim.218R_Aligned.sortedByCoord.out.bam
183856720 + 0 mapped (96.73% : N/A)
126776572 + 0 primary mapped (95.33% : N/A)
trim.225R_Aligned.sortedByCoord.out.bam
2907570 + 0 mapped (4.43% : N/A)
2534398 + 0 primary mapped (3.88% : N/A)
trim.227R_Aligned.sortedByCoord.out.bam
171417626 + 0 mapped (84.19% : N/A)
127096866 + 0 primary mapped (79.79% : N/A)
trim.235R_Aligned.sortedByCoord.out.bam
211878608 + 0 mapped (97.26% : N/A)
143543886 + 0 primary mapped (96.01% : N/A)
trim.236R_Aligned.sortedByCoord.out.bam
212095574 + 0 mapped (96.43% : N/A)
152349478 + 0 primary mapped (95.10% : N/A)
trim.244R_Aligned.sortedByCoord.out.bam
188686986 + 0 mapped (94.76% : N/A)
129969490 + 0 primary mapped (92.57% : N/A)
trim.253R_Aligned.sortedByCoord.out.bam
195777140 + 0 mapped (97.66% : N/A)
141597122 + 0 primary mapped (96.80% : N/A)
trim.34R_Aligned.sortedByCoord.out.bam
1022532 + 0 mapped (1.37% : N/A)
783544 + 0 primary mapped (1.05% : N/A)
trim.43R_Aligned.sortedByCoord.out.bam
12644158 + 0 mapped (9.78% : N/A)
8762710 + 0 primary mapped (6.99% : N/A)
trim.61R_Aligned.sortedByCoord.out.bam
156888590 + 0 mapped (91.65% : N/A)
112220878 + 0 primary mapped (88.70% : N/A)
trim.71R_Aligned.sortedByCoord.out.bam
190552052 + 0 mapped (95.40% : N/A)
127166612 + 0 primary mapped (93.26% : N/A)
trim.76R_Aligned.sortedByCoord.out.bam
207617046 + 0 mapped (95.93% : N/A)
141609686 + 0 primary mapped (94.14% : N/A)
trim.82R_Aligned.sortedByCoord.out.bam
198766054 + 0 mapped (98.85% : N/A)
131310576 + 0 primary mapped (98.26% : N/A)
trim.86R_Aligned.sortedByCoord.out.bam
172716586 + 0 mapped (97.33% : N/A)
119000396 + 0 primary mapped (96.17% : N/A)

Next, we need to compare alignment to P. evermanni vs P. lutea for each Porites sample. Sample 211R did not map, so we need to subsample this file and then re run.

Subsample 211R

We need to make a script to:

Subset sequences from trimmed file for sample 211
Re run alignment for 211 trimmed files
Generate stats for all samples again

Write a script to subsample 211 trimmed data file

cd scripts

nano seqtk_211R.sh

Write the script

#!/bin/bash
#SBATCH --job-name=seqtk_211R
#SBATCH --nodes=1 --cpus-per-task=16
#SBATCH --mem=300G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH -t 7-24:00:00
#SBATCH -q long #job lasting over 2 days
#SBATCH --time=24:00:00  # Job time limit
#SBATCH -o slurm-seqtk_211R.out  # %j = job ID
#SBATCH -e slurm-seqtk_211R.err  # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_data

#load modules 

module load uri/main
module load seqtk/1.4-GCC-12.3.0

#move to scratch directory for temporary work with large files 

cd /scratch3/workspace/ashuffmyer_uri_edu-cots/

cp /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_data/trim.211R_R1_001.fastq.gz /scratch3/workspace/ashuffmyer_uri_edu-cots/
cp /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_data/trim.211R_R2_001.fastq.gz /scratch3/workspace/ashuffmyer_uri_edu-cots/

echo "Files copied to scratch directory" $(date)

#unzip files
gunzip trim.211R_R1_001.fastq.gz
gunzip trim.211R_R2_001.fastq.gz 

echo "Unzipping complete" $(date)

echo "Starting subsetting" $(date)

## Subsample 90,000,000 paired reads from the 211R sample
seqtk sample -s100 trim.211R_R1_001.fastq 90000000 > trim_sub.211R_R1_001.fastq
seqtk sample -s100 trim.211R_R2_001.fastq 90000000 > trim_sub.211R_R2_001.fastq

echo "Finished subsetting" $(date)

echo "Starting file zipping" $(date)

gzip trim_sub.211R_R1_001.fastq 
gzip trim_sub.211R_R2_001.fastq 

echo "Finished file zipping" $(date)

#copy files back into trimmed sequences directory 
cp /scratch3/workspace/ashuffmyer_uri_edu-cots/trim_sub.211R_R1_001.fastq.gz /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_data/

cp /scratch3/workspace/ashuffmyer_uri_edu-cots/trim_sub.211R_R2_001.fastq.gz /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_data/

echo "Files copied. Complete." $(date)

Run script.

sbatch seqtk_211R.sh

Job started on 10 March at 07:00. I had to resubmit this job several times - I ran into errors with file copying and symlinking.

Copy files back to working directory (this didn’t work in the script).

#copy files back into trimmed sequences directory 
cp /scratch3/workspace/ashuffmyer_uri_edu-cots/trim_sub.211R_R1_001.fastq.gz /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_data/

cp /scratch3/workspace/ashuffmyer_uri_edu-cots/trim_sub.211R_R2_001.fastq.gz /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_data/

Check file size of new files.

cd trimmed_data

ls -lh 

File size is now 4.5-4.7GB. Zipped trimmed files were originally 22 GB (R1) and 24 GB (R2) before trimming. This size matches the other files.

Check number of reads. Do this in the scratch directory.

cd /scratch3/workspace/ashuffmyer_uri_edu-cots/

gunzip -c trim_sub.211R_R1_001.fastq.gz | wc -l | awk '{print $1/4}'

This file has 90,000,000 reads, which is what we expected.

Write a script to align 211 subsetting trimmed data to P. evermanni genome

The files for subset trimmed data now exist in the trimmed_data directory as:

trim_sub.211R_R1_001.fastq.gz
trim_sub.211R_R2_001.fastq.gz

Write a script for alignment to the P. evermanni genome.

cd scripts

nano align_211R.sh

Write the script.

#!/bin/bash
#SBATCH --job-name=211R-por-ever-align
#SBATCH --nodes=1 --cpus-per-task=15
#SBATCH --mem=200G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH -t 7-24:00:00
#SBATCH -q long #job lasting over 2 days
#SBATCH -o slurm-211R-por-ever-align.out  # %j = job ID
#SBATCH -e slurm-211R-por-ever-align.err  # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/por-ever/

#load modules
echo "Loading programs" $(date)
module load uri/main
module load STAR/2.7.11b-GCC-12.3.0

echo "Starting read alignment for 211R." $(date)
#loop through all files to align them to genome
for i in /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/trimmed_data/trim_sub*_R1_001.fastq.gz; do

# Define the corresponding R2 file by replacing _R1_ with _R2_
    r2_file="${i/_R1_001.fastq.gz/_R2_001.fastq.gz}"
    
# Define output prefix
    output_prefix="/scratch3/workspace/ashuffmyer_uri_edu-cots/por-ever/$(basename "${i%_R1_001.fastq.gz}")_"
    
# Run STAR alignment
    STAR --runMode alignReads \
        --genomeDir /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por-ever/Pevermanni_index \
        --runThreadN 10 \
        --readFilesCommand zcat \
        --readFilesIn "$i" "$r2_file" \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMunmapped Within \
        --outSAMattributes Standard \
        --outFileNamePrefix "$output_prefix"
done

echo "Alignment of 211R Trimmed Seq data complete." $(date)

sbatch align_211R.sh

Job started on 11 March at 8:00, finished after 1 h.

.bam files are now in the scratch directory at /scratch3/workspace/ashuffmyer_uri_edu-cots/por-ever/.

Then generate alignment stats for all files again.

cd scripts 

sbatch por-ever-stats.sh

Move file to user directory

cp /scratch3/workspace/ashuffmyer_uri_edu-cots/por-ever/por_ever_alignment_stats.txt /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/por-ever/

The output looks like this:

trim.16R_Aligned.sortedByCoord.out.bam
9932412 + 0 mapped (6.14% : N/A)
8301790 + 0 primary mapped (5.18% : N/A)
trim.218R_Aligned.sortedByCoord.out.bam
183856720 + 0 mapped (96.73% : N/A)
126776572 + 0 primary mapped (95.33% : N/A)
trim.225R_Aligned.sortedByCoord.out.bam
2907570 + 0 mapped (4.43% : N/A)
2534398 + 0 primary mapped (3.88% : N/A)
trim.227R_Aligned.sortedByCoord.out.bam
171417626 + 0 mapped (84.19% : N/A)
127096866 + 0 primary mapped (79.79% : N/A)
trim.235R_Aligned.sortedByCoord.out.bam
211878608 + 0 mapped (97.26% : N/A)
143543886 + 0 primary mapped (96.01% : N/A)
trim.236R_Aligned.sortedByCoord.out.bam
212095574 + 0 mapped (96.43% : N/A)
152349478 + 0 primary mapped (95.10% : N/A)
trim.244R_Aligned.sortedByCoord.out.bam
188686986 + 0 mapped (94.76% : N/A)
129969490 + 0 primary mapped (92.57% : N/A)
trim.253R_Aligned.sortedByCoord.out.bam
195777140 + 0 mapped (97.66% : N/A)
141597122 + 0 primary mapped (96.80% : N/A)
trim.34R_Aligned.sortedByCoord.out.bam
1022532 + 0 mapped (1.37% : N/A)
783544 + 0 primary mapped (1.05% : N/A)
trim.43R_Aligned.sortedByCoord.out.bam
12644158 + 0 mapped (9.78% : N/A)
8762710 + 0 primary mapped (6.99% : N/A)
trim.61R_Aligned.sortedByCoord.out.bam
156888590 + 0 mapped (91.65% : N/A)
112220878 + 0 primary mapped (88.70% : N/A)
trim.71R_Aligned.sortedByCoord.out.bam
190552052 + 0 mapped (95.40% : N/A)
127166612 + 0 primary mapped (93.26% : N/A)
trim.76R_Aligned.sortedByCoord.out.bam
207617046 + 0 mapped (95.93% : N/A)
141609686 + 0 primary mapped (94.14% : N/A)
trim.82R_Aligned.sortedByCoord.out.bam
198766054 + 0 mapped (98.85% : N/A)
131310576 + 0 primary mapped (98.26% : N/A)
trim.86R_Aligned.sortedByCoord.out.bam
172716586 + 0 mapped (97.33% : N/A)
119000396 + 0 primary mapped (96.17% : N/A)
trim_sub.211R_Aligned.sortedByCoord.out.bam
219954702 + 0 mapped (89.86% : N/A)
155192096 + 0 primary mapped (86.22% : N/A)

The 211R file that was subset now has mapping information with mapping at ~89%.

We now need to proceed with assembly and generating a gene count matrix.

Assemble and quantify read counts with stringtie

First, sym link .bam files to a new bam-files directory. I am using sym links throughout this pipeline so that the original scripts will overwrite existing files if something needs to be re run. If desired, you can keep all files in the same directory. I like to keep things in discrete directories to keep it organized.

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/

mkdir bam-files-por

cd bam-files-por

ln -s /scratch3/workspace/ashuffmyer_uri_edu-cots/por-ever/*.bam /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-por

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/

mkdir bam-files-acr

cd bam-files-acr

ln -s /scratch3/workspace/ashuffmyer_uri_edu-cots/acr/*.bam /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-acr

Files are now sym linked.

Porites assembly

Make a script.

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/scripts

nano assembly-por.sh

Rename the subsetted 211R file back to the original nomenclature.

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-por
mv trim_sub.211R_Aligned.sortedByCoord.out.bam trim.211R_Aligned.sortedByCoord.out.bam

We will use stringtie for assembly with the following options:

-p 8 for using multiple processors
-e this option directs StringTie to operate in expression estimation mode; this limits the processing of read alignments to estimating the coverage of the transcripts given with the -G option (hence this option requires -G).
-B This switch enables the output of Ballgown input table files (.ctab) containing coverage data for the reference transcripts given with the -G option. (See the Ballgown documentation for a description of these files.) With this option StringTie can be used as a direct replacement of the tablemaker program included with the Ballgown distribution. If the option -o is given as a full path to the output transcript file, StringTie will write the .ctab files in the same directory as the output GTF.
-G Use a reference annotation file (in GTF or GFF3 format) to guide the assembly process. The output will include expressed reference transcripts as well as any novel transcripts that are assembled. This option is required by options -B, -b, -e, -C.
-A Gene abundances will be reported (tab delimited format) in the output file with the given name.
-o output file name for the merged transcripts GTF (default: stdout)

#!/bin/bash
#SBATCH --job-name=por-assembly
#SBATCH --nodes=1 --cpus-per-task=8
#SBATCH --mem=200G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH -t 7-24:00:00
#SBATCH -q long #job lasting over 2 days
#SBATCH -o slurm-por-assembly.out  # %j = job ID
#SBATCH -e slurm-por-assembly.err  # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-por/

module load uri/main
module load StringTie/2.2.1-GCC-11.2.0

echo "Assembling transcripts using stringtie" $(date)

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-por/

array=($(ls *.bam)) #Make an array of sequences to assemble

for i in ${array[@]}; do 
        sample_name=`echo $i| awk -F [_] '{print $1}'`
	stringtie -p 8 -e -B -G /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por-ever/Porites_evermanni_v1.annot.gtf -A ${sample_name}.gene_abund.tab -o ${sample_name}.gtf ${i}
        echo "StringTie assembly for seq file ${i}" $(date)
done

echo "Assembly for each Porites sample complete" $(date)

sbatch assembly-por.sh

Acropora assembly

Make a script.

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/scripts

nano assembly-acr.sh

We will use stringtie for assembly with the following options:

-p 8 for using multiple processors
-e this option directs StringTie to operate in expression estimation mode; this limits the processing of read alignments to estimating the coverage of the transcripts given with the -G option (hence this option requires -G).
-B This switch enables the output of Ballgown input table files (.ctab) containing coverage data for the reference transcripts given with the -G option. (See the Ballgown documentation for a description of these files.) With this option StringTie can be used as a direct replacement of the tablemaker program included with the Ballgown distribution. If the option -o is given as a full path to the output transcript file, StringTie will write the .ctab files in the same directory as the output GTF.
-G Use a reference annotation file (in GTF or GFF3 format) to guide the assembly process. The output will include expressed reference transcripts as well as any novel transcripts that are assembled. This option is required by options -B, -b, -e, -C.
-A Gene abundances will be reported (tab delimited format) in the output file with the given name.
-o output file name for the merged transcripts GTF (default: stdout)

#!/bin/bash
#SBATCH --job-name=acr-assembly
#SBATCH --nodes=1 --cpus-per-task=8
#SBATCH --mem=200G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH -t 7-24:00:00
#SBATCH -q long #job lasting over 2 days
#SBATCH -o slurm-acr-assembly.out  # %j = job ID
#SBATCH -e slurm-acr-assembly.err  # %j = job ID
#SBATCH -D /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-acr/

module load uri/main
module load StringTie/2.2.1-GCC-11.2.0

echo "Assembling transcripts using stringtie" $(date)

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-acr/

array=($(ls *.bam)) #Make an array of sequences to assemble

for i in ${array[@]}; do 
        sample_name=`echo $i| awk -F [_] '{print $1}'`
	stringtie -p 8 -e -B -G /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/acr/Ahyacinthus.coding.gff3 -A ${sample_name}.gene_abund.tab -o ${sample_name}.gtf ${i}
        echo "StringTie assembly for seq file ${i}" $(date)
done

echo "Assembly for each Acropora sample complete" $(date)

sbatch assembly-acr.sh

To check disk space/folder storage used:

#for example 
du -hs /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman

Both the acr and por jobs were started at 09:45 on 12 March and finished by 14:00.

Prepare .gtf files and generate gene count matrix

Make a list of .gtf files in each Acropora and Porites directory.

cd bam-files-acr
ls *R.gtf > acr_gtf_list.txt

cd bam-files-por
ls *R.gtf > por_gtf_list.txt

Write a script to merge the .gtf’s together and evaluate with GFF compare.

cd scripts
nano gtf-merge.sh

#!/bin/bash
#SBATCH --job-name=gtf_merge
#SBATCH --nodes=1 --cpus-per-task=8
#SBATCH --mem=150G  # Requested Memory
#SBATCH -p gpu  # Partition
#SBATCH -G 1  # Number of GPUs
#SBATCH -t 24:00:00
#SBATCH -o slurm-gtf-merge.out  # %j = job ID
#SBATCH -e slurm-gtf-merge.err  # %j = job ID

module load uri/main
module load StringTie/2.2.1-GCC-11.2.0
module load GffCompare/0.12.6-GCC-11.2.0

echo "Merging ACR gtf files" $(date)

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-acr/

stringtie --merge -e -p 8 -G /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/acr/Ahyacinthus.coding.gff3 -o Acropora_merged.gtf acr_gtf_list.txt 

echo "ACR complete" $(date)

echo "Merging POR gtf files" $(date)

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-por/

stringtie --merge -e -p 8 -G /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por-ever/Porites_evermanni_v1.annot.gtf -o Porites_merged.gtf por_gtf_list.txt 

echo "POR complete" $(date)

echo "Starting GFF compare for Acropora" $(date) 

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-acr/

gffcompare -r /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/acr/Ahyacinthus.coding.gff3 Acropora_merged.gtf 

echo "ACR complete" $(date)

echo "Starting GFF compare for Porites" $(date) 

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-por/

gffcompare -r /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por-ever/Porites_evermanni_v1.annot.gtf Porites_merged.gtf 

echo "POR complete" $(date)

sbatch gtf-merge.sh

Job started at 14:00 on 12 March, ran after about 15 min.

The output for Acropora looks like this (gffcmp.stats file in the bam-files-acr folder):

27110 reference transcripts loaded.
  27110 query transfrags loaded.
  
less gffcmp.stats

# gffcompare v0.12.6 | Command line was:
#gffcompare -r /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/acr/Ahyacinthus.coding.gff3 Acropora_merged.gtf
#

#= Summary for dataset: Acropora_merged.gtf 
#     Query mRNAs :   27110 in   27110 loci  (24082 multi-exon transcripts)
#            (0 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs :   27110 in   27110 loci  (24082 multi-exon)
# Super-loci w/ reference transcripts:    27110
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |   100.0    |
        Exon level:   100.0     |   100.0    |
      Intron level:   100.0     |   100.0    |
Intron chain level:   100.0     |   100.0    |
  Transcript level:   100.0     |   100.0    |
       Locus level:   100.0     |   100.0    |

     Matching intron chains:   24082
       Matching transcripts:   27110
              Matching loci:   27110

          Missed exons:       0/173641  (  0.0%)
           Novel exons:       0/173641  (  0.0%)
        Missed introns:       0/146531  (  0.0%)
         Novel introns:       0/146531  (  0.0%)
           Missed loci:       0/27110   (  0.0%)
            Novel loci:       0/27110   (  0.0%)

 Total union super-loci across all input datasets: 27110 
27110 out of 27110 consensus transcripts written in gffcmp.annotated.gtf (0 discarded as redundant)

The output for Porites looks like this (gffcmp.stats file in the bam-files-por folder):

40389 reference transcripts loaded.
40389 query transfrags loaded

gffcmp.stats

#gffcompare -r /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/refs/por-ever/Porites_evermanni_v1.annot.gtf Porites_merged.gtf
#

#= Summary for dataset: Porites_merged.gtf 
#     Query mRNAs :   40389 in   40240 loci  (32889 multi-exon transcripts)
#            (147 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs :   40389 in   40240 loci  (32889 multi-exon)
# Super-loci w/ reference transcripts:    40240
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |   100.0    |
        Exon level:   100.0     |   100.0    |
      Intron level:   100.0     |   100.0    |
Intron chain level:   100.0     |   100.0    |
  Transcript level:   100.0     |   100.0    |
       Locus level:   100.0     |   100.0    |

     Matching intron chains:   32889
       Matching transcripts:   40387
              Matching loci:   40240

          Missed exons:       0/235837  (  0.0%)
           Novel exons:       0/235836  (  0.0%)
        Missed introns:       0/195463  (  0.0%)
         Novel introns:       0/195463  (  0.0%)
           Missed loci:       0/40240   (  0.0%)
            Novel loci:       0/40240   (  0.0%)

 Total union super-loci across all input datasets: 40240 
40389 out of 40389 consensus transcripts written in gffcmp.annotated.gtf (0 discarded as redundant)

Everything looks good.

Make gtf list text file for gene count matrix creation for ACR and por files.

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-acr/

for filename in *R*.gtf; do echo $filename $PWD/$filename; done > listGTF_acr.txt

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-por/

for filename in *R*.gtf; do echo $filename $PWD/$filename; done > listGTF_por.txt

Download the prepDE.py script from here and put it in the scripts folder

cd scripts 

wget https://raw.githubusercontent.com/gpertea/stringtie/refs/heads/master/prepDE.py3

Start an interactive session.

salloc -c 1

Run the prepDE script to obtain gene count matrices. Note that you must load Python 2.7 for this script to work.

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-acr

module load uri/main
module load Python/2.7.18-GCCcore-9.3.0

python /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/scripts/prepDE.py -g /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-acr/Acropora_gene_count_matrix.csv -i listGTF_acr.txt

less Acropora_gene_count_matrix.csv

The file looks good and I do not see any obvious presence of STRG gene names - appears to have correctly identified gene IDs from the reference.

cd /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-por

python /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/scripts/prepDE.py -g /work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-por/Porites_gene_count_matrix.csv -i listGTF_por.txt

less Porites_gene_count_matrix.csv

The file looks good and I do not see any obvious presence of STRG gene names - appears to have correctly identified gene IDs from the reference.

Transfer gene count matrix to desktop for upload to GitHub.

scp ashuffmyer_uri_edu@unity.rc.umass.edu:/work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-acr/Acropora_gene_count_matrix.csv /Users/ashuffmyer/MyProjects/CoTS-RNAseq/output/gene-count-matrix/

scp ashuffmyer_uri_edu@unity.rc.umass.edu:/work/pi_hputnam_uri_edu/ashuffmyer/cots-gorman/bam-files-por/Porites_gene_count_matrix.csv /Users/ashuffmyer/MyProjects/CoTS-RNAseq/output/gene-count-matrix/

Done!

Written on March 13, 2025