Mcapitata Development ITS2 Analysis Part 2

This post details SymPortal analysis for the ITS2 analysis pipeline modified from Kevin Wong pipeline.

ITS2 analysis in SymPortal

5. Prepare Symportal settings and files

More information on Symportal here.

The first step of analysing an ITS2 dataset is to load it into the SymPortal framework’s database. In this step, SymPortal will perform all quality control filtering of the sequence data and convert the raw sequence data into database objects. There is no need for pre filtering or trimming in this case.

Copy SymPortal into home directory for permission access.

Navigate to home directory and download SymPortal.

ssh -l ashuffmyer ssh3.hac.uri.edu
cd /data/putnamlab/ashuffmyer/

rsync -rl /opt/software/SymPortal/0.3.21-foss-2019b/ ./SymPortal/

This step takes a few minutes to complete. There is now a SymPortal directory.

Modify settings and config files.

cd SymPortal
module load SymPortal
mv settings_blank.py ./settings.py
mv sp_config_blank.py ./sp_config.py
base64 /dev/urandom | head -c50

The last command will display a secret key in the header. Copy this key. Open settings.py.

nano settings.py

Paste secret key in settings.py file in SECRET_KEY = ‘qxwykujthzogaKEyHZSFCjmZlTGrt/ureW3Ppw4wQdhPAuCeEn’ and save it.

Open the config file and edit username and email.

nano sp_config.py

user_name = "undefined" --> user_name = "ashuffmyer"

user_email = "undefined" --> user_email = "ashuffmyer@uri.edu"

Create the SymPortal conda environment

Next, create a conda environment to work in. You can do this in interactive mode, but there is a long install.

interactive
module load Miniconda3/4.9.2
module load SymPortal

conda env create -f $EBROOTSYMPORTAL/symportal_env.yml 

exit

Download often failed, so downloaded in several sessions. It appears that it recognizes what has already been downloaded and doesn’t seem to repeat what was downloaded in a previous session.

Internet connection didn’t allow for the download easily. Trying to download in a bash script overnight.

nano install.sh

#!/bin/bash
#SBATCH --job-name="install"
#SBATCH -t 500:00:00
#SBATCH --mem=250GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=ashuffmyer@uri.edu
#SBATCH --exclusive

module load Miniconda3/4.9.2
module load SymPortal

conda env create -f $EBROOTSYMPORTAL/symportal_env.yml 

echo "Success!" $(date) 

sbatch install.sh

Ran overnight and completed, now I will try to run analyses to see if the downloads worked correctly.

I got these error codes for each package:

CondaVerificationError: The package for django located at /home/ashuffmyer/.con$
appears to be corrupted. The path 'site-packages/django/views/generic/base.py'
specified in the package manifest cannot be found.

However, when I tried to initiate download again it said that the environment already existed. Proceeding with analysis.

Create a reference database

Create a reference data base in our SymPortal environment.

cd SymPortal 
interactive

module load Miniconda3/4.9.2

eval "$(conda shell.bash hook)"
conda activate symportal_env

module load SymPortal/0.3.21-foss-2020b

python3 manage.py migrate

##### Populating reference_sequences

python3 populate_db_ref_seqs.py

module unload SymPortal

exit

I had an issue with djano here (error loading django) when entering the python3 manage.py migrate function.

The issue is likely that it was trying to run the SymPortal within the data directory and it should be installed in the home ‘ashuffmyer’ directory.

Trying instead to run as a bash script using the conda paths from Kevin Wong’s scripts that direct SymPortal to the home directory.

cd scripts
nano reference.sh

#!/bin/bash
#SBATCH --job-name="SP_reference"
#SBATCH -t 500:00:00
#SBATCH --mem=120GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=ashuffmyer@uri.edu
#SBATCH -D /data/putnamlab/ashuffmyer/SymPortal
#SBATCH --exclusive

module load Miniconda3/4.9.2

eval "$(conda shell.bash hook)"
conda activate symportal_env

module load SymPortal/0.3.21-foss-2020b
module unload

export PYTHONPATH=/data/putnamlab/ashuffmyer/SymPortal/:/data/putnamlab/ashuffmyer/SymPortal/lib/python3.7/site-packages:$PYTHONPATH

export PATH=/data/putnamlab/ashuffmyer/SymPortal/:/data/putnamlab/ashuffmyer/SymPortal/bin:$PATH

python3 manage.py migrate

python3 populate_db_ref_seqs.py

module unload SymPortal/0.3.21-foss-2020b

echo "Mission Complete!" $(date)

This works as a script! The export paths seem to fix the problem.

Test installation and reference database

SymPortal was not built to run on a cluster, so permission access was tricky as you cannot write into the master installation module. Therefore, we need to change the python and SymPortal paths to the ones we created in our own directory. Additionally, some of the dependencies needed by SymPortal are only in the master installation module. To use these dependencies but write into our own SymPortal framework, we must load then unload the master SymPortal module in our script.

cd scripts
nano symportal_setup.sh

#!/bin/bash
#SBATCH --job-name="SP_setup"
#SBATCH -t 500:00:00
#SBATCH --mem=120GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=ashuffmyer@uri.edu
#SBATCH -D /data/putnamlab/ashuffmyer/SymPortal
#SBATCH --exclusive

module load Miniconda3/4.9.2

eval "$(conda shell.bash hook)"
conda activate symportal_env

module load SymPortal/0.3.21-foss-2020b
module unload

export PYTHONPATH=/data/putnamlab/ashuffmyer/SymPortal/:/data/putnamlab/ashuffmyer/SymPortal/lib/python3.7/site-packages:$PYTHONPATH

export PATH=/data/putnamlab/ashuffmyer/SymPortal/:/data/putnamlab/ashuffmyer/SymPortal/bin:$PATH

##### Checking installation
python3 -m tests.tests

echo "Mission Complete!" $(date)

Job completed.

6. Load experimental data

Create metadata file

First, create a metadata file from this template from SymPortal.

Export a list of file names that will be manually added to this sheet for file names for forward and reverse files.

cd raw_data
find $PWD -type f ! -name filenames.csv ! -name filepath.csv -printf "%P\n" >filenames.csv

mv filenames.csv ../metadata/

Outside of Andromeda:

scp ashuffmyer@bluewaves.uri.edu:/data/putnamlab/ashuffmyer/AH_MCAP_ITS2/metadata/filenames.csv ~/MyProjects/EarlyLifeHistory_Energetics/Mcap2020/Data/ITS2

Manually create the metadata file then add back to Andromeda as a .csv of the first page of the template.

Platemaps with sample names spreadsheet here.

mkdir metadata

Outside of Andromeda:

scp ~/MyProjects/EarlyLifeHistory_Energetics/Mcap2020/Data/ITS2/AH_MCAP_ITS2_meta.csv ashuffmyer@bluewaves.uri.edu:/data/putnamlab/ashuffmyer/AH_MCAP_ITS2/metadata/ 

Loading the data

Create a script in the scripts directory.

cd ../scripts
nano symportal_load.sh

#!/bin/bash
#SBATCH --job-name="SP_load"
#SBATCH -t 500:00:00
#SBATCH --mem=120GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=ashuffmyer@uri.edu
#SBATCH -D /data/putnamlab/ashuffmyer/SymPortal
#SBATCH --exclusive

module load Miniconda3/4.9.2

eval "$(conda shell.bash hook)"
conda activate symportal_env

module load SymPortal/0.3.21-foss-2020b
module unload 

export PYTHONPATH=/data/putnamlab/ashuffmyer/SymPortal/:/data/putnamlab/ashuffmyer/SymPortal/lib/python3.7/site-packages:$PYTHONPATH

export PATH=/data/putnamlab/ashuffmyer/SymPortal/:/data/putnamlab/ashuffmyer/SymPortal/bin:$PATH

main.py --load /data/putnamlab/ashuffmyer/AH_MCAP_ITS2/raw_data \
--name Mcap_Development \
--num_proc $SLURM_CPUS_ON_NODE \
--data_sheet /data/putnamlab/ashuffmyer/AH_MCAP_ITS2/metadata/AH_MCAP_ITS2_meta.csv

Script output indicates 0 sequences “thrown out” due to being too divergent from references. Output file will have information on number of sequences and QC steps.

7. Run analysis in SymPortal

Create a script within the scripts directory.

nano sp_analysis.sh

#!/bin/bash
#SBATCH --job-name="SP_analysis"
#SBATCH -t 500:00:00
#SBATCH --mem=120GB
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=ashuffmyer@uri.edu
#SBATCH -D /data/putnamlab/ashuffmyer/SymPortal
#SBATCH --exclusive

module load Miniconda3/4.9.2

eval "$(conda shell.bash hook)"
conda activate symportal_env

module load SymPortal/0.3.21-foss-2020b
module unload 

export PYTHONPATH=/data/putnamlab/ashuffmyer/SymPortal/:/data/putnamlab/ashuffmyer/SymPortal/lib/python3.7/site-packages:$PYTHONPATH

export PATH=/data/putnamlab/ashuffmyer/SymPortal/:/data/putnamlab/ashuffmyer/SymPortal/bin:$PATH

# Checking dataset number
./main.py --display_data_sets

# Running analysis
./main.py --analyse 5 --name MCAPdevelop_analysis --num_proc $SLURM_CPUS_ON_NODE

# Checking data analysis instances
./main.py --display_analyses

At first this script did not work, so I ran the script only with the #checking dataset number part first. In the slurm output file, this showed that I had 5 datasets. I then replaced the number in the #running analysis step with 5 and the script then worked. Then I ran the whole script again for the analysis.

8. Export data

Export the data to your computer.

Save and transfer the output files.

All data can be found in GitHub here.

scp -r ashuffmyer@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/SymPortal/outputs ~/MyProjects/EarlyLifeHistory_Energetics/Mcap2020/Data/ITS2

9. Explore data

View the type_abundance_stacked_bar_plot.png and other .png files to view results. In this study, we saw a mix of C and D across life stages, which is expected for M. capitata. We can now do further exploration in R.

its2

Next run the data in R to view profiles ordered by lifestage. Script located in the ITS2.Rmd file here.

ITS2 clade C = C31 ITS2 clade D = D1, associated species = S. glynnii and S. trenchii

Relative abundance by samples:
its2

Relative abundance by lifestage:
lifestages

Anova analyses indicate significant differences in relative abundance of C and D between lifestages. It appears that C increases during development and D decreases.

Finally, move sequences to personal computer.

scp ashuffmyer@ssh3.hac.uri.edu:/data/putnamlab/ashuffmyer/AH_MCAP_ITS2/raw_data/*.gz ~/MyProjects/EarlyLifeHistory_Energetics/Mcap2020/Data/ITS2/Sequences

Written on December 29, 2021