2023 UW Software Carpentry Workshop in Terminal Git and R
Workshop: Software Carpentry in R - Terminal, GitHub, and R
Workshop lessons for Terminal, GitHub, and R from UW Data Science Software Carpentry workshop.
Day 1: Terminal
Links for today and set up
Live terminal review
Download, unzip, and put on desktop
Today we are learning the basics terminal and Git and will get into R specifics on Wednesday and Thursday. I will include the major points in this notebook post. Most of this is review information.
Note: In Mac, you may be using the zsh
shell rather than bash
. To start bash you can enter bash
in Terminal.
Using Terminal
We can use Terminal to do work in our computer than can be automated - including organizing or renaming files, running scripts, searching/finding things, etc.
Navigating your computer:
- Commands are case sensitive and you can’t use your mouse to move around the text
- You can call back previous commands with the up arrow and move through commands with up and down arrows
- Enter the command with
enter
ls
to list commands,ls Desktop
to list desktop contents without changing the directory- What if you have spaces in your directory? You need to surround add different slashes: `ls Desktop/Folder\ With\ Spaces"
- List all files in subfolders directory:
ls -R Desktop
- List help for command with
man command
andq
to exit this window. Use these help windows to help with commands. - Use
ls -a
to list all/hidden files - Use
ls -l -h
to print in long format in human readable format for memory (e.g., print in MB or GB) - Use tab to auto complete file/folder names. Tip: press tab twice to have it show you all the possible options for autocomplete!
pwd
to print present working directory. To move directories, usecd Folder
.- Use
clear
to clear terminal - A ~ is shorthand for your home directory e.g.,
cd ~
moves you to your home directory. On Mac, you can also just docd
to get back to home. cd ..
moves back one directorycd -
moves you to the directory you were in just before the one you are in now - can help you bounce back and forth between two directories.- Absolute path = starting from the root directory of your computer (e.g., /Users/ashuffmyer/Desktop/shell-lesson-data). Relative path = starting from current working directory (e.g., Desktop/shell-lesson-data). Absolute paths always start with a slash.
- A dot refers to the current directory.
cd .
doesn’t move you to anywhere new. For example,cd north-pacific-gyre
andcd ./north-pacific-gyre
are the same.
Working with files
cat file.txt
will display contents of this file with only as much as will fit the screenmkdir
to make a new folder. You can create multiple directories at the same time,mkdir project data thesis
creates three folders (project, data, and thesis).- To create a new file and edit text, use
nano file
- In nano, you can move throughout the file like a text editor, but limited.
- To exit, use control+x and then save the file.
- You can also use
less
to scroll through file in terminal withq
to exit. - To move and rename a file, you can use
mv file.txt otherfolder/newname.txt
- To copy a file, use
cp file.txt otherfolder/newname.txt
. This will keep the original file in the original location. - You can do this with multiple files all in the same command separated by spaces with the last thing specifying the destination.
- To copy a folder, you have to use a few other options, `cp -R folder newlocation’.
- Use
rm file.txt
to delete a file andrm -R folder
to remove a folder.
Batching multiple actions together: pipes and loops
- We can use piping and filtering to take output of one command right into another.
- To get length/number of lines of file, use
wc file
. Useman wc
for more information. You can list multiple files by spaces in this command. - To select all files without listing them manually, use a “wildcard”. E.g.,
wc -l *.pdb
. Selects all files matching a pattern. E.g.,wc -l *th*.pdb
select any file with th in the name and .pdb file extension. - You can use loops to repeat functions across many files. More info in the lesson plans.
- If you want to output information into a file, you use “output redirection” to take the output of a command and save into a text file. E.g.,
wc -l *.pdb > lengths.txt
puts file lengths for .pdb files into a new .txt file. - Use
sort -n file.txt
to display sort file by numeric value. There are other flags you can use for dates, reverse order, etc. To save, add > new.txt to save sorted file to new file. - To print, use
head -n 1 file.txt
to print 1 line of file. Usetail
to view the end of a file in the same way. -
In more complex cases, you are creating intermediate files that need to go into a final output. This is called “piping”. This doesn’t save intermediate files. A pipe is “ ”. For example, wc -l *.txt | sort -n | head -n 1
counts number of lines, sorts them numerically, and displays the first line. - Obtain unique lines with
uniq -c file.txt
. Before this you need tosort
because it will identify unique as something different than adjacent lines. echo
prints specified output to terminal. You can>
after echo to put the output into a file. To add something new to a file without overwriting it, you can doecho new text >> file.txt
. Otherwise, whatever you echo into a file will overwrite the contents.- The
cut
command can separate parts of a file. For example,cut -d , -f 2 text.csv
will separate components by a comma delimiter and select only the second column. This can be piped to a new file if you want.cut -d , -f 2 animals.csv | sort | uniq
will cut by commas, keep the second column, sort by order, and keep only unique lines. Thecut -d, -f 2 text.csv | sort | uniq -c | wc -l
command will cut by commas, keep the second column of the text.csv file, sort and find unique entries, then list the number of unique entries in terminal.
Example from workshop data (download link above):
cd Desktop/shell-lesson-data/exercise-data/proteins
cat cubane.pdb
#shows output of file
wc cubane.pdb
#shows number of lines and words in file
wc -l cubane.pdb
#shows only number of line
wc -l *.pdb
#shows line numbers for all files with .pdb
wc -l *.pdb > lengths.txt
#put file lengths in a new file
sort -n lengths.txt
#display sort by file length (numeric)
sort -n lengths.txt > sorted.txt
#save sorted list
head -n 3 sorted.txt
#view top 3 lines
tail -n 3 sorted.txt
#view bottom 3 lines
sort -n lengths.txt | head -n 1
#output sort contents into head command, doesn't save an intermediate sorted file!
#put it all togther with pipes!
wc -l *.pdb | sort -n | head -n 1
echo Hello > greeting.txt
#new file that says "Hello"
echo Goodbye > greeting.txt
#file now only contains "Goodbye"
echo Hello Again >> greeting.txt
#file now has "Hello" and "Goodbye"
Some example exercises are found here using other datasets and pipes.
Loops perform actions over multiple files. An example structure of a loop looks like this:
for thing in list_of_things
do
operation_using $thing # Indentation within the loop is not required, but aids legibility
done
As an example,
$ for filename in basilisk.dat minotaur.dat unicorn.dat
> do
> echo $filename
> head -n 2 $filename | tail -n 1
> done
This returns the following output:
basilisk.dat
CLASSIFICATION: basiliscus vulgaris
minotaur.dat
CLASSIFICATION: bos hominus
unicorn.dat
CLASSIFICATION: equus monoceros
In a loop, $
designates the variables value and for the shell to treat the variable as a name and substitute a value in its place, rather than a text or command. For example, adding a $filename
in the loop above substitutes in the file names from the file name list we created.
Also, $filename
is equivalent to ${filename}
.
You can call your variable anything! Here is another example calling the variable “x” rather than “filename”.
$ for x in basilisk.dat minotaur.dat unicorn.dat
> do
> head -n 2 $x | tail -n 1
> done
This loop below would list all files ending with .pdb and add them into a new file called all.pdb:
for datafile in *.pdb
do
cat $datafile >> all.pdb
done
If you want to create a back up copy for lots of files, for example, you can loop the cp
command:
$ for filename in *.dat
> do
> cp $filename original-$filename
> done
This creates new files that have “original” in the name.
As a tip, use echo
to make sure its doing what you think it is!
You can separate commands by semicolons in a loop as well.
If you want to add a shell script as the command for a loop, it would look like this:
for datafile in NENE*A.txt NENE*B.txt; do echo $datafile;
bash goostats.sh $datafile stats-$datafile; done
In this above example, it is taking datafiles that start with NENE and end with either A or B, echoing the name of that file, and running a bash script. This script takes two inputs: the file input ($datafile
) and an output file (stats-$datafile
).
To look at the history of commands and tasks, you can run: history | tail -n 5
. THis hows the recent history. You can then re-run tasks by specifying !
followed by the task number from the history list.
Nested loops are also possible and would look something like this:
for species in cubane ethane methane
do
for temperature in 25 30 37 40
do
mkdir $species-$temperature
done
done
To enter these in shell, use the above notation and copy paste.
Shell Scripts
A shell script can be used to store commands in a file and then run in an easy command.
Make a shell script by running nano script.sh
, paste/write in your script (best practice is to prepare script in text editor outside terminal then paste in), and then run.
For example:
nano script.sh
head -n 15 "$1" | tail -n 5
bash script.sh file.txt
This will take the first ($1
) input (file.txt) and return the first 15 lines and the last 5 lines of the file.
You can specify as many arguments as you want. For example, here is a shell script that takes three inputs:
# Select lines from the middle of a file.
# Usage: bash script.sh filename end_line num_lines
nano script.sh
head -n "$2" "$1" | tail -n "$3"
bash script.sh file.txt 15 5
This will return the first 15 lines of file.txt and the last 5 lines of this file.
If you have lots of files to run a script over, you can use the $@
to indicate a list of files. For example,
nano do-stats.sh
#put this in the nano editor, note the script here refers to some other script to calculate stats
for datafile in "$@"
do
echo $datafile
bash script.sh $datafile stats-$datafile
done
#then run this script
bash do-stats.sh *.txt
To streamline this a bit, you can also do:
nano do-stats.sh
#put this in the nano editor, note the script here refers to some other script to calculate stats
for datafile in *.txt
do
echo $datafile
bash script.sh $datafile stats-$datafile
done
#then run this script
bash do-stats.sh
The trade off is that you have a simpler command to run this script, but to change the files you want to run this over, you have to edit the .sh file. If you use the $@
operator instead, you can just specify the files when you run the script in the command line.
Finding things in Terminal
The grep
command helps you find things. It is a contraction of global/regular expression/print terms. It will find and print lines in files that match a pattern.
Use the command like this:
grep not haiku.txt
This will return all lines with “not” in the document haiku.txt.
To limit the searches to word boundaries (for example, searching for “the” could return lines with “the” and “thesis”) use grep -w pattern file.txt
.
You can also search for a phrase with grep -w "is not" haiku.txt
.
Adding -n
will return line numbers with the matching phrases, which can be helpful to subset data.
Adding -i
will make the search case sensitive.
Adding -v
will reverse the search! So it will look for lines that do NOT contain the phrase of interest.
As done with other commands, adding -R or -r will search through a directory of files.
Using regular expressions:
Regular expressions are complex and powerful! There is a full lesson on this available here.
Here is an example:
grep -E "^.o" haiku.txt
In this example, we use -E
to tell the computer we are searching for a pattern in quotes (regular expression). The ^
anchors the start of the line. The .
matches a single character. Alpha numeric match to themselves.
So here, we are looking to return any lines with an “o” in the second position in a word.
Using find function:
find
will search for files and directories- To show files in current directory, use
find .
, which will return all things in directory. - To find directories, we can use
find . -type d
, which finds all directories in the current location. If we use-type f
, it will return all files. - You can also search for files that match by name. For example,
find . -name "*.txt"
will return all files in current location that end with.txt
. You want to put the search pattern in quotes because it will otherwise try to expand the first file name it comes across. ls
andfind
are different because ls lists everything it can, while find searches with certain properties.
We can combine this with other commands. For example:
wc -l $(find . -name "*.txt")
This list the line numbers for all files that have .txt as an extension. Cool!
It’s very common to use find and grep together. The first finds files that match a pattern; the second looks for lines inside those files that match another pattern. Here, for example, we can find txt files that contain the word “searching” by looking for the string ‘searching’ in all the .txt files in the current directory:
grep "searching" $(find . -name "*.txt")
This will return lines containing the word “searching” in each file.
Day 2: Git
Links for today and set up
We will also use GitHub today, which I already have installed and set up on my computer and online account.
What is GitHub and Git?
- Git is a program for automated version control and GitHub is a platform/server that hosts Git to version control repositories.
- Provides permanent backup and storage location for files that are openly accessible
- GitHub is the most commonly used Git server
Configuring and Setting Up Git
Note: I’m not doing these steps because its already set up on my computer
Set up Git configurations on your computer:
- First, configure Git with
git config --global user.name "My Name"
, followed bygit config --global user.email "email@email.com"
. - The double dashes
--global
indicate to Git that you are making these settings for every time you use Git. - Then, use
git config --global core.autocrlf input
. This setting has to do with line returns, which is different between mac/windows. - Use
git config --global core.editor "nano -w"
to set nano as the text editor. - Finally,
git config --global init.defaultBranch main
to set branches default name as “main”. - You only need to run these commands once on your computer.
- Use
git config --help
for more information.
We next need to tell Git how to communicate with GitHub. Passwords aren’t very secure, so we can set up SSH authentication. Set this up with the following commands:
- Create a public and private key:
ssh-keygen -t ed25519 -C "email@email.com"
-t
specifies the type of key to create, in this case we are using ed25519.- Save the file with
enter
. - Leave the passphrase empty. As long as you are the only one accessing the computer, this file will only be readable by you anyways. You can use a passphrase if there is someone else that uses the computer.
- Basically, messages are encrypted with the public key, and it can only be decrypted with your private key.
- Then navigate to
cd ~/.ssh
and usels
to view files.cat id_ed25519.pub
and copy the entire line that comes up. - In your GitHub account, go to settings and SSH/GPG keys. Create a new key with whatever title you want. Then paste in the line that you copied. Now the key is stored in GitHub.
- Now you can log into GitHub when you need to push and pull.
Set up a Repository
- Create a repository in GitHub.com.
- You can now copy the SSH repository link.
- In Terminal, navigate to the directory you want the repository to go into.
- Type
git clone <insert link>
to clone the respository to the current directory on your computer. - Use
ls -a
to view all files that start with a . (hidden files). - You can now add files into your repository.
Using Git to track and commit files
git status
shows the status of your repository and changes made that have not been committeedgit add <file>
adds a specific file or directory,git add -A
adds all changes. This file is now added to the queue to be committed.- Commits are taking the work you have done and permanently move it to the Git repository. It will track these changes.
git commit -m <message or description>
will commit the added files/changesgit log
will show you recent activity with author, date, and message of commits.git diff <file>
will show the changes in one file comparing what is currently in GitHub vs the local copy.git diff
without a file will show all changes.git restore <file>
will pull the original file from the repository (we will talk about this more later)- You can add as many changes as you want into a single commit
git log --oneline
shows you the history with each commit on a single line! Super useful.- Git will not track a new folder/directory until there is something in it. It tracks files, not directories.
- Use
git diff <time> <file>
to compare two states. For example, you can entergit diff HEAD file.txt
will compare the file to the current state of the respository. You can also usegit diff commit# file.txt
to compare this file to a previous commit state. git diff HEAD~1 file.txt
compares the file to two commits ago.HEAD
just means the current repository state.git log
gives a longer form list of previous commits- Commit ID’s are long, you will often see shorter alias commit ID’s and you can use these for comparing commits
What if we make a mistake?
- If we save a file with a mistake, we can use
git checkout
to replace a local file with the most recent one from the repository. This is useful to pull whatever is current in the repository and replace the local version. - E.g.,
git checkout file.txt
- To go back to the state of a previous commit (2 ago in this case), you would use,
git checkout HEAD~2 file.txt
. This will replace the file with the version from 2 commits ago. - You will then need to commit this change, since it is different from the current state.
How do we ignore files?
- Its ok to have files that Git isn’t tracking if not needed
- You can tell Git that its ok to ignore files
- You create a new file (in nano or other editor) called
.gitignore
- Each line of this file is a file you want it to ignore. Enter files or patterns (e.g., *.csv) to have it ignore files. You MUST do this before you commit and track a file. If you want to ignore files that have already been tracked or added, you have to delete those files then add them to ignore before adding them back in (e.g., DS_Store objects are notorious for this).
- Commit and push this file.
Sending files to/from GitHub
- Once you have committed changes, use
git push
to send things up to GitHub. - This will now be reflected on your GitHub webpage.
- Use
git pull
to pull any changes that have been sent to GitHub either from webpage or from a collaborator. - Commit and push at the end of major sessions or you accomplish a task/step that you want recorded.
Collaborating on GitHub - Forking and Pull Requests
- You can add collaborators to allow them to push and pull to your repository. Then will accept an invitation and clone the respositorty to their computer.
- Another way is for the other user to go to your GitHub repository on GitHub.com. They can see files since its public, but they wont have permission to edit and commit.
- In order to contribute, the user can “fork” the repository.
- This will copy the respository to your own account.
- Then use
git clone <linked to your forked repo>
to clone this repository fork. - The user can then edit contents of the respository in their forked repo.
- You can see the number of forks on the original repository.
- When changes are pushed, they are pushed to your forked repository.
- You can then have your collaborator with the original repository pull changes from your fork.
- You can do this with “Pull Requests” and “new pull request”.
- At the top there is a bar with the arrow showing the direction of the changes to be made. For example, you would want the change from the forked repo to be pulled into the original repo.
- At the bottom, it will show you the difference between the repositories.
- You can then add a more detailed comment and create the request.
- Now, in the original repo, the original user will then go to Pull Requests and you will see the active requests.
- The user can then confirm the request, which will pull those changes into the repository.
- But in some cases if there are multiple edits to the same file, it will create a conflict.
- You can use the “resolve conflicts” button to resolve them. You can then delete anything you don’t want, or edit the file and then commit the merge.
- It is called a pull request because the owner is pulling from your repo into theirs.
- Users can then pull these changes down into their fork by using “Sync Fork” button. You will then update the branch and it will sync all changes.
Day 3 & 4: R
Basics of using R
Lesson plans for today are here and here.
- String = anything enclosed in quotes
print("string of things")
prints the string into the output- To glue strings together, you can use
paste("string1", "string2")
- Use hashtags to make notes as in Terminal
- In
paste
, you can also set thesep
argument to paste things together with anything other than a space. E.g.,paste("banana", "apple", "orange", sep="/")
returnsbanana/apple/orange
- A variable is anything assigned a value with a
<-
. You can also use=
, but this can be tricky because we often use=
for other tasks. - You can use the
args
function to display arguments that the function requires. For exampleargs(round)
. Theround
function is used to round numbers (round(1.2345, digits=2
returns 1.23). - Create a vector or list of multiple things with
vector <- c(1,2,3,4)
. You can add new things to the front (vector<-c(5, vector)
) or end (vector<-c(vector, 10)`) of a vector as well. - Boolean logical values are
TRUE
andFALSE
. These can be used to filter data later on. - Summarize variables and view structure with
str()
. vector[2]
pulls out the second value of the vectorandvector[c(3,2)]
pulls out the third and second values.- Square values = retrieve something, parentheses = do something.
- All values are coerced to strings if there is at least one string in the vector.
- You can also subset in a vector format using
vector_new<-vector[vector<5]
. A==
will evaluate if two things are equal. !=
means “is not equal to” and you can combine <= or >= to mean less than/greater than or equal to. Combine these with|
to mean “or”. So you can filter byvector <= 5 | vector ==10
. And&
will look for both conditions to be met.- To subset by things in a list, you would filter by
vector %in% list
. Where you create a list to belist<-c(1,2,3)
. - NA’s can make things tricky. For example when calculating a mean, you have to use
mean(data, na.rm=TRUE)
if you have NA’s. We can use!is.na(vector)
to remove them.
Working with data
Below, I’ll paste in the code from our practice session. You can copy and paste this code into an R script and follow along!
To download data from a URL and store it on your computer, use this function:
download.file(url = "https://ndownloader.figshare.com/files/2292169", destfile = "data_raw/portal_data_joined.csv")
This is a helpful function!
We are going to use tidyverse
for the data today using library(tidyverse)
after you install.packages("tidyverse")
.
To read in a file after downloading, use read_csv(path/file.csv)
.
Set up
library(tidyverse)
Download data
download.file(url = "https://ndownloader.figshare.com/files/2292169", destfile = "data_raw/portal_data_joined.csv")
Read data
surveys<-read_csv("data_raw/portal_data_joined.csv")
Examine and practice subsetting data
str(surveys) #view structure of dataframe
names(surveys) #column names
summary(surveys) #summary stats
head(surveys,3) #view first 3 rows
surveys[1,] #gives first row
surveys[,2] #gives second column
surveys[1:3, c("month", "day", "year")] #how month/day/year columns for rows 1 to 3
surveys[,-1] #select all rows but not the first column
surveys$species #view all species column entries
surveys[surveys["sex"]=="F",] #select all rows where sex is female
surveys[!is.na(surveys["sex"]) & surveys["sex"]=="F",] #select rows where there isn't an NA and sex is female
na.omit(surveys[surveys["sex"]=="F",]) #omit NAs and return all rows where sex is female
Summarizing data and using factors
summary(surveys) #this doesn't work to summarize any non-numeric/integer variable
#factors are categorical and R has knowledge that data points belong to these categories
surveys$sex #as character strings
factor(surveys$sex) #as factors
surveys$sex<-factor(surveys$sex) #save as factor
levels(surveys$sex) #shows the levels/order of the factor
nlevels(surveys$sex) #shows number of categories
surveys$sex<-factor(surveys$sex, levels=c("M", "F")) # order levels manually
levels(surveys$sex) #M is now first
Plot data
plot(surveys$sex) #shows counts of observations in each level of factor
surveys$sex<-addNA(surveys$sex) #include NA's in factor levels
plot(surveys$sex) #shows the count of NA's
levels(surveys$sex)
levels(surveys$sex)[3]<-"undetermined" #rename NA's as "undetermined" - a good way to rename a level of a factor
levels(surveys$sex) #shows this change
plot(surveys$sex) #now the plot shows M and F and undetermined
Formatting dates
We will use the lubridate package for formatting dates.
library(lubridate)
Working with dates
my_date<-ymd("2015-01-01") #specifies string as date from YYYY-MM-DD format
str(my_date)
my_date<-ymd(paste("2015", "1", "1", sep="-")) #this is another way we could format a date string into an R date object
str(my_date) #same result as above
surveys$date <- ymd(paste(surveys$year, surveys$month, surveys$day, sep = "-")) #we can format a new column as dates by pasting together separate year month day columns
str(surveys)
summary(surveys$date) #we can see there are 129 NA's from the date we made above
#investigate the missing dates
missing_dates <- surveys[is.na(surveys$date), c("year", "month", "day")] #extract NAs and only show year month day columns
head(missing_dates)
#the reason these dates are NA is because September and April do not have 31 days, only 30. This is an example of QC'ing your data
Manipulating data
Cheat sheets for tidyverse can be found here and here.
Manipulating data
select(surveys, plot_id, species_id, weight) #keeps only plot, species, and weight columns
select(surveys, -record_id, -species_id) #keeps all variables except record and species
filter(surveys, year == 1995) #this will filter the dataset by evaluating a condition
Using pipes to do multiple actions at the same time
#here is an example of nesting functions
surveys_sml <- select(filter(surveys, weight < 5), species_id, sex, weight) #keeps species sex and weight columns from the dataframe for rows with weight less than 5
#the same way to do this with a pipe is:
surveys_sml <- surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight); surveys_sml
Hint: A shortcut for putting in a pipe is to use command + shift + m!
Create new columns with mutate
# filter out na's and create a new column and display the top 5 rows
surveys %>%
filter(!is.na(weight)) %>%
mutate(weight_kg = weight / 1000) %>%
head()
#you can create multiple columns in one mutate command like this
surveys %>%
mutate(weight_kg = weight / 1000,
weight_lb = weight_kg * 2.2)
Summarise data with summarize
#summarize mean weights for each sex and display
surveys %>%
group_by(sex) %>%
summarize(mean_weight = mean(weight, na.rm = TRUE))
#summarize by two groups and remove na's before the calculation
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight))
#summarize and then print 15 lines, use this to customize the output
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight)) %>%
print(n = 15)
#summarize multiple metrics and arrange them in ascending order by weight
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight)) %>%
arrange(min_weight)
#now arrange in descending order
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight)) %>%
arrange(desc(mean_weight))
Counting data
#we can use the count function to calculate the number of observations in our data
surveys %>%
count(sex)
#this is also equivalent (but more streamlined) to:
surveys %>%
group_by(sex) %>%
summarise(count = n())
#you can also sort within the count function
surveys %>%
count(sex, sort = TRUE)
#now calculate number of observations for a combination of two factors and arrange in descending order (alphabetical) by species
surveys %>%
count(sex, species)%>%
arrange(species, desc(n))
#example: what was the heaviest animal measured in each year?
surveys %>%
filter(!is.na(weight)) %>%
group_by(year) %>%
filter(weight == max(weight)) %>%
select(year, genus, species, weight) %>%
arrange(year)
Reshaping data with pivot_longer
and pivot_wider
(previously gather
and spread
).
What are the 4 rules of a “tidy” dataframe?
- Each variable has its own column
- Each observatin has its own row
- Each value has its own cell
- Each type of observational unit forms a table
Pivoting tables helps with rule #4!
Pivoting from wide to long
#pivot_wider needs three things:
# data
# names from (values will become new columns)
# values from (values will fill the new columns)
surveys_gw <- surveys %>%
filter(!is.na(weight)) %>%
group_by(plot_id, genus) %>%
summarize(mean_weight = mean(weight))
str(surveys_gw)
#this data frame has observations for each plot across multiple rows
surveys_wide <- surveys_gw %>%
pivot_wider(names_from = genus, values_from = mean_weight) #convert from long to wide format
head(surveys_wide)
str(surveys_wide)
#now each row is a plot and each column is a species populated by mean weight
#for missing values, add a weight of 0 rather than NA
surveys_gw %>%
pivot_wider(names_from = genus, values_from = mean_weight, values_fill = 0) %>%
head()
Pivoting from wide to long format
#we would do this if we want to have the values in columns as a variable (e.g., genus rather a column for each genus)
#pivot_longer needs 4 things
# the data
# the names_to column variable we wish to create from column names.
# the values_to column variable we wish to create and fill with values.
# cols are the name of the columns we use to make this pivot (or to drop).
surveys_long <- surveys_wide %>%
pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id) #column names now go into a new column called genus, the data goes into a new column called "mean weight", and the function ignores the plot column
str(surveys_long)
head(surveys_long) #now we are back to the wide version
#another example:
surveys_wide_genera <- surveys %>%
group_by(plot_id, year) %>%
summarize(n_genera = n_distinct(genus)) %>%
pivot_wider(names_from = year, values_from = n_genera) #summarize the number of unique genera for each plot and year, then make into a wide format
head(surveys_wide_genera)
#now pivot so that each row is a unique plot and year combo in long format
surveys_wide_genera %>%
pivot_longer(names_to = "year", values_to = "n_genera", cols = -plot_id)
#here is an example of how you would pivot with multiple data columns and put the measurement type as a new column
surveys_long2 <- surveys %>%
pivot_longer(names_to = "measurement", values_to = "value", cols = c(hindfoot_length, weight))
head(surveys_long2)
Exporting data
#obtain dataset with no NAs
surveys_complete <- surveys %>%
filter(!is.na(weight), # remove missing weight
!is.na(hindfoot_length), # remove missing hindfoot_length
!is.na(sex)) # remove missing sex
# Extract the most common species_id
species_counts <- surveys_complete %>%
count(species_id) %>%
filter(n >= 50)
# Only keep the most common species
surveys_complete <- surveys_complete %>%
filter(species_id %in% species_counts$species_id)
#write to csv
write_csv(surveys_complete, file = "data/surveys_complete.csv")
Visualizing data
ggplot2 is the main package we will use for visualizations. It uses the following structure:
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
Build a plot with points and try a hex plot
#full plot
ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point()
# the + sign add layers to the plots and anything you add to the main plot header (data and aes) applies to all other functions in the plot unless otherwise specified
#you can also assign to a variable and then draw additional objects on that variable
# Assign plot to a variable
surveys_plot <- ggplot(data = surveys_complete,
mapping = aes(x = weight, y = hindfoot_length))
# Draw the plot with points
#geom_point, geom_line, geom_boxplot are common geoms to add
surveys_plot +
geom_point()
# Draw with hex
install.packages("hexbin")
library(hexbin)
#draw a hex plot
surveys_plot +
geom_hex()
#the advantage of a hex plot is that you can see count frequency in addition to distribution, cool!
#add point transparency
ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1)
#add color
ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1, color = "blue")
#vary color by species
ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1, aes(color = species_id))
Build a boxplot
#basic boxplot
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
geom_boxplot()
#add data points over the box plot
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
geom_boxplot(alpha = 0) +
geom_jitter(alpha = 0.3, color = "tomato")
#try a violin plot
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
geom_violin()
#scale by log 10
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
geom_violin()+
scale_y_log10()
Plotting time series data
#calculate the number of counts per year per genus
yearly_counts <- surveys_complete %>%
count(year, genus)
#plot this data
ggplot(data = yearly_counts, aes(x = year, y = n)) +
geom_line()
#tell the plot to group by genus
ggplot(data = yearly_counts, aes(x = year, y = n, group = genus)) +
geom_line()
#now add color
ggplot(data = yearly_counts, aes(x = year, y = n, color = genus)) +
geom_line()
Piping in from other functions
#for simplicity, you can pipe in from other tidyverse functions
yearly_counts %>%
ggplot(mapping = aes(x = year, y = n, color = genus)) +
geom_line()
#link data manipulation functions
yearly_counts_graph <- surveys_complete %>%
count(year, genus) %>%
ggplot(mapping = aes(x = year, y = n, color = genus)) +
geom_line();yearly_counts_graph
Faceting
#facet to show multiple plots side by side split by categories of choosing
ggplot(data = yearly_counts, aes(x = year, y = n)) +
geom_line() +
facet_wrap(facets = vars(genus))
#now split line in each plot by sex
yearly_sex_counts <- surveys_complete %>%
count(year, genus, sex)
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(facets = vars(genus))
#set facets by rows and columns manually
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_grid(rows = vars(sex), cols = vars(genus))
# One column, facet by rows
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_grid(rows = vars(genus))
# One row, facet by column
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_grid(cols = vars(genus))
Plot themes
#black and white theme
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
theme_bw()
#minimal
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
theme_minimal()
#light
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
theme_light()
#void
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
theme_void()
#classic
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
theme_classic()
Customizing plots
#change axis titles
ggplot(data = yearly_sex_counts, aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw()
#change text size
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw() +
theme(text=element_text(size = 16))
#change text orientation and face
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw() +
theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 90, hjust = 0.5, vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
strip.text = element_text(face = "italic"),
text = element_text(size = 16))
#create your own custom theme to reuse!
grey_theme <- theme(axis.text.x = element_text(colour="grey20", size = 12,
angle = 90, hjust = 0.5,
vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
text=element_text(size = 16))
ggplot(surveys_complete, aes(x = species_id, y = hindfoot_length)) +
geom_boxplot() +
grey_theme
Arranging plots with patchwork
package. In patchwork, you use +
to put plots next to each other, /
to arrange them vertically, and plot_layout()
to determine the total space. More examples are here: https://patchwork.data-imaginist.com/. In patchwork you can add annotations and customize layouts or add labels.
install.packages("patchwork")
library(patchwork)
plot_weight <- ggplot(data = surveys_complete, aes(x = species_id, y = weight)) +
geom_boxplot() +
labs(x = "Species", y = expression(log[10](Weight))) +
scale_y_log10()
plot_count <- ggplot(data = yearly_counts, aes(x = year, y = n, color = genus)) +
geom_line() +
labs(x = "Year", y = "Abundance")
plot_weight / plot_count + plot_layout(heights = c(3, 2))
#examples of other layouts
#(p1 | p2 | p3) /
# p4
#this would create a plot with three plots on top and one plot on the bottom - super cool and easier than cowplot!
Note: check out patchwork for arranging plots, it seems really useful!
Exporting plots
my_plot <- ggplot(data = yearly_sex_counts,
aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw() +
theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 90,
hjust = 0.5, vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
text = element_text(size = 16))
ggsave("fig/plot.png", my_plot, width = 15, height = 10) #save to folder
## This also works for plots combined with patchwork
plot_combined <- plot_weight / plot_count + plot_layout(heights = c(3, 2))
ggsave("fig/plot_combined.png", plot_combined, width = 10, dpi = 300)