lab01_Intro_to_R.Rmd
R
?
R
is a free, open-source programming language and
software environment for statistical computing, bioinformatics,
visualization and general computing.
It is based on an ever-expanding set of analytical packages that perform specific analytical, plotting, and other programming tasks.
R
?
R
is free(!), runs on pretty much every operating
system, and has a huge user base.
R
is far from the only programming language for working
with data. But it is the most widely used language in the fields of
ecology, evolution, and wildlife sciences. If you plan to pursue a
career in any of these fields, proficiency in R
is quickly
becoming a prerequisite for many jobs.
Even if you don’t pursue a career in one of these fields, the ability to manipulate, analyze, and visualize data (otherwise known as data science) is an extremely marketable skill in many professions right now.
We will go over the basics of using R
during lab
sessions but there are many good online resources for learning
R
and getting help. A few of my favorites include:
Tom Edward’s online Learning R course (from which much of this material is borrowed)
Data Analysis and Visualization in R for Ecologists (also from which much of this material is borrowed)
Of course, if you encounter error messages you don’t understand or
need help figuring out how to accomplish something in R
,
google is your best friend (even the most experienced R
users use google on a daily basis). The key to finding answers on google
is asking the right questions. Because we will not spend much time on
this topic in lab, please refer to these links for advice on formulating
R
-related questions:
Seeking help from Data Analysis and Visualization in R for Ecologists
R
- the very basics
As a statistical programming tool, one thing R is very good at is
doing math. So as a starting point, let’s treat R
like a
fancy calculator.
We interact with this calculator by typing numbers and operators (+,
-, *, /) into the Console
window.
Let’s try it - in the bottom left window (the Console), write the
R
code required to add two plus two and then press
enter:
2+2
When you run the code, you should see the answer printed below the window. Play with your code a bit - try changing the number and the operators and then run the code again.
We can run R
like a calculator by typing equations
directly into the console and then printing the answer. But usually we
don’t want to just do a calculation and see the answer. Instead, we
assign values to objects. That object is then saved in
R
’s memory which allows us to use that object later in our
analysis.
This might seem a bit confusing if you are new to programming so
let’s try it. The following code creates an object called
x
and assigns it a value of 3
:
x <- 3
The operator <-
is how we do assignments in
R
. Whatever is to the left of <-
is the
object’s name and whatever is to the right is the value. As we will see
later, objects can be much more complex than simply a number but for
now, we’ll keep it simple.
You try it - change the code to create an object called
new.x
. Instead of assigning new.x
a number,
give it a calculation, for example 25/5
. What do you think
the value of new.x
is?
In the exercise above, you may have noticed that after running the
code, R
did not print anything. That is because we simply
told R
to create the object (in the top right window, if
you click on the Environment
tab, you should see
x
and new.x
). Now that it is stored in
R
’s memory, we can do a lot of things with it. For one, we
can print it to see the value. To do that, we simply type the name of
the object and run the code:
new.x <- 25/5
new.x
#> [1] 5
We can also use objects to create new objects. What do you think the following code does?
x <- 3
y <- x*4
After running it, print the new object y
to see its
value. Were you right?
It’s a good idea to give objects names that tell you something about
what the object represents. Names can be as long as you want them to be
but should not have spaces (also remember long names require more typing
so brevity is a good rule of thumb). Names also cannot start with a
number and R
is case-sensitive so, for example,
Apple
is not the same as
apple
.
The console is useful for doing simple tasks but as our analyses become more complicated, the console is not very efficient. What if you need to go back and change a line of code? What if you want to show your code to someone else to get help?
Instead of using the console, most of our work will be done using scripts. Scripts are special files that us to write, save, and run many lines of code. Scripts can be saved so you can work on them later or send them to collaborators.
To create a script, click
File -> New File -> R Script
. This new file should
show up in a new window.
R
will ignore any code that follows a #
.
This is very useful for making your code more readable for both
yourself and others. Use comments to remind yourself what a newly
created object is, to explain what a line of code does, to leave
yourself a reminder for later, etc. For example, in the previous code,
it might be a good idea to use comments to define what each object
represents:
n1 <- 44 # Number of individuals captured on first occasion
n2 <- 32 # Number of individuals captured on second occasion
m2 <- 15 # Number of previously marked individuals captured on second occasion
Notice that when you run this code, R
ignores the
comments.
So far, we have only been working with objects that store a single
number. However, often it is more convenient to store a string of
numbers as a single object. In R
, these strings are called
vectors and they are usually created by enclosing the string
between c(
and )
:
x <- c(3,5,2,5)
x
#> [1] 3 5 2 5
You can also sequences of consecutive numbers in a few different ways:
x <- 1:10
x
#> [1] 1 2 3 4 5 6 7 8 9 10
x2 <- seq(from = 1, to = 10, by = 1)
x2
#> [1] 1 2 3 4 5 6 7 8 9 10
The seq()
function is very flexible and useful so if you
are not familiar with it, be sure to look at the help page to better
understand how to use it.
Another useful function for creating vectors is rep()
,
which repeats values of a vector:
rep(x2, times = 2)
#> [1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
or:
rep(x2, each = 2)
#> [1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10
Be sure you notice the difference between using the
times
argument vs the each
argument!
A vector can also contain characters (though you cannot mix numbers and characters in the same vector!):
occasions <- c("Occasion1", "Occasion2", "Occasion3")
occasions
#> [1] "Occasion1" "Occasion2" "Occasion3"
The quotes around “Occasion1”, “Occasion2”, and “Occasion3” are
critical. Without the quotes, R
will assume there are
objects called Occasion1
, Occasion2
and
Occasion3
. As these objects don’t exist in R
’s
memory, there will be an error message.
Vectors can be any length (including 1. In fact, the numeric objects
we’ve been working with are just vectors with length 1). The function
length()
tells you how long a vector is:
length(x)
#> [1] 10
The function class()
indicates the class (the type of
element) of an object:
What is the class of a vector with both numeric and characters entries? Hint:
mixed <- c(1, 2, "3", "4")
You can also use the c()
function to add other elements
to your vector:
y <- c(x, 4,8,3)
Vectors are one of the many data structures that R
uses.
Other important ones are lists (list()
), matrices
(matrix()
), data frames (data.frame()
),
factors (factor()
) and arrays (array()
). We
will learn about each of those data structures as we encounter them in
lab exercises.
One of the most useful properties of vectors in R
is
that we can use them to simplify basic arithmetic operations that need
to be done on multiple observations. For example, consider the following
data on wing chord (a measure of wing length) and body mass of
Swainson’s thrushes (Catharus ustulatus):
Individual | Mass (g) | Wing chord (mm) |
---|---|---|
1 | 36.2 | 95.1 |
2 | 34.6 | 88.4 |
3 | 31.0 | 97.9 |
4 | 31.8 | 96.8 |
5 | 29.4 | 92.3 |
6 | 32.0 | 90.6 |
Perhaps we want to derive the body condition of each individual based on these measures. One common metric of body condition used by ornithologists is \(\frac{mass}{size}\), where wing chord is used as a proxy for body size. We could calculate body condition for each individual:
cond1 <- 36.2/95.1 # Body condition of the first individual
cond2 <- 34.6/88.4 # Body condition of the second individual
But that is time consuming and error prone. Luckily, R
will vectorize basic arithmetic:
mass <- c(36.2, 34.6, 31.0, 31.8, 29.4, 32.0)
wing <- c(95.1, 88.4, 97.9, 96.8, 92.3, 90.6)
cond <- mass/wing
cond
#> [1] 0.3807 0.3914 0.3166 0.3285 0.3185 0.3532
As you can see, when we divide one vector by another, R
divides the first element of the first vector by the first element of
the second vector, etc. and returns a vector.
The power of R
is most apparent in the large number of
built-in functions that are available for users.
Functions are small bits of code that perform a specific task. Most functions accept one or more inputs called arguments and return a value or a new object.
Let’s say we have the following data on the number of ticks recorded on 5 dogs:
Individual | Ticks |
---|---|
1 | 4 |
2 | 7 |
3 | 2 |
4 | 3 |
5 | 150 |
What is the total number of ticks recorded in the study? For that, we
can use the built-in sum()
function:
What is the mean number of ticks per dog?
mean(ticks)
#> [1] 33.2
And the variance?
var(ticks)
#> [1] 4267
Every function takes a different set of arguments and in many cases
you will need to look up what those arguments are. The best way to get
help for a specific function is to type a question mark followed by the
function name, which will bring up a help page in the bottom right
panel. For example, the round
function rounds a number to a
specified number of decimal places. This is a useful function when we
don’t want to print a really large number of digits:
?round
So we see round
takes an argument called x
,
which is the number we want to round, and the number of
digits
we want to round to. If you provide the arguments in
the exact same order as they are defined you don’t have to name them.
For example, :
If you do name the arguments, you can switch their order:
round(digits = 0, x = y)
#> [1] 33
Although you don’t have to name arguments, it’s a good idea to get in the habit of naming them. This will make you code easier to read, will help avoid mistakes that can occur when you don’t put the arguments in the correct order, and makes it easier to trouble shoot code that doesn’t do what you expect it to do.
Often you will need to work with just a subset of a vector. For example, maybe you have a vector of plant biomass measured along transects but you only need
Notice that to index certain elements of the vector y
,
we use square brackets. Inside those brackets, we provided an
integer vector, where each integer refers to the position of
elements in the first vector. The indexing vector can be any length
(including 1).
We can also index vectors using a logical vector. A logical
vector is a special type of object that contains values of
TRUE
or FALSE
. When using a logical vector for
indexing, the logical vector indicates which elements to keep
(TRUE
) or remove (FALSE
) from the original
vector. For this reason, the indexing vector must be same length as the
focal vector; i.e., length(a) == length(v)
# Logical vector (which elements of y are greater than 4?)
y > 4
#> [1] FALSE FALSE TRUE FALSE TRUE
# Indexing using a logical vector (keep elements 3 and 5)
y[y > 4]
#> [1] 8 25
We can also use indexing to remove elements from a vector:
# Remove the second element
y[-2]
#> [1] 2 8 4 25
or to rearrange the order of a vector
y[c(5,4,3,2,1)]
#> [1] 25 4 8 4 2
One of R
’s primary strengths is the large number of
packages available to users. Packages are units of shareable
code and data that have been created by other R
users. We
have already seen the built-in functions that R
comes with.
Packages allow users to share lots and lots of other functions that
serve specific purposes. Packages also allow users to share data
sets. There are packages for cleaning data, visualizing data, making
maps, fitting specialized models, and basically anything else you can
think of.
Accessing the code in a package first requires installing the
package. This only needs to be done once per computer and is usually
done using the install.packages()
function:
install.packages("devtools")
Note that the name of the package (in this case
devtools
) must be in quotation marks. Packages installed
using install.packages()
are stored in a centralized
repository called CRAN (Comprehensive R Archive Network). Once
devtools
(or any package) is installed on your computer,
you do not need to re-run the install.packages()
function
unless you re-install/update R
or need to update the
package to a newer version.
Installing a package does not automatically make the functions from
that package available in a given R
session. To tell
R
where the functions come from, you must load the
package using the library()
function:
Unlike install.packages()
, library()
must
be re-run each time you open R
. Most people include a few
calls to library()
at the beginning of each script so that
all packages needed to run the code are loaded at the beginning of the
script.
Occasionally, some packages are stored in other places (e.g., github). These packages can be installed using different functions. For example, I created a package for this course that contains small data sets we will use in labs throughout the semester. The package is stored on github and can be installed by running:
install_github("RushingLab/FANR6750")
Note that the install_github()
function is from the
devtools
package so you need to run
library(devtools)
before you install the package. Make sure
you install the FANR6750
package now so you have access to
the data sets.
So far, we have only discussed one particular class of R
object - vectors. Vectors hold a string of values as a single object.
Although useful for many applications, vectors are limited in their
ability to store multiple types of data (numeric and character).
This is where data frames become useful. Perhaps the most common type
data object you will use in R
is the data frame. Data
frames are tabular objects (rows and columns) similar in structure to
spreadsheets (think Excel or GoogleSheets). In effect, data frames store
multiple vectors - each column of the data frame is a vector. As such,
each column can be a different class (numeric, character, etc.) but all
values within a column must be the same class. Just as the first row of
an Excel spreadsheet can be a list of column names, each column in a
data frame has a name that (hopefully) provides information about what
the values in that column represent.
To see how data frames work, let’s load a data frame called
jayData
that comes with the FANR6750
package.
Note - As discussed above, if you want to access function or data
sets that come with packages, you first need to load the
package in your current working environment. To do that, use the
library()
function, with the unquoted package name as the
argument. Once loaded, all of the package’s functions are available to
use.
Alternatively, you can access functions from a given package without
loading the package using package.name::function.name()
.
For example, if you want to use the filter()
function from
the dplyr
package, you could type
dplyr::filter()
. Although less commonly used, this method
has a few advantages:
Sometimes different packages have functions with the same names.
R
will default to using the function from the package that
was loaded last. For example, the raster
package also has a
function called filter()
so if you load dplyr
first (using library()
and then raster
,
R
will default to using raster
’s
filter()
function, which could cause problems.
If you share your code with others, the ::
method
makes it clear which packages are being use for which functions. That
additional clarity is often helpful and is the reason I will often use
::
in this course.
To get a quick idea of what information this data frame contains, we
can use the head()
and tail()
functions, which
will print the first and last 6 rows of the data frame:
library(FANR6750)
data("jayData") # the data() function loads data sets the come with packages
head(jayData)
#> x y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124 423 0.00 0.02 Oak Med 34
#> 2 261937 3769224 506 0.10 0.45 Oak Med 38
#> 3 246337 3764124 859 0.00 0.26 Oak High 40
#> 4 239437 3763524 1508 0.02 0.03 Pine Med 43
#> 5 239437 3767724 483 0.26 0.37 Oak Med 36
#> 6 236437 3769524 830 0.00 0.01 Oak Low 39
tail(jayData)
#> x y elevation forest chaparral habitat seeds jays
#> 95 258937 3767124 804 0.19 0.68 Oak Med 40
#> 96 259837 3768024 210 0.00 0.00 Oak Low 33
#> 97 249337 3769524 467 0.70 0.09 Pine Med 36
#> 98 262237 3767424 1318 0.02 0.23 Oak Med 44
#> 99 261937 3770124 354 0.00 0.05 Bare Low 33
#> 100 247837 3769524 686 0.10 0.32 Oak Med 40
We can see that jayData
contains eight columns:
x
, y
, elevation
,
forest
, chaparral
, habitat
,
seeds
, and jays
. We’ll learn more about what
each of these columns represents later in the semester, though just like
functions, many data sets have help pages also and you can access those
help pages using ?jayData
. Several other useful functions
for investigating the structure of data frames are str()
and summary()
str(jayData)
#> 'data.frame': 100 obs. of 8 variables:
#> $ x : num 258637 261937 246337 239437 239437 ...
#> $ y : num 3764124 3769224 3764124 3763524 3767724 ...
#> $ elevation: int 423 506 859 1508 483 830 457 304 834 164 ...
#> $ forest : num 0 0.1 0 0.02 0.26 0 0.02 0 0.54 0 ...
#> $ chaparral: num 0.02 0.45 0.26 0.03 0.37 0.01 0.22 0.09 0.21 0.11 ...
#> $ habitat : chr "Oak" "Oak" "Oak" "Pine" ...
#> $ seeds : chr "Med" "Med" "High" "Med" ...
#> $ jays : int 34 38 40 43 36 39 38 35 41 33 ...
summary(jayData)
#> x y elevation forest
#> Min. :230737 Min. :3761424 Min. : 12 Min. :0.0000
#> 1st Qu.:238762 1st Qu.:3765324 1st Qu.: 365 1st Qu.:0.0000
#> Median :245587 Median :3766824 Median : 548 Median :0.0000
#> Mean :246949 Mean :3767130 Mean : 659 Mean :0.0553
#> 3rd Qu.:254662 3rd Qu.:3768699 3rd Qu.: 929 3rd Qu.:0.0300
#> Max. :266137 Max. :3773724 Max. :1537 Max. :0.7000
#> chaparral habitat seeds jays
#> Min. :0.000 Length:100 Length:100 Min. :30.0
#> 1st Qu.:0.080 Class :character Class :character 1st Qu.:36.0
#> Median :0.210 Mode :character Mode :character Median :38.0
#> Mean :0.241 Mean :38.6
#> 3rd Qu.:0.370 3rd Qu.:41.0
#> Max. :0.850 Max. :48.0
str()
tells us about the structure of the data frame,
for example x
and y
are numeric columns and
habitat
contains character strings. summary()
provides some simple summary statistics for each variable.
Another useful function is nrow()
, which tells us now
many rows are in the data frame (similar to length()
for
vectors):
nrow(jayData)
#> [1] 100
As you will see shortly, one of the most common tasks when working
with data frames is creating new objects from parts of the full
data frame. This task involves subsetting the data frame - selecting
specific rows and columns. There are many ways of
subsetting data frames in R
, too many to discuss so we will
only learn about a few.
First, we may want to select a subset of all of the columns in a big
data frame. Data frames are essentially tables, which means we can
reference both rows and columns by their number:
data.frame[row#, column#]
. The row and column numbers have
to put inside of square brackets following the name of the data frame
object. The row number always comes first and the column number second.
If you want to select all rows of a specific column, you just leave the
row#
blank. For example, if we wanted a vector containing
the number of jays at each survey location:
jayData[,8]
#> [1] 34 38 40 43 36 39 38 35 41 33 34 37 37 38 42 43 39 37 38 40 37 35 37 44 45
#> [26] 37 36 34 48 43 39 41 45 38 35 38 39 38 41 38 36 43 38 36 33 41 38 30 39 36
#> [51] 39 36 34 30 38 37 44 36 36 40 44 48 37 41 42 30 41 39 43 30 42 42 41 38 36
#> [76] 37 33 44 38 35 45 41 35 38 37 45 33 42 34 45 40 42 40 44 40 33 36 44 33 40
We can also select columns using data.frame$column
(where data.frame
is the name of the data frame object and
column
is the name of the column). For example,
jayData$jays
#> [1] 34 38 40 43 36 39 38 35 41 33 34 37 37 38 42 43 39 37 38 40 37 35 37 44 45
#> [26] 37 36 34 48 43 39 41 45 38 35 38 39 38 41 38 36 43 38 36 33 41 38 30 39 36
#> [51] 39 36 34 30 38 37 44 36 36 40 44 48 37 41 42 30 41 39 43 30 42 42 41 38 36
#> [76] 37 33 44 38 35 45 41 35 38 37 45 33 42 34 45 40 42 40 44 40 33 36 44 33 40
Notice that if you hit tab
after you type the
$
, RStudio will bring up all of the columns and you can use
the up or down buttons to find the one you want.
Sometimes you may want to select more than one column. The easiest
way to do that is to use the select()
function in the
dplyr
package:
library(dplyr)
head(select(.data = jayData, x, y, jays))
#> x y jays
#> 1 258637 3764124 34
#> 2 261937 3769224 38
#> 3 246337 3764124 40
#> 4 239437 3763524 43
#> 5 239437 3767724 36
#> 6 236437 3769524 39
Notice that select requires us to first provide the data frame object
(.data = jayData
) and then we provide the column names
(unquoted!) we want to select. You can also use select to remove
columns:
head(select(.data = jayData, -seeds))
#> x y elevation forest chaparral habitat jays
#> 1 258637 3764124 423 0.00 0.02 Oak 34
#> 2 261937 3769224 506 0.10 0.45 Oak 38
#> 3 246337 3764124 859 0.00 0.26 Oak 40
#> 4 239437 3763524 1508 0.02 0.03 Pine 43
#> 5 239437 3767724 483 0.26 0.37 Oak 36
#> 6 236437 3769524 830 0.00 0.01 Oak 39
It’s important to realize that even though there are different ways to accomplish the same general task (in this case, “subset columns of a data frame”), those methods will sometimes differ in subtle but important ways.
For example, what type of object did the
data.frame[,col#]
and data.frame$column.name
options return? What type of object did
select(data.frame, col1, col2)
return? Those are very
different results and knowing what the output will be can be useful when
deciding the best way to accomplish your task.
To select specific rows, we can use the row#
method we
learned above, this time leaving the columns blank:
jayData[1,]
#> x y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124 423 0 0.02 Oak Med 34
If we want more than one row, we just put in a vector with all of the rows we want:
jayData[1:2,]
#> x y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124 423 0.0 0.02 Oak Med 34
#> 2 261937 3769224 506 0.1 0.45 Oak Med 38
jayData[c(1,30),]
#> x y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124 423 0 0.02 Oak Med 34
#> 30 259537 3765924 1419 0 0.07 Pine Med 43
Note that we can use the square brackets to also subset vectors, in
which case we don’t need the comma as long as you tell R
which column you want first:
jayData$jays[1]
#> [1] 34
Sometimes, we may not know the specific row number(s) we want but we
do know the value of one of the columns we want to keep. Using the
filter()
function in the dplyr
package allows
us to filter rows based on the value of one of the variables. For
example, if we want just surveys that were conducted in oak habitat, we
use:
head(filter(jayData, habitat == "Oak"))
#> x y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124 423 0.00 0.02 Oak Med 34
#> 2 261937 3769224 506 0.10 0.45 Oak Med 38
#> 3 246337 3764124 859 0.00 0.26 Oak High 40
#> 4 239437 3767724 483 0.26 0.37 Oak Med 36
#> 5 236437 3769524 830 0.00 0.01 Oak Low 39
#> 6 263737 3766524 457 0.02 0.22 Oak Med 38
Notice the need for two equals signs (==
) when telling
R
we want the row where habitat
equals
Oak
. Filter makes it very easy to select multiple rows
using operators like greater than, less than, etc.
head(filter(jayData, elevation > 1000))
#> x y elevation forest chaparral habitat seeds jays
#> 1 239437 3763524 1508 0.02 0.03 Pine Med 43
#> 2 261637 3768324 1276 0.02 0.36 Oak High 44
#> 3 248737 3766524 1024 0.03 0.41 Pine Low 45
#> 4 255937 3765024 1400 0.02 0.45 Oak High 48
#> 5 259537 3765924 1419 0.00 0.07 Pine Med 43
#> 6 245737 3762924 1004 0.02 0.32 Oak Low 41
or a slightly more complicated example:
head(filter(jayData, elevation < 1000 & habitat == "Oak"))
#> x y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124 423 0.00 0.02 Oak Med 34
#> 2 261937 3769224 506 0.10 0.45 Oak Med 38
#> 3 246337 3764124 859 0.00 0.26 Oak High 40
#> 4 239437 3767724 483 0.26 0.37 Oak Med 36
#> 5 236437 3769524 830 0.00 0.01 Oak Low 39
#> 6 263737 3766524 457 0.02 0.22 Oak Med 38