What is R?

R is a free, open-source programming language and software environment for statistical computing, bioinformatics, visualization and general computing.

It is based on an ever-expanding set of analytical packages that perform specific analytical, plotting, and other programming tasks.

Why R?

R is free(!), runs on pretty much every operating system, and has a huge user base.

R is far from the only programming language for working with data. But it is the most widely used language in the fields of ecology, evolution, and wildlife sciences. If you plan to pursue a career in any of these fields, proficiency in Ris quickly becoming a prerequisite for many jobs.

Even if you don’t pursue a career in one of these fields, the ability to manipulate, analyze, and visualize data (otherwise known as data science) is an extremely marketable skill in many professions right now.

Additional resources and where to get help

We will go over the basics of using R during lab sessions but there are many good online resources for learning R and getting help. A few of my favorites include:

Of course, if you encounter error messages you don’t understand or need help figuring out how to accomplish something in R, google is your best friend (even the most experienced R users use google on a daily basis). The key to finding answers on google is asking the right questions. Because we will not spend much time on this topic in lab, please refer to these links for advice on formulating R-related questions:

Using R- the very basics

As a statistical programming tool, one thing R is very good at is doing math. So as a starting point, let’s treat R like a fancy calculator.

We interact with this calculator by typing numbers and operators (+, -, *, /) into the Console window.

Let’s try it - in the bottom left window (the Console), write the Rcode required to add two plus two and then press enter:

2+2

When you run the code, you should see the answer printed below the window. Play with your code a bit - try changing the number and the operators and then run the code again.

Creating objects

We can run R like a calculator by typing equations directly into the console and then printing the answer. But usually we don’t want to just do a calculation and see the answer. Instead, we assign values to objects. That object is then saved in R’s memory which allows us to use that object later in our analysis.

This might seem a bit confusing if you are new to programming so let’s try it. The following code creates an object called x and assigns it a value of 3:

x <- 3

The operator <- is how we do assignments in R. Whatever is to the left of <- is the object’s name and whatever is to the right is the value. As we will see later, objects can be much more complex than simply a number but for now, we’ll keep it simple.


You try it - change the code to create an object called new.x. Instead of assigning new.x a number, give it a calculation, for example 25/5. What do you think the value of new.x is?


Working with objects

In the exercise above, you may have noticed that after running the code, R did not print anything. That is because we simply told R to create the object (in the top right window, if you click on the Environment tab, you should see x and new.x). Now that it is stored in R’s memory, we can do a lot of things with it. For one, we can print it to see the value. To do that, we simply type the name of the object and run the code:

new.x <- 25/5
new.x
#> [1] 5

We can also use objects to create new objects. What do you think the following code does?

x <- 3
y <- x*4

After running it, print the new object y to see its value. Were you right?

Naming objects

It’s a good idea to give objects names that tell you something about what the object represents. Names can be as long as you want them to be but should not have spaces (also remember long names require more typing so brevity is a good rule of thumb). Names also cannot start with a number and R is case-sensitive so, for example, Apple is not the same as apple.

Using scripts instead of the console

The console is useful for doing simple tasks but as our analyses become more complicated, the console is not very efficient. What if you need to go back and change a line of code? What if you want to show your code to someone else to get help?

Instead of using the console, most of our work will be done using scripts. Scripts are special files that us to write, save, and run many lines of code. Scripts can be saved so you can work on them later or send them to collaborators.

To create a script, click File -> New File -> R Script. This new file should show up in a new window.

Commenting your code

R will ignore any code that follows a #. This is very useful for making your code more readable for both yourself and others. Use comments to remind yourself what a newly created object is, to explain what a line of code does, to leave yourself a reminder for later, etc. For example, in the previous code, it might be a good idea to use comments to define what each object represents:

n1 <- 44     # Number of individuals captured on first occasion

n2 <- 32     # Number of individuals captured on second occasion
  
m2 <- 15     # Number of previously marked individuals captured on second occasion

Notice that when you run this code, R ignores the comments.

Vectors

So far, we have only been working with objects that store a single number. However, often it is more convenient to store a string of numbers as a single object. In R, these strings are called vectors and they are usually created by enclosing the string between c( and ):

x <- c(3,5,2,5)
x
#> [1] 3 5 2 5

You can also sequences of consecutive numbers in a few different ways:

x <- 1:10
x
#>  [1]  1  2  3  4  5  6  7  8  9 10

x2 <- seq(from = 1, to = 10, by = 1)
x2
#>  [1]  1  2  3  4  5  6  7  8  9 10

The seq() function is very flexible and useful so if you are not familiar with it, be sure to look at the help page to better understand how to use it.

Another useful function for creating vectors is rep(), which repeats values of a vector:

rep(x2, times = 2)
#>  [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10

or:

rep(x2, each = 2)
#>  [1]  1  1  2  2  3  3  4  4  5  5  6  6  7  7  8  8  9  9 10 10

Be sure you notice the difference between using the times argument vs the each argument!

A vector can also contain characters (though you cannot mix numbers and characters in the same vector!):

occasions <- c("Occasion1", "Occasion2", "Occasion3")
occasions
#> [1] "Occasion1" "Occasion2" "Occasion3"

The quotes around “Occasion1”, “Occasion2”, and “Occasion3” are critical. Without the quotes, R will assume there are objects called Occasion1, Occasion2 and Occasion3. As these objects don’t exist in R’s memory, there will be an error message.

Vectors can be any length (including 1. In fact, the numeric objects we’ve been working with are just vectors with length 1). The function length() tells you how long a vector is:

length(x)
#> [1] 10

The function class() indicates the class (the type of element) of an object:

class(x)
#> [1] "integer"
class(occasions)
#> [1] "character"

What is the class of a vector with both numeric and characters entries? Hint:

mixed <- c(1, 2, "3", "4")

You can also use the c() function to add other elements to your vector:

y <- c(x, 4,8,3)

Vectors are one of the many data structures that R uses. Other important ones are lists (list()), matrices (matrix()), data frames (data.frame()), factors (factor()) and arrays (array()). We will learn about each of those data structures as we encounter them in lab exercises.

Vectorized arithmetic

One of the most useful properties of vectors in R is that we can use them to simplify basic arithmetic operations that need to be done on multiple observations. For example, consider the following data on wing chord (a measure of wing length) and body mass of Swainson’s thrushes (Catharus ustulatus):

Swainson's Thrush. Image courtesy of VJAnderson via Wikicommons

Swainson’s Thrush. Image courtesy of VJAnderson via Wikicommons

Individual Mass (g) Wing chord (mm)
1 36.2 95.1
2 34.6 88.4
3 31.0 97.9
4 31.8 96.8
5 29.4 92.3
6 32.0 90.6

Perhaps we want to derive the body condition of each individual based on these measures. One common metric of body condition used by ornithologists is \(\frac{mass}{size}\), where wing chord is used as a proxy for body size. We could calculate body condition for each individual:

cond1 <- 36.2/95.1 # Body condition of the first individual

cond2 <- 34.6/88.4 # Body condition of the second individual

But that is time consuming and error prone. Luckily, R will vectorize basic arithmetic:

mass <- c(36.2, 34.6, 31.0, 31.8, 29.4, 32.0)
wing <- c(95.1, 88.4, 97.9, 96.8, 92.3, 90.6)

cond <- mass/wing
cond
#> [1] 0.3807 0.3914 0.3166 0.3285 0.3185 0.3532

As you can see, when we divide one vector by another, R divides the first element of the first vector by the first element of the second vector, etc. and returns a vector.

Built-in functions

The power of R is most apparent in the large number of built-in functions that are available for users.

Functions are small bits of code that perform a specific task. Most functions accept one or more inputs called arguments and return a value or a new object.

Let’s say we have the following data on the number of ticks recorded on 5 dogs:

Individual Ticks
1 4
2 7
3 2
4 3
5 150

What is the total number of ticks recorded in the study? For that, we can use the built-in sum() function:

ticks <- c(4,7,2,3,150)

sum(ticks)
#> [1] 166

What is the mean number of ticks per dog?

mean(ticks)
#> [1] 33.2

And the variance?

var(ticks)
#> [1] 4267

Arguments

Every function takes a different set of arguments and in many cases you will need to look up what those arguments are. The best way to get help for a specific function is to type a question mark followed by the function name, which will bring up a help page in the bottom right panel. For example, the round function rounds a number to a specified number of decimal places. This is a useful function when we don’t want to print a really large number of digits:

?round

So we see round takes an argument called x, which is the number we want to round, and the number of digits we want to round to. If you provide the arguments in the exact same order as they are defined you don’t have to name them. For example, :

y <- mean(ticks)
y
#> [1] 33.2

round(y, 0)
#> [1] 33

If you do name the arguments, you can switch their order:

round(digits = 0, x = y)
#> [1] 33

Although you don’t have to name arguments, it’s a good idea to get in the habit of naming them. This will make you code easier to read, will help avoid mistakes that can occur when you don’t put the arguments in the correct order, and makes it easier to trouble shoot code that doesn’t do what you expect it to do.

Indexing vectors

Often you will need to work with just a subset of a vector. For example, maybe you have a vector of plant biomass measured along transects but you only need

y <- c(2, 4, 8, 4, 25)
y[c(1,3)]
#> [1] 2 8

Notice that to index certain elements of the vector y, we use square brackets. Inside those brackets, we provided an integer vector, where each integer refers to the position of elements in the first vector. The indexing vector can be any length (including 1).

We can also index vectors using a logical vector. A logical vector is a special type of object that contains values of TRUE or FALSE. When using a logical vector for indexing, the logical vector indicates which elements to keep (TRUE) or remove (FALSE) from the original vector. For this reason, the indexing vector must be same length as the focal vector; i.e., length(a) == length(v)

# Logical vector (which elements of y are greater than 4?)
y > 4
#> [1] FALSE FALSE  TRUE FALSE  TRUE
# Indexing using a logical vector (keep elements 3 and 5)
y[y > 4]
#> [1]  8 25

We can also use indexing to remove elements from a vector:

# Remove the second element
y[-2]
#> [1]  2  8  4 25

or to rearrange the order of a vector

y[c(5,4,3,2,1)]
#> [1] 25  4  8  4  2

Packages

One of R’s primary strengths is the large number of packages available to users. Packages are units of shareable code and data that have been created by other R users. We have already seen the built-in functions that R comes with. Packages allow users to share lots and lots of other functions that serve specific purposes. Packages also allow users to share data sets. There are packages for cleaning data, visualizing data, making maps, fitting specialized models, and basically anything else you can think of.

Accessing the code in a package first requires installing the package. This only needs to be done once per computer and is usually done using the install.packages() function:

install.packages("devtools")

Note that the name of the package (in this case devtools) must be in quotation marks. Packages installed using install.packages() are stored in a centralized repository called CRAN (Comprehensive R Archive Network). Once devtools (or any package) is installed on your computer, you do not need to re-run the install.packages() function unless you re-install/update R or need to update the package to a newer version.

Installing a package does not automatically make the functions from that package available in a given R session. To tell R where the functions come from, you must load the package using the library() function:

Unlike install.packages(), library() must be re-run each time you open R. Most people include a few calls to library() at the beginning of each script so that all packages needed to run the code are loaded at the beginning of the script.

Occasionally, some packages are stored in other places (e.g., github). These packages can be installed using different functions. For example, I created a package for this course that contains small data sets we will use in labs throughout the semester. The package is stored on github and can be installed by running:

install_github("RushingLab/FANR6750")

Note that the install_github() function is from the devtools package so you need to run library(devtools) before you install the package. Make sure you install the FANR6750 package now so you have access to the data sets.

Data frames

So far, we have only discussed one particular class of R object - vectors. Vectors hold a string of values as a single object. Although useful for many applications, vectors are limited in their ability to store multiple types of data (numeric and character).

This is where data frames become useful. Perhaps the most common type data object you will use in R is the data frame. Data frames are tabular objects (rows and columns) similar in structure to spreadsheets (think Excel or GoogleSheets). In effect, data frames store multiple vectors - each column of the data frame is a vector. As such, each column can be a different class (numeric, character, etc.) but all values within a column must be the same class. Just as the first row of an Excel spreadsheet can be a list of column names, each column in a data frame has a name that (hopefully) provides information about what the values in that column represent.

To see how data frames work, let’s load a data frame called jayData that comes with the FANR6750 package.


Note - As discussed above, if you want to access function or data sets that come with packages, you first need to load the package in your current working environment. To do that, use the library() function, with the unquoted package name as the argument. Once loaded, all of the package’s functions are available to use.

Alternatively, you can access functions from a given package without loading the package using package.name::function.name(). For example, if you want to use the filter() function from the dplyr package, you could type dplyr::filter(). Although less commonly used, this method has a few advantages:

  1. Sometimes different packages have functions with the same names. R will default to using the function from the package that was loaded last. For example, the raster package also has a function called filter() so if you load dplyr first (using library()and then raster, R will default to using raster’s filter() function, which could cause problems.

  2. If you share your code with others, the :: method makes it clear which packages are being use for which functions. That additional clarity is often helpful and is the reason I will often use :: in this course.


To get a quick idea of what information this data frame contains, we can use the head() and tail() functions, which will print the first and last 6 rows of the data frame:

library(FANR6750)
data("jayData") # the data() function loads data sets the come with packages

head(jayData)
#>        x       y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124       423   0.00      0.02     Oak   Med   34
#> 2 261937 3769224       506   0.10      0.45     Oak   Med   38
#> 3 246337 3764124       859   0.00      0.26     Oak  High   40
#> 4 239437 3763524      1508   0.02      0.03    Pine   Med   43
#> 5 239437 3767724       483   0.26      0.37     Oak   Med   36
#> 6 236437 3769524       830   0.00      0.01     Oak   Low   39

tail(jayData)
#>          x       y elevation forest chaparral habitat seeds jays
#> 95  258937 3767124       804   0.19      0.68     Oak   Med   40
#> 96  259837 3768024       210   0.00      0.00     Oak   Low   33
#> 97  249337 3769524       467   0.70      0.09    Pine   Med   36
#> 98  262237 3767424      1318   0.02      0.23     Oak   Med   44
#> 99  261937 3770124       354   0.00      0.05    Bare   Low   33
#> 100 247837 3769524       686   0.10      0.32     Oak   Med   40

We can see that jayData contains eight columns: x, y, elevation, forest, chaparral, habitat, seeds, and jays. We’ll learn more about what each of these columns represents later in the semester, though just like functions, many data sets have help pages also and you can access those help pages using ?jayData. Several other useful functions for investigating the structure of data frames are str() and summary()


str(jayData)
#> 'data.frame':    100 obs. of  8 variables:
#>  $ x        : num  258637 261937 246337 239437 239437 ...
#>  $ y        : num  3764124 3769224 3764124 3763524 3767724 ...
#>  $ elevation: int  423 506 859 1508 483 830 457 304 834 164 ...
#>  $ forest   : num  0 0.1 0 0.02 0.26 0 0.02 0 0.54 0 ...
#>  $ chaparral: num  0.02 0.45 0.26 0.03 0.37 0.01 0.22 0.09 0.21 0.11 ...
#>  $ habitat  : chr  "Oak" "Oak" "Oak" "Pine" ...
#>  $ seeds    : chr  "Med" "Med" "High" "Med" ...
#>  $ jays     : int  34 38 40 43 36 39 38 35 41 33 ...

summary(jayData)
#>        x                y             elevation        forest      
#>  Min.   :230737   Min.   :3761424   Min.   :  12   Min.   :0.0000  
#>  1st Qu.:238762   1st Qu.:3765324   1st Qu.: 365   1st Qu.:0.0000  
#>  Median :245587   Median :3766824   Median : 548   Median :0.0000  
#>  Mean   :246949   Mean   :3767130   Mean   : 659   Mean   :0.0553  
#>  3rd Qu.:254662   3rd Qu.:3768699   3rd Qu.: 929   3rd Qu.:0.0300  
#>  Max.   :266137   Max.   :3773724   Max.   :1537   Max.   :0.7000  
#>    chaparral       habitat             seeds                jays     
#>  Min.   :0.000   Length:100         Length:100         Min.   :30.0  
#>  1st Qu.:0.080   Class :character   Class :character   1st Qu.:36.0  
#>  Median :0.210   Mode  :character   Mode  :character   Median :38.0  
#>  Mean   :0.241                                         Mean   :38.6  
#>  3rd Qu.:0.370                                         3rd Qu.:41.0  
#>  Max.   :0.850                                         Max.   :48.0

str() tells us about the structure of the data frame, for example x and y are numeric columns and habitat contains character strings. summary() provides some simple summary statistics for each variable.

Another useful function is nrow(), which tells us now many rows are in the data frame (similar to length() for vectors):

nrow(jayData)
#> [1] 100

Subsetting data frames

As you will see shortly, one of the most common tasks when working with data frames is creating new objects from parts of the full data frame. This task involves subsetting the data frame - selecting specific rows and columns. There are many ways of subsetting data frames in R, too many to discuss so we will only learn about a few.

Selecting columns

First, we may want to select a subset of all of the columns in a big data frame. Data frames are essentially tables, which means we can reference both rows and columns by their number: data.frame[row#, column#]. The row and column numbers have to put inside of square brackets following the name of the data frame object. The row number always comes first and the column number second. If you want to select all rows of a specific column, you just leave the row# blank. For example, if we wanted a vector containing the number of jays at each survey location:

jayData[,8]
#>   [1] 34 38 40 43 36 39 38 35 41 33 34 37 37 38 42 43 39 37 38 40 37 35 37 44 45
#>  [26] 37 36 34 48 43 39 41 45 38 35 38 39 38 41 38 36 43 38 36 33 41 38 30 39 36
#>  [51] 39 36 34 30 38 37 44 36 36 40 44 48 37 41 42 30 41 39 43 30 42 42 41 38 36
#>  [76] 37 33 44 38 35 45 41 35 38 37 45 33 42 34 45 40 42 40 44 40 33 36 44 33 40

We can also select columns using data.frame$column (where data.frame is the name of the data frame object and column is the name of the column). For example,

jayData$jays
#>   [1] 34 38 40 43 36 39 38 35 41 33 34 37 37 38 42 43 39 37 38 40 37 35 37 44 45
#>  [26] 37 36 34 48 43 39 41 45 38 35 38 39 38 41 38 36 43 38 36 33 41 38 30 39 36
#>  [51] 39 36 34 30 38 37 44 36 36 40 44 48 37 41 42 30 41 39 43 30 42 42 41 38 36
#>  [76] 37 33 44 38 35 45 41 35 38 37 45 33 42 34 45 40 42 40 44 40 33 36 44 33 40

Notice that if you hit tab after you type the $, RStudio will bring up all of the columns and you can use the up or down buttons to find the one you want.

Sometimes you may want to select more than one column. The easiest way to do that is to use the select() function in the dplyr package:

library(dplyr)

head(select(.data = jayData, x, y, jays))
#>        x       y jays
#> 1 258637 3764124   34
#> 2 261937 3769224   38
#> 3 246337 3764124   40
#> 4 239437 3763524   43
#> 5 239437 3767724   36
#> 6 236437 3769524   39

Notice that select requires us to first provide the data frame object (.data = jayData) and then we provide the column names (unquoted!) we want to select. You can also use select to remove columns:

head(select(.data = jayData, -seeds))
#>        x       y elevation forest chaparral habitat jays
#> 1 258637 3764124       423   0.00      0.02     Oak   34
#> 2 261937 3769224       506   0.10      0.45     Oak   38
#> 3 246337 3764124       859   0.00      0.26     Oak   40
#> 4 239437 3763524      1508   0.02      0.03    Pine   43
#> 5 239437 3767724       483   0.26      0.37     Oak   36
#> 6 236437 3769524       830   0.00      0.01     Oak   39

It’s important to realize that even though there are different ways to accomplish the same general task (in this case, “subset columns of a data frame”), those methods will sometimes differ in subtle but important ways.

For example, what type of object did the data.frame[,col#] and data.frame$column.name options return? What type of object did select(data.frame, col1, col2) return? Those are very different results and knowing what the output will be can be useful when deciding the best way to accomplish your task.


Filtering rows

To select specific rows, we can use the row# method we learned above, this time leaving the columns blank:

jayData[1,]
#>        x       y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124       423      0      0.02     Oak   Med   34

If we want more than one row, we just put in a vector with all of the rows we want:

jayData[1:2,]
#>        x       y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124       423    0.0      0.02     Oak   Med   34
#> 2 261937 3769224       506    0.1      0.45     Oak   Med   38

jayData[c(1,30),]
#>         x       y elevation forest chaparral habitat seeds jays
#> 1  258637 3764124       423      0      0.02     Oak   Med   34
#> 30 259537 3765924      1419      0      0.07    Pine   Med   43

Note that we can use the square brackets to also subset vectors, in which case we don’t need the comma as long as you tell R which column you want first:

jayData$jays[1]
#> [1] 34

Sometimes, we may not know the specific row number(s) we want but we do know the value of one of the columns we want to keep. Using the filter() function in the dplyr package allows us to filter rows based on the value of one of the variables. For example, if we want just surveys that were conducted in oak habitat, we use:

head(filter(jayData, habitat == "Oak"))
#>        x       y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124       423   0.00      0.02     Oak   Med   34
#> 2 261937 3769224       506   0.10      0.45     Oak   Med   38
#> 3 246337 3764124       859   0.00      0.26     Oak  High   40
#> 4 239437 3767724       483   0.26      0.37     Oak   Med   36
#> 5 236437 3769524       830   0.00      0.01     Oak   Low   39
#> 6 263737 3766524       457   0.02      0.22     Oak   Med   38

Notice the need for two equals signs (==) when telling R we want the row where habitat equals Oak. Filter makes it very easy to select multiple rows using operators like greater than, less than, etc.

head(filter(jayData, elevation > 1000))
#>        x       y elevation forest chaparral habitat seeds jays
#> 1 239437 3763524      1508   0.02      0.03    Pine   Med   43
#> 2 261637 3768324      1276   0.02      0.36     Oak  High   44
#> 3 248737 3766524      1024   0.03      0.41    Pine   Low   45
#> 4 255937 3765024      1400   0.02      0.45     Oak  High   48
#> 5 259537 3765924      1419   0.00      0.07    Pine   Med   43
#> 6 245737 3762924      1004   0.02      0.32     Oak   Low   41

or a slightly more complicated example:

head(filter(jayData, elevation < 1000 & habitat == "Oak"))
#>        x       y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124       423   0.00      0.02     Oak   Med   34
#> 2 261937 3769224       506   0.10      0.45     Oak   Med   38
#> 3 246337 3764124       859   0.00      0.26     Oak  High   40
#> 4 239437 3767724       483   0.26      0.37     Oak   Med   36
#> 5 236437 3769524       830   0.00      0.01     Oak   Low   39
#> 6 263737 3766524       457   0.02      0.22     Oak   Med   38