lab01_intro_to_R.Rmd
R
?
R
is a free, open-source programming language and
software environment for statistical computing, bioinformatics,
visualization and general computing.
It is based on an ever-expanding set of analytical packages that perform specific analytical, plotting, and other programming tasks.
R
?
R
is free(!), runs on pretty much every operating
system, and has a huge user base.
R
is far from the only programming language for working
with data, but it is the most widely used language in the fields of
ecology, evolution, and wildlife sciences. If you plan to pursue a
career in any of these fields, proficiency in R
is quickly
becoming a prerequisite for many jobs.
Even if you don’t pursue a career in one of these fields, the ability to manipulate, analyze, and visualize data (otherwise known as data science) is an extremely marketable skill in many professions right now.
We will go over the basics of using R
during lab
sessions but there are many good online resources for learning
R
and getting help. A few of my favorites (from which some
of this material is developed) include:
Tom Edward’s online Learning R course
John Fieberg’s online Statistics for Ecologists book
Of course, if you encounter error messages you don’t understand or
need help figuring out how to accomplish something in R
,
google is your best friend (even the most experienced R
users use google on a daily basis). The key to finding answers on google
is asking the right questions. Because we will not spend much time on
this topic in lab, please refer to these links for advice on formulating
R
-related questions:
Seeking help from Data Analysis and Visualization in R for Ecologists
R
- the very basics
Although users can work directly in R
, most choose to
use RStudio which is the IDE (Integrated Development Environment) for
R
. To use RStudio, you must first have R
installed.
After opening RStudio, you will see 3 panes1.
Console: The console will appear on the left side of your screen.
You can type code directly into the console (also known as the command
line) and it will be executed immediately. The console is also where
output will be shown from tasks that you have executed in
R
.
Environment pane: The environment pane will appear in the top
right of your screen. Here, you can see objects that you have created in
R
as well as the values of those objects and how
R
interprets them (more on this later). The environment
pane also includes a few other tabs but we will not require their use in
this class.
Plot pane: The plot pane will appear in the bottom right of your
screen. As you might imagine, this is where graphics will be displayed
that you have created in R
. This pane also includes several
other useful tabs including the Files tab (which allows you to navigate
and manage files), the Packages tab (where you can install and manage
additional R
packages), and the Help tab where you can
search R
documentation pages.
R
as a calculator
As a statistical programming tool, one thing R is very good at is
doing math. So as a starting point, let’s treat R
like a
fancy calculator.
We interact with this calculator by typing numbers and operators (+,
-, *, /) into the Console
window.
Let’s try it - in the bottom left window (the Console), write the
R
code required to add two plus two and then press
enter:
2+2
When you run the code, you should see the answer printed below the window. Play with your code a bit - try changing the number and the operators and then run the code again.
We can run R
like a calculator by typing equations
directly into the console and then printing the answer. But usually we
don’t want to just do a calculation and see the answer. Instead, we
assign values to objects. That object is then saved in
R
’s memory which allows us to use that object later in our
analysis.
This might seem a bit confusing if you are new to programming so
let’s try it. The following code creates an object called
x
and assigns it a value of 3
:
x <- 3
The operator <-
2 3 is how we do assignments in R
.
Whatever is to the left of <-
is the object’s name and
whatever is to the right is the value. As we will see later, objects can
be much more complex than simply a number but for now, we’ll keep it
simple.
You try it - change the code to create an object called
new.x
. Instead of assigning new.x
a number,
give it a calculation, for example 25/5
. What do you think
the value of new.x
is?
It’s a good idea to give objects names that tell you something about
what the object represents. Names can be as long as you want them to be
but should not have spaces (also remember long names require more typing
so brevity is a good rule of thumb). Names can contain both numbers and
letters but cannot begin with a number. R
is also
case-sensitive so, for example, Apple
is
not the same as apple
. When creating
object names, it is also a good idea to avoid words which show up in
R
as functions. While R
is generally smart
enough to distinguish between you attempting to create an object vs use
a function, avoiding this practice will save you headache when
interpreting your code (especially code that you have not looked at in a
while).
In the exercise above, you may have noticed that after running the
code, R
did not print anything. That is because we simply
told R
to create the object (in the top right window, if
you click on the Environment
tab, you should see
x
and new.x
). Now that it is stored in
R
’s memory, we can do a lot of things with it. For one, we
can print it to see the value. To do that, we simply type the name of
the object and run the code4:
new.x <- 25/5
new.x
#> [1] 5
We can also use objects to create new objects. What do you think the following code does?
x <- 3
y <- x*4
After running it, print the new object y
to see its
value. Were you right?
The console is useful for doing simple tasks but as our analyses become more complicated, the console is not very efficient. What if you need to go back and change a line of code? What if you want to show your code to someone else to get help?
Instead of using the console, most of our work will be done using scripts (the source editor pane). Scripts are special files that allow us to write, save, and run many lines of code. Scripts can be saved so you can work on them later or send them to collaborators.
To create a script, click
File -> New File -> R Script
. This new file should
show up in a new window.
R
will ignore any code that follows a #
.
This is very useful for making your code more readable for both
yourself and others. Use comments to remind yourself what a newly
created object is, to explain what a line of code does, to leave
yourself a reminder for later, etc. For example, in the previous code,
it might be a good idea to use comments to define what each object
represents:
n1 <- 44 # Number of individuals captured on first occasion
n2 <- 32 # Number of individuals captured on second occasion
m2 <- 15 # Number of previously marked individuals captured on second occasion
Notice that when you run this code, R
ignores the
comments.
Now that you have created a new R
script, you need to be
able to save this file somewhere on your computer. To do this, we can
set up a working directory. In addition to providing a place to save
your script, setting up the working directory also tells R
where you would like to put files that come from your data management
and analyses (e.g. spreadsheets or graphics) as well as where to find
source data that you plan to use for this particular project.
There are two methods that exist to set up a working directory within
R
.
You can choose to set up a working directory by clicking
Session
–> Set working directory
–>
Choose directory
and navigating to the folder where you
would like to store the files. If you have opened an R
script and are unsure where the current working directory is located,
you can run getwd()
to see the current working
directory.
You can set the working directory directly in the R
script using the setwd()
function. For example, to set my
working directory in a folder called Lab_1
on my desktop, I
would run the following line of code:
C:/Users/mab46065/Desktop/Lab_1
. Notice that although your
computer will probably create the folder pathway using the backslash
(\
), R
will require forward slashes
(/
) instead. Also, if you are using a Mac, you will omit
c:
from the directory name.
Up to this point, we have only briefly talked about creating objects
in R
. Here, we will discuss different object types in
R
. It is important to know what types of objects
(e.g. vectors, lists, matrices, factors, data frames, arrays) you are
working with because R
will interpret them differently and
different object types will be required to perform certain tasks. We
will learn about each of those data structures as we encounter them in
lab exercises.
So far, we have only been working with objects that store a single
number. However, often it is more convenient to store a string of
numbers as a single object. In R
, these strings are called
vectors and they are usually created by enclosing the string
between c(
and )
:
x <- c(3,5,2,5)
x
#> [1] 3 5 2 5
You can also create sequences of consecutive numbers in a few different ways:
x <- 1:10
x
#> [1] 1 2 3 4 5 6 7 8 9 10
x2 <- seq(from = 1, to = 10, by = 1)
x2
#> [1] 1 2 3 4 5 6 7 8 9 10
The seq()
function is very flexible and useful so if you
are not familiar with it, be sure to look at the help page to better
understand how to use it.
Another useful function for creating vectors is rep()
,
which repeats values of a vector:
rep(x2, times = 2)
#> [1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
or:
rep(x2, each = 2)
#> [1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10
Be sure you notice the difference between using the
times
argument vs the each
argument!
The function class()
indicates the class (the type of
element) of an object:
class(x)
#> [1] "integer"
A vector can also contain characters (though you cannot mix numbers and characters in the same vector!):
occasions <- c("Occasion1", "Occasion2", "Occasion3")
occasions
#> [1] "Occasion1" "Occasion2" "Occasion3"
class(occasions)
#> [1] "character"
The quotes around “Occasion1”, “Occasion2”, and “Occasion3” are
critical. Without the quotes, R
will assume there are
objects called Occasion1
, Occasion2
and
Occasion3
. As these objects don’t exist in R
’s
memory, there will be an error message.
Vectors can be any length (including 1. In fact, the numeric objects
we’ve been working with are just vectors with length 1). The function
length()
tells you how long a vector is:
length(x)
#> [1] 10
What is the class of a vector with both numeric and characters entries? Hint:
mixed <- c(1, 2, "3", "4")
You can also use the c()
function to add other elements
to your vector:
y <- c(x, 4,8,3)
Another class of vectors are referred to as factors. Factors are
similar to character vectors in that R
is interpreting them
as text strings and you cannot perform math on them. The difference,
however, is that R
sees factors as grouping variables. Each
category within the factor is referred to as a ‘level’5.
One of the most useful properties of vectors in R
is
that we can use them to simplify basic arithmetic operations that need
to be done on multiple observations. For example, consider the following
data on wing chord (a measure of wing length) and body mass of
Swainson’s thrushes (Catharus ustulatus):
Individual | Mass (g) | Wing chord (mm) |
---|---|---|
1 | 36.2 | 95.1 |
2 | 34.6 | 88.4 |
3 | 31.0 | 97.9 |
4 | 31.8 | 96.8 |
5 | 29.4 | 92.3 |
6 | 32.0 | 90.6 |
Perhaps we want to derive the body condition of each individual based on these measures. One common metric of body condition used by ornithologists is \(\frac{mass}{size}\), where wing chord is used as a proxy for body size. We could calculate body condition for each individual:
cond1 <- 36.2/95.1 # Body condition of the first individual
cond2 <- 34.6/88.4 # Body condition of the second individual
But that is time consuming and error prone. Luckily, R
will vectorize basic arithmetic:
mass <- c(36.2, 34.6, 31.0, 31.8, 29.4, 32.0)
wing <- c(95.1, 88.4, 97.9, 96.8, 92.3, 90.6)
cond <- mass/wing
cond
#> [1] 0.3807 0.3914 0.3166 0.3285 0.3185 0.3532
As you can see, when we divide one vector by another, R
divides the first element of the first vector by the first element of
the second vector, etc. and returns a vector. Vectorized arithmetic
works well when the vectors that we are using are of the same length.
What would happen though if you were to perform arithmetic on vectors
that were of different lengths? Try running the following code and
seeing what R
is doing with these vectors.
a <- c(1,10,100,1000)
b <- c(1,2,3,4,5)
c <- a/b
c
#> [1] 1.00 5.00 33.33 250.00 0.20
x <- c(1,10,100,1000, 10000)
y <- c(1,2,3,4)
z <- x/y
z
#> [1] 1.00 5.00 33.33 250.00 10000.00
Notice how the way that R
recycles a vector depends on
which is longer.
Often you will need to work with just a subset of a vector. For example, maybe you have a vector of plant biomass measured along transects but you only need the first and third observations.
Notice that to index certain elements of the vector y
,
we use square brackets. Inside those brackets, we provided an
integer vector, where each integer refers to the position of
elements in the first vector. The indexing vector can be any length
(including 1).
We can also index vectors using a logical vector. A logical
vector is a special type of object that contains values of
TRUE
or FALSE
. When using a logical vector for
indexing, the logical vector indicates which elements to keep
(TRUE
) or remove (FALSE
) from the original
vector. For this reason, the indexing vector must be same length as the
focal vector; i.e., length(a) == length(v)
# Logical vector (which elements of y are greater than 4?)
y > 4
#> [1] FALSE FALSE TRUE FALSE TRUE
# Indexing using a logical vector (keep elements 3 and 5)
y[y > 4]
#> [1] 8 25
We can also use indexing to remove elements from a vector:
# Remove the second element
y[-2]
#> [1] 2 8 4 25
or to rearrange the order of a vector
y[c(5,4,3,2,1)]
#> [1] 25 4 8 4 2
The power of R
is most apparent in the large number of
built-in functions that are available for users.
Functions are small bits of code that perform a specific task. Most functions accept one or more inputs called arguments and return a value or a new object.
Let’s say we have the following data on the number of ticks recorded on 5 dogs:
Individual | Ticks |
---|---|
1 | 4 |
2 | 7 |
3 | 2 |
4 | 3 |
5 | 150 |
What is the total number of ticks recorded in the study? For that, we
can use the built-in sum()
function:
What is the mean number of ticks per dog?
mean(ticks)
#> [1] 33.2
And the variance?
var(ticks)
#> [1] 4267
Every function takes a different set of arguments and in many cases
you will need to look up what those arguments are. The best way to get
help for a specific function is to type a question mark followed by the
function name, which will bring up a help page in the bottom right
panel. For example, the round
function rounds a number to a
specified number of decimal places. This is a useful function when we
don’t want to print a really large number of digits:
?round
So we see round
takes an argument called x
,
which is the number we want to round, and the number of
digits
we want to round to. If you provide the arguments in
the exact same order as they are defined you don’t have to name them.
For example, :
If you do name the arguments, you can switch their order:
round(digits = 0, x = y)
#> [1] 33
Although you don’t have to name arguments, it’s a good idea to get in the habit of naming them. This will make your code easier to read, will help avoid mistakes that can occur when you don’t put the arguments in the correct order, and makes it easier to trouble shoot code that doesn’t do what you expect it to do.
Matrices are similar to vectors but have two dimensions. The first dimension shows the number of rows in the matrix and the second shows the number of columns. Here, we have combined multiple vectors to create a matrix. Notice that the vectors will need to be the same length.
Site <- c(1,2,3,4,5)
Species <- c('Alasmidonta varicosa',
'Alasmidonta varicosa',
'Alasmidonta varicosa',
'Lasmigona decorata',
'Lasmigona decorata')
Year <- c(rep(2023,5))
mymatrix <- cbind(Site, Species, Year)
mymatrix
#> Site Species Year
#> [1,] "1" "Alasmidonta varicosa" "2023"
#> [2,] "2" "Alasmidonta varicosa" "2023"
#> [3,] "3" "Alasmidonta varicosa" "2023"
#> [4,] "4" "Lasmigona decorata" "2023"
#> [5,] "5" "Lasmigona decorata" "2023"
Notice that because matrices can only contain one data class, all of
the numeric vectors have been coerced to be characters. While matrices
have many uses in R
, this is one drawback which will lead
us directly to our next object type.
Although useful for many applications, vectors and matrices are limited in their ability to store multiple types of data (numeric and character).
This is where data frames become useful. Perhaps the most common type
of data object you will use in R
is the data frame. Data
frames are tabular objects (rows and columns) similar in structure to
spreadsheets (think Excel or GoogleSheets). In effect, data frames store
multiple vectors - each column of the data frame is a vector. The
advantage they have over matrices is that each column can be a different
class (numeric, character, etc.) but all values within a column must be
the same class. Just as the first row of an Excel spreadsheet can be a
list of column names, each column in a data frame has a name that
(hopefully) provides information about what the values in that column
represent.
To see how data frames work, let’s load a data frame called
jayData
that comes with the FANR6750
package.
One of R
’s primary strengths is the large number of
packages available to users. Packages are units of shareable
code and data that have been created by other R
users. We
have already seen the built-in functions that R
comes with.
Packages allow users to share lots and lots of other functions that
serve specific purposes. Packages also allow users to share data
sets. There are packages for cleaning data, visualizing data, making
maps, fitting specialized models, and basically anything else you can
think of.
Accessing the code in a package first requires installing the
package. This only needs to be done once per computer and is usually
done using the install.packages()
function:
install.packages("devtools")
Note that the name of the package (in this case
devtools
) must be in quotation marks. Packages installed
using install.packages()
are stored in a centralized
repository called CRAN (Comprehensive R Archive Network). Once
devtools
(or any package) is installed on your computer,
you do not need to re-run the install.packages()
function
unless you re-install/update R
or need to update the
package to a newer version.
Installing a package does not automatically make the functions from
that package available in a given R
session. To tell
R
where the functions come from, you must load the
package using the library()
function6:
Unlike install.packages()
, library()
must
be re-run each time you open R
. Most people include a few
calls to library()
at the beginning of each script so that
all packages needed to run the code are loaded at the beginning of the
script.
Occasionally, some packages are stored in other places (e.g., github). These packages can be installed using different functions. For example, I created a package for this course that contains small data sets we will use in labs throughout the semester. The package is stored on github and can be installed by running:
install_github("RushingLab/FANR6750")
Note that the install_github()
function is from the
devtools
package so you need to run
library(devtools)
before you install the package. Make sure
you install the FANR6750
package now so you have access to
the data sets.
Note - As discussed above, if you want to access function or data
sets that come with packages, you first need to load the
package in your current working environment. To do that, use the
library()
function, with the unquoted package name as the
argument. Once loaded, all of the package’s functions are available to
use.
Alternatively, you can access functions from a given package without
loading the package using package.name::function.name()
.
For example, if you want to use the filter()
function from
the dplyr
package, you could type
dplyr::filter()
. Although less commonly used, this method
has a few advantages:
Sometimes different packages have functions with the same names.
R
will default to using the function from the package that
was loaded last. For example, the raster
package also has a
function called filter()
so if you load dplyr
first (using library()
and then raster
,
R
will default to using raster
’s
filter()
function, which could cause problems.
If you share your code with others, the ::
method
makes it clear which packages are being use for which functions. That
additional clarity is often helpful and is the reason I will often use
::
in this course.
To get a quick idea of what information this data frame contains, we
can use the head()
and tail()
functions, which
will print the first and last 6 rows of the data frame:
library(FANR6750)
data("jaydata") # the data() function loads data sets the come with packages
head(jaydata)
#> x y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124 423 0.00 0.02 Oak Med 34
#> 2 261937 3769224 506 0.10 0.45 Oak Med 38
#> 3 246337 3764124 859 0.00 0.26 Oak High 40
#> 4 239437 3763524 1508 0.02 0.03 Pine Med 43
#> 5 239437 3767724 483 0.26 0.37 Oak Med 36
#> 6 236437 3769524 830 0.00 0.01 Oak Low 39
tail(jaydata)
#> x y elevation forest chaparral habitat seeds jays
#> 95 258937 3767124 804 0.19 0.68 Oak Med 40
#> 96 259837 3768024 210 0.00 0.00 Oak Low 33
#> 97 249337 3769524 467 0.70 0.09 Pine Med 36
#> 98 262237 3767424 1318 0.02 0.23 Oak Med 44
#> 99 261937 3770124 354 0.00 0.05 Bare Low 33
#> 100 247837 3769524 686 0.10 0.32 Oak Med 40
We can see that jaydata
contains eight columns:
x
, y
, elevation
,
forest
, chaparral
, habitat
,
seeds
, and jays
. We’ll learn more about what
each of these columns represents later in the semester, though just like
functions, many data sets have help pages also and you can access those
help pages using ?jaydata
. Several other useful functions
for investigating the structure of data frames are str()
and summary()
str(jaydata)
#> 'data.frame': 100 obs. of 8 variables:
#> $ x : num 258637 261937 246337 239437 239437 ...
#> $ y : num 3764124 3769224 3764124 3763524 3767724 ...
#> $ elevation: int 423 506 859 1508 483 830 457 304 834 164 ...
#> $ forest : num 0 0.1 0 0.02 0.26 0 0.02 0 0.54 0 ...
#> $ chaparral: num 0.02 0.45 0.26 0.03 0.37 0.01 0.22 0.09 0.21 0.11 ...
#> $ habitat : chr "Oak" "Oak" "Oak" "Pine" ...
#> $ seeds : chr "Med" "Med" "High" "Med" ...
#> $ jays : int 34 38 40 43 36 39 38 35 41 33 ...
summary(jaydata)
#> x y elevation forest
#> Min. :230737 Min. :3761424 Min. : 12 Min. :0.0000
#> 1st Qu.:238762 1st Qu.:3765324 1st Qu.: 365 1st Qu.:0.0000
#> Median :245587 Median :3766824 Median : 548 Median :0.0000
#> Mean :246949 Mean :3767130 Mean : 659 Mean :0.0553
#> 3rd Qu.:254662 3rd Qu.:3768699 3rd Qu.: 929 3rd Qu.:0.0300
#> Max. :266137 Max. :3773724 Max. :1537 Max. :0.7000
#> chaparral habitat seeds jays
#> Min. :0.000 Length:100 Length:100 Min. :30.0
#> 1st Qu.:0.080 Class :character Class :character 1st Qu.:36.0
#> Median :0.210 Mode :character Mode :character Median :38.0
#> Mean :0.241 Mean :38.6
#> 3rd Qu.:0.370 3rd Qu.:41.0
#> Max. :0.850 Max. :48.0
str()
tells us about the structure of the data frame,
for example x
and y
are numeric columns and
habitat
contains character strings. summary()
provides some simple summary statistics for each variable.
Another useful function is nrow()
, which tells us now
many rows are in the data frame (similar to length()
for
vectors):
nrow(jaydata)
#> [1] 100
As you will see shortly, one of the most common tasks when working
with data frames is creating new objects from parts of the full
data frame. This task involves subsetting the data frame - selecting
specific rows and columns. There are many ways of
subsetting data frames in R
, too many to discuss so we will
only learn about a few.
First, we may want to select a subset of all of the columns in a big
data frame. Data frames are essentially tables, which means we can
reference both rows and columns by their number:
data.frame[row#, column#]
. The row and column numbers have
to put inside of square brackets following the name of the data frame
object. The row number always comes first and the column number second.
If you want to select all rows of a specific column, you just leave the
row#
blank. For example, if we wanted a vector containing
the number of jays at each survey location:
jaydata[,8]
#> [1] 34 38 40 43 36 39 38 35 41 33 34 37 37 38 42 43 39 37 38 40 37 35 37 44 45
#> [26] 37 36 34 48 43 39 41 45 38 35 38 39 38 41 38 36 43 38 36 33 41 38 30 39 36
#> [51] 39 36 34 30 38 37 44 36 36 40 44 48 37 41 42 30 41 39 43 30 42 42 41 38 36
#> [76] 37 33 44 38 35 45 41 35 38 37 45 33 42 34 45 40 42 40 44 40 33 36 44 33 40
We can also select columns using data.frame$column
(where data.frame
is the name of the data frame object and
column
is the name of the column). For example,
jaydata$jays
#> [1] 34 38 40 43 36 39 38 35 41 33 34 37 37 38 42 43 39 37 38 40 37 35 37 44 45
#> [26] 37 36 34 48 43 39 41 45 38 35 38 39 38 41 38 36 43 38 36 33 41 38 30 39 36
#> [51] 39 36 34 30 38 37 44 36 36 40 44 48 37 41 42 30 41 39 43 30 42 42 41 38 36
#> [76] 37 33 44 38 35 45 41 35 38 37 45 33 42 34 45 40 42 40 44 40 33 36 44 33 40
Notice that if you hit tab
after you type the
$
, RStudio will bring up all of the columns and you can use
the up or down buttons to find the one you want.
Sometimes you may want to select more than one column. One way to do that is by indexing using the column names7:
head(jaydata[, c('x', 'y', 'jays')])
#> x y jays
#> 1 258637 3764124 34
#> 2 261937 3769224 38
#> 3 246337 3764124 40
#> 4 239437 3763524 43
#> 5 239437 3767724 36
#> 6 236437 3769524 39
You can also use select to remove columns:
head(subset(jaydata, select= -c(seeds)))
#> x y elevation forest chaparral habitat jays
#> 1 258637 3764124 423 0.00 0.02 Oak 34
#> 2 261937 3769224 506 0.10 0.45 Oak 38
#> 3 246337 3764124 859 0.00 0.26 Oak 40
#> 4 239437 3763524 1508 0.02 0.03 Pine 43
#> 5 239437 3767724 483 0.26 0.37 Oak 36
#> 6 236437 3769524 830 0.00 0.01 Oak 39
To select specific rows, we can use the row#
method we
learned above, this time leaving the columns blank:
jaydata[1,]
#> x y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124 423 0 0.02 Oak Med 34
If we want more than one row, we just put in a vector with all of the rows we want:
jaydata[1:2,]
#> x y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124 423 0.0 0.02 Oak Med 34
#> 2 261937 3769224 506 0.1 0.45 Oak Med 38
jaydata[c(1,30),]
#> x y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124 423 0 0.02 Oak Med 34
#> 30 259537 3765924 1419 0 0.07 Pine Med 43
Note that we can use the square brackets to also subset vectors, in
which case we don’t need the comma as long as you tell R
which column you want first:
jaydata$jays[1]
#> [1] 34
Sometimes, we may not know the specific row number(s) we want but we
do know the value of one of the columns we want to keep. We can do this
in R
by indexing using logical subsetting. For example, if
we want just surveys that were conducted in oak habitat, we use:
head(jaydata[jaydata$habitat == "Oak",])
#> x y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124 423 0.00 0.02 Oak Med 34
#> 2 261937 3769224 506 0.10 0.45 Oak Med 38
#> 3 246337 3764124 859 0.00 0.26 Oak High 40
#> 5 239437 3767724 483 0.26 0.37 Oak Med 36
#> 6 236437 3769524 830 0.00 0.01 Oak Low 39
#> 7 263737 3766524 457 0.02 0.22 Oak Med 38
Notice the need for two equals signs (==
) when telling
R
we want the row where habitat
equals
Oak
. We could also select multiple rows using operators
like greater than, less than, etc.
head(jaydata[jaydata$elevation > 1000,])
#> x y elevation forest chaparral habitat seeds jays
#> 4 239437 3763524 1508 0.02 0.03 Pine Med 43
#> 24 261637 3768324 1276 0.02 0.36 Oak High 44
#> 25 248737 3766524 1024 0.03 0.41 Pine Low 45
#> 29 255937 3765024 1400 0.02 0.45 Oak High 48
#> 30 259537 3765924 1419 0.00 0.07 Pine Med 43
#> 32 245737 3762924 1004 0.02 0.32 Oak Low 41
or a slightly more complicated example:
head(jaydata[jaydata$elevation < 1000 & jaydata$habitat == "Oak",])
#> x y elevation forest chaparral habitat seeds jays
#> 1 258637 3764124 423 0.00 0.02 Oak Med 34
#> 2 261937 3769224 506 0.10 0.45 Oak Med 38
#> 3 246337 3764124 859 0.00 0.26 Oak High 40
#> 5 239437 3767724 483 0.26 0.37 Oak Med 36
#> 6 236437 3769524 830 0.00 0.01 Oak Low 39
#> 7 263737 3766524 457 0.02 0.22 Oak Med 38
The location of these panes on the screen can be
adjusted by clicking View
–> Panes
–>
Pane Layout
↩︎
In most scenarios, the use of the <-
operator and =
will produce equivalent results. However,
there are some instances where =
has a different meaning in
R
and this difference will not always be obvious to the
user. To avoid issues with this, it is best to use the
<-
operator when assigning values to objects.↩︎
Because R
is an object based language, the
assignment arrow is one of the most common operators that you will type.
Although you can choose to literally type the less than sign and dash to
create the assignment operator, some people find it more convenient to
set up a keyboard shortcut. To do this, click Tools
–>
Modify keyboard shortcuts
and search in the
Filter
box for the assignment operator. Keyboard shortcuts
can be used for a variety of tasks in R
.↩︎
This method works well to see object values but it
requires an additional line of code. Another method available that
allows you to see the value of an object directly after creating it is
to surround the assignment with parentheses. For example, the line of
code (x <- 3)
will not only create the object
x
and assign it a value of 3, but it will also display the
value without having to retype the object name.↩︎
Note the use of as.factor()
to convert what
looks like an integer vector into a factor. Functions such as
as.factor()
, as.character()
, and
as.numeric()
are useful to coerce vectors into different
object classes.↩︎
For the purposes of this class, using
library()
or using require()
will produce
equivalent results. These two functions, however, are designed for
different purposes. While library()
is designed to be a
stand alone function which loads packages, require()
is
designed to be nested within a larger function statement. If the package
you are attempting to load is not installed, library()
will
produce an error message immediately, while require()
will
not. As a result, using library()
will make it easier for
you to diagnose where in your script the error is coming from.↩︎
You may begin to notice that as we stack functions
inside of eachother, it becomes difficult to keep track of matching
parentheses and brackets. Newer versions of R
have resolved
this problem using a handy tool. Click Code
–>
Rainbow parentheses
to make color matched parentheses and
brackets for easier code reading.↩︎