projects_and_directories.Rmd
Everything that follows in this tutorial should be a viewed as suggestions for how to organize your data, code, and results. These suggestions are based on hard-won lessons from my own work as well as advice from people who spend a lot of time thinking about efficient and reproducible workflows. The lessons below are unabashedly R/RStudio focused. However, they are just suggestions. There are many other ways of accomplishing these objectives so you are free to use whatever workflow works best for you. Even if you choose to use different workflows or tools that are not as R-focused, hopefully the concepts presented here will help make your process more efficient and organized.
A common workflow used by many novice R users goes something like this:
Open base R or maybe RStudio
Create a script and save it as
analysis_for_my_project.R
Start the script with:
rm(list = ls())
setwd("C:\Users\crushing\path\that\only\I\have")
library(package1)
library(package2)
x <- 1:10
y <- "blah"
Realize you need to be working on your dissertation instead of this side project so open another script you started earlier
Get confused about what objects in your current environment were created by which script
rm(list = ls())
to start from scratch
Rinse and repeat
Remember that when you open R
(or RStudio), you create a
global workspace. This is where all of the objects you create
get stored. If you type ls()
you can see everything that is
currently stored in your workspace. If you are working on different
analyses in the same global environment, it is very easy to
accidentally create different objects with the same name, write code
that depends on other code that happens to have been run previously
during one R session (for example, you loaded a needed package in one
script so didn’t add library(package)
to your next script.
What happens when you try to run this code the next time? This often
called a “hidden dependency” and can make your life very pretty
miserable), and many other confusing behaviors.
This workflow also makes it very difficult to share your work.
Remember that R
will default to looking in/saving to the
current working directory. To control this behavior, most people are
taught to use setwd()
. The problem with this is that the
path you set using setwd()
is almost certainty specific to
the computer your used to write that code. If you use a different
computer later or send code to your adviser for help, they will not be
able to run it without changing the working directory. If you move the
files to a new place on your computer, all of your code will break. Got
a new computer? Oops, none of your code will work anymore. If you are
running multiple analyses in the same R
session, you’ll
also have to constantly change your working directory, which eventually
will lead to confusion and problems rerunning code. Manually setting the
working directory seems like a minor annoyance but if you have to do it
constantly (for example, an instructor who has to grade a bunch of
homework assignments), these minor annoyances add up to a big
headache.
The good news is, there’s an easy solution that will make your life much easier.
Experienced programmers learned long ago the problems outlined above. To solve these problems, they typically use a self-contained directory (i.e., folder) for each project. Each directory contains all of the data and code needed to create research output for that project. Notably, the code should not reference any data or objects that are found in another directory. This helps ensure portability - you can move the entire project and still be able to rerun all code.
RStudio makes is very easy to create Projects (for clarity, I will use “Project” when referring specifically to RStudio Projects and “project” to refer to generic research projects, i.e., a chapter of your dissertation). I highly recommend that you create a new Project for each project that are working on. There are several ways to do this but the easiest way is to:
Open RStudio
Click on File -> New Project
then click
New Directory
(if you already have a folder associated with
the project, you can click Existing Directory
and then
navigate to that folder)
Click New Project
then choose a name and a location
for the Project (tip - don’t include spaces in the project name). For
now we will not worry about creating a git repository or using
packrat
Click Create Project
at the bottom. RStudio will
create a new directory with the same name as your Project and a file
called project_name.Rproj
inside of that
directory.
A new RStudio window should open when you create the new project. In
the future, when you want to work on this Project, just double click the
project_name.Rproj
file and it will open a new instance of
RStudio.
Projects solve two of the big problems we discussed earlier. First,
each project opens a fresh instance of R so the global environment of
projects is self-contained. That means you can you have multiple
projects open at once but each one will behave independently. This means
no more rm(list = ls())
because the only objects in your
environment should be the ones associated with that project (this
doesn’t totally resolve the problem of hidden dependencies but we’ll
talk more about that below).
Second, RStudio will automatically set the working directory to the
root directory associate with the Project. This means that whatever
folder the .Rproj
file is in will automatically be set as
the working directory when you open the project. You can check this by
typing getwd()
in the console of your new project. This
setting avoids the need to use setwd()
which makes your
code much more portable. You can move or send the entire directory,
double-click the .Rproj
file and pick up right where you
left off.
.RData
Another common mistake made by many novice (and not so novice)
R
users is to save your current workspace to
.RData
each time you quit R
or RStudio. On the
surface, this seems like a really convenient shortcut. The next time you
open R
, all of the objects you created previously are right
there waiting for you to keep going!
The problem is that, again, this creates hidden dependencies. The
.RData
file does not re-load packages that you were using
in your previous session. It does not re-set options that you may have
set in your previous session (setwd()
,
stringsAsFactors = FALSE
). Some objects needed for, e.g.,
making a figure may be loaded but others may not be. This means that
even if you saved your workspace, you will still end up needing to
re-run some chunks of code to get back to where you left off previously.
Which chunks? Who knows. You probably think you’ll remember but you
won’t.
A better workflow is to start every R
session with an
empty workspace. This means you go into each session with the mindset
that you will need to re-run all code from scratch every time. Although
this seems like a pain, I promise it will make your life easier. It is a
way of ensuring that at a minimum, you can reproduce all of
your analyses anytime you need to.
In RStudio, you can enforce this workflow by going to
RStudio -> Preferences
and then in the
General
tab setting
Save workspace to .Rdata on exit
to Never
. Do
it. Trust me.
What about re-running code that is very time consuming? Do you really
need to re-run the code the cleans your raw data every time you need to
reanalyze it? Of course not. You can always save those objects (along
with the scripts you used to create them!) so that the next time you
need them, they can just be re-read into R
and you can do
everything downstream of those steps without a problem.
One of the benefits of this workflow is that it treats your raw data
and code as the only real objects in your workflow. These
should be the only objects that if you lose them, you would be in big
trouble (you’re computer is backed up, right?). Cleaned data that you
created from your raw data isn’t real. You can just re-run the code and
create those objects again (this is why we use R
scripts
instead of manipulating raw data in excel. See below for more details).
Your workspace isn’t real - as we just discussed, it can (and should)
disappear at any time. Results and figures are not real because you can
re-create them from the code whenever you need to. Once you get into
this mindset of treating raw data and code as the only real objects,
your workflow will automatically become more organized and reproducible.
(Manuscripts are also real so save them and back them up!)
As we already saw, the .Rproj
file treats the directory
that it is in as the working directory for that project. This means that
you can read files from this location and write files to this location
without changing any settings. For example, you can stick a filled
called data.csv
in this directory and the following code
will work:
data <- read.csv("data.csv")
A natural instinct for those of us that are inclined to be less-than organized might be to put all of our files associated the project in this directory:
my_proj/
|─── data.csv
|─── script_that_does_everything.R
|─── test_script_that_does_something_else.R
|─── Figure1.png
|─── manuscript_v1.doc
|─── manuscript_v2.doc
.
.
.
|─── manuscript_final_v2-4.doc
There is nothing inherently wrong with this but for most projects, you will end up with a very big directory that is hard to navigate when you need to find specific files. A better idea is to put files in intuitive subdirectories. You can pick whatever structure makes most sense to you, but I generally use something like:
my_proj/
|─── data-raw/
|─── data.csv
|─── data_cleaning_script.R
|─── data/
|─── clean_data.rds
|─── R/
|─── function1.R
|─── function2.R
|─── figs/
|─── fig1.png
|─── fig2.png
|─── doc/
|─── manuscript.Rmd
|─── output/
|─── model_results.rds
|─── scripts/
|─── analysis.R
|─── figures.R
data-raw/
contains my, you guessed it, raw data
files and a R
script that reads the raw data in, cleans it,
and saves the clean data to data/
data/
contains the cleaned data files that are ready
to go straight into the analysis
R/
contains scripts that have custom functions that
I will use for data cleaning, analysis, or making figures (we’ll talk
more about this later)
figs/
contains figure files
doc/
contains files for reports or manuscripts
related to the projects
output/
includes output from the analysis, for
example results from fitting a model
scripts/
contains R scripts used to analyze data,
make figures, whatever. These could be put in the subdirectory related
to the function of the script (as I did for
data_cleaning_script.R
) if that makes more sense to you. Or
you can lump all of the different scripts into a single script if you
prefer. That’s up to you.
Because RStudio treats the project root directory as the working directory, you could not do the following if, for example, your data is stored in a subdirectory:
data <- read.csv("data.csv")
Instead, you need to use relative paths. Remember that everything is relative to your working directory:
data <- read.csv("data-raw/data.csv")
If you prefer, you can have as many layers of subdirectories as you want. The important thing is that as long as you move the entire project directory, these relative paths will still work.
FANR6750/
|─── scripts/
|─── lab1.R
|─── lab2.R
|─── homework/
|─── LastnameFirstname-homework1/
|─── LastnameFirstname-homework1.Rmd
|─── LastnameFirstname-homework1.html
|─── LastnameFirstname-homework2/
|─── LastnameFirstname-homework2.Rmd
|─── LastnameFirstname-homework2.html
|─── exams/
|─── exam1.Rmd
|─── exam2.Rmd
scripts/
contains R
scripts created
during lab.
homework/
contains sub-directories for files
associated with each homework.
exams/
contains RMarkdown files for each exam (yes,
exams will be completed in RMarkdown).
Raw data is sacred. If you purposely or accidentally change your raw data, everything that comes downstream (analysis, results, reports) will also change. To maintain the integrity of your work, get in the habit of never making changes to your raw data files after you have entered the data.
One of the great things about using scripts to manipulate your data
is that you can leave your raw data untouched and have a paper trail of
every change you made to prepare your data for analysis. That means
adding new variables that are composites of the raw data (e.g.,
temp_c <- (temp_f -32)*(5/9)
), removing outliers,
joining different data sets together, whatever. All of this is
done in R after you’ve read in the raw data rather than in
Excel!
Just as important, you don’t save these new data objects over the raw
data or even in the same directory. Those files go in
data/
. This means you can only read from the
data-raw/
and you cannot write to it. R
will,
of course, allow you to write to data-raw/
, this is just a
suggestion to help you maintain a firewall that will keep you raw data
in it’s raw form.