lab08_model_selection.Rmd
Review of linear model assumptions
Evaluating model assumptions
Visual assessments
Formal tests
Outliers and influential points
Model selection approaches
Purpose –> Method
Exploration
Variable screening
Stepwise Procedures
Prediction
Model selection is the formal process of choosing a model (out of possibly many) using some defined performance criterion (or several) that somehow best describes the system. As we have seen in lecture, there are many approaches available when it comes to model selection. Importantly, however, we should note that the way we go about model selection should be directly informed by the purpose of our research study. Models which are used for exploration should not be used for prediction. Those which are used for prediction may or may not be helpful in the context of inference. Below, we will look at several model selection approaches within the context of study purpose.
When a biological system is not yet well understood, exploration is often the purpose of many research studies. Exploration is used to help identify potential relationships between variables and can be the motivation for the development of future hypotheses.
It is important to remember that the results of an exploratory study should not be viewed as demonstrating proof of a hypothesis. It is only providing evidence for a potential hypothesis that will need future testing with new data.
Several model selection approaches are commonly used during exploratory studies. Below, we will talk about two broad approaches, variable screening and step-wise procedures.
Genetic modification allows for the production of food crops which have increased usefulness to people. These modifications may be directly related to the nutritional content of the crop, such as increasing digestibility, or they may broadly affect the process of crop production, such as decreasing the time required before harvest can occur. In either case, knowledge of the relationships between genetic variability and phenotypic response is a necessary step in modifying a species’ genetic structure.
Soybeans (Glycine max) are a species of legume native to East Asia which are used worldwide for a variety of food products as well as feed for livestock. In this dataset, a researcher is studying the relationship between several genetic markers and the flowering time of soybean plants. In each row of the dataset, the average flowering time of 10 soybean plants of a particular variety are recorded as well as the parental type (A or H) of each of 141 genetic markers.
Import the data into R
and examine the dataset
Having seen the data set, it is clear that there are many more possible predictor variables than we would like to put into one model. Determining which to include and which to leave out, however, can be a tedious process (especially with so many). We will begin this process using variable screening. This will involve determining a subset of the possible predictors which have the strongest relationships with the response variable.
Note: The code used to set up this analysis is complex. Do not worry too much about getting bogged down in the details of how this data management is done. The main principles being demonstrated here can be seen with other simpler datasets as well.
library(magicfor)
library(MASS)
# Create a dataframe which contains only the genetic markers
genetic_matrix <- soybeandata[,-c(1:3)]
genetic_matrix <- genetic_matrix %>% mutate_if(is.character,as.factor)
# Function which stores the results from a for loop
magic_for(print, silent = TRUE)
for (i in 1:ncol(genetic_matrix)){
output1=t.test(soybeandata$Flower09 ~genetic_matrix[,i])
output2=round(c(output1$statistic,output1$stderr,output1$p.value),4)
output3=c(paste("marker",i,sep=""),output2)
print(output3)
}
Above, we see our first example of a for loop in this course. For loops allow us to execute a set of statements in an iterative way which can greatly reduce the number of code lines we have to write.
The above loop works by iterating through each column of the
genetic_matrix
dataframe and performing multiple tasks.
First, the loop conducts a t-test using the two levels (A and H) within
a particular genetic marker as the grouping variable and flowering time
as the response. The for loop does this for each of the 141 genetic
markers. Next, the loop stores the t-test output, rounds those values,
and assigns each of them to a particular marker.
# This line uses the magic_for() function to save the output of the for loop as a dataframe
flower_output1=magic_result_as_dataframe()
Above, we have saved the output from our for loop into a dataframe. The format of the data, however, is not very convenient. All of the information has been piled into one column. Below, we can take the values and spread them out into a matrix which will make them easier to view and work with.
flower_output=as.data.frame(matrix(flower_output1[,2],nrow=141,byrow=TRUE))
# Creates more informative column names
colnames(flower_output) <- c("marker", "t.stat","Std.Error","pval")
flower_output$pval=as.numeric(as.character(flower_output$pval))
Now that we have conducted all of these t-tests, we need a way of deciding which genetic markers to consider. In this case, it would be wise to be somewhat liberal in our approach. We don’t want to miss any potential relationships so we will want to retain the markers for which the t-test was anywhere near significant. We can do this by filtering only the t-tests which have a p-value < 0.1.
flower_marker1=filter(flower_output, flower_output$pval<0.10)
head(flower_marker1)
#> marker t.stat Std.Error pval
#> 1 marker12 -1.9443 3.3881 0.0599
#> 2 marker13 -2.4719 2.9354 0.0159
#> 3 marker14 -2.7228 2.9659 0.0081
#> 4 marker15 -2.4083 3.1559 0.0191
#> 5 marker16 -2.6763 2.9922 0.0093
#> 6 marker17 -1.7569 2.9337 0.0831
From the above dataframe, we can see that there were 36 genetic markers which may be somewhat related to flowering time.
Below we can take a subset of our original dataset, selecting only the columns that correspond to the potentially relevant genetic markers.
flower_data2 <- cbind(soybeandata[,c(1,2)], genetic_matrix[,which(colnames(genetic_matrix) %in% flower_marker1$marker)])
flower_data2 <- flower_data2[-1]
Note: In reality, if our purpose in this analysis was simply to determine which genetic markers may have some relationship with soybean flowering time, we would be done after this step. This information could now be used to further genetic research with the goal of minimizing flowering time. We will continue to use this dataset below, however, to demonstrate how we might construct a parsimonious model estimating flowering time from a large number of predictor variables.
Now that we have screened our set of predictor variables, we can attempt to construct a parsimonious model. The problem, however, is that we still have 36 potential predictors. How do we decide which to include in the model?
Three step-wise procedures exist (forward selection, backward elimination, and stepwise regression) which could be used to resolve this issue. Below we will see all three in action.
Forward selection works by starting with the simplest possible model (the intercept only model) and adding one predictor variable at a time. The computer will iteratively create a single variable model for each predictor and choose the predictor with the smallest p-value less than alpha (the most significant). Having chosen the first predictor, it will then create a set of two variable models (always containing the most significant and then one each of all the other possible predictors) and select the best two variable model. Larger and larger models will be constructed until either the number of possible predictors is exhausted or the addition of another predictor is not considered significant (the p-value for that predictor is greater than alpha).
Backward elimination works in exactly the opposite way. The most complex model is specified and predictors are dropped one at a time (the least significant predictor is dropped first) until all of the remaining predictors in the model are significant (p-value less than alpha).
Stepwise regression can be thought of as doing both of these processes at once. Working from both ends (the intercept only model and the full model) a set of predictors somewhere in the middle is determined to be appropriate (there are multiple variations of how exactly this can be performed).
Some things to consider:
Because of how these procedures are performed (one variable added or removed at a time) it is possible to miss the ‘optimal’ model. This is especially true with correlated predictors where the inclusion or removal of one may affect how meaningful it is to include another.
Stepwise procedures tend to select smaller models than might be helpful for prediction.
Different procedures can arrive at different ‘optimal’ models. In that case, other metrics (adjusted \(R^2\), root mean square error, predicted residuals sums of squares, or Mallow’s CP) should be considered.
We can begin by specifying the intercept only model and the full
model. Next, we will tell R
which stepwise procedure to
perform. In this case, we will try all three.
# Fitting the models
intercept_only <- lm(Flower09~ 1, data= flower_data2)
full.model <- lm(Flower09~., data= flower_data2)
forward.model <- step(intercept_only, direction= 'forward', scope= formula(full.model))
backward.model <- step(full.model, direction= 'backward', scope= formula(full.model))
both.model <- step(intercept_only, direction= 'both', scope= formula(full.model))
summary(forward.model)
#>
#> Call:
#> lm(formula = Flower09 ~ marker125 + marker110 + marker139 + marker14 +
#> marker66 + marker19 + marker92 + marker34 + marker47 + marker112 +
#> marker117, data = flower_data2)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -22.561 -8.201 -1.903 5.712 39.848
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 65.178 2.596 25.106 < 2e-16 ***
#> marker125H 7.617 2.295 3.318 0.00115 **
#> marker110H -16.185 5.165 -3.133 0.00210 **
#> marker139H 9.909 2.815 3.520 0.00058 ***
#> marker14H 6.173 2.396 2.577 0.01098 *
#> marker66H 6.281 2.162 2.905 0.00426 **
#> marker19H 6.047 2.246 2.693 0.00793 **
#> marker92H -5.943 2.207 -2.693 0.00792 **
#> marker34H 4.931 2.549 1.935 0.05500 .
#> marker47H 4.226 2.183 1.936 0.05484 .
#> marker112H 10.321 5.403 1.910 0.05808 .
#> marker117H -4.234 2.389 -1.772 0.07852 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 13.06 on 143 degrees of freedom
#> Multiple R-squared: 0.4004, Adjusted R-squared: 0.3542
#> F-statistic: 8.68 on 11 and 143 DF, p-value: 1.175e-11
summary(backward.model)
#>
#> Call:
#> lm(formula = Flower09 ~ marker15 + marker19 + marker34 + marker47 +
#> marker66 + marker92 + marker110 + marker112 + marker116 +
#> marker125 + marker139, data = flower_data2)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -22.618 -7.923 -2.788 6.597 40.008
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 65.309 2.599 25.126 < 2e-16 ***
#> marker15H 6.219 2.469 2.519 0.012877 *
#> marker19H 5.426 2.294 2.365 0.019365 *
#> marker34H 5.363 2.558 2.097 0.037793 *
#> marker47H 4.456 2.180 2.044 0.042772 *
#> marker66H 6.306 2.177 2.896 0.004368 **
#> marker92H -6.121 2.214 -2.764 0.006459 **
#> marker110H -15.154 5.168 -2.932 0.003919 **
#> marker112H 9.298 5.448 1.707 0.090046 .
#> marker116H -4.015 2.533 -1.585 0.115144
#> marker125H 7.894 2.304 3.426 0.000800 ***
#> marker139H 9.550 2.832 3.372 0.000961 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 13.12 on 143 degrees of freedom
#> Multiple R-squared: 0.3952, Adjusted R-squared: 0.3486
#> F-statistic: 8.493 on 11 and 143 DF, p-value: 2.061e-11
summary(both.model)
#>
#> Call:
#> lm(formula = Flower09 ~ marker125 + marker110 + marker139 + marker14 +
#> marker66 + marker19 + marker92 + marker34 + marker47 + marker112 +
#> marker117, data = flower_data2)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -22.561 -8.201 -1.903 5.712 39.848
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 65.178 2.596 25.106 < 2e-16 ***
#> marker125H 7.617 2.295 3.318 0.00115 **
#> marker110H -16.185 5.165 -3.133 0.00210 **
#> marker139H 9.909 2.815 3.520 0.00058 ***
#> marker14H 6.173 2.396 2.577 0.01098 *
#> marker66H 6.281 2.162 2.905 0.00426 **
#> marker19H 6.047 2.246 2.693 0.00793 **
#> marker92H -5.943 2.207 -2.693 0.00792 **
#> marker34H 4.931 2.549 1.935 0.05500 .
#> marker47H 4.226 2.183 1.936 0.05484 .
#> marker112H 10.321 5.403 1.910 0.05808 .
#> marker117H -4.234 2.389 -1.772 0.07852 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 13.06 on 143 degrees of freedom
#> Multiple R-squared: 0.4004, Adjusted R-squared: 0.3542
#> F-statistic: 8.68 on 11 and 143 DF, p-value: 1.175e-11
AIC(forward.model)
#> [1] 1250.027
AIC(backward.model)
#> [1] 1251.367
AIC(both.model)
#> [1] 1250.027
Notice the differences in the results of each of these three approaches. Not all of the predictors are in common across all three models. The AIC values differ slightly across the models. What does this tell us? Additionally, notice the adjusted \(R^2\) value which is the proportion of the variability in the response that is explained by the predictors. While slightly different across each model, none of them are particularly good.
Why is this?
This model was designed for the purpose of exploration not prediction. It makes sense that given how many genetic markers there were total, this handful of them is not going to predict most of the variation in flowering time. What this model does do though, is provide an opportunity to explore hypotheses in future studies.
Now we will see an example of how we can use model selection for the purpose of prediction. Here, we would like to select a model which not only performs well for the initial dataset, but can also be used to create accurate predictions when presented with new data.
One way we can do this without having to have two completely separate datasets is by using cross validation. In this example, we will use a particular kind of cross validation known as K-fold. The conceptual steps of K-fold cross validation for model selection are as follows:
Randomly divide the data set into k number of groups (preferably equal size)
Fit a model to all but one of the groups
Calculate a metric such as MSE using the observations from the k-th group that was not used to train the model
Repeat this process k-number of times using each group
Calculate the overall MSE as the average of each calculated above
Repeat this process for each separate model you wish to compare
Compare the metric to the estimated metric from other possible models to select the ‘best’ model
Growth of grasses on ranches can be affected by many factors. In this example, researchers have developed three models which they would like to use to predict grass growth (Kg dry matter per hectare). One of these models contains abiotic factors (gallons of water, ppm soil salinity, and % soil nitrogen) while another contains biotic factors (% pest cover and number of grazers on the plot) and the third contains both.
Load the data set and examine the data
data('grasslanddata')
Below, we have specified our three models
mod1 <- lm(KgDMHA~ Water + Salinity + Nitrogen, data= grasslanddata)
summary(mod1)
#>
#> Call:
#> lm(formula = KgDMHA ~ Water + Salinity + Nitrogen, data = grasslanddata)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2991.99 -652.64 27.78 654.49 3105.87
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.373e+02 3.162e+02 1.383 0.167
#> Water 7.154e-03 3.805e-04 18.802 <2e-16 ***
#> Salinity -1.189e+02 9.003e+00 -13.207 <2e-16 ***
#> Nitrogen 4.556e+02 3.969e+01 11.479 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 958.3 on 996 degrees of freedom
#> Multiple R-squared: 0.395, Adjusted R-squared: 0.3931
#> F-statistic: 216.7 on 3 and 996 DF, p-value: < 2.2e-16
mod2 <- lm(KgDMHA~ Pests + Graze, data= grasslanddata)
summary(mod2)
#>
#> Call:
#> lm(formula = KgDMHA ~ Pests + Graze, data = grasslanddata)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2659.53 -571.16 4.26 601.87 2453.67
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 10115.034 134.935 74.96 <2e-16 ***
#> Pests -40.570 1.964 -20.66 <2e-16 ***
#> Graze -57.924 2.349 -24.66 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 873 on 997 degrees of freedom
#> Multiple R-squared: 0.4974, Adjusted R-squared: 0.4964
#> F-statistic: 493.4 on 2 and 997 DF, p-value: < 2.2e-16
mod3 <- lm(KgDMHA~ Water + Salinity + Nitrogen + Pests + Graze, data= grasslanddata)
summary(mod3)
#>
#> Call:
#> lm(formula = KgDMHA ~ Water + Salinity + Nitrogen + Pests + Graze,
#> data = grasslanddata)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -829.03 -181.67 0.55 172.43 839.83
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.076e+03 9.471e+01 43.04 <2e-16 ***
#> Water 7.921e-03 1.067e-04 74.27 <2e-16 ***
#> Salinity -1.161e+02 2.518e+00 -46.09 <2e-16 ***
#> Nitrogen 5.087e+02 1.113e+01 45.72 <2e-16 ***
#> Pests -4.224e+01 6.037e-01 -69.96 <2e-16 ***
#> Graze -6.237e+01 7.238e-01 -86.17 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 268 on 994 degrees of freedom
#> Multiple R-squared: 0.9528, Adjusted R-squared: 0.9525
#> F-statistic: 4010 on 5 and 994 DF, p-value: < 2.2e-16
Next, we can tell R
which type of cross validation we
plan to use for each model.
library(caret)
library(randomForest)
ctrl <- trainControl(method= 'cv', number= 10)
abiotic <- train(KgDMHA~ Water + Salinity + Nitrogen, data= grasslanddata, trControl= ctrl)
#> note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
print(abiotic)
#> Random Forest
#>
#> 1000 samples
#> 3 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 900, 900, 900, 900, 900, 900, ...
#> Resampling results across tuning parameters:
#>
#> mtry RMSE Rsquared MAE
#> 2 1029.135 0.3138250 827.2056
#> 3 1036.626 0.3077319 831.9005
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was mtry = 2.
abiotic$resample
#> RMSE Rsquared MAE Resample
#> 1 991.0775 0.3142691 793.9216 Fold01
#> 2 1016.8671 0.3540096 849.9105 Fold02
#> 3 1002.3371 0.3031511 783.0787 Fold03
#> 4 947.0895 0.4127291 785.8658 Fold04
#> 5 1134.9488 0.1938290 893.4463 Fold05
#> 6 1005.1726 0.3107187 786.8988 Fold06
#> 7 1020.4815 0.2523628 829.8101 Fold07
#> 8 1069.1704 0.2634308 844.4024 Fold08
#> 9 1059.3334 0.3934140 848.7989 Fold09
#> 10 1044.8677 0.3403354 855.9232 Fold10
biotic <- train(KgDMHA~ Pests + Graze, data= grasslanddata, trControl= ctrl)
#> note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
print(biotic)
#> Random Forest
#>
#> 1000 samples
#> 2 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 900, 900, 900, 900, 900, 900, ...
#> Resampling results:
#>
#> RMSE Rsquared MAE
#> 983.8363 0.3855089 782.0081
#>
#> Tuning parameter 'mtry' was held constant at a value of 2
both <- train(KgDMHA~ Water + Salinity + Nitrogen + Pests + Graze, data= grasslanddata, trControl= ctrl)
print(both)
#> Random Forest
#>
#> 1000 samples
#> 5 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 900, 900, 900, 900, 900, 900, ...
#> Resampling results across tuning parameters:
#>
#> mtry RMSE Rsquared MAE
#> 2 462.2708 0.8943824 359.7300
#> 3 456.6378 0.8864726 356.3873
#> 5 460.0898 0.8783105 357.3706
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was mtry = 3.
Which of these models appears to more accurately predict growth (as shown by R squared and RMSE)?
Take a look at the resample values. Why is there a different R squared for each of the k-number of folds even though each iteration is using the same predictor variables?
Why is the estimated R squared value for the cross validation models lower than it was when we ran the model summaries above?
Create an R Markdown file to do the following:
R
chunk to load the data using:Use the data available to fit 5 linear models which predict jay abundance. Each model should include at least one interaction and at least one model should have a quadratic term (1 pt).
Compare the AIC values of each model and determine which model appears best (1 pt).
Construct an AIC table by hand using the equations found in the lecture notes (2 pt).
Use the equation AIC = -2log-likelihood + 2k (the log-likelihood can be calculated using the R function logLik())
Determine the Akaike weight for the best performing model and explain what that weight means in comparison to the second best performing model
Perform a k-fold cross validation on your top two performing models (2 pt).
Use 5 folds
Explain how you think the number of folds affects this process? When do you think we might choose to use fewer or more folds?
A few things to remember when creating the document:
Be sure the output type is set to:
output: html_document
Be sure to set echo = TRUE
in all R
chunks so that all code and output is visible in the knitted
document
Regularly knit the document as you work to check for errors