Lab 7

  • Review of linear model assumptions

  • Evaluating model assumptions

    • Visual assessments

    • Formal tests

  • Outliers and influential points

Today’s topics

  • Model selection approaches

    • Purpose –> Method

    • Exploration

      • Variable screening

      • Stepwise Procedures

    • Prediction


Model Selection Approaches

Purpose precedes method

Model selection is the formal process of choosing a model (out of possibly many) using some defined performance criterion (or several) that somehow best describes the system. As we have seen in lecture, there are many approaches available when it comes to model selection. Importantly, however, we should note that the way we go about model selection should be directly informed by the purpose of our research study. Models which are used for exploration should not be used for prediction. Those which are used for prediction may or may not be helpful in the context of inference. Below, we will look at several model selection approaches within the context of study purpose.



Exploration

When a biological system is not yet well understood, exploration is often the purpose of many research studies. Exploration is used to help identify potential relationships between variables and can be the motivation for the development of future hypotheses.

It is important to remember that the results of an exploratory study should not be viewed as demonstrating proof of a hypothesis. It is only providing evidence for a potential hypothesis that will need future testing with new data.

Several model selection approaches are commonly used during exploratory studies. Below, we will talk about two broad approaches, variable screening and step-wise procedures.

Data example

Genetic modification allows for the production of food crops which have increased usefulness to people. These modifications may be directly related to the nutritional content of the crop, such as increasing digestibility, or they may broadly affect the process of crop production, such as decreasing the time required before harvest can occur. In either case, knowledge of the relationships between genetic variability and phenotypic response is a necessary step in modifying a species’ genetic structure.

Soybeans (Glycine max) are a species of legume native to East Asia which are used worldwide for a variety of food products as well as feed for livestock. In this dataset, a researcher is studying the relationship between several genetic markers and the flowering time of soybean plants. In each row of the dataset, the average flowering time of 10 soybean plants of a particular variety are recorded as well as the parental type (A or H) of each of 141 genetic markers.


Import the data into R and examine the dataset

library(FANR6750)
data('soybeandata')

Variable screening

Having seen the data set, it is clear that there are many more possible predictor variables than we would like to put into one model. Determining which to include and which to leave out, however, can be a tedious process (especially with so many). We will begin this process using variable screening. This will involve determining a subset of the possible predictors which have the strongest relationships with the response variable.

Note: The code used to set up this analysis is complex. Do not worry too much about getting bogged down in the details of how this data management is done. The main principles being demonstrated here can be seen with other simpler datasets as well.

library(magicfor)
library(MASS)

# Create a dataframe which contains only the genetic markers
genetic_matrix <- soybeandata[,-c(1:3)]

genetic_matrix <- genetic_matrix %>% mutate_if(is.character,as.factor)

# Function which stores the results from a for loop
magic_for(print, silent = TRUE)

for (i in 1:ncol(genetic_matrix)){
  output1=t.test(soybeandata$Flower09 ~genetic_matrix[,i])
  output2=round(c(output1$statistic,output1$stderr,output1$p.value),4)
  output3=c(paste("marker",i,sep=""),output2)
  print(output3)
}

Above, we see our first example of a for loop in this course. For loops allow us to execute a set of statements in an iterative way which can greatly reduce the number of code lines we have to write.

The above loop works by iterating through each column of the genetic_matrix dataframe and performing multiple tasks. First, the loop conducts a t-test using the two levels (A and H) within a particular genetic marker as the grouping variable and flowering time as the response. The for loop does this for each of the 141 genetic markers. Next, the loop stores the t-test output, rounds those values, and assigns each of them to a particular marker.

# This line uses the magic_for() function to save the output of the for loop as a dataframe
flower_output1=magic_result_as_dataframe()

Above, we have saved the output from our for loop into a dataframe. The format of the data, however, is not very convenient. All of the information has been piled into one column. Below, we can take the values and spread them out into a matrix which will make them easier to view and work with.

flower_output=as.data.frame(matrix(flower_output1[,2],nrow=141,byrow=TRUE))

# Creates more informative column names
colnames(flower_output) <- c("marker", "t.stat","Std.Error","pval")

flower_output$pval=as.numeric(as.character(flower_output$pval))

Now that we have conducted all of these t-tests, we need a way of deciding which genetic markers to consider. In this case, it would be wise to be somewhat liberal in our approach. We don’t want to miss any potential relationships so we will want to retain the markers for which the t-test was anywhere near significant. We can do this by filtering only the t-tests which have a p-value < 0.1.

flower_marker1=filter(flower_output, flower_output$pval<0.10)
head(flower_marker1)
#>     marker  t.stat Std.Error   pval
#> 1 marker12 -1.9443    3.3881 0.0599
#> 2 marker13 -2.4719    2.9354 0.0159
#> 3 marker14 -2.7228    2.9659 0.0081
#> 4 marker15 -2.4083    3.1559 0.0191
#> 5 marker16 -2.6763    2.9922 0.0093
#> 6 marker17 -1.7569    2.9337 0.0831

From the above dataframe, we can see that there were 36 genetic markers which may be somewhat related to flowering time.

Below we can take a subset of our original dataset, selecting only the columns that correspond to the potentially relevant genetic markers.

flower_data2 <- cbind(soybeandata[,c(1,2)], genetic_matrix[,which(colnames(genetic_matrix) %in%    flower_marker1$marker)])

flower_data2 <- flower_data2[-1]

Note: In reality, if our purpose in this analysis was simply to determine which genetic markers may have some relationship with soybean flowering time, we would be done after this step. This information could now be used to further genetic research with the goal of minimizing flowering time. We will continue to use this dataset below, however, to demonstrate how we might construct a parsimonious model estimating flowering time from a large number of predictor variables.

Step-wise procedures

Now that we have screened our set of predictor variables, we can attempt to construct a parsimonious model. The problem, however, is that we still have 36 potential predictors. How do we decide which to include in the model?

Three step-wise procedures exist (forward selection, backward elimination, and stepwise regression) which could be used to resolve this issue. Below we will see all three in action.

Forward selection works by starting with the simplest possible model (the intercept only model) and adding one predictor variable at a time. The computer will iteratively create a single variable model for each predictor and choose the predictor with the smallest p-value less than alpha (the most significant). Having chosen the first predictor, it will then create a set of two variable models (always containing the most significant and then one each of all the other possible predictors) and select the best two variable model. Larger and larger models will be constructed until either the number of possible predictors is exhausted or the addition of another predictor is not considered significant (the p-value for that predictor is greater than alpha).

Backward elimination works in exactly the opposite way. The most complex model is specified and predictors are dropped one at a time (the least significant predictor is dropped first) until all of the remaining predictors in the model are significant (p-value less than alpha).

Stepwise regression can be thought of as doing both of these processes at once. Working from both ends (the intercept only model and the full model) a set of predictors somewhere in the middle is determined to be appropriate (there are multiple variations of how exactly this can be performed).


Some things to consider:

  1. Because of how these procedures are performed (one variable added or removed at a time) it is possible to miss the ‘optimal’ model. This is especially true with correlated predictors where the inclusion or removal of one may affect how meaningful it is to include another.

  2. Stepwise procedures tend to select smaller models than might be helpful for prediction.

  3. Different procedures can arrive at different ‘optimal’ models. In that case, other metrics (adjusted \(R^2\), root mean square error, predicted residuals sums of squares, or Mallow’s CP) should be considered.



We can begin by specifying the intercept only model and the full model. Next, we will tell R which stepwise procedure to perform. In this case, we will try all three.

# Fitting the models
intercept_only <- lm(Flower09~ 1, data= flower_data2)
full.model <- lm(Flower09~., data= flower_data2)


forward.model <- step(intercept_only, direction= 'forward', scope= formula(full.model))

backward.model <- step(full.model, direction= 'backward', scope= formula(full.model))

both.model <- step(intercept_only, direction= 'both', scope= formula(full.model))
summary(forward.model)
#> 
#> Call:
#> lm(formula = Flower09 ~ marker125 + marker110 + marker139 + marker14 + 
#>     marker66 + marker19 + marker92 + marker34 + marker47 + marker112 + 
#>     marker117, data = flower_data2)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -22.561  -8.201  -1.903   5.712  39.848 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   65.178      2.596  25.106  < 2e-16 ***
#> marker125H     7.617      2.295   3.318  0.00115 ** 
#> marker110H   -16.185      5.165  -3.133  0.00210 ** 
#> marker139H     9.909      2.815   3.520  0.00058 ***
#> marker14H      6.173      2.396   2.577  0.01098 *  
#> marker66H      6.281      2.162   2.905  0.00426 ** 
#> marker19H      6.047      2.246   2.693  0.00793 ** 
#> marker92H     -5.943      2.207  -2.693  0.00792 ** 
#> marker34H      4.931      2.549   1.935  0.05500 .  
#> marker47H      4.226      2.183   1.936  0.05484 .  
#> marker112H    10.321      5.403   1.910  0.05808 .  
#> marker117H    -4.234      2.389  -1.772  0.07852 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13.06 on 143 degrees of freedom
#> Multiple R-squared:  0.4004, Adjusted R-squared:  0.3542 
#> F-statistic:  8.68 on 11 and 143 DF,  p-value: 1.175e-11
summary(backward.model)
#> 
#> Call:
#> lm(formula = Flower09 ~ marker15 + marker19 + marker34 + marker47 + 
#>     marker66 + marker92 + marker110 + marker112 + marker116 + 
#>     marker125 + marker139, data = flower_data2)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -22.618  -7.923  -2.788   6.597  40.008 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   65.309      2.599  25.126  < 2e-16 ***
#> marker15H      6.219      2.469   2.519 0.012877 *  
#> marker19H      5.426      2.294   2.365 0.019365 *  
#> marker34H      5.363      2.558   2.097 0.037793 *  
#> marker47H      4.456      2.180   2.044 0.042772 *  
#> marker66H      6.306      2.177   2.896 0.004368 ** 
#> marker92H     -6.121      2.214  -2.764 0.006459 ** 
#> marker110H   -15.154      5.168  -2.932 0.003919 ** 
#> marker112H     9.298      5.448   1.707 0.090046 .  
#> marker116H    -4.015      2.533  -1.585 0.115144    
#> marker125H     7.894      2.304   3.426 0.000800 ***
#> marker139H     9.550      2.832   3.372 0.000961 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13.12 on 143 degrees of freedom
#> Multiple R-squared:  0.3952, Adjusted R-squared:  0.3486 
#> F-statistic: 8.493 on 11 and 143 DF,  p-value: 2.061e-11
summary(both.model)
#> 
#> Call:
#> lm(formula = Flower09 ~ marker125 + marker110 + marker139 + marker14 + 
#>     marker66 + marker19 + marker92 + marker34 + marker47 + marker112 + 
#>     marker117, data = flower_data2)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -22.561  -8.201  -1.903   5.712  39.848 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   65.178      2.596  25.106  < 2e-16 ***
#> marker125H     7.617      2.295   3.318  0.00115 ** 
#> marker110H   -16.185      5.165  -3.133  0.00210 ** 
#> marker139H     9.909      2.815   3.520  0.00058 ***
#> marker14H      6.173      2.396   2.577  0.01098 *  
#> marker66H      6.281      2.162   2.905  0.00426 ** 
#> marker19H      6.047      2.246   2.693  0.00793 ** 
#> marker92H     -5.943      2.207  -2.693  0.00792 ** 
#> marker34H      4.931      2.549   1.935  0.05500 .  
#> marker47H      4.226      2.183   1.936  0.05484 .  
#> marker112H    10.321      5.403   1.910  0.05808 .  
#> marker117H    -4.234      2.389  -1.772  0.07852 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 13.06 on 143 degrees of freedom
#> Multiple R-squared:  0.4004, Adjusted R-squared:  0.3542 
#> F-statistic:  8.68 on 11 and 143 DF,  p-value: 1.175e-11

AIC(forward.model)
#> [1] 1250.027
AIC(backward.model)
#> [1] 1251.367
AIC(both.model)
#> [1] 1250.027


Notice the differences in the results of each of these three approaches. Not all of the predictors are in common across all three models. The AIC values differ slightly across the models. What does this tell us? Additionally, notice the adjusted \(R^2\) value which is the proportion of the variability in the response that is explained by the predictors. While slightly different across each model, none of them are particularly good.

Why is this?


This model was designed for the purpose of exploration not prediction. It makes sense that given how many genetic markers there were total, this handful of them is not going to predict most of the variation in flowering time. What this model does do though, is provide an opportunity to explore hypotheses in future studies.



Prediction

Now we will see an example of how we can use model selection for the purpose of prediction. Here, we would like to select a model which not only performs well for the initial dataset, but can also be used to create accurate predictions when presented with new data.

One way we can do this without having to have two completely separate datasets is by using cross validation. In this example, we will use a particular kind of cross validation known as K-fold. The conceptual steps of K-fold cross validation for model selection are as follows:

  1. Randomly divide the data set into k number of groups (preferably equal size)

  2. Fit a model to all but one of the groups

  3. Calculate a metric such as MSE using the observations from the k-th group that was not used to train the model

  4. Repeat this process k-number of times using each group

  5. Calculate the overall MSE as the average of each calculated above

  6. Repeat this process for each separate model you wish to compare

  7. Compare the metric to the estimated metric from other possible models to select the ‘best’ model

Data example

Growth of grasses on ranches can be affected by many factors. In this example, researchers have developed three models which they would like to use to predict grass growth (Kg dry matter per hectare). One of these models contains abiotic factors (gallons of water, ppm soil salinity, and % soil nitrogen) while another contains biotic factors (% pest cover and number of grazers on the plot) and the third contains both.


Load the data set and examine the data

data('grasslanddata')

Below, we have specified our three models

mod1 <- lm(KgDMHA~ Water + Salinity + Nitrogen, data= grasslanddata)
summary(mod1)
#> 
#> Call:
#> lm(formula = KgDMHA ~ Water + Salinity + Nitrogen, data = grasslanddata)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2991.99  -652.64    27.78   654.49  3105.87 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  4.373e+02  3.162e+02   1.383    0.167    
#> Water        7.154e-03  3.805e-04  18.802   <2e-16 ***
#> Salinity    -1.189e+02  9.003e+00 -13.207   <2e-16 ***
#> Nitrogen     4.556e+02  3.969e+01  11.479   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 958.3 on 996 degrees of freedom
#> Multiple R-squared:  0.395,  Adjusted R-squared:  0.3931 
#> F-statistic: 216.7 on 3 and 996 DF,  p-value: < 2.2e-16

mod2 <- lm(KgDMHA~ Pests + Graze, data= grasslanddata)
summary(mod2)
#> 
#> Call:
#> lm(formula = KgDMHA ~ Pests + Graze, data = grasslanddata)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2659.53  -571.16     4.26   601.87  2453.67 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 10115.034    134.935   74.96   <2e-16 ***
#> Pests         -40.570      1.964  -20.66   <2e-16 ***
#> Graze         -57.924      2.349  -24.66   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 873 on 997 degrees of freedom
#> Multiple R-squared:  0.4974, Adjusted R-squared:  0.4964 
#> F-statistic: 493.4 on 2 and 997 DF,  p-value: < 2.2e-16

mod3 <- lm(KgDMHA~ Water + Salinity + Nitrogen + Pests + Graze, data= grasslanddata)
summary(mod3)
#> 
#> Call:
#> lm(formula = KgDMHA ~ Water + Salinity + Nitrogen + Pests + Graze, 
#>     data = grasslanddata)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -829.03 -181.67    0.55  172.43  839.83 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  4.076e+03  9.471e+01   43.04   <2e-16 ***
#> Water        7.921e-03  1.067e-04   74.27   <2e-16 ***
#> Salinity    -1.161e+02  2.518e+00  -46.09   <2e-16 ***
#> Nitrogen     5.087e+02  1.113e+01   45.72   <2e-16 ***
#> Pests       -4.224e+01  6.037e-01  -69.96   <2e-16 ***
#> Graze       -6.237e+01  7.238e-01  -86.17   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 268 on 994 degrees of freedom
#> Multiple R-squared:  0.9528, Adjusted R-squared:  0.9525 
#> F-statistic:  4010 on 5 and 994 DF,  p-value: < 2.2e-16

Next, we can tell R which type of cross validation we plan to use for each model.

library(caret)
library(randomForest)
ctrl <- trainControl(method= 'cv', number= 10)

abiotic <- train(KgDMHA~ Water + Salinity + Nitrogen, data= grasslanddata, trControl= ctrl)
#> note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .

print(abiotic)
#> Random Forest 
#> 
#> 1000 samples
#>    3 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 900, 900, 900, 900, 900, 900, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  RMSE      Rsquared   MAE     
#>   2     1023.360  0.3182402  821.8443
#>   3     1031.408  0.3115387  826.8721
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was mtry = 2.

abiotic$resample
#>         RMSE  Rsquared      MAE Resample
#> 1  1009.4804 0.3628314 795.2984   Fold01
#> 2  1040.0691 0.2551683 806.0872   Fold02
#> 3   979.7285 0.3312794 798.6756   Fold03
#> 4  1047.9130 0.2695540 834.7730   Fold04
#> 5  1027.8042 0.3224239 819.4395   Fold05
#> 6   986.5046 0.3781157 800.0342   Fold06
#> 7  1045.5120 0.2538705 869.3104   Fold07
#> 8  1020.9493 0.2817806 795.0477   Fold08
#> 9  1026.0453 0.4211395 851.2316   Fold09
#> 10 1049.5891 0.3062390 848.5461   Fold10

biotic <- train(KgDMHA~ Pests + Graze, data= grasslanddata, trControl= ctrl)
#> note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .

print(biotic)
#> Random Forest 
#> 
#> 1000 samples
#>    2 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 900, 900, 900, 900, 900, 900, ... 
#> Resampling results:
#> 
#>   RMSE      Rsquared   MAE     
#>   968.9116  0.4017874  771.6155
#> 
#> Tuning parameter 'mtry' was held constant at a value of 2

both <- train(KgDMHA~ Water + Salinity + Nitrogen + Pests + Graze, data= grasslanddata, trControl= ctrl)

print(both)
#> Random Forest 
#> 
#> 1000 samples
#>    5 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 900, 900, 900, 900, 900, 900, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  RMSE      Rsquared   MAE     
#>   2     463.0675  0.8930002  362.2867
#>   3     458.1630  0.8853716  359.9529
#>   5     461.3546  0.8777393  361.9696
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was mtry = 3.

Which of these models appears to more accurately predict growth (as shown by R squared and RMSE)?

Take a look at the resample values. Why is there a different R squared for each of the k-number of folds even though each iteration is using the same predictor variables?

Why is the estimated R squared value for the cross validation models lower than it was when we ran the model summaries above?




Assignment

Create an R Markdown file to do the following:

  1. Create an R chunk to load the data using:
library(FANR6750)
data('jaydata')
  1. Use the data available to fit 5 linear models which predict jay abundance. Each model should include at least one interaction and at least one model should have a quadratic term.

  2. Compare the AIC values of each model and determine which model appears best.

  3. Construct an AIC table by hand using the equations found in the lecture notes.

    1. Use the equation AIC = -2log-likelihood + 2k (the log-likelihood can be calculated using the R function logLik())

    2. Determine the Akaike weight for the best performing model and explain what that weight means in comparison to the second best performing model

  4. Perform a k-fold cross validation on your top two performing models.

    1. Use 5 folds

    2. Explain how you think the number of folds affects this process? When do you think we might choose to use fewer or more folds?


A few things to remember when creating the document:

  • Be sure the output type is set to: output: html_document

  • Be sure to set echo = TRUE in all R chunks so that all code and output is visible in the knitted document

  • Regularly knit the document as you work to check for errors