class: center, middle, inverse, title-slide .title[ # LECTURE 7: linear models part 2: categorical predictor > 2 levels ] .subtitle[ ## FANR 6750 (Experimental design) ] .author[ ###
Fall 2023 ] --- class: inverse # outline #### 1) Motivating example <br/> -- #### 2) Extending the linear model <br/> -- #### 3) Dummy variables with > 2 levels <br/> -- #### 4) *F*-tests <br/> -- #### 5) Comparing linear model and ANOVA approaches --- # categorical predictors #### In lecture 4, we learned about fitting models with categorical predictors with 1 (measurement error) or 2 (frog) levels -- - Linear models with these categorical predictors correspond to one-sample or two-sample *t*-tests -- - Binary categorical predictors are quite common in ecological studies (e.g., male vs female, juvenile vs adult, disturbed vs undisturbed) -- #### However, you will often encounter categorical predictors with >2 levels - What are some examples? --- # motivating example Foresters are studying the effect of 4 different fertilizers (treatments) on the growth of loblolly pine, which are grown on 3 plots (replicates) receiving each treatment. Data are average height per plot after 5 years: .pull-left[ <br/> <table class="table table-condensed" style="font-size: 18px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="4"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Treatment</div></th> </tr> <tr> <th style="text-align:center;"> Replicate </th> <th style="text-align:center;"> A </th> <th style="text-align:center;"> B </th> <th style="text-align:center;"> C </th> <th style="text-align:center;"> D </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 11 </td> <td style="text-align:center;"> 9 </td> <td style="text-align:center;"> 7 </td> <td style="text-align:center;"> 5 </td> </tr> <tr> <td style="text-align:center;"> 2 </td> <td style="text-align:center;"> 10 </td> <td style="text-align:center;"> 8 </td> <td style="text-align:center;"> 6 </td> <td style="text-align:center;"> 4 </td> </tr> <tr> <td style="text-align:center;"> 3 </td> <td style="text-align:center;"> 9 </td> <td style="text-align:center;"> 7 </td> <td style="text-align:center;"> 5 </td> <td style="text-align:center;"> 3 </td> </tr> </tbody> </table> ] -- .pull-right[ #### Notation - The number of groups (treatments) is `\(\large a=4\)` - The number of observations within each group (replicates) is `\(\large n=3\)` - `\(\large y_{ij}\)` denotes the `\(\large j\)`th observation from the `\(\large i\)`th group ] --- # a brief tangent #### What counts as an observation? -- #### Experimental unit > the physical unit that receives a particular treatment -- #### Observational unit > the physical unit on which measurements are taken -- These are not always the same! -- Examples - Agricultural fields given different fertilizer, crop yield measured - Rats given different diets, disease state measured - Microcosm given different predator abundance, tadpole growth measured --- # motivating example **Question:** Is there a difference in growth among the four treatment groups? -- <img src="07_anova_files/figure-html/pine1-1.png" width="576" style="display: block; margin: auto;" /> --- # motivating example #### Hypotheses - `\(\large H_0 : \mu_A = \mu_B = \mu_C = \mu_D\)` - `\(\large H_a :\)` At least one inequality -- #### How should we test the null? -- We could do this using 6 *t*-tests <br/> -- But this would alter the overall (experiment-wise) `\(\large \alpha\)` level because each individual test has a chance (usually `\(\large \alpha = 0.05\)`) of incorrectly rejecting a true null hypothesis, and this is multiplied when multiple tests are used --- # extending the linear model #### Fortunately, extending the previous linear model to include additional predictor levels is "straightforward" - The model looks familiar `$$\large y_{ij} = \beta_0 + \beta_ix_{ij} + \epsilon_{ij}$$` `$$\large \epsilon_{ij} \sim normal(0, \sigma)$$` -- - What's tricky is how we code `\(\large x_{ij}\)` when it is more than just 0 or 1 -- - Fortunately, what we learned about how `R` codes dummy variables will help --- # dummy variable coding ```r data(pinedata) pinedata[c(1,4,7,10),] ``` ``` ## Replicate Treatment Height ## 1 1 A 11 ## 4 1 B 9 ## 7 1 C 7 ## 10 1 D 5 ``` ```r fit1 <- lm(Height ~ Treatment, data = pinedata) model.matrix(fit1)[c(1,4,7,10),] ``` ``` ## (Intercept) TreatmentB TreatmentC TreatmentD ## 1 1 0 0 0 ## 4 1 1 0 0 ## 7 1 0 1 0 ## 10 1 0 0 1 ``` --- # dummy variable coding ```r model.matrix(fit1)[c(1,4,7,10),] ``` ``` ## (Intercept) TreatmentB TreatmentC TreatmentD ## 1 1 0 0 0 ## 4 1 1 0 0 ## 7 1 0 1 0 ## 10 1 0 0 1 ``` Remember that the model matrix has one column for every parameter in the model - How many parameters is this model estimating? -- `$$\large E[y_{ij}] = \beta_0 + \beta_1 I(B)_{ij} + \beta_2 I(C)_{ij} + \beta_3 I(D)_{ij}$$` where `\(I(B/C/D)_{ij}\)` are dummy variables (0/1 depending on treatment) -- - What is the interpretation of each parameter? --- # motivating example ```r broom::tidy(fit1) ``` ``` ## # A tibble: 4 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 10.0 0.577 17.3 0.000000126 ## 2 TreatmentB -2.00 0.816 -2.45 0.0400 ## 3 TreatmentC -4.00 0.816 -4.90 0.00120 ## 4 TreatmentD -6.00 0.816 -7.35 0.0000801 ``` -- - Be sure you understand what each parameter means! -- - *t*-statistics and *p*-values are calculated and interpreted in the same way as the frog example -- - But wait, the null hypothesis was `\(\mu_A = \mu_B = \mu_C = \mu_D\)` --- # regression F-statistic .small-code[ ```r summary(fit1) ``` ``` ## ## Call: ## lm(formula = Height ~ Treatment, data = pinedata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1 -1 0 1 1 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10.000 0.577 17.32 1.3e-07 *** ## TreatmentB -2.000 0.816 -2.45 0.0400 * ## TreatmentC -4.000 0.816 -4.90 0.0012 ** ## TreatmentD -6.000 0.816 -7.35 8.0e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1 on 8 degrees of freedom ## Multiple R-squared: 0.882, Adjusted R-squared: 0.838 ## F-statistic: 20 on 3 and 8 DF, p-value: 0.000449 ``` ] --- # regression F-statistic The *F*-statistic tests whether all regression coefficients (other than the intercept) are all 0 - `\(H_0\)`: All regression coefficients are 0 - `\(H_A\)`: At least one regression coefficient is not 0 -- *F*-statistics are the ratio of *sample variances* `$$F = \frac{s^2_A}{s^2_B}$$` - Null hypothesis is that population variances are equal ( `\(\sigma^2_A = \sigma^2_B\)` ) - Always positive (variances can be negative) - Usually > 1 (by putting larger variance in the numerator) -- Where to the variances come from in the linear model and what do they tell us about differences among groups? --- # regression F-statistic #### To understand why the test is based on variance, it is helpful to consider several types of means: -- .pull-left[ - Grand mean `$$\large \bar{y}. = \frac{\sum_i \sum_jy_{ij}}{a \times n}$$` ] .pull-right[ <img src="07_anova_files/figure-html/pine_grm-1.png" width="288" style="display: block; margin: auto;" /> ] --- # regression F-statistic #### To understand why the test is based on variance, it is helpful to consider several types of means: .pull-left[ - Grand mean `$$\large \bar{y}. = \frac{\sum_i \sum_jy_{ij}}{a \times n}$$` - Group means `$$\large \bar{y}_i = \frac{\sum_j y_{ij}}{n}$$` ] .pull-right[ <img src="07_anova_files/figure-html/pinedata_gm-1.png" width="288" style="display: block; margin: auto;" /> ] --- # regression F-statistic #### To understand why the test is based on variance, it is helpful to consider several types of means: .pull-left[ - Grand mean `$$\large \bar{y}. = \frac{\sum_i \sum_jy_{ij}}{a \times n}$$` - Group means `$$\large \bar{y}_i = \frac{\sum_j y_{ij}}{n}$$` We can now decompose the observations as `$$\large y_{ij} = \color{#446E9B}{\bar{y}.} + \color{#D47500}{(\bar{y}_i - \bar{y}.)} + \color{#3CB521}{(y_{ij} - \bar{y}_i)}$$` ] .pull-right[ <img src="07_anova_files/figure-html/pinedata_var-1.png" width="288" style="display: block; margin: auto;" /> ] --- # the additive model #### The decomposition `$$\Large y_{ij} = \color{#446E9B}{\bar{y}.} + \color{#D47500}{(\bar{y}_i - \bar{y}.)} + \color{#3CB521}{(y_{ij} - \bar{y}_i)}$$` -- #### The additive model `$$\Large y_{ij} = \color{#446E9B}{\mu} + \color{#D47500}{\alpha_i} + \color{#3CB521}{\epsilon_{ij}}$$` -- #### where `$$\Large \epsilon_{ij} \sim normal(0, \sigma^2)$$` --- # the additive model `$$\large y_{ij} = \mu + \alpha_i + \epsilon_{ij}$$` `$$\large \epsilon_{ij} \sim normal(0, \sigma^2)$$` #### Notes - `\(\large \mu\)` is the grand mean of the population, estimated by `\(\large \bar{y}.\)` -- - `\(\large \alpha_i\)` is the effect of treatment *i*, estimated by `\(\large\bar{y}_i - \bar{y}.\)` -- + It is the deviation of the group mean from the grand mean + If all `\(\large\alpha_i = 0\)`, there is no treatment effect + Thus, we can re-write the null hypothesis `\(H_0 : \alpha_1 = \alpha_2=... =\alpha_a = 0\)` -- - `\(\large \epsilon_{ij}\)` is the residual error, estimated by `\(\large y_{ij} - \bar{y}_i\)` + It is the unexplained (random) deviation of the observation from the group mean --- # sums of squares #### Variation among groups `$$\Large SS_A = n \sum_i (\bar{y}_i - \bar{y}.)^2$$` --- # motivating example **Question:** Is there a difference in growth among the four treatment groups? <img src="07_anova_files/figure-html/pinedata_ssa-1.png" width="576" style="display: block; margin: auto;" /> --- # sums of squares #### Variation among groups `$$\Large SS_A = n \sum_i (\bar{y}_i - \bar{y}.)^2$$` #### Variation within groups `$$\Large SS_W = \sum_i \sum_j (y_{ij} - \bar{y}_i)^2$$` --- # motivating example **Question:** Is there a difference in growth among the four treatment groups? <img src="07_anova_files/figure-html/pinedata_ssw-1.png" width="576" style="display: block; margin: auto;" /> --- # sums of squares #### Variation among groups `$$\Large SS_A = n \sum_i (\bar{y}_i - \bar{y}.)^2$$` #### Variation within groups `$$\Large SS_W = \sum_i \sum_j (y_{ij} - \bar{y}_i)^2$$` #### Total variation `$$\Large SS_T = SS_A + SS_W = \sum_i \sum_j (y_{ij} - \bar{y}.)^2$$` --- # motivating example **Question:** Is there a difference in growth among the four treatment groups? <img src="07_anova_files/figure-html/pinedata_sst-1.png" width="576" style="display: block; margin: auto;" /> --- # mean squares ### To covert the sums of squares to variances, divide by the degrees of freedom -- #### Mean squares among `$$\Large MS_A = \frac{SS_A}{a-1}$$` -- #### Mean squares within `$$\Large MS_W = \frac{SS_W}{a(n-1)}$$` --- # F-statistic <br/> <br/> `$$\LARGE F = \frac{MS_A}{MS_W}$$` -- #### Note: -- - *F* is a ratio measuring the variance among groups to the variance within groups -- - If there is a large treatment effect, what happens to `\(MS_A\)`? -- - If there is little residual variation, what happens to `\(MS_W\)`? -- - Large values of `\(F\)` indicate treatment effects are large relative to residual variation, but can we conclude there is a treatment effect? --- # the F-distribution #### **If the null hypothesis is true** ( `\(\sigma^2_A = \sigma^2_B\)`), the ratio of sample variances follows an *F*-distribution .pull-left[ #### Properties: - `\(F > 0\)` - `\(F\)`-distribution is not symmetrical - Shape of distribution depends on an ordered pair of degrees of freedom, `\(df_A\)` and `\(df_B\)` ] .pull-right[ <img src="07_anova_files/figure-html/f-1.png" width="288" style="display: block; margin: auto;" /> ] --- # f-distribution <br/> <img src="07_anova_files/figure-html/f2-1.png" width="504" style="display: block; margin: auto;" /> Just like the *t*-distribution, we can use the *F*-distribution to calculate *p*-values and test the null hypothesis that the variances are equal --- # analysis of variance #### Using the *F*-statistic to test whether there is a treatment effect is commonly referred to as *Analysis of Variance* (ANOVA) -- - Especially in experimental settings, ANOVA is very commonly used -- - The linear model approach is increasingly common, especially in observational settings -- - From a practical standpoint, the linear model and ANOVA approaches provide exactly the same information, just presented in different ways (just like the linear model vs t-test we saw previously) --- # worked example #### Suppose we are interested in the effect of elevation on the abundance of Canada Warblers .pull-left[ <img src="https://upload.wikimedia.org/wikipedia/commons/b/b1/8G7D5475-Canada.jpg" width="80%" /> ] -- .pull-right[ <table class="table table-condensed" style="font-size: 14px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="3"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Elevation</div></th> </tr> <tr> <th style="text-align:center;"> Replicate </th> <th style="text-align:center;"> Low </th> <th style="text-align:center;"> Medium </th> <th style="text-align:center;"> High </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 2 </td> <td style="text-align:center;"> 4 </td> </tr> <tr> <td style="text-align:center;"> 2 </td> <td style="text-align:center;"> 3 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 7 </td> </tr> <tr> <td style="text-align:center;"> 3 </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 4 </td> <td style="text-align:center;"> 5 </td> </tr> <tr> <td style="text-align:center;"> 4 </td> <td style="text-align:center;"> 2 </td> <td style="text-align:center;"> 3 </td> <td style="text-align:center;"> 5 </td> </tr> </tbody> </table> ] ??? Image courtesy of William H. Majoros via Wikicommons --- # worked example .small-code[ ```r data(warblerdata) fit.lm <- lm(Count ~ Elevation, data = warblerdata) summary(fit.lm) ``` ``` ## ## Call: ## lm(formula = Count ~ Elevation, data = warblerdata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.250 -0.688 -0.250 0.938 1.750 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.250 0.717 7.32 4.5e-05 *** ## ElevationLow -3.750 1.014 -3.70 0.0049 ** ## ElevationMedium -3.000 1.014 -2.96 0.0160 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.43 on 9 degrees of freedom ## Multiple R-squared: 0.63, Adjusted R-squared: 0.548 ## F-statistic: 7.66 on 2 and 9 DF, p-value: 0.0114 ``` ] --- # worked example ```r data(warblerdata) fit.aov <- aov(Count ~ Elevation, data = warblerdata) summary(fit.aov) ``` ``` ## Df Sum Sq Mean Sq F value Pr(>F) ## Elevation 2 31.5 15.75 7.66 0.011 * ## Residuals 9 18.5 2.06 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # anova table from `lm` In case you want more detail about the relationship between `lm` and `aov` output: - `lm()` function also returns residuals (e.g., `\(y_i - E[y_i]\)`) .small-code[ ```r fit.lm$residual ``` ``` ## 1 2 3 4 5 6 7 8 9 10 11 12 ## -0.50 1.50 -1.50 0.50 -0.25 -2.25 1.75 0.75 -1.25 1.75 -0.25 -0.25 ``` ] - Residual sum of squares .small-code[ ```r sum(fit.lm$residuals^2) ``` ``` ## [1] 18.5 ``` ] - Residual mean square .small-code[ ```r sum(fit.lm$residuals^2)/9 ``` ``` ## [1] 2.056 ``` ] --- # anova table from `lm` In case you want more detail about the relationship between `lm` and `aov` output: What about among group variation? .small-code[ ```r fit.lm$fitted.values ``` ``` ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 1.50 1.50 1.50 1.50 2.25 2.25 2.25 2.25 5.25 5.25 5.25 5.25 ``` ] - Treatment sum of squares .small-code[ ```r sum((fit.lm$fitted.values - mean(fit.lm$fitted.values))^2) ``` ``` ## [1] 31.5 ``` ] - Treatment mean square .small-code[ ```r sum((fit.lm$fitted.values - mean(fit.lm$fitted.values))^2)/2 ``` ``` ## [1] 15.75 ``` ] --- # anova table from `lm` In case you want more detail about the relationship between `lm` and `aov` output: *F*-statistic and *p*-value ```r MSa <- sum((fit.lm$fitted.values - mean(fit.lm$fitted.values))^2)/2 MSe <- sum(fit.lm$residuals^2)/9 (F <- MSa/MSe) ``` ``` ## [1] 7.662 ``` ```r (p <- pf(F, 2, 9, lower.tail = FALSE)) ``` ``` ## [1] 0.0114 ``` --- # recap 1) Models with categorical variables can be fit using either `lm()` or `aov()` -- 2) The test traditionally known as *ANOVA* is just an extension of the linear model used analyze a binary predictor variable -- 3) `R` codes categorical predictors using dummy variables and `lm` coefficients correspond to the difference between the reference level and the other treatment levels -- 4) `lm()` or `aov()` provide the exact same information, just presented in different ways -- 5) ANOVA output commonly used for manipulative experiments, linear model output for observational studies -- 6) But how do we tell which groups differ from each other (aside from just the reference level)... --- # looking ahead <br/> ### **Next time**: Multiple comparisons <br/> ### **Reading**: [Fieberg chp. 3.9](https://statistics4ecologists-v1.netlify.app/matrixreg#pairwise-comparisons) and [3.12](https://statistics4ecologists-v1.netlify.app/matrixreg#contrasts)