Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

LECTURE 9: multiple regression

FANR 6750 (Experimental design)




Fall 2024

1 / 31

outline

1) Model structure


2 / 31

outline

1) Model structure


2) Factor + continuous predictor (ANCOVA)


2 / 31

outline

1) Model structure


2) Factor + continuous predictor (ANCOVA)


3) Two factors (blocking design)


2 / 31

outline

1) Model structure


2) Factor + continuous predictor (ANCOVA)


3) Two factors (blocking design)


4) Two continuous predictors


2 / 31

outline

1) Model structure


2) Factor + continuous predictor (ANCOVA)


3) Two factors (blocking design)


4) Two continuous predictors


5) Centering predictors

2 / 31

review

yi=β0+β1×Xi+ϵi ϵinormal(0,σ) So far, we have learned about linear models that contain only a single predictor variable X:

1) Single continuous variable = "simple" regression (lecture 2)

2) Single categorical predictor w/ one level = one-sample t-test (lecture 4)

3) Single categorical predictor w/ two levels = two-sample t-test (lecture 4)

4) Single categorical predictor w/ > 2 levels = ANOVA (lecture 7)

3 / 31

review

yi=β0+β1×Xi+ϵi ϵinormal(0,σ) So far, we have learned about linear models that contain only a single predictor variable X:

1) Single continuous variable = "simple" regression (lecture 2)

2) Single categorical predictor w/ one level = one-sample t-test (lecture 4)

3) Single categorical predictor w/ two levels = two-sample t-test (lecture 4)

4) Single categorical predictor w/ > 2 levels = ANOVA (lecture 7)

More often than not, your models will need to contain more than one predictor

3 / 31

multiple regression

Models with more than one predictor go by many names (blocking, ANCOVA, factorial, etc) but are all forms of multiple regression of the form:

yi=β0+β1×X1i+β2×X2i+ϵi ϵinormal(0,σ)

*Note that this model only contains two predictors but multiple regression models often contain many predictors

4 / 31

multiple regression

Models with more than one predictor go by many names (blocking, ANCOVA, factorial, etc) but are all forms of multiple regression of the form:

yi=β0+β1×X1i+β2×X2i+ϵi ϵinormal(0,σ)

*Note that this model only contains two predictors but multiple regression models often contain many predictors

Interpretation of the intercept β0, residual error ϵi, and residual variance σ remains the same as before

4 / 31

multiple regression

Models with more than one predictor go by many names (blocking, ANCOVA, factorial, etc) but are all forms of multiple regression of the form:

yi=β0+β1×X1i+β2×X2i+ϵi ϵinormal(0,σ)

*Note that this model only contains two predictors but multiple regression models often contain many predictors

Interpretation of the intercept β0, residual error ϵi, and residual variance σ remains the same as before

Interpretation of the slope coefficients β1, β2, etc. changes slightly

4 / 31

multiple regression

Models with more than one predictor go by many names (blocking, ANCOVA, factorial, etc) but are all forms of multiple regression of the form:

yi=β0+β1×X1i+β2×X2i+ϵi ϵinormal(0,σ)

*Note that this model only contains two predictors but multiple regression models often contain many predictors

Interpretation of the intercept β0, residual error ϵi, and residual variance σ remains the same as before

Interpretation of the slope coefficients β1, β2, etc. changes slightly

  • Slopes are the expected change in the response variable yi for a unit change in the corresponding predictor variable while holding all other predictors constant
4 / 31

multiple regression

Models with more than one predictor go by many names (blocking, ANCOVA, factorial, etc) but are all forms of multiple regression of the form:

yi=β0+β1×X1i+β2×X2i+ϵi ϵinormal(0,σ)

*Note that this model only contains two predictors but multiple regression models often contain many predictors

Interpretation of the intercept β0, residual error ϵi, and residual variance σ remains the same as before

Interpretation of the slope coefficients β1, β2, etc. changes slightly

  • Slopes are the expected change in the response variable yi for a unit change in the corresponding predictor variable while holding all other predictors constant

  • This is a subtle difference but an important one

4 / 31

example

One of the most common reasons for using multiple regression models is to control for extraneous sources of variation (i.e., sources of variation that influence the response variable but are not of interest in and of themselves)

5 / 31

example

One of the most common reasons for using multiple regression models is to control for extraneous sources of variation (i.e., sources of variation that influence the response variable but are not of interest in and of themselves)

Why would we want to control for extraneous variation?

5 / 31

example

One of the most common reasons for using multiple regression models is to control for extraneous sources of variation (i.e., sources of variation that influence the response variable but are not of interest in and of themselves)

Why would we want to control for extraneous variation?

Perhaps we are raising desert tortoises (Gopherus agassizii) for release into the wild and are interested in whether different diets influence their weight gain

5 / 31

example

A tortoises final weight, however, is not influenced only by diet. For example, it may also be influenced by it's starting body size

6 / 31

example

A tortoises final weight, however, is not influenced only by diet. For example, it may also be influenced by it's starting body size

As we will see, accurately measuring the effect of diet requires taking into account each individual's initial body size. Multiple regression allows us to do this

6 / 31

anova

Let's start by fitting a model we've already learned about:

7 / 31

anova

Let's start by fitting a model we've already learned about:

fit.lm <- lm(weight ~ diet, data = dietdata)
summary(fit.lm)
##
## Call:
## lm(formula = weight ~ diet, data = dietdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.39 -19.29 -0.37 19.78 54.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.19 7.69 33.94 <2e-16 ***
## dietLow 4.56 10.88 0.42 0.677
## dietMed 11.02 10.88 1.01 0.316
## dietHigh 25.28 10.88 2.32 0.024 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.8 on 56 degrees of freedom
## Multiple R-squared: 0.0989, Adjusted R-squared: 0.0506
## F-statistic: 2.05 on 3 and 56 DF, p-value: 0.117
7 / 31

anova

Let's start by fitting a model we've already learned about:

fit.lm <- lm(weight ~ diet, data = dietdata)
summary(fit.lm)
##
## Call:
## lm(formula = weight ~ diet, data = dietdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.39 -19.29 -0.37 19.78 54.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.19 7.69 33.94 <2e-16 ***
## dietLow 4.56 10.88 0.42 0.677
## dietMed 11.02 10.88 1.01 0.316
## dietHigh 25.28 10.88 2.32 0.024 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.8 on 56 degrees of freedom
## Multiple R-squared: 0.0989, Adjusted R-squared: 0.0506
## F-statistic: 2.05 on 3 and 56 DF, p-value: 0.117

Conclusions?

7 / 31

"ancova"

Now let's fit a slightly different model:

8 / 31

"ancova"

Now let's fit a slightly different model:

fit.lm2 <- lm(weight ~ diet + length, data = dietdata)
summary(fit.lm2)
##
## Call:
## lm(formula = weight ~ diet + length, data = dietdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.25 -12.18 -2.53 12.15 49.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 191.364 8.862 21.59 < 2e-16 ***
## dietLow 13.727 6.878 2.00 0.05091 .
## dietMed 25.226 6.974 3.62 0.00065 ***
## dietHigh 30.784 6.833 4.51 3.5e-05 ***
## length 2.789 0.297 9.38 5.2e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.6 on 55 degrees of freedom
## Multiple R-squared: 0.654, Adjusted R-squared: 0.628
## F-statistic: 25.9 on 4 and 55 DF, p-value: 4.14e-12
8 / 31

"ancova"

Now let's fit a slightly different model:

fit.lm2 <- lm(weight ~ diet + length, data = dietdata)
summary(fit.lm2)
##
## Call:
## lm(formula = weight ~ diet + length, data = dietdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.25 -12.18 -2.53 12.15 49.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 191.364 8.862 21.59 < 2e-16 ***
## dietLow 13.727 6.878 2.00 0.05091 .
## dietMed 25.226 6.974 3.62 0.00065 ***
## dietHigh 30.784 6.833 4.51 3.5e-05 ***
## length 2.789 0.297 9.38 5.2e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.6 on 55 degrees of freedom
## Multiple R-squared: 0.654, Adjusted R-squared: 0.628
## F-statistic: 25.9 on 4 and 55 DF, p-value: 4.14e-12

What changed?

8 / 31

signal vs. noise

yi=β0+β1X1i+β2X2i+ϵi

In any statistical test, our goal is to detect a signal ( β ) in the presence of noise (residual variation ϵ)

9 / 31

signal vs. noise

yi=β0+β1X1i+β2X2i+ϵi

In any statistical test, our goal is to detect a signal ( β ) in the presence of noise (residual variation ϵ)

Our ability to do that depends on the strength of the signal (usually beyond our control) and the amount of noise (partially within our control)

9 / 31

signal vs. noise

yi=β0+β1X1i+β2X2i+ϵi

In any statistical test, our goal is to detect a signal ( β ) in the presence of noise (residual variation ϵ)

Our ability to do that depends on the strength of the signal (usually beyond our control) and the amount of noise (partially within our control)

  • In the first model, where was all the variation in weight caused by variation in length?

  • We can see this clearly by looking at the ANOVA tables for the two models

9 / 31

anova tables

anova(fit.lm)
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## diet 3 5459 1820 2.05 0.12
## Residuals 56 49734 888
anova(fit.lm2)
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## diet 3 5459 1820 5.23 0.003 **
## length 1 30615 30615 88.07 5.2e-13 ***
## Residuals 55 19119 348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
10 / 31

anova tables

anova(fit.lm)
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## diet 3 5459 1820 2.05 0.12
## Residuals 56 49734 888
anova(fit.lm2)
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## diet 3 5459 1820 5.23 0.003 **
## length 1 30615 30615 88.07 5.2e-13 ***
## Residuals 55 19119 348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note the residual sum of squares and mean square error of the two models

10 / 31

sums of squares

When calculating the sums of squares for multiple regression models, the order that we write the model matters

11 / 31

sums of squares

When calculating the sums of squares for multiple regression models, the order that we write the model matters

We wrote the model formula with diet first and length second (weight ~ diet + length). R calculates the sums of squares for diet first, then length

  • Notice that the diet SS in the two previous tables are the same
11 / 31

sums of squares

When calculating the sums of squares for multiple regression models, the order that we write the model matters

We wrote the model formula with diet first and length second (weight ~ diet + length). R calculates the sums of squares for diet first, then length

  • Notice that the diet SS in the two previous tables are the same

If we wrote weight ~ length + diet, R calculates the length SS first, then the diet SS

  • The diet SS tells us how much variation is explained by the treatment variable after accounting for the covariate
11 / 31

sums of squares

When calculating the sums of squares for multiple regression models, the order that we write the model matters

We wrote the model formula with diet first and length second (weight ~ diet + length). R calculates the sums of squares for diet first, then length

  • Notice that the diet SS in the two previous tables are the same

If we wrote weight ~ length + diet, R calculates the length SS first, then the diet SS

  • The diet SS tells us how much variation is explained by the treatment variable after accounting for the covariate

This sequential method of calculating is called the Type I sums of squares

  • Type I SS is the default used by R
11 / 31

sums of squares

anova(fit.lm2)
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## diet 3 5459 1820 5.23 0.003 **
## length 1 30615 30615 88.07 5.2e-13 ***
## Residuals 55 19119 348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit.lm3 <- lm(weight ~ length + diet, data = dietdata)
anova(fit.lm3)
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## length 1 27882 27882 80.21 2.5e-12 ***
## diet 3 8192 2731 7.86 0.00019 ***
## Residuals 55 19119 348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
12 / 31

sums of squares

Type I SS is only one of several ways of calculating sums of squares

13 / 31

sums of squares

Type I SS is only one of several ways of calculating sums of squares

  • Type II and Type III SS use slightly different methods that do not depend on order
13 / 31

sums of squares

Type I SS is only one of several ways of calculating sums of squares

  • Type II and Type III SS use slightly different methods that do not depend on order

  • For balanced designs, all three approaches will give the same answers

13 / 31

sums of squares

Type I SS is only one of several ways of calculating sums of squares

  • Type II and Type III SS use slightly different methods that do not depend on order

  • For balanced designs, all three approaches will give the same answers

  • For unbalanced designs, the correct approach depends on the hypotheses being tested

13 / 31

sums of squares

Type I SS is only one of several ways of calculating sums of squares

  • Type II and Type III SS use slightly different methods that do not depend on order

  • For balanced designs, all three approaches will give the same answers

  • For unbalanced designs, the correct approach depends on the hypotheses being tested

  • Other software programs (SAS, SPSS) default to Type III so may give different results than R (but all programs have methods to calculate SS using each method)

13 / 31

another example

In 1960, Herbert Stoddard established an experiment at Tall Timbers Research Station to study the effects of fire frequency on long-leaf pine forests in the Red Hills region of north Florida and southwest Georgia. The experiment originally consisted of eighty four, 0.5ha plots, each of which was assigned one of four treatments: 1-year, 2-year, or 3-year fire intervals or unburned. Stoddard recognized the soil type likely influenced response to fire, so treatment plots were allocated across the 3 primary soil types found on Tall Timbers

14 / 31

another example

We will use a made-up data set from this real experiment, with the goal of understanding whether understory plant richness (measured as the number of understory vascular plants recorded on each plot) differs among burn treatments.

Fire frequency
Replicate Unburned 1-year 2-year 3-year Soil
1 30 43 36 37 Loamy sand
2 28 37 30 27 Sandy loam
3 18 27 24 21 Upland
15 / 31

another example

We will use a made-up data set from this real experiment, with the goal of understanding whether understory plant richness (measured as the number of understory vascular plants recorded on each plot) differs among burn treatments.

Fire frequency
Replicate Unburned 1-year 2-year 3-year Soil
1 30 43 36 37 Loamy sand
2 28 37 30 27 Sandy loam
3 18 27 24 21 Upland

What are the experimental units? What are the observational units?

15 / 31

another example

Fit the model with treatment only:

data(firedata)
fm1 <- lm(species ~ interval, data = firedata)
anova(fm1)
## Analysis of Variance Table
##
## Response: species
## Df Sum Sq Mean Sq F value Pr(>F)
## interval 3 173 57.6 1.1 0.4
## Residuals 8 419 52.4

Conclusions?

16 / 31

another example

Because the soil types differ in a variety of ways that may influence plant communities, we probably want to control for this variability in our analysis

We do that by adding it to the model:

data(firedata)
fm2 <- lm(species ~ interval + soil, data = firedata)
anova(fm2)
## Analysis of Variance Table
##
## Response: species
## Df Sum Sq Mean Sq F value Pr(>F)
## interval 3 173 57.6 13.4 0.00456 **
## soil 2 394 196.8 45.7 0.00023 ***
## Residuals 6 26 4.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusions?

17 / 31

another example

Note that in this case, each treatment is applied exactly once within each soil type

18 / 31

another example

Note that in this case, each treatment is applied exactly once within each soil type

Soil type itself is not a experimental treatment and is not replicated

18 / 31

another example

Note that in this case, each treatment is applied exactly once within each soil type

Soil type itself is not a experimental treatment and is not replicated

This design is often referred to as a "randomized complete blocking" design (RCBD)

  • The soil types are called "blocks" and should be identified as a source of variability before the experiment is conducted

  • "Complete" refers to each treatment being represented in each block

  • "Randomized" means that treatments are assigned randomly to each experimental unit

18 / 31

another example

Note that in this case, each treatment is applied exactly once within each soil type

Soil type itself is not a experimental treatment and is not replicated

This design is often referred to as a "randomized complete blocking" design (RCBD)

  • The soil types are called "blocks" and should be identified as a source of variability before the experiment is conducted

  • "Complete" refers to each treatment being represented in each block

  • "Randomized" means that treatments are assigned randomly to each experimental unit

Again, blocks must be identified during the design phase. You should not "search" for blocks after the fact (why?)

18 / 31

multiple regression

As these examples show, one reason to use a multiple regression model is to increase power to detect treatment effects

  • Remember that F=MStreatment/MSerror

  • By explaining some of the residual variation, including predictors in the model reduces MSerrer and thereby increases F

19 / 31

multiple regression

As these examples show, one reason to use a multiple regression model is to increase power to detect treatment effects

  • Remember that F=MStreatment/MSerror

  • By explaining some of the residual variation, including predictors in the model reduces MSerrer and thereby increases F

  • When one predictor is a factor and the other(s) is continuous, this model is often referred to as an ANCOVA (Analysis of Covariance)

19 / 31

multiple regression

As these examples show, one reason to use a multiple regression model is to increase power to detect treatment effects

  • Remember that F=MStreatment/MSerror

  • By explaining some of the residual variation, including predictors in the model reduces MSerrer and thereby increases F

  • When one predictor is a factor and the other(s) is continuous, this model is often referred to as an ANCOVA (Analysis of Covariance)

  • When both predictors are factors and each treatment is included within each block, this model is often referred to a randomized complete block design

19 / 31

multiple regression

As these examples show, one reason to use a multiple regression model is to increase power to detect treatment effects

  • Remember that F=MStreatment/MSerror

  • By explaining some of the residual variation, including predictors in the model reduces MSerrer and thereby increases F

  • When one predictor is a factor and the other(s) is continuous, this model is often referred to as an ANCOVA (Analysis of Covariance)

  • When both predictors are factors and each treatment is included within each block, this model is often referred to a randomized complete block design

In experimental setting, ANCOVA and blocking are used when there is extraneous variation that we cannot control during the design phase

  • Predictor/blocks are not of interest; only included to increase power
19 / 31

interpreting model output

If we are only interested in the treatment effect and not the blocks/continuous predictor, the ANOVA table (possibly in combination with multiple comparisons) is often sufficient for interpreting the model

  • This is often the case in experimental settings
20 / 31

interpreting model output

If we are only interested in the treatment effect and not the blocks/continuous predictor, the ANOVA table (possibly in combination with multiple comparisons) is often sufficient for interpreting the model

  • This is often the case in experimental settings

However, in observational studies, we are often interested in the effects of both predictors

  • In this case, it helps to understand the model structure and parameter interpretation in more detail
20 / 31

interpreting model output

If we are only interested in the treatment effect and not the blocks/continuous predictor, the ANOVA table (possibly in combination with multiple comparisons) is often sufficient for interpreting the model

  • This is often the case in experimental settings

However, in observational studies, we are often interested in the effects of both predictors

  • In this case, it helps to understand the model structure and parameter interpretation in more detail

  • Understanding how multiple regression models are structured will also help when you need to include different combinations of factors and continuous predictors

20 / 31

interpreting model output

If we are only interested in the treatment effect and not the blocks/continuous predictor, the ANOVA table (possibly in combination with multiple comparisons) is often sufficient for interpreting the model

  • This is often the case in experimental settings

However, in observational studies, we are often interested in the effects of both predictors

  • In this case, it helps to understand the model structure and parameter interpretation in more detail

  • Understanding how multiple regression models are structured will also help when you need to include different combinations of factors and continuous predictors

Luckily, you already have the tools exploring these models in detail...

20 / 31

interpreting the ancova model

broom::tidy(fit.lm2)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 191. 8.86 21.6 1.56e-28
## 2 dietLow 13.7 6.88 2.00 5.09e- 2
## 3 dietMed 25.2 6.97 3.62 6.48e- 4
## 4 dietHigh 30.8 6.83 4.51 3.50e- 5
## 5 length 2.79 0.297 9.38 5.16e-13
21 / 31

interpreting the ancova model

broom::tidy(fit.lm2)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 191. 8.86 21.6 1.56e-28
## 2 dietLow 13.7 6.88 2.00 5.09e- 2
## 3 dietMed 25.2 6.97 3.62 6.48e- 4
## 4 dietHigh 30.8 6.83 4.51 3.50e- 5
## 5 length 2.79 0.297 9.38 5.16e-13
  • Intercept = predicted weight of control group when length = 0
21 / 31

interpreting the ancova model

broom::tidy(fit.lm2)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 191. 8.86 21.6 1.56e-28
## 2 dietLow 13.7 6.88 2.00 5.09e- 2
## 3 dietMed 25.2 6.97 3.62 6.48e- 4
## 4 dietHigh 30.8 6.83 4.51 3.50e- 5
## 5 length 2.79 0.297 9.38 5.16e-13
  • Intercept = predicted weight of control group when length = 0

  • dietLow = difference between control and low diet

21 / 31

interpreting the ancova model

broom::tidy(fit.lm2)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 191. 8.86 21.6 1.56e-28
## 2 dietLow 13.7 6.88 2.00 5.09e- 2
## 3 dietMed 25.2 6.97 3.62 6.48e- 4
## 4 dietHigh 30.8 6.83 4.51 3.50e- 5
## 5 length 2.79 0.297 9.38 5.16e-13
  • Intercept = predicted weight of control group when length = 0

  • dietLow = difference between control and low diet

  • dietMed = difference between control and medium diet

21 / 31

interpreting the ancova model

broom::tidy(fit.lm2)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 191. 8.86 21.6 1.56e-28
## 2 dietLow 13.7 6.88 2.00 5.09e- 2
## 3 dietMed 25.2 6.97 3.62 6.48e- 4
## 4 dietHigh 30.8 6.83 4.51 3.50e- 5
## 5 length 2.79 0.297 9.38 5.16e-13
  • Intercept = predicted weight of control group when length = 0

  • dietLow = difference between control and low diet

  • dietMed = difference between control and medium diet

  • dietHigh = difference between control and high diet

21 / 31

interpreting the ancova model

broom::tidy(fit.lm2)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 191. 8.86 21.6 1.56e-28
## 2 dietLow 13.7 6.88 2.00 5.09e- 2
## 3 dietMed 25.2 6.97 3.62 6.48e- 4
## 4 dietHigh 30.8 6.83 4.51 3.50e- 5
## 5 length 2.79 0.297 9.38 5.16e-13
  • Intercept = predicted weight of control group when length = 0

  • dietLow = difference between control and low diet

  • dietMed = difference between control and medium diet

  • dietHigh = difference between control and high diet

  • length = predicted increase in weight for one unit increase in length

21 / 31

interpreting the ancova model

22 / 31

interpreting the ancova model

Question: does the effect of length depend on diet treatment?

22 / 31

interpreting the ancova model

Question: does the effect of length depend on diet treatment?

Answer: No. The model assumes the effect of length is the same for every treatment

22 / 31

interpreting the rcbd model

summary(fm2)
##
## Call:
## lm(formula = species ~ interval + soil, data = firedata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5000 -0.9167 0.0833 0.9375 2.2500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.08 1.47 21.87 6.0e-07 ***
## interval1 10.33 1.69 6.10 0.00088 ***
## interval2 4.67 1.69 2.75 0.03310 *
## interval3 2.67 1.69 1.57 0.16656
## soilsandy loam -6.25 1.47 -4.26 0.00532 **
## soilupland -14.00 1.47 -9.54 7.6e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.07 on 6 degrees of freedom
## Multiple R-squared: 0.956, Adjusted R-squared: 0.92
## F-statistic: 26.3 on 5 and 6 DF, p-value: 0.000518
23 / 31

interpreting the rcbd model

summary(fm2)
##
## Call:
## lm(formula = species ~ interval + soil, data = firedata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5000 -0.9167 0.0833 0.9375 2.2500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.08 1.47 21.87 6.0e-07 ***
## interval1 10.33 1.69 6.10 0.00088 ***
## interval2 4.67 1.69 2.75 0.03310 *
## interval3 2.67 1.69 1.57 0.16656
## soilsandy loam -6.25 1.47 -4.26 0.00532 **
## soilupland -14.00 1.47 -9.54 7.6e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.07 on 6 degrees of freedom
## Multiple R-squared: 0.956, Adjusted R-squared: 0.92
## F-statistic: 26.3 on 5 and 6 DF, p-value: 0.000518

This scenario (two factors) is a little more tricky. But, as before, we'll use the design matrix to help understand what each parameter means

23 / 31

interpreting the rcbd model

model.matrix(fm2)
## (Intercept) interval1 interval2 interval3 soilsandy loam soilupland
## 1 1 0 0 0 1 0
## 2 1 0 0 0 0 0
## 3 1 0 0 0 0 1
## 4 1 1 0 0 1 0
## 5 1 1 0 0 0 0
## 6 1 1 0 0 0 1
## 7 1 0 1 0 1 0
## 8 1 0 1 0 0 0
## 9 1 0 1 0 0 1
## 10 1 0 0 1 1 0
## 11 1 0 0 1 0 0
## 12 1 0 0 1 0 1
## attr(,"assign")
## [1] 0 1 1 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$interval
## [1] "contr.treatment"
##
## attr(,"contrasts")$soil
## [1] "contr.treatment"
24 / 31

interpreting the rcbd model

  • Intercept = predicted number of species in the unburned treatment in loamy sand
25 / 31

interpreting the rcbd model

  • Intercept = predicted number of species in the unburned treatment in loamy sand

  • interval1 = difference between unburned and 1-year fire interval

25 / 31

interpreting the rcbd model

  • Intercept = predicted number of species in the unburned treatment in loamy sand

  • interval1 = difference between unburned and 1-year fire interval

  • interval2 = difference between unburned and 2-year fire interval

25 / 31

interpreting the rcbd model

  • Intercept = predicted number of species in the unburned treatment in loamy sand

  • interval1 = difference between unburned and 1-year fire interval

  • interval2 = difference between unburned and 2-year fire interval

  • interval3 = difference between unburned and 3-year fire interval

25 / 31

interpreting the rcbd model

  • Intercept = predicted number of species in the unburned treatment in loamy sand

  • interval1 = difference between unburned and 1-year fire interval

  • interval2 = difference between unburned and 2-year fire interval

  • interval3 = difference between unburned and 3-year fire interval

  • soilsandy loam = difference between loamy sand and sandy loam soils

25 / 31

interpreting the rcbd model

  • Intercept = predicted number of species in the unburned treatment in loamy sand

  • interval1 = difference between unburned and 1-year fire interval

  • interval2 = difference between unburned and 2-year fire interval

  • interval3 = difference between unburned and 3-year fire interval

  • soilsandy loam = difference between loamy sand and sandy loam soils

  • soilupland = difference between loamy sand and upland soils

25 / 31

interpreting the rcbd model

  • Intercept = predicted number of species in the unburned treatment in loamy sand

  • interval1 = difference between unburned and 1-year fire interval

  • interval2 = difference between unburned and 2-year fire interval

  • interval3 = difference between unburned and 3-year fire interval

  • soilsandy loam = difference between loamy sand and sandy loam soils

  • soilupland = difference between loamy sand and upland soils

Again, because there is no interaction between interval and soil, the model assumes that the effects of fire frequency are the same in every soil type (and that the soil differences are the same for every fire frequency)

25 / 31

one more example

We have looked at multiple regression examples with two factors and with one factor and one continuous covariate, but what about multiple continuous covariates?

26 / 31

one more example

We have looked at multiple regression examples with two factors and with one factor and one continuous covariate, but what about multiple continuous covariates?

The biomassdata object contains (made up) biomass measurements (kg) as a function of rainfall (mm) and elevation (km)

data("biomassdata")
head(biomassdata)
## biomass rainfall elevation
## 1 96.12 166.0 4.4397
## 2 122.89 150.3 4.9462
## 3 69.14 195.3 5.8502
## 4 47.02 192.0 0.5679
## 5 20.56 211.6 2.5554
## 6 10.77 218.0 1.0106
26 / 31

multiple regression model

fit.lm <- lm(biomass ~ rainfall + elevation, data = biomassdata)
broom::tidy(fit.lm)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 320. 4.11 78.0 2.31e-51
## 2 rainfall -1.44 0.0217 -66.4 3.97e-48
## 3 elevation 3.97 0.241 16.5 3.77e-21
  • Intercept = predicted biomass when rainfall = 0 and elevation = 0
27 / 31

multiple regression model

fit.lm <- lm(biomass ~ rainfall + elevation, data = biomassdata)
broom::tidy(fit.lm)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 320. 4.11 78.0 2.31e-51
## 2 rainfall -1.44 0.0217 -66.4 3.97e-48
## 3 elevation 3.97 0.241 16.5 3.77e-21
  • Intercept = predicted biomass when rainfall = 0 and elevation = 0

  • rainfall = predicted change in biomass for 1mm increase in rainfall while holding elevation constant

27 / 31

multiple regression model

fit.lm <- lm(biomass ~ rainfall + elevation, data = biomassdata)
broom::tidy(fit.lm)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 320. 4.11 78.0 2.31e-51
## 2 rainfall -1.44 0.0217 -66.4 3.97e-48
## 3 elevation 3.97 0.241 16.5 3.77e-21
  • Intercept = predicted biomass when rainfall = 0 and elevation = 0

  • rainfall = predicted change in biomass for 1mm increase in rainfall while holding elevation constant

  • elevation = predicted change in biomass for 1km increase in elevation while holding rainfall constant

27 / 31

centering and standardizing data

When fitting models to continuous covariates, it is common to center or standardize covariates.

28 / 31

centering and standardizing data

When fitting models to continuous covariates, it is common to center or standardize covariates.

Centering is done by subtracting the mean

biomassdata$elevation.c <- biomassdata$elevation - mean(biomassdata$elevation)
biomassdata$rainfall.c <- biomassdata$rainfall - mean(biomassdata$rainfall)
28 / 31

centering and standardizing data

When fitting models to continuous covariates, it is common to center or standardize covariates.

Centering is done by subtracting the mean

biomassdata$elevation.c <- biomassdata$elevation - mean(biomassdata$elevation)
biomassdata$rainfall.c <- biomassdata$rainfall - mean(biomassdata$rainfall)

When interpreting centered data:

  • positive values indicate observations larger than the mean

  • negative values indicate observation smaller than the mean

  • units don't change

28 / 31

multiple regression model

fit.lm <- lm(biomass ~ rainfall.c + elevation.c, data = biomassdata)
broom::tidy(fit.lm)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 62.0 0.414 150. 1.24e-64
## 2 rainfall.c -1.44 0.0217 -66.4 3.97e-48
## 3 elevation.c 3.97 0.241 16.5 3.77e-21
  • Intercept = predicted biomass when rainfall and elevation are at their mean
29 / 31

multiple regression model

fit.lm <- lm(biomass ~ rainfall.c + elevation.c, data = biomassdata)
broom::tidy(fit.lm)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 62.0 0.414 150. 1.24e-64
## 2 rainfall.c -1.44 0.0217 -66.4 3.97e-48
## 3 elevation.c 3.97 0.241 16.5 3.77e-21
  • Intercept = predicted biomass when rainfall and elevation are at their mean

  • rainfall = predicted change in biomass for 1mm increase in rainfall while holding elevation constant

29 / 31

multiple regression model

fit.lm <- lm(biomass ~ rainfall.c + elevation.c, data = biomassdata)
broom::tidy(fit.lm)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 62.0 0.414 150. 1.24e-64
## 2 rainfall.c -1.44 0.0217 -66.4 3.97e-48
## 3 elevation.c 3.97 0.241 16.5 3.77e-21
  • Intercept = predicted biomass when rainfall and elevation are at their mean

  • rainfall = predicted change in biomass for 1mm increase in rainfall while holding elevation constant

  • elevation = predicted change in biomass for 1km increase in elevation while holding rainfall constant

29 / 31

multiple regression model

30 / 31

looking ahead


Next time: Interactions


Reading: Fieberg chp. 3.8

31 / 31

outline

1) Model structure


2 / 31
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow