---
title: "Computer exercise 2: solutions"
author: "Anna Lindgren"
date: "8 April 2019"
output:
pdf_document:
fig_caption : true
classoption:
a4paper
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
echo = TRUE,
collapse = TRUE,
comment = "#>"
)
```
```{r data}
sleep <- read.delim("../data/sleep.txt")
sleep$Danger <- factor(sleep$Danger, levels = c(1, 2, 3),
labels = c("low", "medium", "high"))
```
## Exercise 2: Sleep in Mammals
### Danger
#### (a) Danger frequencies
```{r tabell}
with(sleep, table(Danger))
```
Medium danger is the most frequent category.
#### (b) Average sleep in different danger groups
```{r dangermeans}
aggregate(TotalSleep ~ Danger, data = sleep, FUN = "mean")
```
The more danger the less sleep. Seems reasonable.
```{r dangerplot}
ytext = "Total sleep (hrs/day)"
xtext.d = "Danger category"
with(sleep, plot(TotalSleep ~ Danger,
ylim = c(0, 24),
ylab = ytext,
xlab = xtext.d,
main = "Total daily sleep in the danger categories"))
```
The amount of daily sleep decreases with increasing danger. The medians in the box plots agree fairly well with the corresponding means. There is also a large variation within each group.
#### (c) Linear model with danger categories
```{r dangermodel}
model.danger <- lm(TotalSleep ~ Danger, data = sleep)
summary(model.danger)
```
```{r dangerpredict}
x0 <- data.frame(Danger = c("low", "medium", "high"))
pred.danger <- cbind(x0, fit = predict(model.danger, x0))
pred.danger
```
The predictions are the same as the category averages. They are computed as
$$\hat{Y}_{\text{low}} = \hat{\beta}_0 = `r model.danger$coefficients["(Intercept)"]`$$
$$\hat{Y}_{\text{medium}} = \hat{\beta}_0 + \hat{\beta}_1 = `r model.danger$coefficients["(Intercept)"]` + (`r model.danger$coefficients["Dangermedium"]`) = `r pred.danger[2, 2]`$$
$$\hat{Y}_{\text{high}} = \hat{\beta}_0 + \hat{\beta}_2 = `r model.danger$coefficients["(Intercept)"]` + (`r model.danger$coefficients["Dangerhigh"]`) = `r pred.danger[3, 2]`$$
#### (d) F-test for Danger
The Danger is the only variable in the model we can use a global F-test.
```{r testdanger}
anova(model.danger)
```
Since the P-value $= `r anova(model.danger)[1, "Pr(>F)"]` < \alpha = 0.05$ we can reject $H_0$: $\beta_1 = \beta_2 = 0$. Yes, we need the danger variable.
We could also look at the last line in the summary:
```{r}
summary(model.danger)
```
#### (e) Danger parameter confidence intervals
```{r dangerint}
confint(model.danger)
```
Note that the interval for $\beta_1$, Dangermedium, covers zero, in agreement
with the non-significant result in the summary.
### Danger and body weight
#### (f) Linear function of weight
```{r weight, fig.height = 3.5}
par(mfrow = c(1, 2))
with(sleep, plot(TotalSleep ~ BodyWt))
with(sleep, plot(TotalSleep ~ BodyWt, log = "x"))
```
The body weight variable is very skewed (Elephants are heavy!) and it is difficult to see any linear relationship. By taking the logarithm of the weight we get something much more reasonable. When using body weight as it is, we would assume that adding 1kg to the body weight would give the same decrease in the amount of sleep, regardless of the size of the species. But 1kg is a huge change in a small animal but unnoticable in large animals (adding 1kg to a 6+ ton elephant would not be expected to give any noticable change in the amount of sleep).
When taking the log of the body weight we assume that relative changes in weight, e.g. doubling the weight, would have a fixed effect on the amount of sleep needed.
#### (g) Log weight and danger
We first fit the linear model $\text{TotalSleep}_i = \beta_0 + \beta_1 \ln(\text{BodyWt}_i) + \epsilon_i$:
```{r weightmodel}
model.bodywt <- lm(TotalSleep ~ log(BodyWt), data = sleep)
summary(model.bodywt)
```
We then fit the linear model $\text{TotalSleep}_i = \beta_0 + \beta_1 \ln(\text{BodyWt}_i) + \beta_2 \text{Danger}_{\text{medium},i} + \beta_3 \text{Danger}_{\text{high},i} + \epsilon_i$:
```{r weightdanger}
model.bodywtdanger <- lm(TotalSleep ~ log(BodyWt) + Danger, data = sleep)
summary(model.bodywtdanger)
```
Do we need the danger variable?
```{r adddanger}
anova(model.bodywt, model.bodywtdanger)
```
Since the P-value $= `r anova(model.bodywt, model.bodywtdanger)[2, "Pr(>F)"]` < 0.05 = \alpha$ we should reject $H_0$: $\beta_2 = \beta_3 = 0$. Yes, we still need the danger variable.
Do we need the body weight?
```{r addbody}
anova(model.danger, model.bodywtdanger)
```
Since the P-value $= `r anova(model.danger, model.bodywtdanger)[2, "Pr(>F)"]` < 0.05 = \alpha$ we should reject $H_0$: $\beta_1 = 0$. Yes, we still need the body weight.
#### (h) New danger parameter confidence intervals
```{r cibeta}
confint(model.danger)
confint(model.bodywtdanger)
```
We can note that when we add log body weight, the interval for medium danger no longer covers zero.
```{r bodydangerplot}
with(sleep, plot(BodyWt ~ Danger, log = "y"))
```
As seen in the plot, body weight and danger are not independent. The medium danger species tend to be smaller than either low or high danger species. Adding body weight in the model allows us to separate the two conflicting effects. Species living in medium danger sleep less that species *with the same body weight* living in low danger. But this effect was blurred since the medium danger species were often smaller than the low danger species, and smaller species sleep more.
#### (i) Residuals
```{r resids, fig.height = 5}
par(mfrow = c(2, 2))
e <- model.bodywtdanger$residuals
qqnorm(e, main = "(a) Normal Q-Q-plot")
qqline(e)
plot(e ~ predict(model.bodywtdanger), main = "(b) residuals vs Y-hat")
abline(h = 0)
plot(e ~ sleep$BodyWt, log = "x", main = "(c) residuals vs log(BodyWt)")
abline(h = 0)
plot(e ~ sleep$Danger, main = "(d) residuals vs Danger")
abline(h = 0)
```
The Q-Q-plot in Fig (a) shows that the residuals are close to normally distributed.
In Fig (b) we see a tendency towards increasing variance with increasing $\hat{Y}$.
In Fig (c) we do not see anyting worrying. There is no systematic tendencies and the variance seems constant over ln(BodyWt).
In Fig (d) we see that the variance is largest among the low danger species and decreasing with increasing danger. This could explain the behaviour in Fig (b) since the low danger species are the ones who tend to sleep more and thus have a higher predicted amount of sleep.
### Predictions
#### (j) A new species
```{r newpred}
x0 <- data.frame(BodyWt = 30, Danger = "medium")
cbind(x0, pred = predict(model.bodywtdanger, x0, interval = "prediction"))
```
#### (k) Homo Sapiens
```{r homoconf}
x0 <- data.frame(Species = "Homo Sapiens", BodyWt = 62, Danger = "low")
cbind(x0, pred = predict(model.bodywtdanger, x0, interval = "confidence"))
```
#### (l) Marmot vs Vervet
We have the expected values for the two species as
$$E(Y_{\text{Marmot}}) = \beta_0 + \beta_1 \cdot \ln 4$$
and
$$E(Y_{\text{Vervet}}) = \beta_0 + \beta_1 \cdot \ln 4 + \beta_3.$$
The difference is
$$E(Y_{\text{Marmot}}) - E(Y_{\text{Vervet}}) = \beta_0 + \beta_1 \cdot \ln 4 - (\beta_0 + \beta_1 \cdot \ln 4 + \beta_3) = - \beta_3$$
and the confidence interval is simply $-I_{\beta_3}$ hours:
```{r mvsvci}
ci.l <- -confint(model.bodywtdanger)["Dangerhigh", c(2, 1)]
ci.l
```
### (m) Marmot vs Homo Sapiens
We have the expected values for the two species as
$$E(Y_{\text{Marmot}}) = \beta_0 + \beta_1 \cdot \ln 4$$
and
$$E(Y_{\text{Homo Sapiens}}) = \beta_0 + \beta_1 \cdot \ln 62.$$
The difference is
$$E(Y_{\text{Marmot}}) - E(Y_{\text{Homo Sapiens}}) = \beta_0 + \beta_1 \cdot \ln 4 - (\beta_0 + \beta_1 \cdot \ln 62) = \beta_1 (\ln 4 - \ln 62) = - \ln \frac{62}{4} \cdot \beta_1$$
and the confidence interval is simply $-\ln\frac{62}{4} \cdot I_{\beta_1}$ hours:
```{r mhsci}
ci.m <- -log(62/4) * confint(model.bodywtdanger)["log(BodyWt)", c(2, 1)]
ci.m
```
#### (n) Vervet vs Homo Sapiens
We now have the difference as
$$E(Y_{\text{Vervet}}) - E(Y_{\text{Homo Sapiens}}) = \beta_0 + \beta_1 \cdot \ln 4 + \beta_3 - (\beta_0 + \beta_1 \cdot \ln 62) = \beta_1 (\ln 4 - \ln 62) + \beta_3 = \ln \frac{4}{62} \cdot \beta_1 + \beta_3.$$
```{r predd}
xx <- matrix(c(0, log(4/62), 0, 1), nrow = 1)
preddif <- xx %*% coefficients(model.bodywtdanger)
```
The estimate is $\ln \frac{4}{62} \cdot \hat{\beta}_1 + \hat{\beta}_3 = `r preddif`$ with variance
$$(\ln \frac{4}{62})^2 V(\hat{\beta}_1) + V(\hat{\beta}_3) + 2 \ln \frac{4}{62} Cov(\hat{\beta}_1, \hat{\beta}_3)$$
```{r covariance}
s <- summary(model.bodywtdanger)$sigma
XtXinv <- summary(model.bodywtdanger)$cov.unscaled
Vxx <- xx %*% XtXinv %*% t(xx)
# preddif and Vxx are 1 x 1 matrices so we need to pick the scalar value out first
# if we want to calculate both limits at the same time with +/- as c(-1, 1)
# which cannot be multiplied on to a matrix. R "feature".
preddif[1, 1] + qt(1 - 0.05 / 2, nrow(sleep) - 4) * s * sqrt(Vxx[1, 1]) * c(-1, 1)
```
The Vervet is expected to sleep less than Homo Sapiens.