Lab-10-ANCOVA-with-Code.Rmd

---
title: "Lab-10-ANCOVA-with-Code.Rmd"
author: "Matt Defenderfer"
date: "9/29/2019"
output: 
  html_document:
    toc: TRUE
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Outline for Lab 10:
1. Load Packages and Data
    a. Overview of the Data
2. ANCOVA Assumptions
    a. Linearity
    b. Independence
    c. Normality within each group
    d. Homogeneity of Variance
    e. Homogeneity of Regression Slopes
3. ANCOVA
    a. Anova
    b. ezANOVA
    c. Reporting an ANCOVA
    
## Load Packages
Like always, we will need the tidyverse package, the `car` package for performing the ANCOVA, the QuantPsych package for standardized coefficients from linear models, the `pastecs` package for testing normality, and the `broom` package for extracting residuals.

```{r Packages, message=FALSE}
library(car)
library(pastecs)
library(QuantPsyc)
library(broom)
library(tidyverse)

theme_set(theme_bw())
```


The data is located in the file "ANCOVA_Data.RData". It contains a single dataframe mouse.mass. mouse.mass has the following variables in it:

- `idnum` (factor, 69 levels): a unique ID number for each mouse 
- `group` (factor, 3 levels): a designation of whether the mouse was fully raised in the lab ('lab'), in the wild ('wild') or was in the wild but has lived in lab for a few weeks ('wild-der')
- `bodyfat` (numeric): a measure of the amount of body fat on each mouse
- `leanmass` (numeric): a measure of the mass of the mouse after a period of food deprivation

```{r Load the data}
load("/Users/Matt/Google Drive/Grad School/GR 770 Statistics/R Labs/Data/ANCOVA_Data.RData")
```

## Assumptions
ANCOVA has essentially the same assumptions as ANOVA with a couple more thrown in.

### Linearity
The linearity assumption applies to the covariate and the dependent variable. We can test for linearity in the same way as before, plot both variables on a scatterplot, add a regression line, and see if the points follow the line. Put the covariate `leanmass` on the x-axis, and the dependent variable `bodyfat` on the y-axis.

```{r Linearity}
ggplot(mouse.mass, aes(x = leanmass, y = bodyfat)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE)
```

We can see that the points pretty well follow the line which is good. Linearity seems to be preserved here. 

### Independence
Data must be independent within and across groups. We are only measuring each variable once from each mouse and each mouse is only a part of a single group. So independence is preserved.

### Normality Within Groups
We will test normality just like always: histograms, Q-Q plots, boxplots, and statistical tests. We need to do this for both the dependent variable and the covariate

#### Histograms
```{r hist leanmass}
ggplot(mouse.mass, aes(x = leanmass, fill = group)) + 
  geom_histogram(bins = 7, color = "black") +
  facet_wrap(~group, scales = "free")
```

Histograms for leanmass show fairly normal distributions overall. The wild-der group may be slightly skewed, but given the small-ish sample size, I wouldn't declare it skewed at this point.

```{r hist bodyfat}
ggplot(mouse.mass, aes(x = bodyfat, fill = group)) + 
  geom_histogram(bins = 7, color = "black") +
  facet_wrap(~group, scales = "free")
```

Histograms for bodyfat also show generally normal distributions. Wild might somewhat positively skewed, but again, nothing overt.

#### Q-Q plots
```{r qq leanmass}
ggplot(mouse.mass, aes(sample = leanmass)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(~group, scales = "free")
```

The Q-Q plots for leanmass are all normally distributed. Nothing to worry about here.

```{r qq bodyfat}
ggplot(mouse.mass, aes(sample = bodyfat)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(~group, scales = "free")
```

Again, the Q-Q plots show normal distributions all around. No evidence of skewness in any group.

#### Boxplots
```{r box leanmass}
ggplot(mouse.mass, aes(x = group, y = leanmass)) +
  geom_boxplot()
```

Medians are close to the center of the IQR for all 3 groups. Whiskers are also around the same length outside of the Wild-Der group whose 1st quartile whisker is shorter than its 3rd quartile whisker. The lab group also has a slightly shorted 1st quartile whisker, but it seems within tolerance. There isn't anything overtly skewed about this distributions, but Wild-Der's leanmass should be kept an eye on ni further tests.


```{r box bodyfat}
ggplot(mouse.mass, aes(x = group, y = bodyfat)) +
  geom_boxplot()
```

Wild's 3rd quartile whisker is much longer than its 1st quartile, but its median is firmly in the middle of the IQR. Otherwise, the whiskers and medians seem perfect in the other 2 groups. In no boxplot did we see any outliers which points towards normal distributions.

#### Statistical Tests
We need to compute the skewness statistics for both `leanmass` and `bodyfat` within the different groups. We can do it across groups all at once using the `tapply` function from the T-test lab.

We can calculate the normality stats for `leanmass` in each group using `tapply` like so:

```{r stats leanmass}
list.out <- tapply(mouse.mass$leanmass, mouse.mass$group, stat.desc, basic = FALSE, norm = TRUE)
stats <- as.data.frame(do.call(rbind,list.out))
t(stats)
```

For our skewness metric, 2 distributions came out slightly skewed, Wild-Der and Lab groups. However, the skew.2SE values are far lower than our rule of thumb, 1.58, for smaller sample sizes and none of the SW tests came out significant either. I would say these tests point towards normal leanmass distributions for all groups. Very slight skew may exist, but nothing that needs to be outright corrected

```{r stats bodyfat}
list.out <- tapply(mouse.mass$bodyfat, mouse.mass$group, stat.desc, basic = FALSE, norm = TRUE)
stats <- as.data.frame(do.call(rbind,list.out))
t(stats)
```

As for the bodyfat distributions, Wild has a distribution with skewness > 0.5, however the skew.2SE value was again far less than 1.58, and the SW test came out non-significant. Both other groups have no evidence of skewness in any metric output here.

Taking everything into account, I would say all of our distributions are normal enough to continue without any transformations. Given the smaller sample size per group (N = 23), I would put more emphasis on the results of the SW test and the Q-Q plots. Histograms are much more difficult to interpret correctly with small sample sizes. The skewness metric for any given distribution was either not skewed or only slightly skewed and was never greater than 0.55, only 10% above our threshold for skewness. Our rule of thumb using skew.2SE also never passed the 1.58 threshold for smaller sample sizes.

### Homogeneity of Variance
We need to use Levene's Test to calculate variance across groups for both `leanmass` and `bodyfat`. We can easily do this with the leveneTest function, like before

```{r levene leanmass}
leveneTest(leanmass ~ group, mouse.mass)
```

```{r levene bodyfat}
leveneTest(bodyfat ~ group, mouse.mass)
```

Neither Levene Test came out significant, it seems like homogeneity of variance has been preserved.

### Homogeneity of Regression Slopes
The last assumption that is different from basic ANOVA is homogeneity of regression slopes. When doing an ANCOVA, you are essentially running a normal linear model of the covariate on the dependent variable for each group. You do not want the covariate to interact with the groups. This means that the regression slopes across groups is essentially the same. We can test this graphically by plotting scatterplots of the covariate and dependent variable along with regression lines, like we did when testing linearity, except we split by group this time. We are going to use the color aesthetic in ggplot to split across group

```{r regression slope homogeneity}
ggplot(mouse.mass, aes(x = leanmass, y = bodyfat, color = group)) +
  geom_point() + 
  geom_smooth(method = 'lm', se = FALSE)
```

What you want to see is that the slopes for the different groups are about the same, essentially parallel. In this case, `wild-der` and `lab` have essentially the same slope. `Wild` is not perfectly parallel, but it is definitely close enough to what it needs to be.

#### Checking Standardized Regression Slopes
For this example, there is no doubt that the regression slopes are homogenous enough. However, there will be borderline cases where you will need something more quantifiable. You can use the `lm.beta` function from the `QuantPsych` package to see what the standardize slope should be if you create a linear model for each group. In general, you do not want any single standardized slope to be 0.4 units away from any other slope. For practice, let's calculate the standardized slopes for each group here and compare them.

For this example, we will be making 3 different linear models. All of them will be bodyfat as a function of leanmass, but will be made with data from each of the 3 different groups
```{r leanmass betas}
# Wild group linear model
wild.m <- lm(bodyfat ~ leanmass, data = subset(mouse.mass, group == 'wild'))
lm.beta(wild.m)

# Wild-derived group linear model
wildder.m <- lm(bodyfat ~ leanmass, data = subset(mouse.mass, group == 'wild-der'))
lm.beta(wildder.m)

# Lab group linear model
lab.m <- lm(bodyfat ~ leanmass, data = subset(mouse.mass, group == 'lab'))
lm.beta(lab.m)
```

As you can see, all of the slopes are ~ 0.74, about as close to equal as you're going to see which is great. The homogeneity of regression slopes has been maintained.

## Group Means and Standard Errors
Quickly, let's also plot the mean and SEM of bodyfat for each group to get an idea what we are dealing with

```{r mean and SEM}
ggplot(mouse.mass, aes(x = group, y = bodyfat, fill = group, color = group)) +
  geom_bar(stat = "summary", fun.y = mean, alpha = 0.3, show.legend = FALSE) +
  geom_jitter(size = 2, width = 0.25, show.legend = FALSE) +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.3, color = 'black')
```

## Running an ANCOVA
### Using Anova
ANCOVA using Anova looks very similar to our basic ANOVA from last lab. For a refresher, let's create an ANOVA of just bodyfat as a function of group. Use type-III sum of squares again as well

```{r bodyfat by group}
bf.anova <- lm(bodyfat ~ group, mouse.mass)
Anova(bf.anova, type = 'III')
```

So we see a significant main effect of group by itself. If you leave it there, you could definitely convince yourself that result is real when looking at the mean and standard error plot we made just a bit ago. The means of the distributions are definitely different. However, we still have a covariate to account for as well.

Just like last time, we are going to begin by creating an `lm` object. The `lm` object is going to have bodymass as a function of the main effects of leanmass and group.

```{r lm}
bf.ancova <- lm(bodyfat ~ group + leanmass, mouse.mass)
Anova(bf.ancova, type = 'III')
```

You can now see that the main effect of group has completely disappeared. In its place, there is a highly significant effect of the covariate instead. So, if you have a covariate that may explain some part of the variance of your outcome variable, add that into the model, and do not be surprised if it explains any perceived difference in the groups.

To get out the Multiple R^2 and F-statistic for the overall model, run the lm object through `summary`

```{r summary lm}
summary(bf.ancova)
```

DO NOT USE THE RESULTS FROM THE ANOVA TABLE FROM THE SUMMARY FUNCTION WHEN YOU HAVE MORE THAN 2 GROUPS. THE TABLE WILL SPLIT THE GROUPING VARIABLES INTO PAIRS GIVING YOU AN ESTIMATE, P-VALUE, AND ALL THAT FOR EACH PAIR WHICH ISN'T WHAT YOU WANT. JUST USE IT FOR THE OVERALL MODEL STANDARD ERROR, R^2, F-STAT, DF, AND P-VALUE AT THE BOTTOM

### Using ezANOVA
Unfortunately, ezANOVA is not equipped to properly use covariates. Do not be fooled by the different covariate options in the function. The function will instead fully regress out the effects of the covariate and only leave you with a results section regarding the grouping variable. It will not tell you whether the covariate was a significant predictor or not. If you have a covariate in your model, use the Anova method from above, no matter what.

### Reporting an ANCOVA
The unadjusted model testing bodyfat between groups of lab-reared mice, wild mice, and wild-derived mice was significant, F(2,66) = 5.754, p = 0.00497, R^2^ = 0.1485. \<Include sentence here describing specific differences between the groups using a post-hoc test ON THE ANOVA WITHOUT THE COVARIATE\> (The F-stat, DF, p, and R^2^ values for the whole ANOVA came from the output of `summary(bf.anova)`, at the bottom).

After adjusting for leanmass, the mass of the mouse after fasting, the model remained significant, F(3,65) = 32.66, p = 5.399e-13, R^2^ = 0.6012. However, the effect of group was no longer significant, F(2,65) = 0.188, p = 0.829. <If group effect is still significant, include output from post-hoc tests describing specific differences between groups>. (Whole ANCOVA outputs gotten from `summary(bf.ancova)`, stats for the sole effect of `group` gotten from `Anova(bf.ancova, type = 3)`. 

#### Differences Between Reporting the ANCOVAs Here and in Lecture
If you remember in lecture, included in the report slides were t-values, p-values, and interpretations of one group being greater than another group. You can do so in that example without running post-hoc tests because there are only two levels in the group, so `summary` will output the correct things to put into the report. However, when you have more than 2 levels in your grouping variable, you can only say "there was a significant main effect of group" and give the F statistic and p-value for the effect of group as a whole. You then have to perform post-hoc tests to find specific differences, and add in those results after you say "there was a significant main effect of group <insert stats>". When we go over post-hoc tests, we can talk about the full report sentence there. Just don't go looking for a t-statistic (implies a t-test was run) to include right out of the ANOVA or ANCOVA summary to put into the report sentence when you have more than 3 levels in your grouping variable.

## Residual Plot
Like before, we need to use `augment` to extract the residuals and fitted values from the ANCOVA model object.

```{r augment}
bf.a <- augment(bf.ancova)
```

As before, we will plot it as a scatterplot, with .resid on the y axis, .fitted on the x axis, and color the points by group. We should also facet by group as well.

```{r resid}
ggplot(bf.a, aes(x = .fitted, y = .resid)) +
  geom_point(aes(color = group), show.legend = FALSE) +
  geom_hline(yintercept = 0) + 
  facet_wrap(~group, scales = "free")
```

As before, we are looking for randomly distributed points centered around 0. All of our groups seem to pass. We don't see any evidence of heterogeneity of variance in the residuals, and they all seem to be centered around the horizontal 0 line.