Lab-3-Intro-to-Statistical-Models-With-Code.Rmd

---
title: "Statistical Models With Code"
author: "Matt Defenderfer"
date: "8/2/2019"
output:
  html_document:
    toc: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Overall Progression For Lab 3
1. Viewing Normality
    a. Histograms with Normal Distributions
    b. Q-Q Plots
    c. Boxplots
2. Testing Normality
3. Calculating Skew and Kurtosis
4. Homoscedasticity and Levene's Test
5. Comparing Means Across Multiple Groups

## Package Setup
For this lab, we will be doing data tidying using `dplyr` and plotting using `ggplot2`. Both of these are contained within the `tidyverse` package group so we will just load it. We will also be using the `stat.desc` function from the `pastecs` package and the `leveneTest` function from the `car` package, so download and load those as well.

We want all of our graphs to have the nice bw theme, so we will also set our deafult theme to `theme_bw()`

```{r Load Packages, message=FALSE}
library(car)
library(pastecs)
library(tidyverse)

theme_set(theme_bw())
```

## Loading Data
We will be using the same data as last lab so load that as done last time. It is named "Microbiome_Data.RData" and can be downloaded from Canvas if necessary

```{r Load Data}
load("~/Google Drive/Grad School/GR 770 Statistics/R Labs/Data/Microbiome_Data.RData")
```


## Viewing Normality
According to wikipedia: "Normality tests are used to determine if a data set is well modeled by a normal distribution and to compute how likely it is for a random variable underlying the data set to be normally distributed." This section will show how we can view normality by visual inspection using histograms and q-q plots


### Histograms with Normal Distributions
We have made histograms before, but we will be adding a normal distribution this time as well. Remember previously, we were able to rapidly make histograms for each of the Microbe groups using faceting. Unfortunately, adding normal distributions in the way we need to does not behave well with faceting or grouping so we will make them individually. Let's begin with the `Fir` group. So we want:

Histogram with the following properties:

- X axis: 40 bins of `Fir` concentrations
- Y axis: **density** of the bins (done by setting `y = ..density..`)
- black outlines for the bins (`color` aesthetic)
- white fill for bins

Normal distribution with the following properties:

- mean and standard deviation of `Fir` microbe
- red color

We will use the `stat_function` function to create the normal distribution. For this use, it has the following form:

`stat_function(fun, args)`

- `fun` is the function we are going to add to the graph, in this case `dnorm`, the function that creates a normal distribution.
- `args` is a list of arguments to be passed to `fun`. In this case, `dnorm` needs the mean and sd for our variable `Fir`

```{r Fir Hist with dnorm}
ggplot(mb,aes(x = Fir)) +
  geom_histogram(aes(y = ..density..), bins = 40, fill = "white", color = "black") +
  stat_function(fun = dnorm, args = list(mean(mb$Fir), sd(mb$Fir)), color = "red")
```

1. How do we determine if the distribution is skewed?
2. Does the distribution look normal or skewed?

Let's try it with the Actin microbe now as opposed to Fir

```{r Actin Hist with dnorm}
ggplot(mb,aes(x = Actin)) +
  geom_histogram(aes(y = ..density..), bins = 40, fill = "white", color = "black") +
  stat_function(fun = dnorm, args = list(mean(mb$Actin), sd(mb$Actin)), color = "red")
```

3. Does the `Actin` distribution look normal or skewed?

### Q-Q Plots
Q-Q (quantile-quantile) plots on univariate data essentially show the data versus a normal distribution of the data along with a line. For normally distributed data, each point should fall on or near the line. If the data tails off, this indicates skew.

Using the `Fir` microbe, let's make a Q-Q plot to see the distribution of the data. We will be using the `geom_qq` function to plot the individual points and the `geom_qq_line` to plot the line they should fall on. The main aesthetic for `geom_qq` is sample, not x or y. We will assign `Fir` to sample here.

```{r Fir QQ}
ggplot(mb,aes(sample = Fir)) + 
  geom_qq() +
  geom_qq_line()
```

1. Based on the QQ plot, do you think Fir is skewed? 

Let's try it with Actin again to see if the QQ plot shows us skew like the histogram did.

```{r Actin QQ}
ggplot(mb,aes(sample = Actin)) + 
  geom_qq() +
  geom_qq_line()
```

2. Based on the QQ plot, do you think Actin is skewed?

You can easily make a QQ plot for each microbe using the same method as for the histograms in the last lab. First, we will gather our data to where Microbe designation is in a single column. Then we will `facet_wrap` by the new Microbe variable

```{r Facet QQ}
# First, gather the data into mb.g. All the microbe columns will be gathered
mb.g <- gather(mb, key = "Microbe", value = "Value", Fir:Verru)

# The call will be very similar to previously, except sample will now be the Value variable and we will add a facet_wrap at the end as well. I want the graph for each Microbe to be different as well, so I will also add a color aesthetic
ggplot(mb.g, aes(sample = Value, color = Microbe)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(~Microbe, scales = "free")
```

After seeing all 4 variables, you can see that Bacter and Fir are both fairly normally distributed according to the QQ plots since the points generally fall on their respective line. However, Actin and Verru both seem to be positively skewed since data points tail away from the line dramatically

### Boxplots
As a third way to look at normality, you can make boxplots that show the quantiles and any outliers for the data. If the data comes from a normal distribution, the box will be evenly distributed with the mean in the center.

- Use `geom_boxplot` to create a boxplot showing the distribution of `Fir`. The y value will be the group variable (`Fir` in this case). No x value is needed due to only plotting one variable and how we have the data frame currently set up.

```{r Fir boxplot}
ggplot(mb, aes(y = Fir)) +
  geom_boxplot()
```

1. Does the boxplot show `Fir` to be normally distributed?
2. What else do boxplots show us that histograms and q-q plots have trouble showing?

Typically, boxplots are coded to show outliers as any point beyond 1.5*IQR (1st-3rd quartile). You can change that (with some difficulty) using `stat_summary` to create the boxplot as opposed to `geom_boxplot`, but that will not be shown here.

Again, to show a more skewed distribution, let's plot Actin as a boxplot now.

```{r}
ggplot(mb, aes(y = Actin)) + 
  geom_boxplot()
```

You can see the lower whisker is much smaller than the upper whisker, the median - 3rd quantile is also a larger range than the median - 1st quantile, another sign that the distribution is not normal. You can also see a large amount of outliers in this dataset. 

How about if we wanted to show boxplots for all of the microbes at once? In order to split the data by group easily in ggplot, we will use the gathered data from previously: `mb.g`. 

```{r Boxplot by Microbe}
ggplot(mb.g, aes(x = Microbe, y = Value)) +
  geom_boxplot()
```

As you can see, plotting all on the same graph can make it difficult to see the distribution for more tightly grouped variables, so we will facet them by Microbe and free the y-axis like previously

```{r Facet Boxplot by Microbe}
ggplot(mb.g, aes(x = Microbe, y = Value)) +
  geom_boxplot() +
  facet_wrap(~Microbe, scales = "free")
```

You can see that Fir and Bacter are relatively normally distributed with very few outliers, however, the boxplots back up what we saw in the Q-Q plots and histograms for Actin and Verru. They are not evenly distributed.

## Testing Normality
Normality can be visualized, but visual inspection is not always accurate enough for borderline cases. Statistical tests exist to determine whether a sample is likely to have come from a normal distribution. We will talk about how to perform the two most common, Kolmogorov-Smirnov and Shapiro-Wilks

### Kolmogorov-Smirnov

This test is performed using the `ks.test` function. The necessary inputs for a one-sample test are:

- `x`: your data
- `y`: the normal probability function (defined by `pnorm`)
- `mean` and `sd`: used to define `pnorm`

We will test `Fir` first. Remember, we are comparing Fir to a normal probability function defined by the mean and sd of Fir, not to just the standard normal distribution.

```{r KS Test Fir, warning=FALSE}
ks.test(x = mb$Fir, y = pnorm, mean = mean(mb$Fir), sd = sd(mb$Fir))
```

You will get a short table with the test statistic (D) and the p value for significance. If the p > 0.05, the distribution is not significantly different from normal. For Fir, p = 0.65, so the distribution is not different from normal. As far as reporting KS statistics, you can phrase it like so:

"The Firmicutes microbe, D = 0.0728, p = 0.65, is not significantly non-normal." 

You can ignore the warning that comes out as well. It is not saying ties are present, just that they shouldn't be present.

We can use the KS test on the Actin variable to verify its skewness as well

```{r KS Test Actin, warning=FALSE}
ks.test(x = mb$Actin, y = pnorm, mean = mean(mb$Actin), sd = sd(mb$Actin))
```

You can see that the p value is extremely small meaning the Actin distribution is significantly non-normal.

### Shapiro-Wilks 

This test is performed using the `shapiro.test` function. It only takes one input ever, your distribution, and does a similar process as the KS test. 

```{r SW test Fir}
shapiro.test(mb$Fir)
```

You will again be provided with a short table containing the test statistic (W), and the p value. If p > 0.05, the distribution is not significantly non-normal. For Fir, p = 0.1901, therefore it is not significantly non-normal. For reporting the statistics, you can phrase it in the same way as the KS test. When reporting normality, you might as well give both the SW and KS test results.

## Calculating Skew and Kurtosis
Reporting skew and kurtosis for your data can be informational since the SW and KS tests do not inherently give such information. In order to calculate these measures, we will use the `stat.desc` function from the `pastecs` package. If you have not downloaded `pastecs`, do so using `install.packages('pastecs')`, then load the package.

`stat.desc` has the following inputs:

- `x`: the distribution to describe
- `basic`: logical (T/F) value saying whether to output basic statistics. True by default
- `desc`: logical (T/F) value saying whether to output descriptive statistics. True by default
- `norm`: logical (T/F) value saying whether to output normal distribution statistics. False by default
- `p`: probability level to calculate confidence interval on. 0.95 by default

For this purpose, we do not need the basic statistics but do need the normal distribution statistics. We will try it on Fir first. We can leave `desc` and `p` alone since we like their default values but will change `basic` and `norm`

```{r Fir stat.desc}
stat.desc(mb$Fir, basic = FALSE, norm = TRUE)
```

The outputs we are concerned with for skewness and kurtosis are: skewness, skew.2SE, kurtosis, and kurt.2SE. So we have values but how do we interpret them?

1. Skewness can be either positive or negative. Closer to 0 means less skewed. Positive values mean there is a pile-up of values on the left side of the distribution while a negative value means they are piled on the right side.
2. Kurtosis can be positive or negative. Positive values mean the tail are heavy and drop rapidly, negative values indicate a much more gradual fall.

For both of these, the further from 0 means the less likely the data is normally distributed. Typically, skewness values between -0.5 and 0.5 are approximately symmetric, between 0.5 and 1 are slightly skewed, and > 1 are highly skewed. Same for between -0.5 and -1, and less than -1.

The ?.2SE values show skew and kurtosis divided by 2 SEMs. For small and medium sample sizes, you can compare this to a cutoff and see whether the skew and kurtosis are significant. For small samples, the cutoff is 1.96 (anything greater being significant), whereas for medium samples, the cutoff is 2.58. USE CAUTION WHEN APPLYING THESE. For large samples ( > 200 samples), it is important to look at the shape itself rather than relying on these rules of thumb. 

As a note, `stat.desc` also outputs the results from the SW test as well. The normtest.W value corresponds to W from the `shapiro.test` output while normtest.p is the p value. 

As opposed to calculating these stats for each microbe separately, we can also do it for all microbes at the same time.

```{r Wide stat.desc}
# For wide data, where all of the microbes are in their own column, we will just input all the columns we want to describe in x. This corresponds to Fir, Bacter, Actin, Verru which we can shorten in the select function to Fir:Verru
stat.desc(select(mb,Fir:Verru), basic = FALSE, norm = TRUE)
```

So we now have a column for each bacteria and a row for each statistic. This makes comparisons across groups much easier.

## Homoscedasticity and Levene's Test
When comparing some measure across groups, the groups should have "equal" vairance. In order to test if variances in different groups are the same, we will use Levene's test. For this, we will use the `leveneTest` function from the `car` package. 

### Formulas
The `leveneTest` function uses another type of R object called a formula. Formulas are a type of "language" in R that denote an expression for a function to evaluate. They are typically denoted with the following form:

dep.var ~ ind.var1 + ind.var2 + ...

I typically read this as "dependent variable as a function of (~) independent variable 1 and independent variable 2". The `leveneTest` function will have the following form:

`leveneTest(dep ~ ind, data = df)`

where you are testing the variances of the dependent variable as a function of the independent variable. In our case, we can test to see if the variance of the `Bacter` concentration differs as a function of `Sex`.

```{r Levene Bacter}
leveneTest(Bacter ~ Sex, data = mb)
```

The output of the `leveneTest` function is a small table with the grouping variable, the degrees of freedom, the F value, and the p value. If the p-value is < 0.05, the variances between the groups are significantly different. In our case, the p-value is 0.29 meaning the variances are homoscedastic.

In fact, if you test all of our microbes as a function of Sex, they are all homoscedastic. However, if you test the variance of `Verru` as a function of `ACF` instead:

```{r Levene Verru}
leveneTest(Verru ~ ACF, data = mb)
```

The test comes out highly significant meaning the variances of `Verru` at the different levels of the `ACF` group are significantly different.

## Comparing Means Across Multiple Groups

In the previous lab, we talked about plotting means and SEMs from our data. Up to now, we have only used either all of the data (such as all of Fir) or only grouped by Microbe in the long format. However, multiple groups exist in the dataset (Gender and ACF) that we can use to further subdivide and compare across. For this problem, we can use a number of aesthetics to split by group. Typically, this is done with `color`, `fill`, or `group`.

Before starting, let's load up a larger version of this dataset. One problem with grouping is that it necessarily decreases the number of samples that go into calculating means and SEMs for each bar. We also have a large amount of subjects missing Sex and/or ACF values which will need to be removed when making the graphs.

The file you will want to load is "Microbiome_Data_Full.csv" and can be downloaded from Canvas. The variable stored is also named `mb` so our dataframe we have been working with so far will be overwritten.

```{r Load Full Data}
load("~/Google Drive/Grad School/GR 770 Statistics/R Labs/Data/Microbiome_Data_Full.RData")
```

We will want to gather the data just like before to get the Microbe types into a single column

```{r Gather Full Data}
mb.g <- gather(mb, key = "Microbe", value = "Value", Fir:Verru)
```

Let's create a plot comparing mean concentrations of each microbe across sex first. Since we have the raw data, we will make the bar graphs using the raw data method from last lab.

`ggplot` properties:
- `data`: rows that have either Male or Female sex, no NA values. We can remove these rows with `filter` combined with `!is.na`.
- `x`: Microbe type
- `y`: Value
- `fill`: Sex. We will be changing the fill color dependent on the values in Sex

`geom_bar` properties:
- The bar graph should plot the mean
- The bars for the different groups should be side by side as opposed to stacked

`stat_summary` properties:
- Use the "errorbar" geom
- Bars should represent the SEM
- Errorbars need to be on top of the bar they correspond with (position dodged along with the bar graph)
- Errorbars should have a width of 0.2

```{r Comparing Means across Microbe and Sex}
ggplot(filter(mb.g, !is.na(Sex)), aes(x= Microbe, y = Value, fill = Sex)) +
  geom_bar(stat = "summary", fun.y = "mean", position = position_dodge(0.9)) +
  stat_summary(fun.data = mean_se, geom = "errorbar", position = position_dodge(0.9), width = 0.2)
```

So by visual inspection, there doesn't appear to be an obvious difference between male and female in any of the 4 microbiota concentrations. Please, don't do actual statistics by just visual inspection....

So by plotting Value across Sex, we split the bar at each microbe into two bars since there are two groups for Sex (after removing NA values). When we do this for ACF, since ACF has 5 factor levels, there will be 5 bars for each microbe instead but the process is basically the same

```{r Comparing Means across Microbe and ACF}
ggplot(filter(mb.g, !is.na(ACF)), aes(x= Microbe, y = Value, fill = ACF)) +
  geom_bar(stat = "summary", fun.y = "mean", position = position_dodge()) +
  stat_summary(fun.data = mean_se, geom = "errorbar", position = position_dodge(0.9), width = 0.2)
```

So you can see that `ggplot` easily splits groups into any number of levels based on the factor levels in the grouping variable. By visual inspection, there may be some differences between the ACG groups within the Microbes, but nothing too drastic.

Lastly, we can split the graph by both ACF and Sex at the same time. Splitting by both ACF and Sex on the same plot would create 10 bars for each microbe (2 groups for Sex times 5 for ACF). This would overcrowd the plot making it more difficult to interpret, so instead we will facet the plots across Sex. Very little will need to change from the last plot call besides adding a `facet_wrap` and excluding samples with `NA` in either ACF or Sex

```{r Comparing Means Across Microbe, Sex, and ACF}
ggplot(filter(mb.g, !is.na(ACF), !is.na(Sex)), aes(x= Microbe, y = Value, fill = ACF)) +
  geom_bar(stat = "summary", fun.y = "mean", position = position_dodge()) +
  stat_summary(fun.data = mean_se, geom = "errorbar", position = position_dodge(0.9), width = 0.2) +
  facet_wrap(~Sex)
```

When splitting across multiple groups, keep in mind how many graphs you are creating by faceting vs. how many groups are going to be shown within each facet. Make sure the plots will not be too dense but will contain a sufficient amount of information to make your point.