Lab-16-ChiSq-With-Code.Rmd

---
title: "Lab-16-ChiSq-With-Code"
author: "Matt"
date: "10/29/2019"
output: 
  html_document:
    toc: TRUE
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Load Data and Packages
For this lab, we are going to be using a new package, `gmodels`, to calculate our chi-squared test as well as another package, `RVAideMemoire` to calculate loglikelihood ratios.

```{r Load Packages}
library(RVAideMemoire)
library(gmodels)
```

The data that we will be using is stored in `Categorical-Data.RData`. This file contains a single dataframe, `bacteria`, with the following variables:

1. `drug.tx` (factor, 2 levels): current drug treatment, either treatment or placebo
2. `diet` (factor, 2 levels): vegetarian or meat eater
3. `bac.presence` (factor, 2 levels): presence or absence of bacteria

This is simulated data from an experiment testing whether an experimental drug has adverse effects on the presence of a certain bacteria. Additional data about the person's diet was also collected as it may play a role in the presence or absence of that bacteria. For this lab, we will be ignoring the diet variable.

```{r Load data}
load('/Users/Matthew/Google Drive/Grad School/GR 770 Statistics/R Labs/Data/Categorical-Data.RData')
```


## Making a Contingency Table
Since chi-square tests are testing differences in frequencies, contingency tables are used to present the data fairly regulary as an overall summary of how many subjects fall into each unique group. Our data is stored with individual subjects on each row in a more raw version but can be easily converted to a contingency table as well. We can convert it using the `xtabs` function which has the following form:

`xtabs(formula, data)`

- `formula`: a formula detailing which variables to make the contingency table from.
- `data`: the original dataframe

In our case, we will tabulate the data as a function of just the drug treatment and bacteria presence variables.

```{r contingency}
bac.table <- xtabs(~ drug.tx + bac.presence, bacteria)
bac.table
```

## Chi-Square Test
The chi-square test compares actual frequencies in groups to frequencies you would expect to get by chance. We can run the chi-square test using the `CrossTable` function from the `gmodels` package. `CrossTable` has the following form:

`CrossTable(x, y, expected = FALSE, chisq = FALSE, fisher = FALSE, resid = FALSE, sresid = FALSE, format, ...)`

- `x`: vector or matrix of data. Can be either a vector like our drug.tx variable or a contingency table.
- `y`: vector of data. If x is a vector, then y should also be a vector. If x is a contingency table, y should be unspecified.
- `expected`: whether to include expected values for each unique group from the chi-square test. When set to TRUE, automatically sets chisq to TRUE
- `chisq`: whether to perform a chi-sq test.
- `fischer`: whether to perform a fischer's exact test
- `resid`: whether to include Pearson residual
- `sresid`: whether to include standardized residual
- `format`: whether to print using SAS or SPSS format. Choose SPSS

For this example we are going to set all of these options to TRUE to see what the output looks like.

```{r chisq 1}
CrossTable(x = bacteria$drug.tx,
           y = bacteria$bac.presence,
           expected = TRUE,
           chisq = TRUE,
           fisher = TRUE,
           resid = TRUE,
           sresid = TRUE,
           format = 'SPSS')
```

The top box shows how the following table is organized. The first thing we need to make sure is that none of the expected values are lower than 5 in the table. This is one of our assumptions of a chi-square test. The table also gives the row-wise, column-wise, and total percentage of data in each cell. Residuals are reported below this. Results from the chi-square test with and without the Yates' correction are shown below the table. We will just use the normal chi-squared test values. As you can see, there is a significant relationship between drug treatment and presence of the bacteria. 

We also could have passed in the contingency table into x and gotten the same results.

```{r chisq 2}
CrossTable(x = bac.table,
           expected = TRUE,
           chisq = TRUE,
           fisher = TRUE,
           resid = TRUE,
           sresid = TRUE,
           format = 'SPSS')
```

Either one works, use whichever is more convenient based on what data you are given. 

## Interpretation of a ChiSq Test
Alright so what does this all mean though. We have a significant test, but what does that really tell us. We can use the standardized residuals to tell us a little bit more. Basically, standardized residuals > 1.96 are significant at p < 0.05, sresids > 2.58 are significant at p < 0.01, and sresids > 3.29 are significant at p < 0.001. If we look through our table, we see that we have significant sresids in 3 out of 4 cells. If an sresid is significant, that means the true value of that cell was significantly higher or lower than what was predicted based on the sign. For this example, Treatment:Present and Placebo:Not Present combinations had significantly lower values than what was expected and Placebo:Present had a significantly higher frequency than what was expected.

### Odds Ratio
One other useful metric is the odds ratio, essentially an effect size for the association. The calculated odds ratio, its confidence interval, and p-value can all be found under the Fisher's Exact Test section. For this, we have an odds ratio of 0.07. This means that the treatment group has odds of the bacteria being present 0.07 times lower than the placebo group. 

### Reporting a ChiSq Test
For this example we would report it like so:

There was a significant association between drug treatment and presence of bacteria $\chi^2(1) = 17.01, p < 0.001$. This seems to represent the fact that, based on the odds ratio, the odds of bacteria being present was 0.07 (0.01, 0.31) times lower if the person was taking the drug treatment as opposed to a placebo.

## Likelihood Ratio

An alternative to the chi-squared test is the likelihood ratio test, also known as the G test of independence. The function we will use for this is `G.test` from the `RVAideMemoire` package. It behaves awfully similarly to the `CrossTable` function except only passing in a contingency table. No other inputs are necessary

```{r G.test table}
G.test(bac.table)
```


The table it outputs is fairly barren, only including the test statistic, G, the degrees of freedom, and the p-value. Since we are only comparing a 2x2 table, we don't need to run any post-hoc tests, however the `pairwise.G.test` function will do that for you if necessary and is a part of the same package. Read http://rcompanion.org/rcompanion/b_06.html for more information on that as it is beyond the scope of this course.

### Reporting a LogLikelihood Ratio Test
The result will look awfully similar to the chi-square test report. All that we are replacing are the test statistic, df, and p value. The odds ratio and its confidence interval remain the same. It will look as such:

There was a significant association between drug treatment and presence of bacteria $G(1) = 17.77, p < 0.001$. This seems to represent the fact that, based on the odds ratio, the odds of bacteria being present was 0.07 (0.01, 0.31) times lower if the person was taking the drug treatment as opposed to a placebo.