Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
## Overall Progression for Lab 7:
# 1. Load Packages and Data
# 2. Review our Simple Model
# 3. Checking Assumptions
# a. Linearity
# b. Homoscedasticity of residuals
# c. Normality of residuals
# d. Independence
# e. Multicollinearity (for multiple linear model)
## Loading Packages and Data For this lab, we will be using the `tidyverse`,
#`broom`, `car`, and `pastecs`, as well as the same dataset as last lab:
library(pastecs)
library(broom)
library(car)
library(tidyverse)
theme_set(theme_bw())
load("~/Google Drive/Grad School/GR 770 Statistics/R Labs/Data/framingham_n50.RData")
# Reviewing the Simple Model
# Let's review our simple model of systolic blood pressure (sysBP) as a function
# of age alone. We will recalculate this using `lm`
sysBP.m <- lm()
summary()
# Checking Assumptions
## Linearity
## You can assess linearity using a scatterplot. We can recreate that
## plot here, again adding a regression line using `geom_smooth`
ggplot(fhs, aes()) +
geom_point() +
geom_smooth()
## Homoscedasticity We will be assessing homoscedasticity of the residuals.
## Residuals are difficult to access normally, however the `augment` function
## can grab them and other information from the lm object to work with
## Let's augment sysBP.m and see what we get
sysBP.m <- augment()
head(sysBP.a,10)
## Let's make sure our residuals average out to 0 first of all
mean()
## For assessing homoscedasticity, we will make scatterplots of the residuals
## versus the fitted values. We will add a horizontal line at zero for comparison
ggplot(sysBP.a, aes()) +
geom_point() +
geom_hline()
## We want to make sure we see no pattern in the residuals and that they are all
## randomly spaced away from the horizontal
## Normality of Residuals
## We will test normality in the same way as before, using Q-Q plots,
## historgram, and statistical tests
### Histogram
### Let's create a histogram with 15 bins for the residuals, adding on top a
### normal distribution
ggplot(sysBP.a,aes()) +
geom_histogram(aes(), bins = , fill = "orange", color = "black") +
stat_function(fun = , args = list(), color = "red")
### Q-Q Plot
ggplot(sysBP.a,aes()) +
geom_qq() +
geom_qq_line()
### Boxplot
ggplot(sysBP.a,aes()) +
geom_boxplot()
### Statistical Tests
### Let's calculate skew, kurtosis, and perform SW tests on the residuals
stat.desc()
### Independence
### We don't talk about that in this class
### Multicolinearity
### This only applies for multiple linear regression. We need to make sure that
### our predictor variables are not highly correlated. We can test for
### multicolinearity using the `vif` function from the `car` package. If a
### predictor has VIF > 10, it is colinear with another variable and should be
### removed from the model
### Let's look at our 2 variable model of sysBP predicted by main effects of age
### and glucose
sysBP.m.2 <- lm()
vif()