5/29/2019

What Is R

  • R is a widely used language for statistical computing and graphics.
  • Completely open source and available for free on multiple platforms
  • Provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, etc.) and graphical techniques
  • R in an interpreted language meaning it runs commands directly without needing to be compiled

Why Use R

  • Computers are good at mindless, repetitive tasks and statistics is both!
  • Open source
  • Easy reproducibility
  • Extremely robust statistical packages
    • If you can think of it, there’s probably a package for it
  • Makes pretty pictures
  • Learning R is learning coding in general

RStudio

R can be run out of the box, but can feel both basic and difficult at the same time. RStudio is a comprehensive IDE built on top of R that makes using R feel better by keeping various windows organized, introducing new functionality such as Markdown documents, and adding a semi-GUI to R. It’s free and highly used in the R community. It can be downloaded here:

https://www.rstudio.com/

Using R as a Free Calculator

If you want, R can be just a calculator. Type in any simple mathematical expression to the command prompt, press Enter and see the result

10 + 20
## [1] 30
10/5
## [1] 2
3^3
## [1] 27

Order of operations still matters here

2+5*4
## [1] 22
(2+5)*4
## [1] 28

Sometimes you will need to enclose in multiple parentheses. Make sure they line up exactly where you mean them to.

(4+2)/3^3
## [1] 0.2222222
((4+2)/3)^3
## [1] 8

Scripts

Commands can be entered directly into the command line, or they can be stored in modified text files called scripts.

Scripts are the basic way coding is done for a few reasons:

  • Organization
  • Efficiency
  • Reproducibility

To create a new script in RStudio, click the page icon in the top left of the GUI and choose R Script

A Very Special “Operator”

In R, one of the best “operators” you can use is #. This symbol single-handedly prevents the downfall of civilization. It is the comment symbol and a comment looks like this:

# This is a comment, everything after the # will be ignored 
# when the code is running. I can use it to tell future me 
# and any other person what is happening at this section.

a <- 1 # I can even put it on a line with an operation.

There is an unlimited supply of # in R. They are free. Use them everywhere, all the time, and you will save yourself pain and heartache down the line

Questions?

Variables

A variable is a kind of label or container for a piece of information. You assign values to a variable. The assignment operator is the <- combination

a <- 1

b <- 2

c <- 3*2

The output is not shown in the command prompt which is different from before, however it is shown in the Variables window in RStudio now.

Variables can also be used in arithmetic operations

spikes <- 120
sec <- 20

spikes.per.sec <- spikes/sec

spikes.per.sec
## [1] 6

You can also use a variable to reassign a vale to itself. That’s confusing wording but the code is simple:

a
## [1] 1
a <- a + 5
a
## [1] 6

A Note on the Assignment Operator

Why <- and not =?

  • = works as well! Try it!
  • Using = instead of <- can screw you up when passing variables into functions.
  • Just stick with <- and it will save you headache

Rules for Naming Variables

  • Can include any alphanumeric character, a period ., or an underscore _
  • Variables are case sensitive. num is different from Num is different from NUM
  • Variable names cannot include spaces.
  • Variable names must start with a letter or a .. Generally starting with a . is reserved for special cases though
  • Variable names cannot be on of the reserved keywords such as TRUE, if, then, NaN, while, and more. If you try to assign to a keyword, R will complain so you’ll know pretty quickly

Tips for Naming Variables

  • Names should be informative to what the variable is. Do not use a, b, a1, var1 etc.
  • Be succinct. You can use abbreviations if they are somewhat memorable or you add comments about what they stand for in the code.
  • Multiword variables are named using different conventions based on the person.
    • _ : example_underscore
    • . : example.period
    • Camel case: exampleCamelCase

Questions?

Functions

Functions are commands that will act on inputs to produce some output.

abs(-21)
## [1] 21
abs(21)
## [1] 21

The standard form of a function is: \(object <- function(input1, input2, ...)\)

Inputs are also known as arguments

Do not try to memorize all the functions. Google is your best friend for remembering and finding functions.

If you don’t remember the exact form of a function, use the help operator ?

?round

Arguments can be specified one of two ways, by location or by name. Let’s use the round function as an example

round(3.14159,2)
## [1] 3.14
round(x = 3.14159, digits = 2)
## [1] 3.14
round(digits = 2, x = 3.14159)
## [1] 3.14
round(2,3.14159)
## [1] 2

One of the best things about RStudio is the tab-complete capability. When typing a function (or variable or sometimes file path) name, press tab to bring up a small window with a list of functions containing that string of letters. You can use the up/down arrows to traverse this list. Press tab again when highlighting the function you want to autocomplete it. Practicing this will save time in the long run.

As well as the tab-complete, you can also list out possible arguments for a command after you have finished typing it with parentheses by pressing Tab.

Functions can be called inside other functions

sqrt(abs(-4))
## [1] 2

These are called nested functions. abs() is nested within the call to sqrt(). Functions will always be evaluated from inside to out. You can nest as many functions as you like, but it makes code much more difficult to read. Use nested functions reasonably.

Creating Vectors

Vectors (a.k.a arrays) are data structures that can store multiple values. All of the values must be the same data type which we will cover soon

You can easily create a vector using the c() command which stands for combine.

mice.per.cage <- c(5,6,12,1,14,3,6,8,5,5,9,14,5)

mice.per.cage
##  [1]  5  6 12  1 14  3  6  8  5  5  9 14  5

This vector is now said to contain 13 elements. The number in brackets refers to the index of the leftmost printed value on that row.

Indexing

Extracting data from a data structure is vital for anything in R. The basic way of doing this is via the [] operator.

# We want the number of mice in cage 6
mice.per.cage[6]
## [1] 3

The number in the [] is called an index. You told R you wanted the number in the 6th spot in the mice.per.cage variable. The output of this can also be stored as a variable

cage.6.pop <- mice.per.cage[6]

You can extract multiple indices at a time as well.

mice.per.cage[c(1,2,3,4,5)]
## [1]  5  6 12  1 14

For a sequential number vector changing by 1, you can also use another operator, :.

1:5
## [1] 1 2 3 4 5
mice.per.cage[1:5]
## [1]  5  6 12  1 14

Changing Elements in a Vector

There will be times where you will want to edit specific indices in a vector. To do that, we can use the <- and [] operators in conjunction.

mice.per.cage[3]
## [1] 12
mice.per.cage[3] <- 2

mice.per.cage[3]
## [1] 2

Other Vector Operations

Determine the number of elements in a vector:

length(mice.per.cage)
## [1] 13

Alter all elements of a vector:

mice.per.cage*2
##  [1] 10 12  4  2 28  6 12 16 10 10 18 28 10

Now say we were adding mice to each cage, but we aren’t adding the same amount to each cage. R makes it easy to do element-by-element arithmetic:

mice.per.cage
##  [1]  5  6  2  1 14  3  6  8  5  5  9 14  5
mice.added <- c(1,2,1,1,3,6,4,3,8,7,2,4,5)
mice.per.cage + mice.added
##  [1]  6  8  3  2 17  9 10 11 13 12 11 18 10

For element-by-element operations, the vectors need to be the exact same size. This also works for division and multiplication as well without needing special operators.

Functions can also perform their operation on each element of a vector:

sqrt(mice.per.cage)
##  [1] 2.236068 2.449490 1.414214 1.000000 3.741657 1.732051 2.449490
##  [8] 2.828427 2.236068 2.236068 3.000000 3.741657 2.236068

Questions?

Basics of Matrices

Matrices are just multidimensional arrays. Matrices can be created with the cbind, rbind, or matrix functions

# cbind concatenates column vectors together
mice.per.cage
##  [1]  5  6  2  1 14  3  6  8  5  5  9 14  5
mice.added
##  [1] 1 2 1 1 3 6 4 3 8 7 2 4 5
mice.mat <- cbind(mice.per.cage,mice.added)

mice.mat
##       mice.per.cage mice.added
##  [1,]             5          1
##  [2,]             6          2
##  [3,]             2          1
##  [4,]             1          1
##  [5,]            14          3
##  [6,]             3          6
##  [7,]             6          4
##  [8,]             8          3
##  [9,]             5          8
## [10,]             5          7
## [11,]             9          2
## [12,]            14          4
## [13,]             5          5

Accessing parts of a matrix is similar to vectors, still using the [] operator except we are adding another number. Values in a matrices are accessed using a var[row,col] format

# Return the element in row 1, column 1. 
mice.mat[1,1]
## mice.per.cage 
##             5
# Return the element in row 3, column 2. 
mice.mat[3,2]
## mice.added 
##          1

# Return the elements from rows 1 through 5 in column 2
mice.mat[1:5,2]
## [1] 1 2 1 1 3
# Return elements from rows 2 through 6 in columns 1 and 2
mice.mat[2:6,1:2]
##      mice.per.cage mice.added
## [1,]             6          2
## [2,]             2          1
## [3,]             1          1
## [4,]            14          3
## [5,]             3          6

# Return all rows of the second column by leaving the row field blank
mice.mat[,2]
##  [1] 1 2 1 1 3 6 4 3 8 7 2 4 5
# Same with columns. Return all columns of row 4 by leaving the 
# col field blank
mice.mat[4,]
## mice.per.cage    mice.added 
##             1             1

Storing Text Data

At times, you may need to store text data instead of numbers. To denote something as text, you use the "" operators

greeting <- "hello"
greeting
## [1] "hello"

This is extremely different from:

greeting <- hello

As a note, '' is generally interchangeable with "". You can use either one to denote a string

You can create character vectors in the same way as vectors of numbers:

metallica<-c("Lars","James","Jason","Kirk")
metallica
## [1] "Lars"  "James" "Jason" "Kirk"

You can index them exactly the same as well

metallica[2]
## [1] "James"

Functions are data type-specific. For example, you cannot take the square root of "hello" or "Lars". There are character-specific functions as well, such as the nchar() function which tells us how many characters are in a character string

nchar(metallica[4])
## [1] 4

You can use this function on the entire metallica array as well to get the number of characters in each string

nchar(metallica)
## [1] 4 5 5 4

Logical Values

Logical values tell you whether something is TRUE or FALSE. These can also be represented as 1 and 0, respectively. This data type can be created from a variety of commands, but most simply are created when testing if numbers are equal to, greater than, or less than each other which are represented by the ==, >, and < operators, respectively.

# Less than
21 < 50
## [1] TRUE
# Greater than
21 > 50
## [1] FALSE
# Equal to
21 == 50
## [1] FALSE

Table of logical operators:

operation operator example input answer
less than < 2 < 3 TRUE
less than or equal to <= 2 <= 3 TRUE
greater than > 2 > 3 FALSE
greater than or equal >= 2 >= 3 FALSE
equal to == 2 == 3 FALSE
not equal to != 2 != 3 TRUE

Other important logical operators:

operation operator example input answer
not ! !(1==1) FALSE
or | (1==1) | (2==3) TRUE
and & (1==1) & (2==3) FALSE

The ! operator essentially just switches TRUE to FALSE and vice-versa. You can use it in conjuntion with commands that return logicals as well.

is.nan(c(2, NaN, 5))
## [1] FALSE  TRUE FALSE
!is.nan(c(2, NaN, 5))
## [1]  TRUE FALSE  TRUE

Up to this point, we have introduced 3 types of data we can store and use in R: numerics, characters (strings), and logicals. There are plenty more, but those are generally the more common types. It is possible to forget what type of variable something is, or when you import data, you will want to make sure R imports variables as the correct types. The str and class functions work great for this.

class(mice.per.cage)
## [1] "numeric"
str(mice.per.cage)
##  num [1:13] 5 6 2 1 14 3 6 8 5 5 ...

Logical Indexing

This is an extremely useful tool for pulling out data that meet specific parameters you want. We know logical operator return a vector with TRUE and FALSE. If we use that output as an index, R will return only the values at the indices corresponding to TRUE

For example, say we only wanted the cages of mice with >= 6 mice

ind <- mice.per.cage >= 6
ind
##  [1] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE
## [12]  TRUE FALSE

The indices equalling TRUE are 2, 5, 7, 8, 11, and 12

mice.per.cage
##  [1]  5  6  2  1 14  3  6  8  5  5  9 14  5
mice.per.cage[ind]
## [1]  6 14  6  8  9 14
mice.per.cage[mice.per.cage >= 6]
## [1]  6 14  6  8  9 14

Logical expressions can be combined with the & and | operators as seen before. So if we only wanted cages with between 6 and 20 mice:

mice.per.cage >= 6
##  [1] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE
## [12]  TRUE FALSE
mice.per.cage <= 20
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
mice.per.cage[mice.per.cage >= 6 & mice.per.cage <= 20]
## [1]  6 14  6  8  9 14

Another extremely useful thing about logical indexing is that it can also allow you to count the number of instances your data matches specific parameters. For example, how many cages have more than 5 mice:

sum(mice.per.cage >= 6)
## [1] 6

This is because TRUE is treated as a numerical 1 and FALSE is treated as a numerical 0. You are only summing the 1’s

Factors

Factors are a special type of variable. They can take the shape of characters or numbers, but don’t necessarily represent either. They represent a classification of the data point. For example, if you have are measuring the difference between treatment and control groups and each group has 5 subjects, a factor variable may look like this:

# Control = 0, Treatment = 1
group <- factor(x = c(0,0,0,0,0,1,1,1,1,1))
group
##  [1] 0 0 0 0 0 1 1 1 1 1
## Levels: 0 1

0 and 1 aren’t entirely descriptive by themselves, so you can add labels to them:

group.2 <- factor(x = c(0,0,0,0,0,1,1,1,1,1), 
                  labels = c("Ctrl","Tx"))
group.2
##  [1] Ctrl Ctrl Ctrl Ctrl Ctrl Tx   Tx   Tx   Tx   Tx  
## Levels: Ctrl Tx

In fact, just always add labels to factors. Coming back to a data set with 0 and 1 coding gender is frustrating when you don’t remember which is which.

Another important point is that labels are assigned in order. The factor function sorts the unique values in your numerical array in ascending order and assigns them in order to the label array.

group.3 <- factor(x = c(1,1,1,1,1,0,0,0,0,0), 
                  labels = c("Ctrl","Tx"))
group.3
##  [1] Tx   Tx   Tx   Tx   Tx   Ctrl Ctrl Ctrl Ctrl Ctrl
## Levels: Ctrl Tx

So even though 1 comes first in the numerical array, it still is assigned to Tx because 0 is first numerically and gets assigned to the first value in the labels array, Ctrl.

Questions?

Data Frames

A data frame is a container data structure for many different variable types relating to a full or partial data set.

The basic organization:

  1. Each column is a variable
  2. Each row is a sample or individual
  3. Each index only contains a single value corresponding to the variable type.

For example, if we are storing information on multiple mice, we could store it as we have done previously:

ID <- factor(c("mouse1","mouse2","mouse3","mouse4","mouse5",
               "mouse6","mouse7","mouse8","mouse9","mouse10"), ordered = TRUE)
age <- c(16,28,19,21,23,20,25,24,27,22)
gender <- factor(c(0,1,1,0,1,0,0,1,1,0), levels = c(0,1), 
                 labels = c("F","M"))
weight <- c(25,19,21,19,19,27,22,21,17,22)

age
##  [1] 16 28 19 21 23 20 25 24 27 22
gender
##  [1] F M M F M F F M M F
## Levels: F M

So we would know that age, gender, and weight correspond to each individual mouse, however R would not know to treat them that way. We can instead join them together in a data frame using the data.frame command

mouse.df <- data.frame(ID,age,gender,weight)

# Output first 5 rows and all columns of mouse.df
mouse.df[1:5,]
##       ID age gender weight
## 1 mouse1  16      F     25
## 2 mouse2  28      M     19
## 3 mouse3  19      M     21
## 4 mouse4  21      F     19
## 5 mouse5  23      M     19

Accessing data in a data frame can be done in a variety of ways, but the most common is using the $ operator.

mouse.df$gender
##  [1] F M M F M F F M M F
## Levels: F M

The [] operator can also be used in combination with $ to pull out specific rows of the variable of interest

mouse.df$gender[1:5]
## [1] F M M F M
## Levels: F M

Data frames can be accessed similar to matrices as well

# Using numbers (value in the 1st row, 2nd column)
mouse.df[1,2]
## [1] 16
# Or using names of the variables (all values in the "age" column)
mouse.df[,"age"]
##  [1] 16 28 19 21 23 20 25 24 27 22

As well, changing the original variables outside the data frame DOES NOT change the data inside the data frame after it has been created

age <- c(0,0,0,0,0,0,0,0,0,0)
age
##  [1] 0 0 0 0 0 0 0 0 0 0
mouse.df$age
##  [1] 16 28 19 21 23 20 25 24 27 22

Other Data Frame Operations

Being able to check the variable types in a data frame after it has been created is important when both importing data or coming back to it after a time. The str function is very useful for this

str(mouse.df)
## 'data.frame':    10 obs. of  4 variables:
##  $ ID    : Ord.factor w/ 10 levels "mouse1"<"mouse10"<..: 1 3 4 5 6 7 8 9 10 2
##  $ age   : num  16 28 19 21 23 20 25 24 27 22
##  $ gender: Factor w/ 2 levels "F","M": 1 2 2 1 2 1 1 2 2 1
##  $ weight: num  25 19 21 19 19 27 22 21 17 22

str gives a list of all variables, their type, any factor levels, and a small sample of the data.

Providing simple summary statistics for a data frame can also be helpful. This can be done using the summary() function

summary(mouse.df)
##        ID         age        gender     weight    
##  mouse1 :1   Min.   :16.00   F:5    Min.   :17.0  
##  mouse10:1   1st Qu.:20.25   M:5    1st Qu.:19.0  
##  mouse2 :1   Median :22.50          Median :21.0  
##  mouse3 :1   Mean   :22.50          Mean   :21.2  
##  mouse4 :1   3rd Qu.:24.75          3rd Qu.:22.0  
##  mouse5 :1   Max.   :28.00          Max.   :27.0  
##  (Other):4

This will provide quartiles and means for numeric data as well as counts for factors.

Dataframe Subset

Most likely, you won’t want to perform operations on all samples of a data frame every time, such as when there are multiple groups in the dataset. Subsetting a dataframe allows you to pick and choose the rows you want, and there are multiple ways to do it.

# Can use logical indexing to grab the mice who are over 4 months old
mouse.df[mouse.df$age > 4,]
##         ID age gender weight
## 1   mouse1  16      F     25
## 2   mouse2  28      M     19
## 3   mouse3  19      M     21
## 4   mouse4  21      F     19
## 5   mouse5  23      M     19
## 6   mouse6  20      F     27
## 7   mouse7  25      F     22
## 8   mouse8  24      M     21
## 9   mouse9  27      M     17
## 10 mouse10  22      F     22

# Can also grab only single variable if necessary
mouse.df[mouse.df$age > 4, "weight"]
##  [1] 25 19 21 19 19 27 22 21 17 22

The basic structure is that mouse.df$age > 4 is returning an array of 0’s and 1’s to choose rows whereas the "weight" is saying which column, or variable, to take values from. Use c() to choose multiple variables.

The subset function

Subsetting using [] can often become messy and hard to read after the fact. Multiple functions have been created to handle this type of operation in a more readable fashion. The most common one is subset

# Mice over 4 months old
subset(x = mouse.df, subset = age > 4)
##         ID age gender weight
## 1   mouse1  16      F     25
## 2   mouse2  28      M     19
## 3   mouse3  19      M     21
## 4   mouse4  21      F     19
## 5   mouse5  23      M     19
## 6   mouse6  20      F     27
## 7   mouse7  25      F     22
## 8   mouse8  24      M     21
## 9   mouse9  27      M     17
## 10 mouse10  22      F     22

When subsetting by multiple conditions, use the logical operators discussed previously. So if you only wanted female mice who were over 4 months old:

# Mice over 4 months old who are female
subset(x = mouse.df, subset = age > 4 & gender == "F")
##         ID age gender weight
## 1   mouse1  16      F     25
## 4   mouse4  21      F     19
## 6   mouse6  20      F     27
## 7   mouse7  25      F     22
## 10 mouse10  22      F     22

Use the select option in the subset function to return 1 or more specific variables

# Mice over 4 months old who are female only returning weight
subset(x = mouse.df, subset = age > 4 & gender == "M", 
       select = weight)
##   weight
## 2     19
## 3     21
## 5     19
## 8     21
## 9     17

Something important to notice is the data type returned by subset as opposed to [].

  1. subset returns selected output as a dataframe no matter what
  2. [] will return an array if only a single variable is selected as output.

View output structure from mouse.df[mouse.df$age > 4, "weight"] and subset(mouse.df, age > 4 & gender == "M", select = weight) for example

Adding Variables

Adding a variable to a dataframe is simple. Use the df$<new.var.name> notation and assign something to it

# Store information on average number of hours each mouse sleeps per day.
mouse.df$avg.sleep <- c(6,8,3,14,2,7,7,8,5,10)
mouse.df
##         ID age gender weight avg.sleep
## 1   mouse1  16      F     25         6
## 2   mouse2  28      M     19         8
## 3   mouse3  19      M     21         3
## 4   mouse4  21      F     19        14
## 5   mouse5  23      M     19         2
## 6   mouse6  20      F     27         7
## 7   mouse7  25      F     22         7
## 8   mouse8  24      M     21         8
## 9   mouse9  27      M     17         5
## 10 mouse10  22      F     22        10

An important note is that any time you add a new variable to a dataframe, all of the data in that variable must be of the same type and must have the same length as the rest of the dataframe. Otherwise an error will occur.

Basic data frame manipulation is supported in base R, however custom functions are available to make life easier in packages.

Questions?

Packages

What is a package?

  • Collection of functions and data sets grouped under a single name that can be loaded and unloaded from the workspace.
  • Fill in gaps in functionality that base R leaves.
  • Mainly stored on CRAN making them easily downloaded in RStudio

Thousands of packages exist and they do not come with R when you download the language. Must install the ones you want.

There is a major difference between a package being installed and loaded.

Installed: A package has been downloaded onto the individual’s computer. Does not say anything about whether the components of the package are available to the current R environment. Only done once!

Loaded: Components of a packages are available to the environment after they have been loaded. Have to load a package each time RStudio is opened. Commands for loading packages can be installed at the beginning of scripts

A package must be installed before it can be loaded. A package must be loaded before it can be used.

There are a couple of ways to install a package. RStudio makes the process straightforward using the Packages Window.

Click the Install button to bring up a new window

The options describe the following:

  1. Whether you are installing from CRAN (the default), or from a downloaded tar.gz file.
  2. Either the package name (when downloading from CRAN), or the path to the tar.gz file.
  3. Where the package will be installed to on your computer. Defaults to a library in the R program folder

Packages can also be installed using the install.packages command.

install.packages("dplyr")

Generally easier to use the GUI however, in my experience.

After being installed, packages can be loaded using the library command

library(dplyr)

Or they can be loaded by clicking the checkbox next to their names in the Packages panel in RStudio

Installed packages should be updated from time to time for bug-fixes and added funcitonality that the authors may have put in. This can be done in a couple of ways

  1. Click the Update button in the Packages window. Select the packages to update or select all to update all.
  2. The update.packages() function

The tidyverse

The tidyverse is a suite of packages that make data-wrangling much easier overall. It is composed of dplyr, ggplot2, tidyr, and more. You can install the tidyverse suite all at once using the methods from above inserting tidyverse into the field for the package name. You can load the tidyverse packages in the same way.

We will only touch on a few of the functions from dplyr and ggplot2 from here on out, but becoming familiar with the other aspects of the tidyverse will improve how you interact with R and your data.

Some dplyr Functions (aka verbs)

  1. select: choosing variables to keep or remove
  2. arrange: sort by specific variables
  3. summarize: reduce multiple variables down to single values
  4. group_by: make data operations done by group instead of as a whole
  5. gather and spread: switching between long and wide data organization

Choosing variables to keep or remove from a data frame can be done with the select function. The basic structure is:

select(df, var1, var2, ...)

# Choose variables to keep
select(mouse.df, gender, age)
##    gender age
## 1       F  16
## 2       M  28
## 3       M  19
## 4       F  21
## 5       M  23
## 6       F  20
## 7       F  25
## 8       M  24
## 9       M  27
## 10      F  22

Remove variables using the - sign before the dropped variable’s name

# Remove specific variables
select(mouse.df, -weight)
##         ID age gender avg.sleep
## 1   mouse1  16      F         6
## 2   mouse2  28      M         8
## 3   mouse3  19      M         3
## 4   mouse4  21      F        14
## 5   mouse5  23      M         2
## 6   mouse6  20      F         7
## 7   mouse7  25      F         7
## 8   mouse8  24      M         8
## 9   mouse9  27      M         5
## 10 mouse10  22      F        10

arrange is useful when trying to sort data by any specific variable(s). The basic structure is:

arrange(df, var1, var2, ...)

arrange(mouse.df, age)
##         ID age gender weight avg.sleep
## 1   mouse1  16      F     25         6
## 2   mouse3  19      M     21         3
## 3   mouse6  20      F     27         7
## 4   mouse4  21      F     19        14
## 5  mouse10  22      F     22        10
## 6   mouse5  23      M     19         2
## 7   mouse8  24      M     21         8
## 8   mouse7  25      F     22         7
## 9   mouse9  27      M     17         5
## 10  mouse2  28      M     19         8

You can also sort in descending order using the desc function

arrange(mouse.df, desc(age))
##         ID age gender weight avg.sleep
## 1   mouse2  28      M     19         8
## 2   mouse9  27      M     17         5
## 3   mouse7  25      F     22         7
## 4   mouse8  24      M     21         8
## 5   mouse5  23      M     19         2
## 6  mouse10  22      F     22        10
## 7   mouse4  21      F     19        14
## 8   mouse6  20      F     27         7
## 9   mouse3  19      M     21         3
## 10  mouse1  16      F     25         6

Sorting by multiple variables is also possible. Put the variables in the order you want the data frame sorted by in the arrange function call

# Sort by gender, then by age
arrange(mouse.df, gender, age)
##         ID age gender weight avg.sleep
## 1   mouse1  16      F     25         6
## 2   mouse6  20      F     27         7
## 3   mouse4  21      F     19        14
## 4  mouse10  22      F     22        10
## 5   mouse7  25      F     22         7
## 6   mouse3  19      M     21         3
## 7   mouse5  23      M     19         2
## 8   mouse8  24      M     21         8
## 9   mouse9  27      M     17         5
## 10  mouse2  28      M     19         8

summarize returns a dataframe that contains user-defined summary statistics for an existing dataframe. The basic structure is:

summarize(df, new.var1 = func1(var1), new.var2 = func2(var2)).

# get the average age and weight for all mice
summarize(mouse.df, mean.age = mean(age), mean.weight = mean(weight))
##   mean.age mean.weight
## 1     22.5        21.2

group_by is useful for applying functions across groups. By itself, it doesn’t do much, but is very powerful in combination with other functions. For example, if we wanted to get the mean age and weight separated by gender, we can use group_by with summarize

mouse.grouped <- group_by(mouse.df, gender)
summarize(mouse.grouped, mean.age = mean(age), mean.weight = mean(weight))
## # A tibble: 2 x 3
##   gender mean.age mean.weight
##   <fct>     <dbl>       <dbl>
## 1 F          20.8        23  
## 2 M          24.2        19.4

The last pair of functions included here involve transforming how the data frame itself is structured and creating tidy data. Tidy data refers to the data with the following properties:

  1. Each column is a separate variable
  2. Each row is an observation
  3. Each cell has a single value for that observation

Tidiness is important for data with repeated measures, where a single variable type is measured on multiple occasions. For example, there are a couple of ways to store repeated measure variables: the wide format, and the long format.

The wide format has repeated measure categories (e.g. different days) are spread across different variables. Say we were measuring temperature at different locations across seasons.

state.temp <- data.frame(state = c("Alabama","Tennessee","Michigan"),
                         winter = c(47, 39, 22),
                         summer = c(79, 76, 66))
state.temp
##       state winter summer
## 1   Alabama     47     79
## 2 Tennessee     39     76
## 3  Michigan     22     66

In this case, the summer and winter variables both represent the same thing, temperature. It is a repeated measure of the same variable spread across multiple columns. Wide formats typically do not conform to tidy data principles since multiple observations of a variable are stored in a single row

The long format has repeated measure categories grouped in a single column. So winter and summer would be factors in a season variable and temperature would be a separate column

##       state season temperature
## 1   Alabama winter          47
## 2   Alabama summer          79
## 3  Michigan winter          22
## 4  Michigan summer          66
## 5 Tennessee winter          39
## 6 Tennessee summer          76

The long format typically complies with tidy data standards as now there is a single observation of temperature described by season and state per row.

Continuing with our loveable mice, say the average sleep time we entered earlier was only in the summer, and we add sleep time for the winter as a separate variable

# rename the avg.sleep variable to avg.sleep.summer
mouse.df <- rename(mouse.df, summer = avg.sleep)

mouse.df$winter <- c(10,8,12,16,7,11,12,12,5,7)
mouse.df
##         ID age gender weight summer winter
## 1   mouse1  16      F     25      6     10
## 2   mouse2  28      M     19      8      8
## 3   mouse3  19      M     21      3     12
## 4   mouse4  21      F     19     14     16
## 5   mouse5  23      M     19      2      7
## 6   mouse6  20      F     27      7     11
## 7   mouse7  25      F     22      7     12
## 8   mouse8  24      M     21      8     12
## 9   mouse9  27      M     17      5      5
## 10 mouse10  22      F     22     10      7

There are now columns for both summer and winter avg sleep time. The data are now in a wide form where average amount of sleep is spread across different columns based on time of year. To convert to a long format, we will use the gather function. gather has the following format:

gather(df, key, value, variables)

  1. key is a string for the name of the new grouping variable
  2. value is the name of the value variable
  3. variables are current columns that are going to be gathered
mouse.df.g <- gather(mouse.df, key = "season", value = "avg.sleep", 
                     summer, winter)

##         ID age gender weight season avg.sleep
## 1   mouse1  16      F     25 summer         6
## 2   mouse1  16      F     25 winter        10
## 3  mouse10  22      F     22 summer        10
## 4  mouse10  22      F     22 winter         7
## 5   mouse2  28      M     19 summer         8
## 6   mouse2  28      M     19 winter         8
## 7   mouse3  19      M     21 summer         3
## 8   mouse3  19      M     21 winter        12
## 9   mouse4  21      F     19 summer        14
## 10  mouse4  21      F     19 winter        16
## 11  mouse5  23      M     19 summer         2
## 12  mouse5  23      M     19 winter         7
## 13  mouse6  20      F     27 summer         7
## 14  mouse6  20      F     27 winter        11
## 15  mouse7  25      F     22 summer         7
## 16  mouse7  25      F     22 winter        12
## 17  mouse8  24      M     21 summer         8
## 18  mouse8  24      M     21 winter        12
## 19  mouse9  27      M     17 summer         5
## 20  mouse9  27      M     17 winter         5

spread is the opposite of gather. It turns long structure into wide structure, but the command is very similar to gather

spread(df, key, value)

However, key and value are now both current columns in the dataframe as opposed to ones that will be created

##        ID age gender weight season avg.sleep
## 1  mouse1  16      F     25 summer         6
## 2  mouse1  16      F     25 winter        10
## 3 mouse10  22      F     22 summer        10
## 4 mouse10  22      F     22 winter         7
## 5  mouse2  28      M     19 summer         8
## 6  mouse2  28      M     19 winter         8
head(spread(mouse.df.g, key = "season", value = "avg.sleep"),3)
##        ID age gender weight summer winter
## 1  mouse1  16      F     25      6     10
## 2 mouse10  22      F     22     10      7
## 3  mouse2  28      M     19      8      8

Other dplyr verbs

  1. filter - dplyr’s version of subset
  2. mutate - more easily add variables to a dataframe
  3. sample - randomly sample rows from dataframe
  4. %>% - pipe operator. Chain together commands as opposed to nesting commands

Many, many more.

Reading in Outside Data

Up to now, we have been concerned with user-generated data from inside R, however most data will need to be imported instead. Luckily there are a variety of functions and packages for importing all kinds of data.

  1. utils: import using base R. read.table, read.csv, read.delim, etc.
  2. readr: specialized for rectangular text data. read_table, read_csv, read_delim
  3. foreign: import from other data processing programs. read.spss, read.octave, etc.
  4. readxl: import excel tables. read_xls, read_xlsx, read_excel

Let’s look at an example comma-separated value (csv) text file and how to import that into R.

NOTE: The commas in this file are called the delimiter. Any single symbol can be used as a delimiter, however the most common are commas, tabs, and spaces.

mouse.df.read <- read.delim("~/Desktop/mousedf.csv", sep = ",")

mouse.df.read
##         ID age gender weight summer winter
## 1   mouse1  16      F     25      6     10
## 2   mouse2  28      M     19      8      8
## 3   mouse3  19      M     21      3     12
## 4   mouse4  21      F     19     14     16
## 5   mouse5  23      M     19      2      7
## 6   mouse6  20      F     27      7     11
## 7   mouse7  25      F     22      7     12
## 8   mouse8  24      M     21      8     12
## 9   mouse9  27      M     17      5      5
## 10 mouse10  22      F     22     10      7

read.csv is a shortcut for read.delim where the delimiter is assumed to be a comma, so it is more apt for this situation

mouse.df.read <- read.csv("~/Desktop/mousedf.csv")

mouse.df.read
##         ID age gender weight summer winter
## 1   mouse1  16      F     25      6     10
## 2   mouse2  28      M     19      8      8
## 3   mouse3  19      M     21      3     12
## 4   mouse4  21      F     19     14     16
## 5   mouse5  23      M     19      2      7
## 6   mouse6  20      F     27      7     11
## 7   mouse7  25      F     22      7     12
## 8   mouse8  24      M     21      8     12
## 9   mouse9  27      M     17      5      5
## 10 mouse10  22      F     22     10      7

Sometimes, data may be stored in excel sheets. R can import those with ease as well using the readxl package. Use read_xls or read_xlsx if you know the file type, otherwise use read_excel and it will guess which is more appropriate. So our data in excel would look like this:

Since we know our file type (xlsx here), we will use read_xlsx

mouse.df.read <- read_xlsx(path = "~/Desktop/mousedf.xlsx")

mouse.df.read
## # A tibble: 10 x 6
##    ID        age gender weight summer winter
##    <chr>   <dbl> <chr>   <dbl>  <dbl>  <dbl>
##  1 mouse1     16 F          25      6     10
##  2 mouse2     28 M          19      8      8
##  3 mouse3     19 M          21      3     12
##  4 mouse4     21 F          19     14     16
##  5 mouse5     23 M          19      2      7
##  6 mouse6     20 F          27      7     11
##  7 mouse7     25 F          22      7     12
##  8 mouse8     24 M          21      8     12
##  9 mouse9     27 M          17      5      5
## 10 mouse10    22 F          22     10      7

After importing data, be sure to check the variable types for the data frame like we did before (str function). Import functions have trouble determining between characters, factors and others, so make sure you’re data is the way you want it before working with it.

Questions?

Graphing with ggplot

Lastly, ggplot is the plotting tool of choice for most things in R. The gg in the name stands for the grammar of graphics, the high-level idea you can build any graph from the same components:

  1. Data
  2. A coordinate system
  3. Visual marks that represent data points

At a low-level, every ggplot function call has 3 properties to it:

  1. Data
  2. Aesthetic mappings between the data and visual properties of the graph (x, y, color, etc.)
  3. Geoms - layers that render the data

Aesthetics are controlled via the aes function. This function maps properties of the graph to variables in the data frame.

A simple example would be plotting the weight of a mouse versus its age in a scatter plot.

NOTE: I have added data to the dataframe, specifically 10 more mice. This will make the plots more populated.

ggplot(data = mouse.df, mapping = aes(x = age, y = weight)) + 
  geom_point()

This produces a scatterplot defined by:

  1. Data: mouse.df
  2. Aesthetics: age mapped to the x axis, weight mapped to the y axis
  3. Layer: points

The basic structure is that data and aesthetics are supplied in the ggplot call and the layers are added on with +

x and y are not the only aesthetics you’ll want to use when plotting. aes gives access to many different properties of the geoms such as

  • color
  • size
  • shape
  • fill
  • transparency (alpha)
  • and more!

For instance, if we wanted to differentiate males and females by color, it would look like:

#Plot different genders as different colors
ggplot(mouse.df, aes(age, weight, color = gender)) + 
  geom_point()

In the previous graph, those points may have been difficult to see. This problem raises the question of how to change properties of the graph without mapping to a variable. Luckily, you can manually set a value for any of the previous properties in the geom call. For example:

#Plot different genders as different colors
ggplot(mouse.df, aes(age, weight, color = gender)) + 
  geom_point(size = 3)

In this relatively simple dataset, it’s easy to see differences between the groups on that plot. However with much larger datasets, visual differences can be muddled if all the groups are plotted on the same graph. Faceting helps with this. Faceting splits data by a specification variable in the facet command. The generally most useful facet command is

facet_wrap(~var)

#Plot different genders in different facet plots
ggplot(mouse.df, aes(age, weight, color = gender)) + 
  geom_point(size = 3) +
  facet_wrap(~gender)

Multiple Geoms

Up until now, we have worked with one geom at a time to represent our data. However, plotting multiple geoms is common. Multiple geoms can be added easily with extra geom functions. For example, if we wanted to plot a dot and line plot using the age and weight data.

ggplot(mouse.df,aes(age,weight)) + 
  geom_point() +
  geom_line()

However, what if we wanted to differentiate dots by color again? We can add back in that aesthetic to the ggplot call, but it has an unintended consequence on the line plot:

ggplot(mouse.df,aes(age,weight, color = gender)) + 
  geom_point(size = 3) +
  geom_line()

Inhereted Aesthetics

In the previous plot, we only wanted the dots to change color, but the line to stay the same, so why did the line geom split by gender as well?

In ggplot, all aesthetics named in the ggplot call are passed down to every following geom function. So both geom_point and geom_line have the same aesthetics of:

  1. x = age
  2. y = weight
  3. color = gender

We say that the geoms inherit the aesthetics of the original ggplot call

Inhereted aesthetics are an extremely important concept in ggplot. It means you do not have to tell each individual geom call what’s x, y, and whatever else making code much cleaner overall.

However, inhereted aesthetics become slightly problematic if you want one geom to apply to subgroups (plotting dots as different colors by gender) and another geom to apply to the data as a whole (connect all points with a single line)

Luckily, ggplot also makes this fairly easy as you can name aesthetics inside the geom calls themselves, and those aesthetics will be local only to that geom.

For example, let’s again try to plot the dots as different colors, but plot a single line plot for the whole group:

ggplot(mouse.df,aes(age,weight)) + 
  geom_point(aes(color = gender), size = 3) +
  geom_line()

Constantly keep in mind how you want your graph to look at the end and which aesthetics are appropriate for which geoms.

Drawing your sketches of your graph on paper will help organize which geoms and aesthetics are appropriate and will save time overall.

Questions?

Plotting Discrete Vs. Continuous Values

So far, we have only been concerned with continuous values on the x and y axes. However, plotting discrete values instead is just as easy as continuous.

Let’s plot a boxplot of male vs. female average summer sleep time

ggplot(mouse.df,aes(gender,summer)) +
  geom_boxplot(outlier.shape = 3)

We can also easily add points to the plot as well overlayed on top of the boxplot

ggplot(mouse.df,aes(gender,summer)) +
  geom_boxplot(outlier.shape = 3) +
  geom_point(size = 3)

You may notice that the number of points on the graph does not match the number of points we know we have for each group. This is because some mice sleep the same amount of time, so their points are plotted over each other. We can offset x positions using geom_jitter as opposed to geom_point so that all dots can be seen on the graph.

ggplot(mouse.df,aes(gender,summer)) +
  geom_boxplot(outlier.shape = 3) +
  geom_jitter(size = 3)

We can change the width of the jitter using the width variable in the geom_jitter call

ggplot(mouse.df,aes(gender,summer)) +
  geom_boxplot(outlier.shape = 3) +
  geom_jitter(size = 3, width = 0.1)

Tidy Data and ggplot

Given how aesthetics are named, ggplot works best with tidy data. For example, given how our current data frame is set up, plotting average sleep time for summer vs winter would be very difficult. However, if we gather the data like we did earlier, it makes plotting much easier

mouse.df.g <- gather(mouse.df, key = "season", value = "avg.sleep", summer,winter)
head(mouse.df.g)
##       ID age gender weight season avg.sleep
## 1 mouse1  16      F     25 summer         6
## 2 mouse2  28      M     17 summer         8
## 3 mouse3  19      M     19 summer         3
## 4 mouse4  21      F     19 summer        14
## 5 mouse5  23      M     17 summer         2
## 6 mouse6  20      F     27 summer         7

Now, we can make a bar plot of avg.sleep time vs. season.

ggplot(mouse.df.g, aes(season,avg.sleep)) +
  geom_bar(stat = "summary", fun.y = "mean")

However, what if we wanted to see if there are also differences between males and females as well as between seasons? We can add a fill aesthetic for this.

ggplot(mouse.df.g, aes(season,avg.sleep, fill = gender)) +
  geom_bar(stat = "summary", fun.y = "mean")

geom_bar by default stacks results on top of one another. We can get around this using the position = position_dodge() option in geom_bar

ggplot(mouse.df.g, aes(season,avg.sleep, fill = gender)) +
  geom_bar(stat = "summary", fun.y = "mean", position = position_dodge())

Storing Plots as Variables

ggplot objects can be stored as variables to make trying out different layers easier with less typing. For example, we can store the bar plot we just made in a variable bar

bar <- ggplot(mouse.df.g, aes(season, avg.sleep, fill = gender)) +
  geom_bar(stat = "summary", fun.y = "mean", position = position_dodge())

We can then just call bar to again plot the graph stored

bar

You can also add layers using +. So if we wanted to add points to the plot, we can call

bar + geom_point(position = position_dodge(width = 0.9))

Errorbars

Errorbars are essential for publication quality plots. We can use the stat_summary function to create errorbars representing the standard error of the mean for us for each group.

For this case, the stat_summary function works by giving it 2 important variables:

  1. fun.data: the function to apply to the data. Various functions exist for this such as mean_se, mean_sdl, mean_cl_boot, and others
  2. geom: the geom used to represent the errorbars. Can be one of 4 different types: errorbar, pointrange, linerange, or crossbar

For example, if we wanted to add errorbars to the previous plot:

bar + stat_summary(fun.data = mean_se, geom = "errorbar", 
                   position = position_dodge())

By default, the width of the errorbars are the width of the bars. We can change this using the width option in stat_summary, but this also changes the errorbars’ horizontal offset

bar + stat_summary(fun.data = mean_se, geom = "errorbar", 
                   position = position_dodge(), width = 0.2)

To fix this, we will manually set the position_dodge argument to be 0.9, the default offset. For more complex graphs, you may need to play around with this number to place errorbars exacly where you want them

bar + stat_summary(fun.data = mean_se, geom = "errorbar", 
                   position = position_dodge(0.9), width = 0.2)

Themes, axes, and titles

ggplot claims to be able to make publication quality plots out of the box, however the defaults may not be suitable. The Theme of a graph refers to general aesthetics of the graph itself such as the font type and size, the background color, the presence of gridlines, etc. Titles are not added by default but can be with a simple command, and axis labels are named as the variable they represent by default but can also be changed.

Themes

Any thematic element of the graph can changed using the theme command. A mojority of the little things you want to change about the graph, you will need to google, but for example, if we wanted to change the font size on the axes, we could use a command like this:

bar + theme(text = element_text(size = 30))

There are a ton of different options to change that cannot be summarized well here, but google is your best friend.

There are some stock themes included in ggplot that can be used as one-liners to change many things about a plot at once. These themes are listed and can be seen at this link: ggplot themes

Titles and Axis Labels

Titles and axis labels can be added easily using the xlab, ylab, and ggtitle commands. These have the basic form:

bar + ggtitle("This is a title for a plot")

Obviously, you will need to center the text and change the font size for it using the theme command, but that is the general way of adding titles and axis labels to the plot.

Helpful Outside Resources