In order to facilitate the data analysis pipeline, it is crucial to have tidy data
. What this means is that every column in your data frame represents a variable and every row represents an observation. This is also referred to as long format (as opposed to wide format).
tidyr
is a package that provides useful functions for converting raw data into tidy data. This is typically the first step in the data analysis pipeline after you have collected your data.
This tutorial will focus on step 2 of the process. The main verbs we will use are:
gather()
and spread()
in order to convert between long and wide dataseparate()
can split up a single column into servaral variables and is more commonly used in conjunction with gather()
for linguistic research (i.e. when separating columns in praat).gather(df, newVar1, newVar2, vector1, vector2)
library(tidyr); library(dplyr)
set.seed(1)
tidyr.ex <- data.frame(
participant = c("p1", "p2", "p3", "p4", "p5", "p6"),
info = c("g1m", "g1m", "g1f", "g2m", "g2m", "g2m"),
day1score = rnorm(n = 6, mean = 80, sd = 15),
day2score = rnorm(n = 6, mean = 88, sd = 8)
)
print(tidyr.ex)
## participant info day1score day2score
## 1 p1 g1m 70.60319 91.89943
## 2 p2 g1m 82.75465 93.90660
## 3 p3 g1f 67.46557 92.60625
## 4 p4 g2m 103.92921 85.55689
## 5 p5 g2m 84.94262 100.09425
## 6 p6 g2m 67.69297 91.11875
tidyr.ex %>%
gather(day, score, c(day1score, day2score))
## participant info day score
## 1 p1 g1m day1score 70.60319
## 2 p2 g1m day1score 82.75465
## 3 p3 g1f day1score 67.46557
## 4 p4 g2m day1score 103.92921
## 5 p5 g2m day1score 84.94262
## 6 p6 g2m day1score 67.69297
## 7 p1 g1m day2score 91.89943
## 8 p2 g1m day2score 93.90660
## 9 p3 g1f day2score 92.60625
## 10 p4 g2m day2score 85.55689
## 11 p5 g2m day2score 100.09425
## 12 p6 g2m day2score 91.11875
Essentially we took the columns day1score
and day2score
, which represent the variable day
and the variable score
, and gathered them. Why? Remember that tidy data has one column for each variable and one row for each observation. The numbers in the two columns we changed were observations, thus they should each get their own row.
gather()
. The `spread() verb takes different levels of a factor and spreads them out into different columns. This means we can convert from long data to wide.tidyr.ex %>%
gather(day, score, c(day1score, day2score)) %>%
spread(day, score)
## participant info day1score day2score
## 1 p1 g1m 70.60319 91.89943
## 2 p2 g1m 82.75465 93.90660
## 3 p3 g1f 67.46557 92.60625
## 4 p4 g2m 103.92921 85.55689
## 5 p5 g2m 84.94262 100.09425
## 6 p6 g2m 67.69297 91.11875
Now we are back to how we started.
separate(df, col, into, sep)
Consider the column info
of our fake data. You can probably guess what observations represent. How many variables are there? Take a second to think about it if it doesn’t jump out at you. The answer is 2. g1
and g2
appear to be a grouping variable (g = group) and m
f
is an indication of gender. Because there are two separate variables, there should be two columns in the data frame… one for group
and one for gender
.
tidyr.ex %>%
gather(day, score, c(day1score, day2score)) %>%
separate(col = info, into = c("group", "gender"), sep = 2)
## participant group gender day score
## 1 p1 g1 m day1score 70.60319
## 2 p2 g1 m day1score 82.75465
## 3 p3 g1 f day1score 67.46557
## 4 p4 g2 m day1score 103.92921
## 5 p5 g2 m day1score 84.94262
## 6 p6 g2 m day1score 67.69297
## 7 p1 g1 m day2score 91.89943
## 8 p2 g1 m day2score 93.90660
## 9 p3 g1 f day2score 92.60625
## 10 p4 g2 m day2score 85.55689
## 11 p5 g2 m day2score 100.09425
## 12 p6 g2 m day2score 91.11875
tidyr.ex %>%
gather(day, score, c(day1score, day2score)) %>%
separate(col = info, into = c("group", "gender"), sep = 2) %>%
unite(infoAgain, group, gender)
## participant infoAgain day score
## 1 p1 g1_m day1score 70.60319
## 2 p2 g1_m day1score 82.75465
## 3 p3 g1_f day1score 67.46557
## 4 p4 g2_m day1score 103.92921
## 5 p5 g2_m day1score 84.94262
## 6 p6 g2_m day1score 67.69297
## 7 p1 g1_m day2score 91.89943
## 8 p2 g1_m day2score 93.90660
## 9 p3 g1_f day2score 92.60625
## 10 p4 g2_m day2score 85.55689
## 11 p5 g2_m day2score 100.09425
## 12 p6 g2_m day2score 91.11875
Now that our data are tidy (using just the gather()
and separate()
verbs), we can plot and analyze it.
tidyr.ex %>%
gather(day, score, c(day1score, day2score)) %>%
separate(col = info, into = c("group", "gender"), sep = 2) %>%
ggplot(aes(x = day, y = score)) +
geom_point() +
facet_wrap(~ group) +
geom_smooth(method = "lm", aes(group = 1), se = F)
These are the essential verbs used for tidying data. There are other commands that can be useful, but mainly they are different takes on the ones we have covered here (i.e. extract()
and unite()
, which are similar to separate()
and gather()
, respectively, but use regex).