library(tidyverse)
Tidyverse is a collection of R packages for the comfortable cleanup of data. Loading the package tidyverse
will laod the core tidyverse packages: tibble
, tidyr
, readr
, purrr
, and dplyr
. You can of course also load each package individually.
Intro
- Tidyverse is optimized for interactive workflow with data
- Each function does one thing easy and well
- Basic idea:
action(data, some_arguments)
ordata %>% action(some_arguments)
- Everything works with
tibbles
- Web page: http://tidyverse.org/
- Workshop page (withe example scripts): http://bodowinter.com/carpentry/index.html
Core packages
tibble
- A modern version of dataframes
- The first argument of every tidyverse function and what every tidyverse function returns
- Tibbles use characters instead of factors for texts
- Tibbles have nicer printout than normal dataframes: show data type of columns, number of rows, only the first few rows/columns not all of the data
mynames <- c('bla', 'jkl', 'xyz', 'asdf', 'asdf')
age <- c(NA, 30, 20, 25, 18)
pre <- round(rnorm(length(mynames)), 2)
post <- round(rnorm(length(mynames)), 2)
mydata <- tibble(mynames, age, pre, post) # create tibble from data
mydf <- data.frame(mynames, age, pre, post)
as_tibble(mydf) # convert data frame into tibble
readr
- Does the same as
read.csv
from base R, it reads a csv file - Faster
- Automatically creates tibbles
- Progress bar for big files
read_csv('somefile.csv')
tidyr
- A data frame is a rectangular array of variables (columns) and observations (rows)
-
A tidy data frame is a data frame where…
** Each variable is in a column.
** Each observation is a row.
** Each value is a cell. -
Wide format: a row has many entries for observations, e.g., time-series in columns T0, T1, T2, …
- Long format: each observation is a separate row, time is a new column, e.g., row1 is T0, row2 is T1, row3 is T2
- Two functions:
gather()
to convert from wide format to long format andspread()
to convert from wide format to long format
# Convert to long format, so that every observation is one row,
# with either the text 'pre' or 'post' in the column 'exam'
# and the value that was in pre or post now in the column 'score'
tidydf <- gather(mydata, exam, score, pre:post)
# From tidydf create the same thing back that we had in mydata (wide format)
spread(tidydf, exam, score)
- Easily split columns with
separate()
and merge withunite()
court # tibble with lots of comma-separated text in one column ‘text’
# Split it into 14 columns with the names A-N,
# Convert = True -> try to guess the datatypes, otherwise everything would be characters
court <- separate(court, text, into = LETTERS[1:14], convert = T)
# Put columns B, C and D into one column 'condition'
court <- unite(court, condition, B, C, D)
dplyr
- Filter rows with
filter()
filter(mydata, !is.na(age), pre>0, !duplicated(mynames))
filter(mydata, mynames %in% c('jkl', 'bla'))
filter(mydata, post > pre)
- Select columns with
select()
select(mydata, pre) # select a column
select(mydata, -pre) # select everything besides this column
select(mydata, age:pre) # select all columns between pre and post
select(mydata, -(pre:post)) # select all columns besides those between pre and post
select(mydata, pre:post, age, mynames) # select and reorder
- Sort a tibble by a column with
arrange()
arrange(mydata, desc(age), pre) # sort by age (descending), then by pre
- Rename one or more columns with
rename()
rename(mydata, newname=pre, othernew=post)
- Add new columns with
mutate()
andtransmute()
mutate(mydata,
diff = pre-post,
diff = diff*2,
diff_c = diff-mean(diff, na.rm=T))
mutate(mydata, gender = ifelse(mynames == 'jkl', 'F', 'M'))
# transmute does the same, but returns only newly defined columns
transmute(mydata, diff = pre-post, diff2 = diff*2)
- Aggregate data with
summarize()
mydata %>% group_by(gender) %>%
summarise(MeanAge = mean(age, na.rm=T), Mean = mean(score, na.rm=T), SD = sd(score, na.rm=T))
# na.rm -> remove NA values
- Merge tibbles with
left_join()
(there are also other joins)
Other packages
magrittr
- Pipes:
%>%
- Send the same dataframe as input to a pipeline of actions.
- Example:
mydf %>%
filter(!is.na(F0)) %>%
mutate(LogFreq = log(Freq)) %>%
group_by(Condition) %>%
summarise(mean = mean(LogFreq))
- Does the same as:
mydf.filtered <- filter(mydf, !is.na(F0))
mydf.log <- mutate(mydf.filtered, LogFreq = log(Freq))
mydf.grouped <- group_by(mydf.log, mydf.log)
summarise(mydf.grouped, mean = mean(LogFreq))
ggplot2
- “An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points.”
- “A geom is the geometrical object that a plot uses to represent data.”
- General form:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
- Examples:
ggplot(mydf, # dataframe/tibble as first arg, mapping -> from data to aestetics/graphic properties
mapping = aes( # aes -> set of aestetics mappings,
x = pred, y = resp # map x/y-values of plot to dataframe columns with these names
)) + geom_point() # add shape to the plot
ggplot(mydf, mapping = aes( x = pred)) +
geom_histogram(binwidth = .5,
fill = rgb(0.2,0.4,0.8,0.6), # rgb values in [0..1], last part is alpha
color = 'black') # use colors() to get a list of all colors
ggplot(mydf, mapping = aes( x = pred)) +
geom_density(fill = rgb(0.8,0.4,0.3,0.6), color = 'black')
stringr
- Basic String manipulation
s1 <- "THis is a String 123 that has numbers 456 "
str_to_lower(s1)
str_to_upper(s1)
str_length(s2)
- String concatenation and splitting
str_c("Hello", "Bodo", "nice", "to", "meet", "you", sep = " ")
s2 <- c('Anna Beispiel', 'Cornelia Daten', 'Egon Fritz')
xsplit <- str_split(s2, ' ') # Returns a list of character vectors
unlist(xsplit) # Flattens the list into a vector of characters
str_split(s2, ' ', simplify = T) # Returns a matrix instead of a list
- Substrings
str_sub(s2, 1, 1) # get the first letter of every entry
- Regular expressions on a (list of) Strings
str_view(s1, "(S|s)tr") # Search and show the result
str_detect(s1, "[0-9]") # Check presence
str_extract(s1, "[0-9]+") # Extract the (first) match
str_replace(s1, "[0-9]+", ":)") # replace first occurrence
str_replace_all(s1, "([0-9]+)", "\\1 :)") # replace all