Tidyverse R package

library(tidyverse)

Tidyverse is a collection of R packages for the comfortable cleanup of data. Loading the package tidyverse will laod the core tidyverse packages: tibble, tidyr, readr, purrr, and dplyr. You can of course also load each package individually.

Intro

  • Tidyverse is optimized for interactive workflow with data
  • Each function does one thing easy and well
  • Basic idea: action(data, some_arguments) or data %>% action(some_arguments)
  • Everything works with tibbles
  • Web page: http://tidyverse.org/
  • Workshop page (withe example scripts): http://bodowinter.com/carpentry/index.html

Core packages

tibble

  • A modern version of dataframes
  • The first argument of every tidyverse function and what every tidyverse function returns
  • Tibbles use characters instead of factors for texts
  • Tibbles have nicer printout than normal dataframes: show data type of columns, number of rows, only the first few rows/columns not all of the data
 mynames <- c('bla', 'jkl', 'xyz', 'asdf', 'asdf')
 age <- c(NA, 30, 20, 25, 18)
 pre <- round(rnorm(length(mynames)), 2)
 post <- round(rnorm(length(mynames)), 2)
 mydata <- tibble(mynames, age, pre, post)  # create tibble from data
 mydf <- data.frame(mynames, age, pre, post)
 as_tibble(mydf)  # convert data frame into tibble

readr

  • Does the same as read.csv from base R, it reads a csv file
  • Faster
  • Automatically creates tibbles
  • Progress bar for big files
 read_csv('somefile.csv')

tidyr

  • A data frame is a rectangular array of variables (columns) and observations (rows)
  • A tidy data frame is a data frame where…
    ** Each variable is in a column.
    ** Each observation is a row.
    ** Each value is a cell.

  • Wide format: a row has many entries for observations, e.g., time-series in columns T0, T1, T2, …

  • Long format: each observation is a separate row, time is a new column, e.g., row1 is T0, row2 is T1, row3 is T2
  • Two functions: gather() to convert from wide format to long format and spread() to convert from wide format to long format
 # Convert to long format, so that every observation is one row,
 # with either the text 'pre' or 'post' in the column 'exam'
 # and the value that was in pre or post now in the column 'score'
 tidydf <- gather(mydata, exam, score, pre:post)

 # From tidydf create the same thing back that we had in mydata (wide format)
 spread(tidydf, exam, score)
  • Easily split columns with separate() and merge with unite()
    court # tibble with lots of comma-separated text in one column ‘text’
 # Split it into 14 columns with the names A-N, 
 # Convert = True -> try to guess the datatypes, otherwise everything would be characters
 court <- separate(court, text, into = LETTERS[1:14], convert = T)

 # Put columns B, C and D into one column 'condition'
 court <- unite(court, condition, B, C, D)

dplyr

  • Filter rows with filter()
 filter(mydata, !is.na(age), pre>0, !duplicated(mynames))
 filter(mydata, mynames %in% c('jkl', 'bla'))
 filter(mydata, post > pre)
  • Select columns with select()
 select(mydata, pre) # select a column
 select(mydata, -pre) # select everything besides this column
 select(mydata, age:pre) # select all columns between pre and post
 select(mydata, -(pre:post)) # select all columns besides those between pre and post
 select(mydata, pre:post, age, mynames) # select and reorder
  • Sort a tibble by a column with arrange()
 arrange(mydata, desc(age), pre) # sort by age (descending), then by pre
  • Rename one or more columns with rename()
 rename(mydata, newname=pre, othernew=post)
  • Add new columns with mutate() and transmute()
 mutate(mydata, 
        diff = pre-post, 
        diff = diff*2, 
        diff_c = diff-mean(diff, na.rm=T))
 mutate(mydata, gender = ifelse(mynames == 'jkl', 'F', 'M'))
 # transmute does the same, but returns only newly defined columns
 transmute(mydata,  diff = pre-post,  diff2 = diff*2) 
  • Aggregate data with summarize()
 mydata %>% group_by(gender) %>% 
        summarise(MeanAge = mean(age, na.rm=T), Mean = mean(score, na.rm=T), SD = sd(score, na.rm=T))
 # na.rm -> remove NA values
  • Merge tibbles with left_join() (there are also other joins)

Other packages

magrittr

  • Pipes: %>%
  • Send the same dataframe as input to a pipeline of actions.
  • Example:
 mydf %>%
        filter(!is.na(F0)) %>%
        mutate(LogFreq = log(Freq)) %>%
        group_by(Condition) %>%
        summarise(mean = mean(LogFreq))
  • Does the same as:
 mydf.filtered <- filter(mydf, !is.na(F0))
 mydf.log <- mutate(mydf.filtered, LogFreq = log(Freq))
 mydf.grouped <- group_by(mydf.log, mydf.log)
 summarise(mydf.grouped, mean = mean(LogFreq))

ggplot2

  • “An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points.”
  • “A geom is the geometrical object that a plot uses to represent data.”
  • General form:
 ggplot(data = <DATA>) +
        <GEOM_FUNCTION>(
                mapping = aes(<MAPPINGS>),
                stat = <STAT>,
                position = <POSITION>
        ) +
        <COORDINATE_FUNCTION> +
        <FACET_FUNCTION>
  • Examples:
 ggplot(mydf, # dataframe/tibble as first arg, mapping -> from data to aestetics/graphic properties
        mapping = aes( # aes -> set of aestetics mappings,
           x = pred, y = resp #  map x/y-values of plot to dataframe columns with these names
         )) + geom_point() # add shape to the plot
 ggplot(mydf, mapping = aes( x = pred)) + 
        geom_histogram(binwidth = .5, 
        fill = rgb(0.2,0.4,0.8,0.6),   # rgb values in [0..1], last part is alpha
        color = 'black')   # use colors() to get a list of all colors
 ggplot(mydf, mapping = aes( x = pred)) + 
        geom_density(fill = rgb(0.8,0.4,0.3,0.6), color = 'black') 

stringr

  • Basic String manipulation
 s1 <- "THis is a String 123 that has numbers 456 "
 str_to_lower(s1)
 str_to_upper(s1)
 str_length(s2) 
  • String concatenation and splitting
 str_c("Hello", "Bodo", "nice", "to", "meet", "you", sep = " ")
 s2 <- c('Anna Beispiel', 'Cornelia Daten', 'Egon Fritz')
 xsplit <- str_split(s2, ' ') # Returns a list of character vectors
 unlist(xsplit) # Flattens the list into a vector of characters
 str_split(s2, ' ', simplify = T) # Returns a matrix instead of a list
  • Substrings
 str_sub(s2, 1, 1) # get the first letter of every entry
  • Regular expressions on a (list of) Strings
 str_view(s1, "(S|s)tr") # Search and show the result
 str_detect(s1, "[0-9]") # Check presence
 str_extract(s1, "[0-9]+") # Extract the (first) match
 str_replace(s1, "[0-9]+", ":)") # replace first occurrence
 str_replace_all(s1, "([0-9]+)", "\\1 :)") # replace all

rJava troubles

I am running code that needs the R package rJava. When I call that code, R just crashes without any indication of what is going wrong. This is a segmentation fault, that for some reason never makes it to the surface. You can solve this by setting the following (see also stackoverflow):

export _JAVA_OPTIONS="-Xss2560k -Xmx2g"

The result in my case is that R starts, but throws the error

Error: 'Error: C stack usage  141780829840 is too close to the limit'

I haven’t found a solution for that problem yet, tell me if you have any hints 🙁

Histograms of category frequencies in R

I am learning R, so this is my first attempt to create histograms in R. The data that I have is a vector of one category for each data point. For this example we will use a vector of a random sample of letters. The important thing is that we want a histogram of the frequencies of texts, not numbers. And the texts are longer than just one letter. So let’s start with this:

labels <- sample(letters[1:20],100,replace=TRUE)
labels <- vapply(seq_along(labels), 
                 function(x) paste(rep(labels[x],10), collapse = ""),
                 character(1L)) # Repeat each letter 10 times
library(plyr) # for the function 'count'
distribution <- count(labels)
distribution_sorted <- 
   distribution[order(distribution[,"freq"], decreasing=TRUE),]

I use the function count from the package plyr to get a matrix distribution with the different categories in column one (called "x") and the number of times this label occurs in column two (called "freq"). As I would like the histogram to display the categories from the most frequent to the least frequent one, I then sort this matrix by frequency with the function order. The function gives back a vector of indices in the correct order, so I need to plug this into the original matrix as row numbers.

Now let's do the histogram:

mp <- barplot(distribution_sorted[,"freq"],
         names.arg=distribution_sorted[,1], # X-axis names
         las=2,  # turn labels by 90 degrees
         col=c("blue"), # blue bars (just for fun)
         xlab="Kategorie", ylab="Häufigkeit", # Axis labels
         )

There are many more settings to adapt, e.g., you can use cex to increase the font size for the numerical y-axis values (cex.axis), the categorical x-axis names (cex.names), and axis labels (cex.lab).

In my plot there is one problem. My categorie names are much longer than the values on the y-axis and so the axis labels are positioned incorrectly. This is the point to give up and do the plot in Excel (ahem, LaTeX!) - or take input from fellow bloggers. They explain the issues way better than me, so I will just post my final solution. I took the x-axis label out of the plot and inserted it separately with mtext. I then wanted a line for the x-axis as well and in the end I took out the x-axis names from the plot again and put them into a separate axis at the bottom (side=1) with zero-length ticks (tcl=0) intersecting the y-axis at pos=-0.3.

# mai = space around the plot: bottom - left - top - right
# mgp = spacing for axis title - axis labels - axis line
par(mai=c(2.5,1,0.3,0.15), mgp=c(2.5,0.75,0))
mp <- barplot(distribution_sorted[,"freq"],
         #names.arg=distribution_sorted[,1], # X-axis names/labels
         las=2,  # turn labels by 90 degrees
         col=c("blue"), # blue bars (just for fun)
         ylab="Häufigkeit", # Axis title
         )
axis(side=1, at=mp, pos=-0.3, 
     tick=TRUE, tcl=0, 
     labels=distribution_sorted[,1], las=2, 
     )
mtext("Kategorie", side=1, line=8.5) # x-axis label

There has to be an easier way !?