Tidyverse R package

Posted on December 23, 2018 by swk

library(tidyverse)

Tidyverse is a collection of R packages for the comfortable cleanup of data. Loading the package tidyverse will laod the core tidyverse packages: tibble, tidyr, readr, purrr, and dplyr. You can of course also load each package individually.

Intro

Tidyverse is optimized for interactive workflow with data
Each function does one thing easy and well
Basic idea: action(data, some_arguments) or data %>% action(some_arguments)
Everything works with tibbles
Web page: http://tidyverse.org/
Workshop page (withe example scripts): http://bodowinter.com/carpentry/index.html

Core packages

tibble

A modern version of dataframes
The first argument of every tidyverse function and what every tidyverse function returns
Tibbles use characters instead of factors for texts
Tibbles have nicer printout than normal dataframes: show data type of columns, number of rows, only the first few rows/columns not all of the data

 mynames <- c('bla', 'jkl', 'xyz', 'asdf', 'asdf')
 age <- c(NA, 30, 20, 25, 18)
 pre <- round(rnorm(length(mynames)), 2)
 post <- round(rnorm(length(mynames)), 2)
 mydata <- tibble(mynames, age, pre, post)  # create tibble from data
 mydf <- data.frame(mynames, age, pre, post)
 as_tibble(mydf)  # convert data frame into tibble

readr

Does the same as read.csv from base R, it reads a csv file
Faster
Automatically creates tibbles
Progress bar for big files

 read_csv('somefile.csv')

tidyr

A data frame is a rectangular array of variables (columns) and observations (rows)
A tidy data frame is a data frame where…
** Each variable is in a column.
** Each observation is a row.
** Each value is a cell.
Wide format: a row has many entries for observations, e.g., time-series in columns T0, T1, T2, …
Long format: each observation is a separate row, time is a new column, e.g., row1 is T0, row2 is T1, row3 is T2
Two functions: gather() to convert from wide format to long format and spread() to convert from wide format to long format

 # Convert to long format, so that every observation is one row,
 # with either the text 'pre' or 'post' in the column 'exam'
 # and the value that was in pre or post now in the column 'score'
 tidydf <- gather(mydata, exam, score, pre:post)

 # From tidydf create the same thing back that we had in mydata (wide format)
 spread(tidydf, exam, score)

Easily split columns with separate() and merge with unite()
court # tibble with lots of comma-separated text in one column ‘text’

 # Split it into 14 columns with the names A-N, 
 # Convert = True -> try to guess the datatypes, otherwise everything would be characters
 court <- separate(court, text, into = LETTERS[1:14], convert = T)

 # Put columns B, C and D into one column 'condition'
 court <- unite(court, condition, B, C, D)

dplyr

Filter rows with filter()

 filter(mydata, !is.na(age), pre>0, !duplicated(mynames))
 filter(mydata, mynames %in% c('jkl', 'bla'))
 filter(mydata, post > pre)

Select columns with select()

 select(mydata, pre) # select a column
 select(mydata, -pre) # select everything besides this column
 select(mydata, age:pre) # select all columns between pre and post
 select(mydata, -(pre:post)) # select all columns besides those between pre and post
 select(mydata, pre:post, age, mynames) # select and reorder

Sort a tibble by a column with arrange()

 arrange(mydata, desc(age), pre) # sort by age (descending), then by pre

Rename one or more columns with rename()

 rename(mydata, newname=pre, othernew=post)

Add new columns with mutate() and transmute()

 mutate(mydata, 
        diff = pre-post, 
        diff = diff*2, 
        diff_c = diff-mean(diff, na.rm=T))
 mutate(mydata, gender = ifelse(mynames == 'jkl', 'F', 'M'))
 # transmute does the same, but returns only newly defined columns
 transmute(mydata,  diff = pre-post,  diff2 = diff*2)

Aggregate data with summarize()

 mydata %>% group_by(gender) %>% 
        summarise(MeanAge = mean(age, na.rm=T), Mean = mean(score, na.rm=T), SD = sd(score, na.rm=T))
 # na.rm -> remove NA values

Merge tibbles with left_join() (there are also other joins)

Other packages

magrittr

Pipes: %>%
Send the same dataframe as input to a pipeline of actions.
Example:

 mydf %>%
        filter(!is.na(F0)) %>%
        mutate(LogFreq = log(Freq)) %>%
        group_by(Condition) %>%
        summarise(mean = mean(LogFreq))

Does the same as:

 mydf.filtered <- filter(mydf, !is.na(F0))
 mydf.log <- mutate(mydf.filtered, LogFreq = log(Freq))
 mydf.grouped <- group_by(mydf.log, mydf.log)
 summarise(mydf.grouped, mean = mean(LogFreq))

ggplot2

“An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points.”
“A geom is the geometrical object that a plot uses to represent data.”
General form:

 ggplot(data = <DATA>) +
        <GEOM_FUNCTION>(
                mapping = aes(<MAPPINGS>),
                stat = <STAT>,
                position = <POSITION>
        ) +
        <COORDINATE_FUNCTION> +
        <FACET_FUNCTION>

Examples:

 ggplot(mydf, # dataframe/tibble as first arg, mapping -> from data to aestetics/graphic properties
        mapping = aes( # aes -> set of aestetics mappings,
           x = pred, y = resp #  map x/y-values of plot to dataframe columns with these names
         )) + geom_point() # add shape to the plot
 ggplot(mydf, mapping = aes( x = pred)) + 
        geom_histogram(binwidth = .5, 
        fill = rgb(0.2,0.4,0.8,0.6),   # rgb values in [0..1], last part is alpha
        color = 'black')   # use colors() to get a list of all colors
 ggplot(mydf, mapping = aes( x = pred)) + 
        geom_density(fill = rgb(0.8,0.4,0.3,0.6), color = 'black')

stringr

Basic String manipulation

 s1 <- "THis is a String 123 that has numbers 456 "
 str_to_lower(s1)
 str_to_upper(s1)
 str_length(s2)

String concatenation and splitting

 str_c("Hello", "Bodo", "nice", "to", "meet", "you", sep = " ")
 s2 <- c('Anna Beispiel', 'Cornelia Daten', 'Egon Fritz')
 xsplit <- str_split(s2, ' ') # Returns a list of character vectors
 unlist(xsplit) # Flattens the list into a vector of characters
 str_split(s2, ' ', simplify = T) # Returns a matrix instead of a list

Substrings

 str_sub(s2, 1, 1) # get the first letter of every entry

Regular expressions on a (list of) Strings

 str_view(s1, "(S|s)tr") # Search and show the result
 str_detect(s1, "[0-9]") # Check presence
 str_extract(s1, "[0-9]+") # Extract the (first) match
 str_replace(s1, "[0-9]+", ":)") # replace first occurrence
 str_replace_all(s1, "([0-9]+)", "\\1 :)") # replace all

rJava troubles

Posted on October 28, 2017 by swk

I am running code that needs the R package rJava. When I call that code, R just crashes without any indication of what is going wrong. This is a segmentation fault, that for some reason never makes it to the surface. You can solve this by setting the following (see also stackoverflow):

export _JAVA_OPTIONS="-Xss2560k -Xmx2g"

The result in my case is that R starts, but throws the error

Error: 'Error: C stack usage  141780829840 is too close to the limit'

I haven’t found a solution for that problem yet, tell me if you have any hints 🙁

Histograms of category frequencies in R

Posted on December 22, 2016 by swk

I am learning R, so this is my first attempt to create histograms in R. The data that I have is a vector of one category for each data point. For this example we will use a vector of a random sample of letters. The important thing is that we want a histogram of the frequencies of texts, not numbers. And the texts are longer than just one letter. So let’s start with this:

labels <- sample(letters[1:20],100,replace=TRUE)
labels <- vapply(seq_along(labels), 
                 function(x) paste(rep(labels[x],10), collapse = ""),
                 character(1L)) # Repeat each letter 10 times
library(plyr) # for the function 'count'
distribution <- count(labels)
distribution_sorted <- 
   distribution[order(distribution[,"freq"], decreasing=TRUE),]

I use the function count from the package plyr to get a matrix distribution with the different categories in column one (called "x") and the number of times this label occurs in column two (called "freq"). As I would like the histogram to display the categories from the most frequent to the least frequent one, I then sort this matrix by frequency with the function order. The function gives back a vector of indices in the correct order, so I need to plug this into the original matrix as row numbers.

Now let's do the histogram:

mp <- barplot(distribution_sorted[,"freq"],
         names.arg=distribution_sorted[,1], # X-axis names
         las=2,  # turn labels by 90 degrees
         col=c("blue"), # blue bars (just for fun)
         xlab="Kategorie", ylab="Häufigkeit", # Axis labels
         )

There are many more settings to adapt, e.g., you can use cex to increase the font size for the numerical y-axis values (cex.axis), the categorical x-axis names (cex.names), and axis labels (cex.lab).

In my plot there is one problem. My categorie names are much longer than the values on the y-axis and so the axis labels are positioned incorrectly. This is the point to give up and do the plot in Excel (ahem, LaTeX!) - or take input from fellow bloggers. They explain the issues way better than me, so I will just post my final solution. I took the x-axis label out of the plot and inserted it separately with mtext. I then wanted a line for the x-axis as well and in the end I took out the x-axis names from the plot again and put them into a separate axis at the bottom (side=1) with zero-length ticks (tcl=0) intersecting the y-axis at pos=-0.3.

# mai = space around the plot: bottom - left - top - right
# mgp = spacing for axis title - axis labels - axis line
par(mai=c(2.5,1,0.3,0.15), mgp=c(2.5,0.75,0))
mp <- barplot(distribution_sorted[,"freq"],
         #names.arg=distribution_sorted[,1], # X-axis names/labels
         las=2,  # turn labels by 90 degrees
         col=c("blue"), # blue bars (just for fun)
         ylab="Häufigkeit", # Axis title
         )
axis(side=1, at=mp, pos=-0.3, 
     tick=TRUE, tcl=0, 
     labels=distribution_sorted[,1], las=2, 
     )
mtext("Kategorie", side=1, line=8.5) # x-axis label

There has to be an easier way !?

Yesterday's Coffee

Too good to throw away – too hard to remember

Tag Archives: R