Tidyverse R package

library(tidyverse)

Tidyverse is a collection of R packages for the comfortable cleanup of data. Loading the package tidyverse will laod the core tidyverse packages: tibble, tidyr, readr, purrr, and dplyr. You can of course also load each package individually.

Intro

  • Tidyverse is optimized for interactive workflow with data
  • Each function does one thing easy and well
  • Basic idea: action(data, some_arguments) or data %>% action(some_arguments)
  • Everything works with tibbles
  • Web page: http://tidyverse.org/
  • Workshop page (withe example scripts): http://bodowinter.com/carpentry/index.html

Core packages

tibble

  • A modern version of dataframes
  • The first argument of every tidyverse function and what every tidyverse function returns
  • Tibbles use characters instead of factors for texts
  • Tibbles have nicer printout than normal dataframes: show data type of columns, number of rows, only the first few rows/columns not all of the data
 mynames <- c('bla', 'jkl', 'xyz', 'asdf', 'asdf')
 age <- c(NA, 30, 20, 25, 18)
 pre <- round(rnorm(length(mynames)), 2)
 post <- round(rnorm(length(mynames)), 2)
 mydata <- tibble(mynames, age, pre, post)  # create tibble from data
 mydf <- data.frame(mynames, age, pre, post)
 as_tibble(mydf)  # convert data frame into tibble

readr

  • Does the same as read.csv from base R, it reads a csv file
  • Faster
  • Automatically creates tibbles
  • Progress bar for big files
 read_csv('somefile.csv')

tidyr

  • A data frame is a rectangular array of variables (columns) and observations (rows)
  • A tidy data frame is a data frame where…
    ** Each variable is in a column.
    ** Each observation is a row.
    ** Each value is a cell.

  • Wide format: a row has many entries for observations, e.g., time-series in columns T0, T1, T2, …

  • Long format: each observation is a separate row, time is a new column, e.g., row1 is T0, row2 is T1, row3 is T2
  • Two functions: gather() to convert from wide format to long format and spread() to convert from wide format to long format
 # Convert to long format, so that every observation is one row,
 # with either the text 'pre' or 'post' in the column 'exam'
 # and the value that was in pre or post now in the column 'score'
 tidydf <- gather(mydata, exam, score, pre:post)

 # From tidydf create the same thing back that we had in mydata (wide format)
 spread(tidydf, exam, score)
  • Easily split columns with separate() and merge with unite()
    court # tibble with lots of comma-separated text in one column ‘text’
 # Split it into 14 columns with the names A-N, 
 # Convert = True -> try to guess the datatypes, otherwise everything would be characters
 court <- separate(court, text, into = LETTERS[1:14], convert = T)

 # Put columns B, C and D into one column 'condition'
 court <- unite(court, condition, B, C, D)

dplyr

  • Filter rows with filter()
 filter(mydata, !is.na(age), pre>0, !duplicated(mynames))
 filter(mydata, mynames %in% c('jkl', 'bla'))
 filter(mydata, post > pre)
  • Select columns with select()
 select(mydata, pre) # select a column
 select(mydata, -pre) # select everything besides this column
 select(mydata, age:pre) # select all columns between pre and post
 select(mydata, -(pre:post)) # select all columns besides those between pre and post
 select(mydata, pre:post, age, mynames) # select and reorder
  • Sort a tibble by a column with arrange()
 arrange(mydata, desc(age), pre) # sort by age (descending), then by pre
  • Rename one or more columns with rename()
 rename(mydata, newname=pre, othernew=post)
  • Add new columns with mutate() and transmute()
 mutate(mydata, 
        diff = pre-post, 
        diff = diff*2, 
        diff_c = diff-mean(diff, na.rm=T))
 mutate(mydata, gender = ifelse(mynames == 'jkl', 'F', 'M'))
 # transmute does the same, but returns only newly defined columns
 transmute(mydata,  diff = pre-post,  diff2 = diff*2) 
  • Aggregate data with summarize()
 mydata %>% group_by(gender) %>% 
        summarise(MeanAge = mean(age, na.rm=T), Mean = mean(score, na.rm=T), SD = sd(score, na.rm=T))
 # na.rm -> remove NA values
  • Merge tibbles with left_join() (there are also other joins)

Other packages

magrittr

  • Pipes: %>%
  • Send the same dataframe as input to a pipeline of actions.
  • Example:
 mydf %>%
        filter(!is.na(F0)) %>%
        mutate(LogFreq = log(Freq)) %>%
        group_by(Condition) %>%
        summarise(mean = mean(LogFreq))
  • Does the same as:
 mydf.filtered <- filter(mydf, !is.na(F0))
 mydf.log <- mutate(mydf.filtered, LogFreq = log(Freq))
 mydf.grouped <- group_by(mydf.log, mydf.log)
 summarise(mydf.grouped, mean = mean(LogFreq))

ggplot2

  • “An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points.”
  • “A geom is the geometrical object that a plot uses to represent data.”
  • General form:
 ggplot(data = <DATA>) +
        <GEOM_FUNCTION>(
                mapping = aes(<MAPPINGS>),
                stat = <STAT>,
                position = <POSITION>
        ) +
        <COORDINATE_FUNCTION> +
        <FACET_FUNCTION>
  • Examples:
 ggplot(mydf, # dataframe/tibble as first arg, mapping -> from data to aestetics/graphic properties
        mapping = aes( # aes -> set of aestetics mappings,
           x = pred, y = resp #  map x/y-values of plot to dataframe columns with these names
         )) + geom_point() # add shape to the plot
 ggplot(mydf, mapping = aes( x = pred)) + 
        geom_histogram(binwidth = .5, 
        fill = rgb(0.2,0.4,0.8,0.6),   # rgb values in [0..1], last part is alpha
        color = 'black')   # use colors() to get a list of all colors
 ggplot(mydf, mapping = aes( x = pred)) + 
        geom_density(fill = rgb(0.8,0.4,0.3,0.6), color = 'black') 

stringr

  • Basic String manipulation
 s1 <- "THis is a String 123 that has numbers 456 "
 str_to_lower(s1)
 str_to_upper(s1)
 str_length(s2) 
  • String concatenation and splitting
 str_c("Hello", "Bodo", "nice", "to", "meet", "you", sep = " ")
 s2 <- c('Anna Beispiel', 'Cornelia Daten', 'Egon Fritz')
 xsplit <- str_split(s2, ' ') # Returns a list of character vectors
 unlist(xsplit) # Flattens the list into a vector of characters
 str_split(s2, ' ', simplify = T) # Returns a matrix instead of a list
  • Substrings
 str_sub(s2, 1, 1) # get the first letter of every entry
  • Regular expressions on a (list of) Strings
 str_view(s1, "(S|s)tr") # Search and show the result
 str_detect(s1, "[0-9]") # Check presence
 str_extract(s1, "[0-9]+") # Extract the (first) match
 str_replace(s1, "[0-9]+", ":)") # replace first occurrence
 str_replace_all(s1, "([0-9]+)", "\\1 :)") # replace all

Mitgliederbeiträge mit Hibiscus und JVerein einziehen

Der Einzug hat zwei Teile: Erstmal müssen wir definieren für wen wir den Beitrag einziehen wollen, dann müssen wir die Lastschriften über die Bank einziehen lassen.

Die Ermittlung der Mitglieder und ihrer Beiträge geht in JVerein unter Abrechnung im Ordner Abrechnung. Dort ist folgendes einzugeben:

  • Modus: Alle
  • Fälligkeit: Datum eintragen an dem eingezogen werden soll
  • Stichtag: 31.12.[Jahr]
  • Zahlungsgrund: Mitgliedsbeitrag [Jahr]
  • SEPA-Datei: Nein
  • Abbuchungsausgabe: Hibiscus

In Hibiscus gibt des dann unter SEPA Zahlungsverkehr den Punkt SEPA Lastschriften. Dort sollte jetzt für jedes Mitglied eine Lastschrift mit dem korrekten Mitgliedsbeitrag aufgeführt sein. Um sie tatsächlich auszuführen muss man mit rechts klicken und Jetzt ausführen wählen. Dann sollte Hibiscus nach einer TAN fragen. Und dann sollte Geld abgebucht werden.

Leider kann man in unserer Version von Hibiscus nur einzelne Lastschriften machen, d.h. man muss für jedes Mitglied neu klicken und die TAN eingeben. Alternativ kann man eine SEPA-Datei erzeugen und die bei der Bank im Online-Banking hochladen.

Cropmarks (Beschnittmarken)

If you print a professional book where graphics go until the very edge of the page, you need to give a file with cropmarks (Beschnittmarken) to the printer. This is how to produce them in LaTeX, it is actually very simple with the package crop:

\usepackage[cam,width=154truemm,height=216truemm,center]{crop}

The book in question was A5 paper which has a size of 148mm x 210mm. I wanted to have 3 more milimeters to each side. This makes the final paper size I want to have 154mm x 216mm, which I have given above. As I want the markings equally on each side, I use the option center to center the content in the middle of the larger page. The option cam print standard cropmarks (this is also the default).

Now the only thing left to do is adjust the graphics. For every page where a graphic goes to the edge of the page, increase it a little bit over the margin. There are several ways to do that, depending on what exactly it is you do. This is an example for a colored background image that fills the whole page. The important part is \dimexpr\paperwidth+6mm which just adds 6mm to the height of the picture:

\begin{tikzpicture}[remember picture, overlay]
\node[inner sep=0pt] at (current page.center) { 
  \includegraphics[width=\dimexpr\paperwidth+6mm\relax,
      height=\dimexpr\paperheight+6mm\relax]
    {img/background1}
};
\end{tikzpicture}

Find file names with invalid encoding on Linux

I have files copied from Windows computers in ancient times. The filenames contain special characters and they have been messed up somewhere along the way. For example I got a file named 9.5.2 Modelo de aceptaci??n (espa??ol).doc in the folder 9 Garant??a del Estado.

First, I want to find and list these files. Stackexchange tells us how to do that:

LC_ALL=C find . -name '*[! -~]*'

This will find all names that have non-ASCII letters, not only those that are broken. But in my case I have folders where ALL of the names are broken, so I don’t mind.

Second, I want to fix the names. I did it manually, but for future reference, if I ever were to do anything like that again, I might use one of the solutions proposed in this thread on serverfault.com.