Histograms of category frequencies in R

I am learning R, so this is my first attempt to create histograms in R. The data that I have is a vector of one category for each data point. For this example we will use a vector of a random sample of letters. The important thing is that we want a histogram of the frequencies of texts, not numbers. And the texts are longer than just one letter. So let’s start with this:

labels <- sample(letters[1:20],100,replace=TRUE)
labels <- vapply(seq_along(labels), 
                 function(x) paste(rep(labels[x],10), collapse = ""),
                 character(1L)) # Repeat each letter 10 times
library(plyr) # for the function 'count'
distribution <- count(labels)
distribution_sorted <- 
   distribution[order(distribution[,"freq"], decreasing=TRUE),]

I use the function count from the package plyr to get a matrix distribution with the different categories in column one (called "x") and the number of times this label occurs in column two (called "freq"). As I would like the histogram to display the categories from the most frequent to the least frequent one, I then sort this matrix by frequency with the function order. The function gives back a vector of indices in the correct order, so I need to plug this into the original matrix as row numbers.

Now let's do the histogram:

mp <- barplot(distribution_sorted[,"freq"],
         names.arg=distribution_sorted[,1], # X-axis names
         las=2,  # turn labels by 90 degrees
         col=c("blue"), # blue bars (just for fun)
         xlab="Kategorie", ylab="Häufigkeit", # Axis labels
         )

There are many more settings to adapt, e.g., you can use cex to increase the font size for the numerical y-axis values (cex.axis), the categorical x-axis names (cex.names), and axis labels (cex.lab).

In my plot there is one problem. My categorie names are much longer than the values on the y-axis and so the axis labels are positioned incorrectly. This is the point to give up and do the plot in Excel (ahem, LaTeX!) - or take input from fellow bloggers. They explain the issues way better than me, so I will just post my final solution. I took the x-axis label out of the plot and inserted it separately with mtext. I then wanted a line for the x-axis as well and in the end I took out the x-axis names from the plot again and put them into a separate axis at the bottom (side=1) with zero-length ticks (tcl=0) intersecting the y-axis at pos=-0.3.

# mai = space around the plot: bottom - left - top - right
# mgp = spacing for axis title - axis labels - axis line
par(mai=c(2.5,1,0.3,0.15), mgp=c(2.5,0.75,0))
mp <- barplot(distribution_sorted[,"freq"],
         #names.arg=distribution_sorted[,1], # X-axis names/labels
         las=2,  # turn labels by 90 degrees
         col=c("blue"), # blue bars (just for fun)
         ylab="Häufigkeit", # Axis title
         )
axis(side=1, at=mp, pos=-0.3, 
     tick=TRUE, tcl=0, 
     labels=distribution_sorted[,1], las=2, 
     )
mtext("Kategorie", side=1, line=8.5) # x-axis label

There has to be an easier way !?

Marking significance in a bar plot

And still on the topic of LaTeX presentations, this time trying to plot a symbol over a bar to indicate significance.

This is how it works:

\node[xshift=\pgfkeysvalueof{/pgf/bar shift},anchor=south] at (axis cs:Xcoord1,0.47) {$\bullet$}; 

You need to put this code directly after the point where the data series has been plotted. Example:

\begin{tikzpicture}
\begin{axis}[xtick=data,axis x line*=bottom,axis y line=left,symbolic x coords={Xcoord1, Xcoord2}]

\addplot [ybar,seagreen] coordinates {(Xcoord1, -0.027) (Xcoord2, 0.436)}; 
\node[xshift=\pgfkeysvalueof{/pgf/bar shift},anchor=south] at (axis cs:Xcoord2,0.47) {$\bullet$}; 
\addlegendentry{System 1}

\addplot+ [ybar,blue] coordinates  {(Xcoord1, 0.331) (Xcoord2, 0.095)}; 
\node[xshift=\pgfkeysvalueof{/pgf/bar shift},anchor=south] at (axis cs:Xcoord1,0.36) {$\bullet$};
\addlegendentry{System 2}

\addplot+ [ybar,orange] coordinates {(Xcoord1, 0.222) (Xcoord2, 0.441)}; 
\node[xshift=\pgfkeysvalueof{/pgf/bar shift},anchor=south] at (axis cs:Xcoord1,0.25) {$\bullet$};
\node[xshift=\pgfkeysvalueof{/pgf/bar shift},anchor=south] at (axis cs:Xcoord2,0.47) {$\bullet$};
\addlegendentry{System 3}
\end{axis}
\end{tikzpicture}

Overlays for bar charts (take 2)

A while back I posted about using overlays for bar charts to show one value at a time. For my latest presentation I had a similar but slightly different wish: show all values for one system at a time, one system after the other.

Easily done, I just adapt the code from my previous post to show all values at the same time:

\newcommand{\addplotoverlay}[3][]{
\alt<#3->{
\addplot+ [ybar,#1] coordinates {#2}; 
}{
\addplot+ [ybar,#1] coordinates {(Xcoord1,0)}; % + don't show zero values in plot
}
}

This is specific to my plot, Xcoord1 is one of my symbolic x-coordinates in the plot. Other than that, the code is completely independent from the used coordinates and the number of them, which makes it more flexible than my old stuff.

Usage (this will let seagreen bars at the given coordinates appear on slide 2):

\addplotoverlayrank[seagreen]{(Xcoord1, 0.331) (Xcoord2, 0.095)}{2}

Skip a style for a bar in a bar plot

Let’s say you add five data series to a bar plot and they would get the colors blue – red – brown – gray – purple. Now suppose you have another plot with only four data series, but you would like them to have the colors blue – red – gray – purple, because they are similar to the series 1, 2, 4 and 5 in the first plot. You also don’t want to change the order. What can you do?

The style (colors, markers, etc) for a dataseries are determined by the cycle list in pgfplots. This is a series of style definitions that are applied to your data series one after the other. You can of course define one cycle list for each of the plots and assign the colors the way you want:

\pgfplotscreateplotcyclelist{my five bars}{%
solid,fill,blue, \\%
solid,fill,red, \\%
solid,fill,brown, \\%
solid,fill,gray, \\%
solid,fill,purple, \\%
}
\pgfplotscreateplotcyclelist{my four bars}{%
solid,fill,blue, \\%
solid,fill,red, \\%
solid,fill,gray, \\%
solid,fill,purple, \\%
}

\begin{tikzpicture}
\begin{axis}[cycle list name=my five bars,...]
... add the five data series ...
\end{axis}
\end{tikzpicture}
\begin{tikzpicture}
\begin{axis}[cycle list name=my four bars,...]
... add the four data series ...
\end{axis}
\end{tikzpicture}

But you always need to remember to change both versions. Fortunately there is an easier way! You can shift the index of the cycle list:

\begin{axis}[cycle list name=my five bars,...]
... add first two data series ...
\pgfplotsset{cycle list shift=1} % Skips one style
... add the other two data series ...
\end{axis}
\end{tikzpicture}

Done!

Overlays for bar charts

Yesterday I posted about creating bar charts with TikZ and pgfplots.

Today I want to present a command to make the bars of one data series (i.e., one of my systems) appear one after the other on a beamer LaTeX slide.

This is the code to put into your preamble:

\newcounter{MyNextSlide}
\newcounter{MyNextNextSlide}
\newcommand{\addplotoverlay}[5][]{
\setcounter{MyNextSlide}{#5}
\stepcounter{MyNextSlide}
\setcounter{MyNextNextSlide}{\theMyNextSlide}
\stepcounter{MyNextNextSlide}
\alt<#5->{\only<#5->{\alt<\theMyNextSlide->{\alt<\theMyNextNextSlide->{
\addplot+ [ybar,#1] coordinates {#2 #3 #4}; 
}{
\addplot+ [ybar,#1] coordinates {#2 #3}; 
}}{
\addplot+ [ybar,#1] coordinates {#2}; 
}}}{
\addplot+ [ybar,#1] coordinates {(PI,0)}; % + don't show zero values in plot
}
}

Usage (‘first slide’ refers to the slide on which value 1 should first appear, it will stay and the slide afterwards will add value 2, the slide after that will add value 3):

\addplotoverlay [color or other options] {value 1}{value 2}{value 3}{first slide}

This depends on there being three data points in a data series and I have hardcoded the x coordinate PI. You’ll probably need to adjust this before you are able to do something useful with this code.

Bar charts in LaTeX with TikZ

I have four systems to compare (baseline, minimal, window, syntax) on three different tasks (let’s call them PI, AI and AC). I want a bar chart (similar to this example). We of course use TikZ and pgfplots and there is ybar to get a bar chart. The outer bars are cut off, so we need to add a little space on both sides with enlarge x limits. We can play around with the axes, the height and the width of the plot and the legend, but you can look at other examples for this, I’ll focus on two things here.

First, I would like to have the three tasks side by side with a nice name. In TikZ we can use symbolic x coordinates for this, we just give them some names and can then use them like any other x coordinate, e.g., to put a data point at (PI, 50). We can give the coordinates labels that are nicer to read with xticklabels. Usually there will be ‘ticks’ (i.e., markers on the x axis) somewhere randomly, to get only for each x-axis label/task, use xtick=data.

symbolic x coords={PI, AI, AC},
xticklabels={Pred. ident., Arg. ident., Arg. class.},
xtick=data,

Second, I would like to have the numbers above the bars with one decimal place. We can get the numbers with these two lines (the first one gives the numbers, as they are too big the second line adjusts the font size):

nodes near coords={\pgfmathprintnumber[fixed zerofill,fixed,precision=1]{\pgfplotspointmeta}}
every node near coord/.append style={font=\tiny}

To get rid of zeros, we can replace the second line with

every node near coord/.append style={
      check for zero/.code={
        \pgfmathfloatifflags{\pgfplotspointmeta}{0}{
           \pgfkeys{/tikz/coordinate}
        }{}
      }, 
      check for zero, font=\tiny},

So this is my final axis style:

\pgfplotsset{resultsplot/.style={
axis x line*=bottom, 
axis y line=left, 
ybar,
symbolic x coords={PI, AI, AC},
xticklabels={Pred. ident., Arg. ident., Arg. class.},
xtick=data,
enlarge x limits=0.2,
nodes near coords={\pgfmathprintnumber[fixed zerofill,fixed,precision=1]{\pgfplotspointmeta}},
every node near coord/.append style={
      check for zero/.code={
        \pgfmathfloatifflags{\pgfplotspointmeta}{0}{
           \pgfkeys{/tikz/coordinate}
        }{}
      }, check for zero, font=\tiny},
area legend,
legend style={at={(0.5,-0.12)},
anchor=north,legend columns=-1},
}
}

And now we can get the actual graph that uses this axis style. Each plot represents a different system (the numbers are F1 scores):

 
\begin{tikzpicture}
\begin{axis}[resultsplot]
\addplot+ [ybar,green] coordinates {(PI, 67.8) (AI, 30.6) (AC, 20.2)};
\addlegendentry{Baseline}
\addplot+ [ybar,blue] coordinates {(PI, 78.6) (AI, 21.2) (AC, 16.5)};
\addlegendentry{Minimal system}
\addplot+ [ybar,orange] coordinates {(PI, 80.0) (AI, 44.2) (AC, 36.6)};
\addlegendentry{Window}
\addplot+ [ybar,red] coordinates {(PI, 80.1) (AI, 54.2) (AC, 44.8)};
\addlegendentry{Syntax}
\end{axis}
\end{tikzpicture}

Have fun!