Histograms of category frequencies in R

I am learning R, so this is my first attempt to create histograms in R. The data that I have is a vector of one category for each data point. For this example we will use a vector of a random sample of letters. The important thing is that we want a histogram of the frequencies of texts, not numbers. And the texts are longer than just one letter. So let’s start with this:

labels <- sample(letters[1:20],100,replace=TRUE)
labels <- vapply(seq_along(labels), 
                 function(x) paste(rep(labels[x],10), collapse = ""),
                 character(1L)) # Repeat each letter 10 times
library(plyr) # for the function 'count'
distribution <- count(labels)
distribution_sorted <- 
   distribution[order(distribution[,"freq"], decreasing=TRUE),]

I use the function count from the package plyr to get a matrix distribution with the different categories in column one (called "x") and the number of times this label occurs in column two (called "freq"). As I would like the histogram to display the categories from the most frequent to the least frequent one, I then sort this matrix by frequency with the function order. The function gives back a vector of indices in the correct order, so I need to plug this into the original matrix as row numbers.

Now let's do the histogram:

mp <- barplot(distribution_sorted[,"freq"],
         names.arg=distribution_sorted[,1], # X-axis names
         las=2,  # turn labels by 90 degrees
         col=c("blue"), # blue bars (just for fun)
         xlab="Kategorie", ylab="Häufigkeit", # Axis labels
         )

There are many more settings to adapt, e.g., you can use cex to increase the font size for the numerical y-axis values (cex.axis), the categorical x-axis names (cex.names), and axis labels (cex.lab).

In my plot there is one problem. My categorie names are much longer than the values on the y-axis and so the axis labels are positioned incorrectly. This is the point to give up and do the plot in Excel (ahem, LaTeX!) - or take input from fellow bloggers. They explain the issues way better than me, so I will just post my final solution. I took the x-axis label out of the plot and inserted it separately with mtext. I then wanted a line for the x-axis as well and in the end I took out the x-axis names from the plot again and put them into a separate axis at the bottom (side=1) with zero-length ticks (tcl=0) intersecting the y-axis at pos=-0.3.

# mai = space around the plot: bottom - left - top - right
# mgp = spacing for axis title - axis labels - axis line
par(mai=c(2.5,1,0.3,0.15), mgp=c(2.5,0.75,0))
mp <- barplot(distribution_sorted[,"freq"],
         #names.arg=distribution_sorted[,1], # X-axis names/labels
         las=2,  # turn labels by 90 degrees
         col=c("blue"), # blue bars (just for fun)
         ylab="Häufigkeit", # Axis title
         )
axis(side=1, at=mp, pos=-0.3, 
     tick=TRUE, tcl=0, 
     labels=distribution_sorted[,1], las=2, 
     )
mtext("Kategorie", side=1, line=8.5) # x-axis label

There has to be an easier way !?

Using GIMP to draw a rectangle

GIMP is not your typical program for drawing, but is is the only thing related to graphics that is installed on my linux. So I have this screenshot and I want to draw a red rectangle around the part that needs to be clicked. This is how:

  1. Open your graphic file with GIMP.
  2. Use the "Rectangle Select Tool" and mark the place where you want your rectangle to be.
  3. Select the color you want to draw the rectangle in as foreground color (in my case that would be red).
  4. In the menu "Edit" choose "Stroke selection".
  5. In the dialogue that comes up, choose "Stroke line" with "solid colour" (it will take the current foreground color), you can adjust the width and if you open up "Line style" you can do more things (e.g., rounded edges).
  6. Click "Stroke" and voila!

Pie charts with LaTeX TikZ

Define a new command to insert a pie slice:

\newcommand{\slice}[4]{
  \pgfmathparse{0.5*#1+0.5*#2}
  \let\midangle\pgfmathresult

  % slice
  \draw[thick,fill=black!10] (0,0) -- (#1:1) arc (#1:#2:1) -- cycle;

  % outer label
  \node[label=\midangle:#4] at (\midangle:1) {};

  % inner label
  \pgfmathparse{min((#2-#1-10)/110*(-0.3),0)}
  \let\temp\pgfmathresult
  \pgfmathparse{max(\temp,-0.5) + 0.8}
  \let\innerpos\pgfmathresult
  \node at (\midangle:\innerpos) {#3};
}

Then define the slices in the order you want to have them and with the percentages and labels. You can start at a different point in the circle by setting the counter ‘d’ to a different value before the loop, e.g. \setcounter{d}{25}.

\begin{tikzpicture}[scale=3]
\newcounter{c}
\newcounter{d}
\foreach \p/\t in {66/, 17/Equative, 10/Difference, 7/}
  {
    \setcounter{c}{\value{d}}
    \addtocounter{d}{\p}
    \slice{\thec/100*360}
          {\thed/100*360}
          { \small \p\%}{\t}
  }
  \node[label=0.5:Ranked] at (1,0.6) {};
  \node[label=0.5:Superlative] at (1,-0.3) {};
\end{tikzpicture}

I didn’t like the automatic placement of two labels, that is why I gave ‘Ranked’ and ‘Superlative’ an empty label in the loop and placed them by hand later on.

The original is from Texample, uploaded by Robert Vollmert.