# Histograms of category frequencies in R

I am learning R, so this is my first attempt to create histograms in R. The data that I have is a vector of one category for each data point. For this example we will use a vector of a random sample of letters. The important thing is that we want a histogram of the frequencies of texts, not numbers. And the texts are longer than just one letter. So let’s start with this:

```labels <- sample(letters[1:20],100,replace=TRUE)
labels <- vapply(seq_along(labels),
function(x) paste(rep(labels[x],10), collapse = ""),
character(1L)) # Repeat each letter 10 times
library(plyr) # for the function 'count'
distribution <- count(labels)
distribution_sorted <-
distribution[order(distribution[,"freq"], decreasing=TRUE),]
```

I use the function `count` from the package `plyr` to get a matrix `distribution` with the different categories in column one (called "x") and the number of times this label occurs in column two (called "freq"). As I would like the histogram to display the categories from the most frequent to the least frequent one, I then sort this matrix by frequency with the function `order`. The function gives back a vector of indices in the correct order, so I need to plug this into the original matrix as row numbers.

Now let's do the histogram:

```mp <- barplot(distribution_sorted[,"freq"],
names.arg=distribution_sorted[,1], # X-axis names
las=2,  # turn labels by 90 degrees
col=c("blue"), # blue bars (just for fun)
xlab="Kategorie", ylab="Häufigkeit", # Axis labels
)
```

There are many more settings to adapt, e.g., you can use `cex` to increase the font size for the numerical y-axis values (`cex.axis`), the categorical x-axis names (`cex.names`), and axis labels (`cex.lab`).

In my plot there is one problem. My categorie names are much longer than the values on the y-axis and so the axis labels are positioned incorrectly. This is the point to give up and do the plot in Excel (ahem, LaTeX!) - or take input from fellow bloggers. They explain the issues way better than me, so I will just post my final solution. I took the x-axis label out of the plot and inserted it separately with `mtext`. I then wanted a line for the x-axis as well and in the end I took out the x-axis names from the plot again and put them into a separate `axis` at the bottom (`side=1`) with zero-length ticks (`tcl=0`) intersecting the y-axis at `pos=-0.3`.

```# mai = space around the plot: bottom - left - top - right
# mgp = spacing for axis title - axis labels - axis line
par(mai=c(2.5,1,0.3,0.15), mgp=c(2.5,0.75,0))
mp <- barplot(distribution_sorted[,"freq"],
#names.arg=distribution_sorted[,1], # X-axis names/labels
las=2,  # turn labels by 90 degrees
col=c("blue"), # blue bars (just for fun)
ylab="Häufigkeit", # Axis title
)
axis(side=1, at=mp, pos=-0.3,
tick=TRUE, tcl=0,
labels=distribution_sorted[,1], las=2,
)
mtext("Kategorie", side=1, line=8.5) # x-axis label
```

There has to be an easier way !?

This entry was posted in Programming and tagged , , , by swk. Bookmark the permalink.