I am learning R, so this is my first attempt to create histograms in R. The data that I have is a vector of one category for each data point. For this example we will use a vector of a random sample of letters. The important thing is that we want a histogram of the frequencies of texts, not numbers. And the texts are longer than just one letter. So let’s start with this:
labels <- sample(letters[1:20],100,replace=TRUE) labels <- vapply(seq_along(labels), function(x) paste(rep(labels[x],10), collapse = ""), character(1L)) # Repeat each letter 10 times library(plyr) # for the function 'count' distribution <- count(labels) distribution_sorted <- distribution[order(distribution[,"freq"], decreasing=TRUE),]
I use the function count
from the package plyr
to get a matrix distribution
with the different categories in column one (called "x") and the number of times this label occurs in column two (called "freq"). As I would like the histogram to display the categories from the most frequent to the least frequent one, I then sort this matrix by frequency with the function order
. The function gives back a vector of indices in the correct order, so I need to plug this into the original matrix as row numbers.
Now let's do the histogram:
mp <- barplot(distribution_sorted[,"freq"], names.arg=distribution_sorted[,1], # X-axis names las=2, # turn labels by 90 degrees col=c("blue"), # blue bars (just for fun) xlab="Kategorie", ylab="Häufigkeit", # Axis labels )
There are many more settings to adapt, e.g., you can use cex
to increase the font size for the numerical y-axis values (cex.axis
), the categorical x-axis names (cex.names
), and axis labels (cex.lab
).
In my plot there is one problem. My categorie names are much longer than the values on the y-axis and so the axis labels are positioned incorrectly. This is the point to give up and do the plot in Excel (ahem, LaTeX!) - or take input from fellow bloggers. They explain the issues way better than me, so I will just post my final solution. I took the x-axis label out of the plot and inserted it separately with mtext
. I then wanted a line for the x-axis as well and in the end I took out the x-axis names from the plot again and put them into a separate axis
at the bottom (side=1
) with zero-length ticks (tcl=0
) intersecting the y-axis at pos=-0.3
.
# mai = space around the plot: bottom - left - top - right # mgp = spacing for axis title - axis labels - axis line par(mai=c(2.5,1,0.3,0.15), mgp=c(2.5,0.75,0)) mp <- barplot(distribution_sorted[,"freq"], #names.arg=distribution_sorted[,1], # X-axis names/labels las=2, # turn labels by 90 degrees col=c("blue"), # blue bars (just for fun) ylab="Häufigkeit", # Axis title ) axis(side=1, at=mp, pos=-0.3, tick=TRUE, tcl=0, labels=distribution_sorted[,1], las=2, ) mtext("Kategorie", side=1, line=8.5) # x-axis label
There has to be an easier way !?