# Precision, Recall and F-measure

In the last post we discussed accuracy, a straightforward method of calculating the performance of a classification system. Using accuracy is fine when the classes are of equal size, but this is often not the case in real world tasks. In such cases the very large number of true negatives outweighs the number of true positives in the evaluation so that accuracy will always be artificially high.

Luckily there are performance measures that ignore the number of true negatives. Two frequently used measures are precision and recall. Precision P indicates how many of the items that we have identified as positives are really positives. In other words, how precise have we been in our identification. How many of those that we think are X, really are X. Formally, this means that we divide the number of true positives by the number of all identified positives (true and false):
$P = TP/(TP+FP)$

Recall R indicates how many of the real positives we have found. So from all of the positive items that are there, how many did we manage to identify. In other words, how exhaustive we were. Formally, this means that we divide the number of true positives by the number of all existing positives (true positives and false negatives):
$R = TP/(TP+FN)$

For our example from the last post, precision and recall are as follows:
$P = 1/(1+3) = 1/4 = 0.25$
$R = 1/(1+2) = 1/3 = 0.33$

It is easy to get a recall of 100%. We just say for everything that it is a positive. But as this will probably not the case (or else we have a really easy dataset to classify!), this approach will give us a really low precision. On the other hand, we can usually get a high precision if we only classify as positive one single item that we are really, really sure about. But if we do that, recall will be low, as there will be more than one item in the dataset to be classified (or else it is not a very meaningful set).

So recall and precision are in a sort of balance. The F1 score or F1 measure is a way of putting the two of them together to produce one single number. Formally it calculates the harmonic mean of the two numbers and weights the two of them with the same importance (there are other variants that put more importance on one of them):
$F_1 = (2 \cdot P \cdot R)/(P + R)$

Using the values for precision and recall for our example, F1 is:
$F_1 = (2 \cdot 0.25 \cdot 0.33)/(0.25 + 0.33) = 0.165 / 0.58 = 0.28$

Intuitively, F1 is between the two values of precision and recall, but closer to the lower of the two. In other words, it penalizes if we concentrate only on one of the values and rewards systems where precision and recall are closer together.

Link for a second explanation: Explanation from an Information Retrieval perspective

# Accuracy

We are still trying to figure out how good our system for determining whether e-mails are spam or not is. In the last post we ended up with a confusion matrix like this:

 Actual label Spam NonSpam Predicted label Spam 1 (true positives, TP) 3 (false positives, FP) NonSpam 2 (false negatives, FN) 4 (true negatives, TN)

Now we want to calculate numbers from this table to describe the performance of our system. One easy way of doing this is to use accuracy A. Accuracy basically describes which percentage of decisions we got right. So we would take the diagonal entries in the matrix (the true positives and true negatives) and divide by the total number of entries. Formally:
$A = (TP+TN)/(TP+TN+FP+FN)$

In our example the accuracy is:
$A = (1+4)/(1+4+2+3) = 5/10 = 0.5$

Using accuracy is fine in examples like the above when both classes occur more or less with the same frequency. But frequently the number of true negatives is larger than the number of true positives by many orders of magnitudes. So let’s assume 994 for true negatives and when we calculate accuracy again, we get this:
$A = (1+994)/(1+994+2+3) = 995/1000 = 0.995$

It doesn’t really matter if we correctly identify any spam mails. Even if we always say NonSpam, so we get zero Spam-Mails right, we still get more nearly the same accuracy as above. So accuracy is not a good indicator of performance for our system in this situation. In the next post we will look at other measures we can use instead.

Link for a second explanation: Explanation from an Information Retrieval perspective

# Confusion matrix

Let’s say we want to analyze e-mails to determine whether they are spam or not. We have a set of mails and for each of them we have a label that says either "Spam" or "NotSpam" (for example we could get these labels from users who mark mails as spam). On this set of documents (the training data) we can train a machine learning system which given an e-mail can predict the label. So now we want to know how the system that we have trained is performing, whether it really recognizes spam or not.

So how can we find out? We take another set of mails that have been marked as "Spam" or "NotSpam" (the test data), apply our machine learning system and get predicted labels for these documents. So we end up with a list like this:

Actual label Predicted label
Mail 1 Spam NonSpam
Mail 2 NonSpam NonSpam
Mail 3 NonSpam NonSpam
Mail 4 Spam Spam
Mail 5 NonSpam NonSpam
Mail 6 NonSpam NonSpam
Mail 7 Spam NonSpam
Mail 8 NonSpam Spam
Mail 9 NonSpam Spam
Mail 10 NonSpam Spam

We can now compare the predicted labels from our system to the actual labels to find out how many of them we got right. When we have two classes, there are four possible outcomes for the comparison of a predicted label and an actual label. We could have predicted "Spam" and the actual label is also "Spam". Or we predicted "NonSpam" and the label is actually "NonSpam". In both of these cases we were right, so these are the true predictions. But, we could also have predicted "Spam" when the actual label is "NonSpam". Or "NonSpam" when we should have predicted "Spam". So these are the false predictions, the cases where we have been wrong. Let’s assume that we are interested in how well we can predict "Spam". Every mail for which we have predicted the class "Spam" is a positive prediction, a prediction for the class we are interested in. Every mail where we have predicted "NonSpam" is a negative prediction, a prediction of not the class we are interested in. So we can summarize the possible outcomes and their names in this table:

 Actual label Spam NonSpam Predicted label Spam true positives (TP) false positives (FP) NonSpam false negatives (FN) true negatives (TN)

The true positives are the mails where we have predicted "Spam", the class we are interested in, so it is a positive prediction, and the actual label was also "Spam", so the prediction was true. The false positives are the mails where we have predicted "Spam" (a positive prediction), but the actual label is "NonSpam", so the prediction is false. Correspondingly the false negatives, the mails we should have labeled as "Spam" but didn’t. And the true negatives that we correctly recognized as "NonSpam". This matrix is called a confusion matrix.

Let’s create the confusion matrix for the table with the ten mails that we classified above. Mail 1 is "Spam", but we predicted "NonSpam", so this is a false negative. Mail 2 is "NonSpam" and we predicted "NonSpam", so this is a true negative. And so on. We end up with this table:

 Actual label Spam NonSpam Predicted label Spam 1 3 NonSpam 2 4

In the next post we will take a loo at how we can calculate performance measures from this table.

Link for a second explanation: Explanation from an Information Retrieval perspective

# Histograms of category frequencies in R

I am learning R, so this is my first attempt to create histograms in R. The data that I have is a vector of one category for each data point. For this example we will use a vector of a random sample of letters. The important thing is that we want a histogram of the frequencies of texts, not numbers. And the texts are longer than just one letter. So let’s start with this:

labels <- sample(letters[1:20],100,replace=TRUE)
labels <- vapply(seq_along(labels),
function(x) paste(rep(labels[x],10), collapse = ""),
character(1L)) # Repeat each letter 10 times
library(plyr) # for the function 'count'
distribution <- count(labels)
distribution_sorted <-
distribution[order(distribution[,"freq"], decreasing=TRUE),]


I use the function count from the package plyr to get a matrix distribution with the different categories in column one (called "x") and the number of times this label occurs in column two (called "freq"). As I would like the histogram to display the categories from the most frequent to the least frequent one, I then sort this matrix by frequency with the function order. The function gives back a vector of indices in the correct order, so I need to plug this into the original matrix as row numbers.

Now let's do the histogram:

mp <- barplot(distribution_sorted[,"freq"],
names.arg=distribution_sorted[,1], # X-axis names
las=2,  # turn labels by 90 degrees
col=c("blue"), # blue bars (just for fun)
xlab="Kategorie", ylab="Häufigkeit", # Axis labels
)


There are many more settings to adapt, e.g., you can use cex to increase the font size for the numerical y-axis values (cex.axis), the categorical x-axis names (cex.names), and axis labels (cex.lab).

In my plot there is one problem. My categorie names are much longer than the values on the y-axis and so the axis labels are positioned incorrectly. This is the point to give up and do the plot in Excel (ahem, LaTeX!) - or take input from fellow bloggers. They explain the issues way better than me, so I will just post my final solution. I took the x-axis label out of the plot and inserted it separately with mtext. I then wanted a line for the x-axis as well and in the end I took out the x-axis names from the plot again and put them into a separate axis at the bottom (side=1) with zero-length ticks (tcl=0) intersecting the y-axis at pos=-0.3.

# mai = space around the plot: bottom - left - top - right
# mgp = spacing for axis title - axis labels - axis line
par(mai=c(2.5,1,0.3,0.15), mgp=c(2.5,0.75,0))
mp <- barplot(distribution_sorted[,"freq"],
#names.arg=distribution_sorted[,1], # X-axis names/labels
las=2,  # turn labels by 90 degrees
col=c("blue"), # blue bars (just for fun)
ylab="Häufigkeit", # Axis title
)
axis(side=1, at=mp, pos=-0.3,
tick=TRUE, tcl=0,
labels=distribution_sorted[,1], las=2,
)
mtext("Kategorie", side=1, line=8.5) # x-axis label


There has to be an easier way !?

# Citations

Thank you Google Scholar Alerts for bringing to my attention this latest reference to one of my papers:

(3) 基于语义角色标注的提取

Kessler 等 [37] 运用 SRL 对英文比较句的元素进行标注

SRL 中 [38] 。上述研究取得了一定成果, 但是采用 SRL

Whatever it says, it counts towards my H-index!

# Marking significance in a bar plot

And still on the topic of LaTeX presentations, this time trying to plot a symbol over a bar to indicate significance.

This is how it works:

\node[xshift=\pgfkeysvalueof{/pgf/bar shift},anchor=south] at (axis cs:Xcoord1,0.47) {$\bullet$};


You need to put this code directly after the point where the data series has been plotted. Example:

\begin{tikzpicture}
\begin{axis}[xtick=data,axis x line*=bottom,axis y line=left,symbolic x coords={Xcoord1, Xcoord2}]

\addplot [ybar,seagreen] coordinates {(Xcoord1, -0.027) (Xcoord2, 0.436)};
\node[xshift=\pgfkeysvalueof{/pgf/bar shift},anchor=south] at (axis cs:Xcoord2,0.47) {$\bullet$};

\addplot+ [ybar,blue] coordinates  {(Xcoord1, 0.331) (Xcoord2, 0.095)};
\node[xshift=\pgfkeysvalueof{/pgf/bar shift},anchor=south] at (axis cs:Xcoord1,0.36) {$\bullet$};

\addplot+ [ybar,orange] coordinates {(Xcoord1, 0.222) (Xcoord2, 0.441)};
\node[xshift=\pgfkeysvalueof{/pgf/bar shift},anchor=south] at (axis cs:Xcoord1,0.25) {$\bullet$};
\node[xshift=\pgfkeysvalueof{/pgf/bar shift},anchor=south] at (axis cs:Xcoord2,0.47) {$\bullet$};
\end{axis}
\end{tikzpicture}


# Overlays for bar charts (take 2)

A while back I posted about using overlays for bar charts to show one value at a time. For my latest presentation I had a similar but slightly different wish: show all values for one system at a time, one system after the other.

Easily done, I just adapt the code from my previous post to show all values at the same time:

\newcommand{\addplotoverlay}[3][]{
\alt<#3->{
}{
\addplot+ [ybar,#1] coordinates {(Xcoord1,0)}; % + don't show zero values in plot
}
}


This is specific to my plot, Xcoord1 is one of my symbolic x-coordinates in the plot. Other than that, the code is completely independent from the used coordinates and the number of them, which makes it more flexible than my old stuff.

Usage (this will let seagreen bars at the given coordinates appear on slide 2):

\addplotoverlayrank[seagreen]{(Xcoord1, 0.331) (Xcoord2, 0.095)}{2}


# LaTeX ‘correct’ and ‘wrong’ symbols with TikZ

A symbol for a checkmark to indicate something is correct:

\newcommand{\correct}{$\color{green}\tikz\fill[scale=0.4](0,.35) -- (.25,0) -- (1,.7) -- (.25,.15) -- cycle;$}


A symbol for a cross to indicate something is wrong:

\newcommand{\wrong}{$\mathbin{\tikz [x=1.4ex,y=1.4ex,line width=.2ex, red] \draw (0,0) -- (1,1) (0,1) -- (1,0);}$}%


You’ll need TikZ for this.

# LaTeX presentation background picture

In one slide of a presentation I wanted to have a background picture and overlay it with several text blocks one after the other to have the effect of the text “coming out of” the background. It is tricky to align things in LaTeX beamer, especially if you want to have them on top of each other, so this is my solution: Two minipages that cover the whole slide on top of each other.

A slide is more or less 7cm high (depending a bit on your template). There probably is a length defined for that, but I was too lazy to look for it so I took the actual value. The width of the slide is of course \textwidth. I use vertically centered alignment for the minipage, but that is up to you (see the post Set height of a minipage for the options you can give to minipage).

The way it now works is the following. Create one minipage of full width and height. Use this to display the background image. Then jump back the full height and create a second minipage of full width and height to display the text inside of that. This is the code for my slide:

\begin{minipage}[c][7cm][c]{\textwidth}
\centering
\includegraphics[width=0.8\linewidth]{img/Reviews}
\end{minipage}

\vspace{-7cm}
\begin{minipage}[c][7cm][c]{\textwidth}
\centering

\visible<2->{
\colorbox{white}{\fbox{\textcolor{blue}{I was impressed by the fast shutter speed of D3200.}\only<3->{\textcolor{darkgreen}{~(\emph{positive})}}}}
}

\vspace{1cm}
\visible<4->{
\colorbox{white}{\fbox{\textcolor{blue}{The autofocus was \textbf{not} so reliable.}\only<5->{\textcolor{red}{~(\emph{negative})}}}}
}
\end{minipage}