Let’s say we want to analyze e-mails to determine whether they are spam or not. We have a set of mails and for each of them we have a label that says either "Spam" or "NotSpam" (for example we could get these labels from users who mark mails as spam). On this set of documents (the *training data*) we can train a machine learning system which given an e-mail can predict the label. So now we want to know how the system that we have trained is performing, whether it really recognizes spam or not.

So how can we find out? We take another set of mails that have been marked as "Spam" or "NotSpam" (the *test data*), apply our machine learning system and get predicted labels for these documents. So we end up with a list like this:

Actual label | Predicted label | |
---|---|---|

Mail 1 | Spam | NonSpam |

Mail 2 | NonSpam | NonSpam |

Mail 3 | NonSpam | NonSpam |

Mail 4 | Spam | Spam |

Mail 5 | NonSpam | NonSpam |

Mail 6 | NonSpam | NonSpam |

Mail 7 | Spam | NonSpam |

Mail 8 | NonSpam | Spam |

Mail 9 | NonSpam | Spam |

Mail 10 | NonSpam | Spam |

We can now compare the predicted labels from our system to the actual labels to find out how many of them we got right. When we have two classes, there are four possible outcomes for the comparison of a predicted label and an actual label. We could have predicted "Spam" and the actual label is also "Spam". Or we predicted "NonSpam" and the label is actually "NonSpam". In both of these cases we were right, so these are the *true* predictions. But, we could also have predicted "Spam" when the actual label is "NonSpam". Or "NonSpam" when we should have predicted "Spam". So these are the *false* predictions, the cases where we have been wrong. Let’s assume that we are interested in how well we can predict "Spam". Every mail for which we have predicted the class "Spam" is a *positive* prediction, a prediction *for* the class we are interested in. Every mail where we have predicted "NonSpam" is a *negative* prediction, a prediction of *not* the class we are interested in. So we can summarize the possible outcomes and their names in this table:

Actual label | |||

Spam | NonSpam | ||

Predicted label | Spam | true positives (TP) | false positives (FP) |

NonSpam | false negatives (FN) | true negatives (TN) |

The *true positives* are the mails where we have predicted "Spam", the class we are interested in, so it is a *positive* prediction, and the actual label was also "Spam", so the prediction was *true*. The *false positives* are the mails where we have predicted "Spam" (a *positive* prediction), but the actual label is "NonSpam", so the prediction is *false*. Correspondingly the *false negatives*, the mails we should have labeled as "Spam" but didn’t. And the *true negatives* that we correctly recognized as "NonSpam". This matrix is called a *confusion matrix*.

Let’s create the confusion matrix for the table with the ten mails that we classified above. Mail 1 is "Spam", but we predicted "NonSpam", so this is a false negative. Mail 2 is "NonSpam" and we predicted "NonSpam", so this is a true negative. And so on. We end up with this table:

Actual label | |||

Spam | NonSpam | ||

Predicted label | Spam | 1 | 3 |

NonSpam | 2 | 4 |

In the next post we will take a loo at how we can calculate performance measures from this table.

Link for a second explanation: Explanation from an Information Retrieval perspective