Learning to learn – What to look for in the evaluation of classification

Applying some machine learning algorithm to classify some data is made easy these days. There is a large amount of programming libraries, applications and online services available on the market. But how do we know whether the algorithm works? Or which of the methods to chose from is the best for our task? This post will give a very short introduction to the four most relevant issues in the evaluation of classification results. None of these points is restricted to a specific machine learning algorithm. In fact, none of them requires any understanding of the used classification method at all.

1. Good machine learning requires good data
The first and most obvious point is, that machine learning requires training data. Training data consists of a number of data items with associated labels assigned by humans. For example, we could train on a set of e-mails which have been labeled as spam or non-spam by the person receiving them. In order for a learning algorithm to learn something useful, a few conditions should be met by the training data. First, there should be enough data. Learning from only 100 e-mails will not work, as there are many types of spam mails. Second, the data should be as clean as possible. If half of the spam-labels are wrong, there is no way how an algorithm can learn what real spam is. And finally, the data should also be as close as possible to the real data that you want to classify later. Training a spam-recognition system on English and then apply it to German will probably not work well.

2. Don’t evaluate on your training data
After the machine learning algorithm has learned to distinguish the classes on the training data, it is ready to be applied to new data. Based on what it has learned from the training data, the algorithm will assign a class to each new data item. A common beginner’s mistake is to apply the algorithm to the training data again. This will lead to very good results, but these results are misleading. Imagine that the “learning algorithm” is just memorizing complete e-mails. If the exact same e-mail is shown to the algorithm again, it will confidently assign the correct class. 100% of training set e-mails will be correct! But even changing one word will cause the algorithm to fail, so it is no use to us in reality. Of course real learning algorithms are more complex, but the issue is the same. It is very easy to be confident about what you already have seen. The hard part is to deal with new stuff. So in order to have a reliable evaluation, the algorithm should be trained on one data set and applied to another totally separate set.

3. Evaluate on data that is close to the data you want to classify later
As we have just discussed, we need to evaluate on data that is different from the training data. But, just like the training data, the evaluation data should be as close as possible to the real data that you want to classify later. If you want to classify German, it doesn’t help you to know that the spam-classifier works very well on English. A common procedure for a good evaluation is to create one data set with labeled data, and then split it up into training and test data (e.g., 80% training data, 20% test data). No item is allowed to be in both sets at the same time. Another common technique is called k-fold cross-validation. This method splits the data into k (often 10) folds and does k train-test runs. In each run, one of the folds is used as test data and the other folds are used as training data. The folds do not change between runs, so in the end every item in the data has been assigned a label, but at that point this item was not in the training set, so point 2 is not violated. For both technique it is worth thinking about whether to randomly shuffle the folds or to enforce a similar label-distribution in all the folds in order to avoid artificial inflation of the results.

4. Chose the right evaluation metric for your problem
After the machine learning system has assigned a class to every data item, we compare the assigned labels to the real labels. The larger the percentage of correct labels, the better the system. There are many ways of comparing the labels depending on the nature of the labels and their distribution. The simplest measure, called accuracy, is to count the number of correct assignments, e.g., how many real spam-mails have been classified as spam by the system and how many non-spam-mails have been classified as non-spam. But accuracy is not a good measure in some cases. Let’s assume that 90% of mails are non-spam. If a system always assigns the label non-spam, it will be 90% accurate – but not useful at all. The same thing happens with many classes if some are much bigger than the others. Accuracy is also not a good choice when labels are on a scale. In this case confusing 1 and 5 is much more serious than confusing 1 and 2 and accuracy does not reflect this. There are alternative metrics for such scenarios that should be used.

I will stop here, although there is more much to be said. I encourage everybody to investigate the topic in more detail. Good evaluation is at least as important as good machine learning algorithms. If evaluation numbers do not reflect the expected real performance of a system, how can they be the basis of any decision?


This post has first appeared at 5analytics.com

Precision, Recall and F-measure

In the last post we discussed accuracy, a straightforward method of calculating the performance of a classification system. Using accuracy is fine when the classes are of equal size, but this is often not the case in real world tasks. In such cases the very large number of true negatives outweighs the number of true positives in the evaluation so that accuracy will always be artificially high.

Luckily there are performance measures that ignore the number of true negatives. Two frequently used measures are precision and recall. Precision P indicates how many of the items that we have identified as positives are really positives. In other words, how precise have we been in our identification. How many of those that we think are X, really are X. Formally, this means that we divide the number of true positives by the number of all identified positives (true and false):
P = TP/(TP+FP)

Recall R indicates how many of the real positives we have found. So from all of the positive items that are there, how many did we manage to identify. In other words, how exhaustive we were. Formally, this means that we divide the number of true positives by the number of all existing positives (true positives and false negatives):
R = TP/(TP+FN)

For our example from the last post, precision and recall are as follows:
P = 1/(1+3) = 1/4 = 0.25
R = 1/(1+2) = 1/3 = 0.33

It is easy to get a recall of 100%. We just say for everything that it is a positive. But as this will probably not the case (or else we have a really easy dataset to classify!), this approach will give us a really low precision. On the other hand, we can usually get a high precision if we only classify as positive one single item that we are really, really sure about. But if we do that, recall will be low, as there will be more than one item in the dataset to be classified (or else it is not a very meaningful set).

So recall and precision are in a sort of balance. The F1 score or F1 measure is a way of putting the two of them together to produce one single number. Formally it calculates the harmonic mean of the two numbers and weights the two of them with the same importance (there are other variants that put more importance on one of them):
F_1 = (2 \cdot P \cdot R)/(P + R)

Using the values for precision and recall for our example, F1 is:
F_1 = (2 \cdot 0.25 \cdot 0.33)/(0.25 + 0.33) = 0.165 / 0.58 = 0.28

Intuitively, F1 is between the two values of precision and recall, but closer to the lower of the two. In other words, it penalizes if we concentrate only on one of the values and rewards systems where precision and recall are closer together.

Link for a second explanation: Explanation from an Information Retrieval perspective

Accuracy

We are still trying to figure out how good our system for determining whether e-mails are spam or not is. In the last post we ended up with a confusion matrix like this:

Actual label
Spam NonSpam
Predicted label Spam 1 (true positives, TP) 3 (false positives, FP)
NonSpam 2 (false negatives, FN) 4 (true negatives, TN)

Now we want to calculate numbers from this table to describe the performance of our system. One easy way of doing this is to use accuracy A. Accuracy basically describes which percentage of decisions we got right. So we would take the diagonal entries in the matrix (the true positives and true negatives) and divide by the total number of entries. Formally:
A = (TP+TN)/(TP+TN+FP+FN)

In our example the accuracy is:
A = (1+4)/(1+4+2+3) = 5/10 = 0.5

Using accuracy is fine in examples like the above when both classes occur more or less with the same frequency. But frequently the number of true negatives is larger than the number of true positives by many orders of magnitudes. So let’s assume 994 for true negatives and when we calculate accuracy again, we get this:
A = (1+994)/(1+994+2+3) = 995/1000 = 0.995

It doesn’t really matter if we correctly identify any spam mails. Even if we always say NonSpam, so we get zero Spam-Mails right, we still get more nearly the same accuracy as above. So accuracy is not a good indicator of performance for our system in this situation. In the next post we will look at other measures we can use instead.

Link for a second explanation: Explanation from an Information Retrieval perspective