Learning to learn – What to look for in the evaluation of classification

Applying some machine learning algorithm to classify some data is made easy these days. There is a large amount of programming libraries, applications and online services available on the market. But how do we know whether the algorithm works? Or which of the methods to chose from is the best for our task? This post will give a very short introduction to the four most relevant issues in the evaluation of classification results. None of these points is restricted to a specific machine learning algorithm. In fact, none of them requires any understanding of the used classification method at all.

1. Good machine learning requires good data
The first and most obvious point is, that machine learning requires training data. Training data consists of a number of data items with associated labels assigned by humans. For example, we could train on a set of e-mails which have been labeled as spam or non-spam by the person receiving them. In order for a learning algorithm to learn something useful, a few conditions should be met by the training data. First, there should be enough data. Learning from only 100 e-mails will not work, as there are many types of spam mails. Second, the data should be as clean as possible. If half of the spam-labels are wrong, there is no way how an algorithm can learn what real spam is. And finally, the data should also be as close as possible to the real data that you want to classify later. Training a spam-recognition system on English and then apply it to German will probably not work well.

2. Don’t evaluate on your training data
After the machine learning algorithm has learned to distinguish the classes on the training data, it is ready to be applied to new data. Based on what it has learned from the training data, the algorithm will assign a class to each new data item. A common beginner’s mistake is to apply the algorithm to the training data again. This will lead to very good results, but these results are misleading. Imagine that the “learning algorithm” is just memorizing complete e-mails. If the exact same e-mail is shown to the algorithm again, it will confidently assign the correct class. 100% of training set e-mails will be correct! But even changing one word will cause the algorithm to fail, so it is no use to us in reality. Of course real learning algorithms are more complex, but the issue is the same. It is very easy to be confident about what you already have seen. The hard part is to deal with new stuff. So in order to have a reliable evaluation, the algorithm should be trained on one data set and applied to another totally separate set.

3. Evaluate on data that is close to the data you want to classify later
As we have just discussed, we need to evaluate on data that is different from the training data. But, just like the training data, the evaluation data should be as close as possible to the real data that you want to classify later. If you want to classify German, it doesn’t help you to know that the spam-classifier works very well on English. A common procedure for a good evaluation is to create one data set with labeled data, and then split it up into training and test data (e.g., 80% training data, 20% test data). No item is allowed to be in both sets at the same time. Another common technique is called k-fold cross-validation. This method splits the data into k (often 10) folds and does k train-test runs. In each run, one of the folds is used as test data and the other folds are used as training data. The folds do not change between runs, so in the end every item in the data has been assigned a label, but at that point this item was not in the training set, so point 2 is not violated. For both technique it is worth thinking about whether to randomly shuffle the folds or to enforce a similar label-distribution in all the folds in order to avoid artificial inflation of the results.

4. Chose the right evaluation metric for your problem
After the machine learning system has assigned a class to every data item, we compare the assigned labels to the real labels. The larger the percentage of correct labels, the better the system. There are many ways of comparing the labels depending on the nature of the labels and their distribution. The simplest measure, called accuracy, is to count the number of correct assignments, e.g., how many real spam-mails have been classified as spam by the system and how many non-spam-mails have been classified as non-spam. But accuracy is not a good measure in some cases. Let’s assume that 90% of mails are non-spam. If a system always assigns the label non-spam, it will be 90% accurate – but not useful at all. The same thing happens with many classes if some are much bigger than the others. Accuracy is also not a good choice when labels are on a scale. In this case confusing 1 and 5 is much more serious than confusing 1 and 2 and accuracy does not reflect this. There are alternative metrics for such scenarios that should be used.

I will stop here, although there is more much to be said. I encourage everybody to investigate the topic in more detail. Good evaluation is at least as important as good machine learning algorithms. If evaluation numbers do not reflect the expected real performance of a system, how can they be the basis of any decision?


This post has first appeared at 5analytics.com