In this blog post, I would like to introduce the two main forms of machine learning, supervised and unsupervised machine learning. The two differ quite a lot in the task they address, in the data that is necessary and in the algorithms that are used.
Supervised learning starts out from a set of data where each item is associated with a label that indicates a category. One example data set could be a collection of e-mails where each one is labeled as “spam” or “non-spam“. Another example data set could be a photo collection with categories such as “shows a mountain“, “is a portrait” or “taken at night“. These labels have usually been assigned by a human. The task for the machine learning algorithm is now to learn how to assign these labels. To this end, it is shown a large number of items with labels and it tries to learn how to distinguish one category from the other. The process is similar to a human who tries to learn something new. A child might first call everything with four legs a cat, but after seeing enough animals and the accompanying comment “no, that’s not a cat, that’s a X“, she will over time come to distinguish actual cats from dogs, cows or horses. Supervised machine learning algorithms do basically the same thing. Given a large amount of examples and their category, they try to find features that separate one class from the others. Coming back to the example of e-mails, the algorithm may find that e-mails that contain the phrases “earn a lot of money” or “prince from Nigeria” are likely spam. Or in the case of photos, it may learn that when a picture is dark, it has been taken at night. There are two main differences to the learning process of us humans. One disadvantage is, that the algorithm cannot generalize as well as we do. But this is offset by the advantage that it is much faster than we are and can look at a much larger data set than we ever could. Supervised learning is sometimes also called classification and there are many machine learning algorithms available. Examples include decision trees, Naive Bayes, logistic regression and neural networks.
Let us now turn to unsupervised learning. Just like with supervised learning, we start with a large data set to show the computer. But in contrast to supervised learning, there are no labels. No one is telling the algorithm what to learn. The task is rather to use the internal characteristics of the data to come up with groups inside the data. For example we could try to find groups of users with similar shopping habits out of all the online customers of your company. Or products that are similar to each other in the set of items those sold at a web shop. Or group the web pages in the result of a web search, e.g., the pages discussing jaguar the car versus those about the cat. The resulting division in the data is not based on outside input, like it is for classification, where a human has to define the categories for the data beforehand. The division is only based on the similarity of items in the data set among each other. No human has defined that for the search “jaguar” there are results for a cat and a car, but just by looking at the pages it turns out that there are two groups of pages that use a very different vocabulary. Algorithms for unsupervised learning include clustering algorithms and methods for covariance analysis like principal component analysis/singular value decomposition.
For the sake of completeness, let me mention that supervised and unsupervised learning are the two poles of machine learning methods, but not everything falls clearly into one camp or the other. Several semi-supervised approaches exist that fall somewhere in between. Some of these approaches use partial labels or external information to create the data set from where supervised learning can then start. Other methods use supervised learning to incrementally increase the data set on which the learning algorithm itself is trained. And of course there is no limit to creativity in this area.
To summarize, supervised and unsupervised learning differ in the task they want to solve (supervised learning assigns human-defined categories while unsupervised learning tries to find inherent groups in the data), the data that is necessary (supervised learning needs a set of items with associated categories, unsupervised learning needs only the items) and in the algorithms that are used (classification algorithms for supervised learning versus clustering algorithms for unsupervised learning).
This post has first appeared at 5analytics.com