Discontinuous x axis with pgfplots

Having a discontinuous y axis is common and Stackoverflow has a few solutions for that. I wanted an x axis with a gap (values 0-10 plus value 20). So this is what I did.

I create an axis from 0 to 12 and give 12 the label “20”. I add an extra tick on the x-axis at about halfway between 10 and “12”, where I want the gap and make it thick and white – basically I want a break in the axis. Then over that break I draw the “label” of this tick, which is two vertical lines at an angle, symbolizing the discontinuity. The relevant part of the style:

xticklabels={0, 2, 4, 6, 8, 10, 20},
extra x ticks={11.1},
extra x tick style={grid=none, tick style={white, very thick}, tick label style={xshift=0cm,yshift=.50cm, rotate=-20}},
extra x tick label={\color{black}{/\!\!/}},

And then I add the data with x-values 20 at x-coordinate “12”:

\addplot coordinates {
(0, 43.3) (1, 43.2) (2, 43.3) (3, 42.9) (4, 42.1) (5, 41.4) 
(6, 41.2) (7, 41.7) (8, 41.7) (9, 42.1) (10, 42.1) }; 
\pgfplotsset{cycle list shift=-1}
\addplot coordinates { (12, 43.8) };
\draw[dotted] (axis cs:10, 42.1) -- (axis cs:12, 43.8);

Adding the last point separately from the rest of the data serves the purpose that I can draw the dotted line by hand. cycle list shift=-1 causes the new “plot” to have the same style as the previous. There might be a way of doing this, but this works.

Hat tip: Stackoverflow, but I currently cannot find the question(s) and answer(s) that helped me solve this. Still, thank you, anonymous people.

Learning to learn – supervised versus unsupervised machine learning

In this blog post, I would like to introduce the two main forms of machine learning, supervised and unsupervised machine learning. The two differ quite a lot in the task they address, in the data that is necessary and in the algorithms that are used.

Supervised learning starts out from a set of data where each item is associated with a label that indicates a category. One example data set could be a collection of e-mails where each one is labeled as “spam” or “non-spam“. Another example data set could be a photo collection with categories such as “shows a mountain“, “is a portrait” or “taken at night“. These labels have usually been assigned by a human. The task for the machine learning algorithm is now to learn how to assign these labels. To this end, it is shown a large number of items with labels and it tries to learn how to distinguish one category from the other. The process is similar to a human who tries to learn something new. A child might first call everything with four legs a cat, but after seeing enough animals and the accompanying comment “no, that’s not a cat, that’s a X“, she will over time come to distinguish actual cats from dogs, cows or horses. Supervised machine learning algorithms do basically the same thing. Given a large amount of examples and their category, they try to find features that separate one class from the others. Coming back to the example of e-mails, the algorithm may find that e-mails that contain the phrases “earn a lot of money” or “prince from Nigeria” are likely spam. Or in the case of photos, it may learn that when a picture is dark, it has been taken at night. There are two main differences to the learning process of us humans. One disadvantage is, that the algorithm cannot generalize as well as we do. But this is offset by the advantage that it is much faster than we are and can look at a much larger data set than we ever could. Supervised learning is sometimes also called classification and there are many machine learning algorithms available. Examples include decision trees, Naive Bayes, logistic regression and neural networks.

Let us now turn to unsupervised learning. Just like with supervised learning, we start with a large data set to show the computer. But in contrast to supervised learning, there are no labels. No one is telling the algorithm what to learn. The task is rather to use the internal characteristics of the data to come up with groups inside the data. For example we could try to find groups of users with similar shopping habits out of all the online customers of your company. Or products that are similar to each other in the set of items those sold at a web shop. Or group the web pages in the result of a web search, e.g., the pages discussing jaguar the car versus those about the cat. The resulting division in the data is not based on outside input, like it is for classification, where a human has to define the categories for the data beforehand. The division is only based on the similarity of items in the data set among each other. No human has defined that for the search “jaguar” there are results for a cat and a car, but just by looking at the pages it turns out that there are two groups of pages that use a very different vocabulary. Algorithms for unsupervised learning include clustering algorithms and methods for covariance analysis like principal component analysis/singular value decomposition.

For the sake of completeness, let me mention that supervised and unsupervised learning are the two poles of machine learning methods, but not everything falls clearly into one camp or the other. Several semi-supervised approaches exist that fall somewhere in between. Some of these approaches use partial labels or external information to create the data set from where supervised learning can then start. Other methods use supervised learning to incrementally increase the data set on which the learning algorithm itself is trained. And of course there is no limit to creativity in this area.

To summarize, supervised and unsupervised learning differ in the task they want to solve (supervised learning assigns human-defined categories while unsupervised learning tries to find inherent groups in the data), the data that is necessary (supervised learning needs a set of items with associated categories, unsupervised learning needs only the items) and in the algorithms that are used (classification algorithms for supervised learning versus clustering algorithms for unsupervised learning).

This post has first appeared at 5analytics.com

Include pages from a pdf into a LaTeX beamer presentation

As you know, I do basically everything with LaTeX. But, I have colleagues who work with other tools and sometimes we exchange slides. Fortunately by now people have realized that I don’t like to get weird formats, so they send me pdfs. Yay!

It is actually really easy to include pages from a presentation in pdf format into a LaTeX beamer presentation. You will need the package pdfpages and then just write:

\setbeamercolor{background canvas}{bg=}

The first line is necessary, because it seems like otherwise the pdf slides end up being inserted behind the background of the slides, which doesn’t make so much sense to me, but anyway.

You can also include one pdf page into a beamer-slide (“frame”). This is useful if you want to edit the slide a bit, for example to hack your own footer back into the slide to get consistent page numbering:

\setbeamercolor{background canvas}{bg=}

\vspace{0.81\paperheight} % go down to where we want the footer

\hspace*{0.31\paperwidth} % space to the left
\begin{minipage}{0.6\paperwidth} % insert my footer
\tiny\colorbox{white}{~\insertshortauthor: \insertshorttitle} 
\hfill \insertframenumber ~/ \inserttotalframenumber