I am a computational linguist, teacher of computer science and above all a huge fan of LaTeX. I use LaTeX for everything, including things you never wanted to do with LaTeX. My latest love is lilypond, aka LaTeX for music. I'll post at irregular intervals about cool stuff, stupid hacks and annoying settings I want to remember for the future.

# Discontinuous x axis with pgfplots

Having a discontinuous y axis is common and Stackoverflow has a few solutions for that. I wanted an x axis with a gap (values 0-10 plus value 20). So this is what I did.

I create an axis from 0 to 12 and give 12 the label “20”. I add an extra tick on the x-axis at about halfway between 10 and “12”, where I want the gap and make it thick and white – basically I want a break in the axis. Then over that break I draw the “label” of this tick, which is two vertical lines at an angle, symbolizing the discontinuity. The relevant part of the style:

xmin=0,
xmax=12.5,
xticklabels={0, 2, 4, 6, 8, 10, 20},
extra x ticks={11.1},
extra x tick style={grid=none, tick style={white, very thick}, tick label style={xshift=0cm,yshift=.50cm, rotate=-20}},
extra x tick label={\color{black}{/\!\!/}},


And then I add the data with x-values 20 at x-coordinate “12”:

\addplot coordinates {
(0, 43.3) (1, 43.2) (2, 43.3) (3, 42.9) (4, 42.1) (5, 41.4)
(6, 41.2) (7, 41.7) (8, 41.7) (9, 42.1) (10, 42.1) };
\pgfplotsset{cycle list shift=-1}
\addplot coordinates { (12, 43.8) };
\draw[dotted] (axis cs:10, 42.1) -- (axis cs:12, 43.8);


Adding the last point separately from the rest of the data serves the purpose that I can draw the dotted line by hand. cycle list shift=-1 causes the new “plot” to have the same style as the previous. There might be a way of doing this, but this works.

Hat tip: Stackoverflow, but I currently cannot find the question(s) and answer(s) that helped me solve this. Still, thank you, anonymous people.

# Learning to learn – supervised versus unsupervised machine learning

In this blog post, I would like to introduce the two main forms of machine learning, supervised and unsupervised machine learning. The two differ quite a lot in the task they address, in the data that is necessary and in the algorithms that are used.

Supervised learning starts out from a set of data where each item is associated with a label that indicates a category. One example data set could be a collection of e-mails where each one is labeled as “spam” or “non-spam“. Another example data set could be a photo collection with categories such as “shows a mountain“, “is a portrait” or “taken at night“. These labels have usually been assigned by a human. The task for the machine learning algorithm is now to learn how to assign these labels. To this end, it is shown a large number of items with labels and it tries to learn how to distinguish one category from the other. The process is similar to a human who tries to learn something new. A child might first call everything with four legs a cat, but after seeing enough animals and the accompanying comment “no, that’s not a cat, that’s a X“, she will over time come to distinguish actual cats from dogs, cows or horses. Supervised machine learning algorithms do basically the same thing. Given a large amount of examples and their category, they try to find features that separate one class from the others. Coming back to the example of e-mails, the algorithm may find that e-mails that contain the phrases “earn a lot of money” or “prince from Nigeria” are likely spam. Or in the case of photos, it may learn that when a picture is dark, it has been taken at night. There are two main differences to the learning process of us humans. One disadvantage is, that the algorithm cannot generalize as well as we do. But this is offset by the advantage that it is much faster than we are and can look at a much larger data set than we ever could. Supervised learning is sometimes also called classification and there are many machine learning algorithms available. Examples include decision trees, Naive Bayes, logistic regression and neural networks.

Let us now turn to unsupervised learning. Just like with supervised learning, we start with a large data set to show the computer. But in contrast to supervised learning, there are no labels. No one is telling the algorithm what to learn. The task is rather to use the internal characteristics of the data to come up with groups inside the data. For example we could try to find groups of users with similar shopping habits out of all the online customers of your company. Or products that are similar to each other in the set of items those sold at a web shop. Or group the web pages in the result of a web search, e.g., the pages discussing jaguar the car versus those about the cat. The resulting division in the data is not based on outside input, like it is for classification, where a human has to define the categories for the data beforehand. The division is only based on the similarity of items in the data set among each other. No human has defined that for the search “jaguar” there are results for a cat and a car, but just by looking at the pages it turns out that there are two groups of pages that use a very different vocabulary. Algorithms for unsupervised learning include clustering algorithms and methods for covariance analysis like principal component analysis/singular value decomposition.

For the sake of completeness, let me mention that supervised and unsupervised learning are the two poles of machine learning methods, but not everything falls clearly into one camp or the other. Several semi-supervised approaches exist that fall somewhere in between. Some of these approaches use partial labels or external information to create the data set from where supervised learning can then start. Other methods use supervised learning to incrementally increase the data set on which the learning algorithm itself is trained. And of course there is no limit to creativity in this area.

To summarize, supervised and unsupervised learning differ in the task they want to solve (supervised learning assigns human-defined categories while unsupervised learning tries to find inherent groups in the data), the data that is necessary (supervised learning needs a set of items with associated categories, unsupervised learning needs only the items) and in the algorithms that are used (classification algorithms for supervised learning versus clustering algorithms for unsupervised learning).

This post has first appeared at 5analytics.com

# Include pages from a pdf into a LaTeX beamer presentation

As you know, I do basically everything with LaTeX. But, I have colleagues who work with other tools and sometimes we exchange slides. Fortunately by now people have realized that I don’t like to get weird formats, so they send me pdfs. Yay!

It is actually really easy to include pages from a presentation in pdf format into a LaTeX beamer presentation. You will need the package pdfpages and then just write:

{
\setbeamercolor{background canvas}{bg=}
\includepdf[pages=3-8]{slides.pdf}
}


The first line is necessary, because it seems like otherwise the pdf slides end up being inserted behind the background of the slides, which doesn’t make so much sense to me, but anyway.

You can also include one pdf page into a beamer-slide (“frame”). This is useful if you want to edit the slide a bit, for example to hack your own footer back into the slide to get consistent page numbering:

{
\setbeamercolor{background canvas}{bg=}
\begin{frame}[t]
\includepdf[pages=3]{slides.pdf}

\vspace{0.81\paperheight} % go down to where we want the footer

\hspace*{0.31\paperwidth} % space to the left
\begin{minipage}{0.6\paperwidth} % insert my footer
\tiny\colorbox{white}{~\insertshortauthor: \insertshorttitle}
\end{minipage}

\end{frame}
}


# Fun with newlines

Use a typewriter lately? No? Well, who cares… except when you encounter stupidities left over from the early days of computing where people were still used to typewriters. Because typewriters had two ways of going to a new line, ASCII knows two ways of representing the newline:

• LF (line feed, German Zeilenvorschub), represented as Unicode code point 0x0A, ASCII 00001100 and escape character \n
• CR (carriage return, German Wagenrücklauf), represented as Unicode code point 0x0D, ASCII 00001101 and escape character \r

ASCII was the first-ever invented encoding for representing text in bits. It’s from the 1960s and at the time someone probably thought it is a good idea to have two characters for the concept of a new line. We’d think "who cares about stuff from the 1960s", it’s 2017, right? But unfortunately many later encodings base themselves on ASCII, most notably those from the Unicode family, e.g., the widely used UTF-8. So – thank you, 1960s! /sarcasm

Two characters for a new line would not be too bad if they were used consistently, but that is where the fun begins. Of course they are not! Differnt operating systems use different conventions to mark the end of a line:

• Linux and Mac OSX use LF
• Windows uses CR LF
• (and to make the chaos complete, Mac OS from before version X uses CR)

So have fun reading "plain text" files! /sarcasm

# Encoding in Python 2.x

One of the annoying things where I always forget the specifics. So here it is…

Reading a file line-by-line in python and writing it to another file is easy:

input_file = open("input.txt")
outputFile = open("output.txt", "w")
for line in input_file:
outputFile.write(line + "\n")


But whenever encodings are involved, everything gets complicated. What is the encoding of line? Actually, not really anything, without specification, line is just a Byte-String (type 'str') with no encoding.

Because this is usually not what we want, the first step is to convert the read line to Unicode. This is done with the method decode. The method has two parameters. The fist is the encoding that should be used. This is a predefined String value which you can guess it for the more common encodings (or look it up in the documentation). If left out, ASCII is assumed. The second parameter defines how to handle unknown byte patterns. The value 'strict' will abort with UnicodeDecodeError, 'ignore' will leave the character out of the Unicode result, and 'replace' will replace every unknown pattern with U+FFFD. Let’s assume our input file and therefor the line we read from there is in Latin-1. We convert this to Unicode with:

lineUnicode = line.decode('latin-1','strict')


or equivalently

lineUnicode = unicode(line, encoding='latin-1', errors='strict')


After decoding, we have sometihng of type 'unicode'. If we try to print this and it is not simple English, it will probably give an error (UnicodeEncodeError: 'ascii' codec can't encode characters in position 63-66: ordinal not in range(128)). This is because Python will try to convert the characters to ASCII, which is not possible for characters that are not ASCII. So, to print out a Unicode text, we have to convert it to some encoding. Let’s say we want UTF-8 (there is no reason not to use UTF-8, so this is what you should always want):

lineUtf8 = lineUnicode.encode('utf-8')
print(lineUtf8)


Here again, there is a second parameter which defines how to handle characters that cannot be represented (which shouldn’t happen too often with UTF-8). Happy coding!

Unicode HOWTO in the Python documentation, Overcoming frustration: Correctly using unicode in python2 from the Python Kitchen project, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky (not specific to Python, but gives a good background).

# Learning to learn – What to look for in the evaluation of classification

Applying some machine learning algorithm to classify some data is made easy these days. There is a large amount of programming libraries, applications and online services available on the market. But how do we know whether the algorithm works? Or which of the methods to chose from is the best for our task? This post will give a very short introduction to the four most relevant issues in the evaluation of classification results. None of these points is restricted to a specific machine learning algorithm. In fact, none of them requires any understanding of the used classification method at all.

1. Good machine learning requires good data
The first and most obvious point is, that machine learning requires training data. Training data consists of a number of data items with associated labels assigned by humans. For example, we could train on a set of e-mails which have been labeled as spam or non-spam by the person receiving them. In order for a learning algorithm to learn something useful, a few conditions should be met by the training data. First, there should be enough data. Learning from only 100 e-mails will not work, as there are many types of spam mails. Second, the data should be as clean as possible. If half of the spam-labels are wrong, there is no way how an algorithm can learn what real spam is. And finally, the data should also be as close as possible to the real data that you want to classify later. Training a spam-recognition system on English and then apply it to German will probably not work well.

2. Don’t evaluate on your training data
After the machine learning algorithm has learned to distinguish the classes on the training data, it is ready to be applied to new data. Based on what it has learned from the training data, the algorithm will assign a class to each new data item. A common beginner’s mistake is to apply the algorithm to the training data again. This will lead to very good results, but these results are misleading. Imagine that the “learning algorithm” is just memorizing complete e-mails. If the exact same e-mail is shown to the algorithm again, it will confidently assign the correct class. 100% of training set e-mails will be correct! But even changing one word will cause the algorithm to fail, so it is no use to us in reality. Of course real learning algorithms are more complex, but the issue is the same. It is very easy to be confident about what you already have seen. The hard part is to deal with new stuff. So in order to have a reliable evaluation, the algorithm should be trained on one data set and applied to another totally separate set.

3. Evaluate on data that is close to the data you want to classify later
As we have just discussed, we need to evaluate on data that is different from the training data. But, just like the training data, the evaluation data should be as close as possible to the real data that you want to classify later. If you want to classify German, it doesn’t help you to know that the spam-classifier works very well on English. A common procedure for a good evaluation is to create one data set with labeled data, and then split it up into training and test data (e.g., 80% training data, 20% test data). No item is allowed to be in both sets at the same time. Another common technique is called k-fold cross-validation. This method splits the data into k (often 10) folds and does k train-test runs. In each run, one of the folds is used as test data and the other folds are used as training data. The folds do not change between runs, so in the end every item in the data has been assigned a label, but at that point this item was not in the training set, so point 2 is not violated. For both technique it is worth thinking about whether to randomly shuffle the folds or to enforce a similar label-distribution in all the folds in order to avoid artificial inflation of the results.

4. Chose the right evaluation metric for your problem
After the machine learning system has assigned a class to every data item, we compare the assigned labels to the real labels. The larger the percentage of correct labels, the better the system. There are many ways of comparing the labels depending on the nature of the labels and their distribution. The simplest measure, called accuracy, is to count the number of correct assignments, e.g., how many real spam-mails have been classified as spam by the system and how many non-spam-mails have been classified as non-spam. But accuracy is not a good measure in some cases. Let’s assume that 90% of mails are non-spam. If a system always assigns the label non-spam, it will be 90% accurate – but not useful at all. The same thing happens with many classes if some are much bigger than the others. Accuracy is also not a good choice when labels are on a scale. In this case confusing 1 and 5 is much more serious than confusing 1 and 2 and accuracy does not reflect this. There are alternative metrics for such scenarios that should be used.

I will stop here, although there is more much to be said. I encourage everybody to investigate the topic in more detail. Good evaluation is at least as important as good machine learning algorithms. If evaluation numbers do not reflect the expected real performance of a system, how can they be the basis of any decision?

This post has first appeared at 5analytics.com

# Encoding question mark in TikZ

I’m trying to draw the question mark that is sometimes displayed when there are encoding issues: �

This is my solution:

\tikz[baseline=(wi.base)]{
\node[fill=black,rotate=45,inner sep=.1ex,text height=1.8ex,text width=1.8ex] {};
\node[font=\color{white}] (wi) {?};
}


# Accessing JVM arguments from inside Java

Whenever I get a ClassNotFoundException error in Java, I think to myself “but it is there!” and then I correct the typo in the classpath or get angry at Eclipse for messing up my classpath. Lately I have programmed in more complex settings where it was not always clear to me where the application gets the classpath from, so I wanted to check which of my libraries actually end up on the classpath. Turns out it is not very complicated. Here is code to print a number of useful things:

System.out.println("Working directory: "
+ Paths.get(".").toAbsolutePath().normalize().toString());
System.out.println("Classpath: "
+ System.getProperty("java.class.path"));
System.out.println("Library path: "
+ System.getProperty("java.library.path"));
System.out.println("java.ext.dirs: "
+ System.getProperty("java.ext.dirs"));


The current working directory is the starting point for all relative paths, e.g., for reading and writing files. The normalization of the path makes it a bit more readable, but is not necessary. The class Paths is from the package java.nio.file.Paths. The classpath is the place where Java looks for (bytecode for) classes. The entries should be folders or jar-files. The Java library path is where Java looks for native libraries, e.g., platform dependent things. You can of course access other environment variables with the same method, but I cannot at the moment think of a useful example.

Related (at least related enough to put it into the same post), this is how you can print the space used and available on the JVM heap:

int mbfactor = 1024*1024;
System.out.println("Memory free/used/total/max "
+ Runtime.getRuntime().freeMemory()/mbfactor + "/"
+ (Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory())/mbfactor + "/"
+ Runtime.getRuntime().totalMemory()/mbfactor + "/"
+ Runtime.getRuntime().maxMemory()/mbfactor + " MB"
);


# Tuplets (Triolen) in Lilypond

\times 4/3 { a8( b c) }


And as of Lilypond 2.17:

\tuplet 4/3 { a8( b c) }


# Positionng rests in the middle of a two-voice line

In choir scores, you often have the score for two voices (e.g., soprano and alto) in one line:

      \new Staff  = "Frauen"<<
\new Voice = "Sopran" { \voiceOne \global  \soprano }
\new Voice = "Alt" { \voiceTwo \global  \alto }
>>


When they both have a pause at the same time with the same length, lilypond will still print two rests in different positions. If you (like me) think this looks weird, here is how you can change it:

soprano = \relative c' { a2 \oneVoice r4 \voiceOne } a4 }
alto = \relative c' { a2 s4 } a4 }


In one voice, change to only one voice with \oneVoice for the rest and then back to the usual voice, here /voiceOne. If you do the same in the other voice, you will get warnings about clashing notes, so instead of using a rest, use an invisible rest (spacer) with s.

An alternative is the following command which causes all rests to appear in the middle of the line. It should be used inside the \layout block:

   \override Voice.Rest #'staff-position = #0