Confusion matrix

Let’s say we want to analyze e-mails to determine whether they are spam or not. We have a set of mails and for each of them we have a label that says either "Spam" or "NotSpam" (for example we could get these labels from users who mark mails as spam). On this set of documents (the training data) we can train a machine learning system which given an e-mail can predict the label. So now we want to know how the system that we have trained is performing, whether it really recognizes spam or not.

So how can we find out? We take another set of mails that have been marked as "Spam" or "NotSpam" (the test data), apply our machine learning system and get predicted labels for these documents. So we end up with a list like this:

Actual label Predicted label
Mail 1 Spam NonSpam
Mail 2 NonSpam NonSpam
Mail 3 NonSpam NonSpam
Mail 4 Spam Spam
Mail 5 NonSpam NonSpam
Mail 6 NonSpam NonSpam
Mail 7 Spam NonSpam
Mail 8 NonSpam Spam
Mail 9 NonSpam Spam
Mail 10 NonSpam Spam

We can now compare the predicted labels from our system to the actual labels to find out how many of them we got right. When we have two classes, there are four possible outcomes for the comparison of a predicted label and an actual label. We could have predicted "Spam" and the actual label is also "Spam". Or we predicted "NonSpam" and the label is actually "NonSpam". In both of these cases we were right, so these are the true predictions. But, we could also have predicted "Spam" when the actual label is "NonSpam". Or "NonSpam" when we should have predicted "Spam". So these are the false predictions, the cases where we have been wrong. Let’s assume that we are interested in how well we can predict "Spam". Every mail for which we have predicted the class "Spam" is a positive prediction, a prediction for the class we are interested in. Every mail where we have predicted "NonSpam" is a negative prediction, a prediction of not the class we are interested in. So we can summarize the possible outcomes and their names in this table:

Actual label
Spam NonSpam
Predicted label Spam true positives (TP) false positives (FP)
NonSpam false negatives (FN) true negatives (TN)

The true positives are the mails where we have predicted "Spam", the class we are interested in, so it is a positive prediction, and the actual label was also "Spam", so the prediction was true. The false positives are the mails where we have predicted "Spam" (a positive prediction), but the actual label is "NonSpam", so the prediction is false. Correspondingly the false negatives, the mails we should have labeled as "Spam" but didn’t. And the true negatives that we correctly recognized as "NonSpam". This matrix is called a confusion matrix.

Let’s create the confusion matrix for the table with the ten mails that we classified above. Mail 1 is "Spam", but we predicted "NonSpam", so this is a false negative. Mail 2 is "NonSpam" and we predicted "NonSpam", so this is a true negative. And so on. We end up with this table:

Actual label
Spam NonSpam
Predicted label Spam 1 3
NonSpam 2 4

In the next post we will take a loo at how we can calculate performance measures from this table.

Link for a second explanation: Explanation from an Information Retrieval perspective

Histograms of category frequencies in R

I am learning R, so this is my first attempt to create histograms in R. The data that I have is a vector of one category for each data point. For this example we will use a vector of a random sample of letters. The important thing is that we want a histogram of the frequencies of texts, not numbers. And the texts are longer than just one letter. So let’s start with this:

labels <- sample(letters[1:20],100,replace=TRUE)
labels <- vapply(seq_along(labels), 
                 function(x) paste(rep(labels[x],10), collapse = ""),
                 character(1L)) # Repeat each letter 10 times
library(plyr) # for the function 'count'
distribution <- count(labels)
distribution_sorted <- 
   distribution[order(distribution[,"freq"], decreasing=TRUE),]

I use the function count from the package plyr to get a matrix distribution with the different categories in column one (called "x") and the number of times this label occurs in column two (called "freq"). As I would like the histogram to display the categories from the most frequent to the least frequent one, I then sort this matrix by frequency with the function order. The function gives back a vector of indices in the correct order, so I need to plug this into the original matrix as row numbers.

Now let's do the histogram:

mp <- barplot(distribution_sorted[,"freq"],
         names.arg=distribution_sorted[,1], # X-axis names
         las=2,  # turn labels by 90 degrees
         col=c("blue"), # blue bars (just for fun)
         xlab="Kategorie", ylab="Häufigkeit", # Axis labels
         )

There are many more settings to adapt, e.g., you can use cex to increase the font size for the numerical y-axis values (cex.axis), the categorical x-axis names (cex.names), and axis labels (cex.lab).

In my plot there is one problem. My categorie names are much longer than the values on the y-axis and so the axis labels are positioned incorrectly. This is the point to give up and do the plot in Excel (ahem, LaTeX!) - or take input from fellow bloggers. They explain the issues way better than me, so I will just post my final solution. I took the x-axis label out of the plot and inserted it separately with mtext. I then wanted a line for the x-axis as well and in the end I took out the x-axis names from the plot again and put them into a separate axis at the bottom (side=1) with zero-length ticks (tcl=0) intersecting the y-axis at pos=-0.3.

# mai = space around the plot: bottom - left - top - right
# mgp = spacing for axis title - axis labels - axis line
par(mai=c(2.5,1,0.3,0.15), mgp=c(2.5,0.75,0))
mp <- barplot(distribution_sorted[,"freq"],
         #names.arg=distribution_sorted[,1], # X-axis names/labels
         las=2,  # turn labels by 90 degrees
         col=c("blue"), # blue bars (just for fun)
         ylab="Häufigkeit", # Axis title
         )
axis(side=1, at=mp, pos=-0.3, 
     tick=TRUE, tcl=0, 
     labels=distribution_sorted[,1], las=2, 
     )
mtext("Kategorie", side=1, line=8.5) # x-axis label

There has to be an easier way !?

Citations

Thank you Google Scholar Alerts for bringing to my attention this latest reference to one of my papers:

(3) 基于语义角色标注的提取
语义角色标注 SRL 是将词语序列分组, 并按照语
义角色对其分类。SRL 的目的就是找出给定句子中谓
语词的对应语义成分, 即核心语义角色(主语、宾语等)
和附属角色(时间、地点等)。SRL 只针对句子中的部
分成分与谓语的关系进行标注, 属于浅层语义分析。
Kessler 等 [37] 运用 SRL 对英文比较句的元素进行标注
与提取, 效果优于之前的方法。但是, 只使用 SRL 对
中文比较关系提取效果较差, 为此进行不同程度的改
进。例如, 构建混合比较模式的 SRL 模型, 对汉语比
较句进行两阶段标注 [9] ; 将 SRL 与句法分析树相结合,
提出语义角色分析树 [28] , 通过计算两棵子树之间的匹
配相似度抽取比较关系; 还有学者尝试将 CRF 应用到
SRL 中 [38] 。上述研究取得了一定成果, 但是采用 SRL
进行中文标注的效果还有待提高, 对涉及上下句的比
较信息提取尚未能够有效解决。

Whatever it says, it counts towards my H-index!

Marking significance in a bar plot

And still on the topic of LaTeX presentations, this time trying to plot a symbol over a bar to indicate significance.

This is how it works:

\node[xshift=\pgfkeysvalueof{/pgf/bar shift},anchor=south] at (axis cs:Xcoord1,0.47) {$\bullet$}; 

You need to put this code directly after the point where the data series has been plotted. Example:

\begin{tikzpicture}
\begin{axis}[xtick=data,axis x line*=bottom,axis y line=left,symbolic x coords={Xcoord1, Xcoord2}]

\addplot [ybar,seagreen] coordinates {(Xcoord1, -0.027) (Xcoord2, 0.436)}; 
\node[xshift=\pgfkeysvalueof{/pgf/bar shift},anchor=south] at (axis cs:Xcoord2,0.47) {$\bullet$}; 
\addlegendentry{System 1}

\addplot+ [ybar,blue] coordinates  {(Xcoord1, 0.331) (Xcoord2, 0.095)}; 
\node[xshift=\pgfkeysvalueof{/pgf/bar shift},anchor=south] at (axis cs:Xcoord1,0.36) {$\bullet$};
\addlegendentry{System 2}

\addplot+ [ybar,orange] coordinates {(Xcoord1, 0.222) (Xcoord2, 0.441)}; 
\node[xshift=\pgfkeysvalueof{/pgf/bar shift},anchor=south] at (axis cs:Xcoord1,0.25) {$\bullet$};
\node[xshift=\pgfkeysvalueof{/pgf/bar shift},anchor=south] at (axis cs:Xcoord2,0.47) {$\bullet$};
\addlegendentry{System 3}
\end{axis}
\end{tikzpicture}

Overlays for bar charts (take 2)

A while back I posted about using overlays for bar charts to show one value at a time. For my latest presentation I had a similar but slightly different wish: show all values for one system at a time, one system after the other.

Easily done, I just adapt the code from my previous post to show all values at the same time:

\newcommand{\addplotoverlay}[3][]{
\alt<#3->{
\addplot+ [ybar,#1] coordinates {#2}; 
}{
\addplot+ [ybar,#1] coordinates {(Xcoord1,0)}; % + don't show zero values in plot
}
}

This is specific to my plot, Xcoord1 is one of my symbolic x-coordinates in the plot. Other than that, the code is completely independent from the used coordinates and the number of them, which makes it more flexible than my old stuff.

Usage (this will let seagreen bars at the given coordinates appear on slide 2):

\addplotoverlayrank[seagreen]{(Xcoord1, 0.331) (Xcoord2, 0.095)}{2}

LaTeX ‘correct’ and ‘wrong’ symbols with TikZ

A symbol for a checkmark to indicate something is correct:

\newcommand{\correct}{$\color{green}\tikz\fill[scale=0.4](0,.35) -- (.25,0) -- (1,.7) -- (.25,.15) -- cycle;$}

A symbol for a cross to indicate something is wrong:

\newcommand{\wrong}{$\mathbin{\tikz [x=1.4ex,y=1.4ex,line width=.2ex, red] \draw (0,0) -- (1,1) (0,1) -- (1,0);}$}%

You’ll need TikZ for this.

LaTeX presentation background picture

In one slide of a presentation I wanted to have a background picture and overlay it with several text blocks one after the other to have the effect of the text “coming out of” the background. It is tricky to align things in LaTeX beamer, especially if you want to have them on top of each other, so this is my solution: Two minipages that cover the whole slide on top of each other.

A slide is more or less 7cm high (depending a bit on your template). There probably is a length defined for that, but I was too lazy to look for it so I took the actual value. The width of the slide is of course \textwidth. I use vertically centered alignment for the minipage, but that is up to you (see the post Set height of a minipage for the options you can give to minipage).

The way it now works is the following. Create one minipage of full width and height. Use this to display the background image. Then jump back the full height and create a second minipage of full width and height to display the text inside of that. This is the code for my slide:

\begin{minipage}[c][7cm][c]{\textwidth}
\centering
\includegraphics[width=0.8\linewidth]{img/Reviews}
\end{minipage}

\vspace{-7cm}
\begin{minipage}[c][7cm][c]{\textwidth}
\centering

\visible<2->{
\colorbox{white}{\fbox{\textcolor{blue}{I was impressed by the fast shutter speed of D3200.}\only<3->{\textcolor{darkgreen}{~(\emph{positive})}}}}
}

\vspace{1cm}
\visible<4->{
\colorbox{white}{\fbox{\textcolor{blue}{The autofocus was \textbf{not} so reliable.}\only<5->{\textcolor{red}{~(\emph{negative})}}}}
}
\end{minipage}

Dependency trees with tikz-dependency

There is a package for drawing dependency trees in LaTeX called tikz-dependency:

\usepackage{tikz-dependency} % draw example with dependency tree

First the sentence is defined inside a deptext environment. You could add more rows to the sentence, e.g., for lemmas or parts-of-speech.
The edges between words are given with depedge commands (outside the deptext environment). Edges can in theory go both ways, but it looks better if they go from the head to the child. Here is an example dependency tree for a sentence:

\begin{dependency}
\begin{deptext}
It \& has \& a \& larger \& LCD \& than \& the \& T3i \& .\\
\end{deptext}
\deproot{3}{ROOT}
\depedge{2}{1}{NMOD}
\depedge{5}{3}{NMOD}
\depedge{5}{4}{NMOD}
\depedge{2}{5}{NMOD}
\depedge{5}{6}{PMOD}
\depedge{8}{7}{AMOD}
\depedge{6}{8}{PRD}
\end{dependency}

Besides drawing dependency trees, the package is also useful to create nice mark-up for words and phrases. The command is wordgroup which is inserted at the same place as the dependency edges and works upon the elements of the deptext The following draws a red box around words 7 and 8 in the sentence above (the 1 stands for the row):

\wordgroup[group style={fill=red!30, draw=red}]{1}{7}{8}{a}

There are lots of styling options for nodes, edges and word groups. I use the following in my thesis (this defines a style that combines options for both the dependency and the deptext environment, I just use the same in both):

\depstyle{depex}{%
   edge style = {gray},
   group style={inner sep=.2ex},
   column sep=0.5em,
   edge unit distance=2ex,
   edge horizontal padding=0.5ex,
   row sep=0.2em,
   label style={draw=none,font=\scriptsize},
   edge vertical padding=0.4ex,
}

I want all mark-up of all word groups to be styled the same way, so I don’t want to write the part with fill and draw all the time with different colors. And if I want to change the percentage of white, I would like to do it at one place for all nodes. So I have written a macro that only expects the color to be given:

\tikzset{
   coloring of/.style={fill=#1!30, draw=#1},
}

And finally, for presentations with dependency trees I want to be able to use overlays, i.e., to show a word group at a specific time. For edges you can use the normal \visible command from beamer, but for word groups it does not work for some reason. So this is a macro that does work and shows word groups only on specific slides (I got this from LaTeX Stack Exchange):

\tikzset{
    invisible/.style={opacity=0},
    visible on/.style={alt={#1{}{invisible}}},
    alt/.code args={<#1>#2#3}{%
      \alt<#1>{\pgfkeysalso{#2}}{\pgfkeysalso{#3}}},
}

Finally, here an example of using the two macros:

\wordgroup[group style={coloring of=predicatecolor},visible on=<4->]{1}{2}{4}{pred}

MacOSX Postinstall – Todo list

Just a list of steps to take after a clean mac os x install, maybe later will be ordered etc.

[ ] Install Little Snitch
https://www.obdev.at/products/littlesnitch/index.html
Config on gdrive enc container
[ ] Install Little Flocker
https://github.com/jzdziarski/littleflocker
Config on gdrive enc container
[ ] Install Launchbar
https://www.obdev.at/products/launchbar/download.html
Config on gdrive enc container
[ ] Put ssh keys in ~/.ssh
[ ] Install gpg key in Mail.app
[ ] Install Firefox and Chrome
[ ] Install Filezilla
[ ] Install Pulse VPN
[ ] Install Remote Desktop Manager Free
http://remotedesktopmanager.com/Home/DownloadFree
Config on gdrive enc container
[ ] Install Skype for Business
[ ] Install Skype (not sure, needs way too much permissions etc.)
[ ] Install apps from AppStore
[ ] Unset natural scrolling
[ ] Install Shades
[ ] Install BTT (Better Touch Tool)
[ ] Install Dropbox
[ ] Install GDrive
[ ] Install OneDrive