About swk

I am a software developr, data scientist, computational linguist, teacher of computer science and above all a huge fan of LaTeX. I use LaTeX for everything, including things you never wanted to do with LaTeX. My latest love is lilypond, aka LaTeX for music. I'll post at irregular intervals about cool stuff, stupid hacks and annoying settings I want to remember for the future.

Accessing JVM arguments from inside Java

Whenever I get a ClassNotFoundException error in Java, I think to myself “but it is there!” and then I correct the typo in the classpath or get angry at Eclipse for messing up my classpath. Lately I have programmed in more complex settings where it was not always clear to me where the application gets the classpath from, so I wanted to check which of my libraries actually end up on the classpath. Turns out it is not very complicated. Here is code to print a number of useful things:

System.out.println("Working directory: " 
      + Paths.get(".").toAbsolutePath().normalize().toString());
System.out.println("Classpath: " 
      + System.getProperty("java.class.path"));
System.out.println("Library path: " 
      + System.getProperty("java.library.path"));
System.out.println("java.ext.dirs: " 
      + System.getProperty("java.ext.dirs"));

The current working directory is the starting point for all relative paths, e.g., for reading and writing files. The normalization of the path makes it a bit more readable, but is not necessary. The class Paths is from the package java.nio.file.Paths. The classpath is the place where Java looks for (bytecode for) classes. The entries should be folders or jar-files. The Java library path is where Java looks for native libraries, e.g., platform dependent things. You can of course access other environment variables with the same method, but I cannot at the moment think of a useful example.

Related (at least related enough to put it into the same post), this is how you can print the space used and available on the JVM heap:

int mbfactor = 1024*1024;
System.out.println("Memory free/used/total/max " 
      + Runtime.getRuntime().freeMemory()/mbfactor + "/"
      + (Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory())/mbfactor + "/"
      + Runtime.getRuntime().totalMemory()/mbfactor + "/"
      + Runtime.getRuntime().maxMemory()/mbfactor + " MB"
);

Positioning rests in the middle of a two-voice line

In choir scores, you often have the score for two voices (e.g., soprano and alto) in one line:

      \new Staff  = "Frauen"<<
         \new Voice = "Sopran" { \voiceOne \global  \soprano }
         \new Voice = "Alt" { \voiceTwo \global  \alto }
      >>

When they both have a pause at the same time with the same length, lilypond will still print two rests in different positions. If you (like me) think this looks weird, here is how you can change it:

soprano = \relative c' { a2 \oneVoice r4 \voiceOne a4 }
alto = \relative c' { a2 s4 a4 }

In one voice, change to only one voice with \oneVoice for the rest and then back to the usual voice, here /voiceOne. If you do the same in the other voice, you will get warnings about clashing notes, so instead of using a rest, use an invisible rest (spacer) with s.

An alternative is the following command which causes all rests to appear in the middle of the line. It should be used inside the \layout block:

   \override Voice.Rest #'staff-position = #0

Precision, Recall and F-measure

In the last post we discussed accuracy, a straightforward method of calculating the performance of a classification system. Using accuracy is fine when the classes are of equal size, but this is often not the case in real world tasks. In such cases the very large number of true negatives outweighs the number of true positives in the evaluation so that accuracy will always be artificially high.

Luckily there are performance measures that ignore the number of true negatives. Two frequently used measures are precision and recall. Precision P indicates how many of the items that we have identified as positives are really positives. In other words, how precise have we been in our identification. How many of those that we think are X, really are X. Formally, this means that we divide the number of true positives by the number of all identified positives (true and false):
P = TP/(TP+FP)

Recall R indicates how many of the real positives we have found. So from all of the positive items that are there, how many did we manage to identify. In other words, how exhaustive we were. Formally, this means that we divide the number of true positives by the number of all existing positives (true positives and false negatives):
R = TP/(TP+FN)

For our example from the last post, precision and recall are as follows:
P = 1/(1+3) = 1/4 = 0.25
R = 1/(1+2) = 1/3 = 0.33

It is easy to get a recall of 100%. We just say for everything that it is a positive. But as this will probably not the case (or else we have a really easy dataset to classify!), this approach will give us a really low precision. On the other hand, we can usually get a high precision if we only classify as positive one single item that we are really, really sure about. But if we do that, recall will be low, as there will be more than one item in the dataset to be classified (or else it is not a very meaningful set).

So recall and precision are in a sort of balance. The F1 score or F1 measure is a way of putting the two of them together to produce one single number. Formally it calculates the harmonic mean of the two numbers and weights the two of them with the same importance (there are other variants that put more importance on one of them):
F_1 = (2 \cdot P \cdot R)/(P + R)

Using the values for precision and recall for our example, F1 is:
F_1 = (2 \cdot 0.25 \cdot 0.33)/(0.25 + 0.33) = 0.165 / 0.58 = 0.28

Intuitively, F1 is between the two values of precision and recall, but closer to the lower of the two. In other words, it penalizes if we concentrate only on one of the values and rewards systems where precision and recall are closer together.

Link for a second explanation: Explanation from an Information Retrieval perspective

Accuracy

We are still trying to figure out how good our system for determining whether e-mails are spam or not is. In the last post we ended up with a confusion matrix like this:

Actual label
Spam NonSpam
Predicted label Spam 1 (true positives, TP) 3 (false positives, FP)
NonSpam 2 (false negatives, FN) 4 (true negatives, TN)

Now we want to calculate numbers from this table to describe the performance of our system. One easy way of doing this is to use accuracy A. Accuracy basically describes which percentage of decisions we got right. So we would take the diagonal entries in the matrix (the true positives and true negatives) and divide by the total number of entries. Formally:
A = (TP+TN)/(TP+TN+FP+FN)

In our example the accuracy is:
A = (1+4)/(1+4+2+3) = 5/10 = 0.5

Using accuracy is fine in examples like the above when both classes occur more or less with the same frequency. But frequently the number of true negatives is larger than the number of true positives by many orders of magnitudes. So let’s assume 994 for true negatives and when we calculate accuracy again, we get this:
A = (1+994)/(1+994+2+3) = 995/1000 = 0.995

It doesn’t really matter if we correctly identify any spam mails. Even if we always say NonSpam, so we get zero Spam-Mails right, we still get more nearly the same accuracy as above. So accuracy is not a good indicator of performance for our system in this situation. In the next post we will look at other measures we can use instead.

Link for a second explanation: Explanation from an Information Retrieval perspective

Settings swk

Ubuntu / Gnome settings:

  • System settings / Appearance / Behavior: check “Enable workspaces”, show the menus “in the window’s title bar”, menu visibility “always displayed”.
  • System settings / Regional format: Change to “English (Ireland)”.
  • System settings / Bluetooth: Turn off.
  • System settings / Details / Removable media: set all to “Ask what to do”.
  • System settings / Time & Date / Clock : check “Weekday”, “date and month”, “24-hour time”, “include week numbers”
  • System settings / Display: turn off “Sticky edges”, check “Launcher on all displays”
  • System settings / Text entry: set to “Allow different sources for each window” and “new windows use the default source”.
  • Unity tweak tool / Hotcorners: turn on, upper left corner set “Window spread”

Suse, Kubuntu / KDE settings:

  • Settings / Desktop Behaviour / Desktop effects – deactivate “Fade”, “Blur”, “Translucency”,
  • Settings / Desktop Behaviour / Accessibility – deactivate “use system bell” in “audible bell”
  • Settings / Account Details / KDE Wallet – deactivate
  • Settings / Input devices / Keyboard – configure English keyboard
  • Settings / Input devices / Mouse / General – set “double click to open files”
  • Settings / Task Manager Settings / General – Sorting “manually”, Grouping “do not group”, mark “show only tasks from the current desktop”
  • Settings / Startup and Shutdown / Desktop session – On startup “start with an empty session”
  • Panel – Remove “Show Desktop” widget, add “Quick launcher” widget.

Firefox settings:

  • General: check “Make Firefox your default browser”, “Always ask me where to save files”, “Open new windows in a new tab instead”.
  • Search: uncheck “Provide search suggestions”.
  • Applications: change pdf to “Always ask”.
  • Privacy: “Use custom settings for history”, uncheck “Remember search and form history”, Keep cookies “I close Firefox”.
  • Security: uncheck “Remember logins for sites”.
  • Advanced / General: check “Search for text when I start typing”,
    uncheck “Check my spelling as I type”.
  • In about:config: set “browser.bookmarks.showRecentlyBookmarked” to False

Thunderbird settings:

  • Enable menu bar
  • Preferences / General: uncheck “When Thunderbird launches show start page”, uncheck “play a sound when new message arrives”.
  • Preferences / Display / Advanced: check “Close message window/tab on move or delete”, uncheck “Show only display name for people in my address book”.
  • Preferences / Composition / Spelling: uncheck “Enable spell check as you type”.
  • Preferences / Privacy: Uncheck “Accept cookies from sites”, check “Tell sites that I do not want to be tracked”.
  • View / Layout: uncheck “Message pane”
  • View / Today pane: uncheck “Show”
  • Account settings / Copies and Folders: change “Place a copy in”, check “Place replies in the folder of message”.
  • Account settings / Composition: uncheck “Compose messages in HTML format.”
  • Install Enigmail and import keys.
  • Install Lightning and import calendars.

Pidgin settings:

  • Preferences / Interface: set “Hide new IM conversations” to “Never”. Set “New conversations” to “New window”. Show system tray icon “Always”
  • Preferences / Conversations: uncheck “show formatting”, uncheck “buddy animation”, uncheck “highlight misspelled words”, uncheck “resize smileys”.
  • Preferences / Sounds: check “Mute sounds”
  • Preferences / Status: set “Idle time” to “Never”, uncheck “change to this status”, set “startup status” to “available”.
  • Plugins: Enable “Message Notification”, “Message Timestamp Formats”,
  • Show: “Offline Buddies”, “Empty groups”
  • Install Skype plugin

Atom settings:

  • Core settings: uncheck “audio beep”, Restore previous windows on start set “no”,
  • Editor: check “Scroll past end”, check “Soft wrap at preferred line length”,
  • Themes: Set to “Atom light”
  • Install Packages:
    • atom-latex (custom toolchain %TEX %ARG %DOC, add *.synctex.gz for cleaning, save files before build)
    • script
    • minimap
    • linter-flake8
  • Disable packages: autocomplete-plus

Konsole/Terminal settings

  • TabBar: check “Show New Tab and Close Tab buttons”
  • Profile / Scrolling: “Unlimited Scrollback”

Confusion matrix

Let’s say we want to analyze e-mails to determine whether they are spam or not. We have a set of mails and for each of them we have a label that says either "Spam" or "NotSpam" (for example we could get these labels from users who mark mails as spam). On this set of documents (the training data) we can train a machine learning system which given an e-mail can predict the label. So now we want to know how the system that we have trained is performing, whether it really recognizes spam or not.

So how can we find out? We take another set of mails that have been marked as "Spam" or "NotSpam" (the test data), apply our machine learning system and get predicted labels for these documents. So we end up with a list like this:

Actual label Predicted label
Mail 1 Spam NonSpam
Mail 2 NonSpam NonSpam
Mail 3 NonSpam NonSpam
Mail 4 Spam Spam
Mail 5 NonSpam NonSpam
Mail 6 NonSpam NonSpam
Mail 7 Spam NonSpam
Mail 8 NonSpam Spam
Mail 9 NonSpam Spam
Mail 10 NonSpam Spam

We can now compare the predicted labels from our system to the actual labels to find out how many of them we got right. When we have two classes, there are four possible outcomes for the comparison of a predicted label and an actual label. We could have predicted "Spam" and the actual label is also "Spam". Or we predicted "NonSpam" and the label is actually "NonSpam". In both of these cases we were right, so these are the true predictions. But, we could also have predicted "Spam" when the actual label is "NonSpam". Or "NonSpam" when we should have predicted "Spam". So these are the false predictions, the cases where we have been wrong. Let’s assume that we are interested in how well we can predict "Spam". Every mail for which we have predicted the class "Spam" is a positive prediction, a prediction for the class we are interested in. Every mail where we have predicted "NonSpam" is a negative prediction, a prediction of not the class we are interested in. So we can summarize the possible outcomes and their names in this table:

Actual label
Spam NonSpam
Predicted label Spam true positives (TP) false positives (FP)
NonSpam false negatives (FN) true negatives (TN)

The true positives are the mails where we have predicted "Spam", the class we are interested in, so it is a positive prediction, and the actual label was also "Spam", so the prediction was true. The false positives are the mails where we have predicted "Spam" (a positive prediction), but the actual label is "NonSpam", so the prediction is false. Correspondingly the false negatives, the mails we should have labeled as "Spam" but didn’t. And the true negatives that we correctly recognized as "NonSpam". This matrix is called a confusion matrix.

Let’s create the confusion matrix for the table with the ten mails that we classified above. Mail 1 is "Spam", but we predicted "NonSpam", so this is a false negative. Mail 2 is "NonSpam" and we predicted "NonSpam", so this is a true negative. And so on. We end up with this table:

Actual label
Spam NonSpam
Predicted label Spam 1 3
NonSpam 2 4

In the next post we will take a loo at how we can calculate performance measures from this table.

Link for a second explanation: Explanation from an Information Retrieval perspective

Histograms of category frequencies in R

I am learning R, so this is my first attempt to create histograms in R. The data that I have is a vector of one category for each data point. For this example we will use a vector of a random sample of letters. The important thing is that we want a histogram of the frequencies of texts, not numbers. And the texts are longer than just one letter. So let’s start with this:

labels <- sample(letters[1:20],100,replace=TRUE)
labels <- vapply(seq_along(labels), 
                 function(x) paste(rep(labels[x],10), collapse = ""),
                 character(1L)) # Repeat each letter 10 times
library(plyr) # for the function 'count'
distribution <- count(labels)
distribution_sorted <- 
   distribution[order(distribution[,"freq"], decreasing=TRUE),]

I use the function count from the package plyr to get a matrix distribution with the different categories in column one (called "x") and the number of times this label occurs in column two (called "freq"). As I would like the histogram to display the categories from the most frequent to the least frequent one, I then sort this matrix by frequency with the function order. The function gives back a vector of indices in the correct order, so I need to plug this into the original matrix as row numbers.

Now let's do the histogram:

mp <- barplot(distribution_sorted[,"freq"],
         names.arg=distribution_sorted[,1], # X-axis names
         las=2,  # turn labels by 90 degrees
         col=c("blue"), # blue bars (just for fun)
         xlab="Kategorie", ylab="Häufigkeit", # Axis labels
         )

There are many more settings to adapt, e.g., you can use cex to increase the font size for the numerical y-axis values (cex.axis), the categorical x-axis names (cex.names), and axis labels (cex.lab).

In my plot there is one problem. My categorie names are much longer than the values on the y-axis and so the axis labels are positioned incorrectly. This is the point to give up and do the plot in Excel (ahem, LaTeX!) - or take input from fellow bloggers. They explain the issues way better than me, so I will just post my final solution. I took the x-axis label out of the plot and inserted it separately with mtext. I then wanted a line for the x-axis as well and in the end I took out the x-axis names from the plot again and put them into a separate axis at the bottom (side=1) with zero-length ticks (tcl=0) intersecting the y-axis at pos=-0.3.

# mai = space around the plot: bottom - left - top - right
# mgp = spacing for axis title - axis labels - axis line
par(mai=c(2.5,1,0.3,0.15), mgp=c(2.5,0.75,0))
mp <- barplot(distribution_sorted[,"freq"],
         #names.arg=distribution_sorted[,1], # X-axis names/labels
         las=2,  # turn labels by 90 degrees
         col=c("blue"), # blue bars (just for fun)
         ylab="Häufigkeit", # Axis title
         )
axis(side=1, at=mp, pos=-0.3, 
     tick=TRUE, tcl=0, 
     labels=distribution_sorted[,1], las=2, 
     )
mtext("Kategorie", side=1, line=8.5) # x-axis label

There has to be an easier way !?

Citations

Thank you Google Scholar Alerts for bringing to my attention this latest reference to one of my papers:

(3) 基于语义角色标注的提取
语义角色标注 SRL 是将词语序列分组, 并按照语
义角色对其分类。SRL 的目的就是找出给定句子中谓
语词的对应语义成分, 即核心语义角色(主语、宾语等)
和附属角色(时间、地点等)。SRL 只针对句子中的部
分成分与谓语的关系进行标注, 属于浅层语义分析。
Kessler 等 [37] 运用 SRL 对英文比较句的元素进行标注
与提取, 效果优于之前的方法。但是, 只使用 SRL 对
中文比较关系提取效果较差, 为此进行不同程度的改
进。例如, 构建混合比较模式的 SRL 模型, 对汉语比
较句进行两阶段标注 [9] ; 将 SRL 与句法分析树相结合,
提出语义角色分析树 [28] , 通过计算两棵子树之间的匹
配相似度抽取比较关系; 还有学者尝试将 CRF 应用到
SRL 中 [38] 。上述研究取得了一定成果, 但是采用 SRL
进行中文标注的效果还有待提高, 对涉及上下句的比
较信息提取尚未能够有效解决。

Whatever it says, it counts towards my H-index!