Typesetting text in math mode

In information retrieval and text classification, tf-idf plays a big role. Read the Wikipedia article to learn what it is about, here I want to deal with the problem of typesetting the formula in LaTeX.

The formula is log-weighted term frequency tf times inverse document frequency idf, if we naivly write this down, we arrive at this:

tf-idf_{t,d} = (1 +\log tf_{t,d}) \cdot \log \frac{N}{df_t}

When you look at the LaTeX output, you will see that several things go wrong. In math mode, LaTeX interprets two letters next to each other as a product of two variables. So the name tf becomes the mathematical expression “t times f” and is typeset accordingly. Also, in case of tf-idf, the name contains a hyphen. In math mode a hyphen between two expression is interpreted as a minus sign. So this is definitely not what we want.

How do we solve the problem? What we want is that this part is interpreted as normal text. One possibility to add text to equations is the command \mbox{} (another is the command \text{} which requires the amsmath package). So this is it:

\mbox{tf-idf}_{t,d} = (1 +\log \mbox{tf}_{t,d}) \cdot \log \frac{N}{\mbox{df}_t}

Which process has an open handle on my file x (fuser, lsof or Process Explorer)?

Here’s how to find out if a file is locked because of another process that still has an open file handle.

On Linux/Unix just use: fuser or lsof

lsof | grep
fuser -v

On Windows the Sysinternals Process Explorer is a great answer to this (and many other questions):

Just Ctrl+F and enter the name or part of it and search

Stanford Tokenizer options for MATE Parser

These are the options I use for the Stanford tokenizer to preprocess my data for parsing with the MATE Parser:

normalizeParentheses=false,
normalizeOtherBrackets=false,
untokenizable=allKeep,
escapeForwardSlashAsterisk=false

This is the explanation of the options from the documentation:

  • normalizeParentheses: Whether to map round parentheses to -LRB-, -RRB-, as in the Penn Treebank
  • normalizeOtherBrackets: Whether to map other common bracket characters to -LCB-, -LRB-, -RCB-, -RRB-, roughly as in the Penn Treebank
  • untokenizable: What to do with untokenizable characters (ones not known to the tokenizer). Six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep. The default is “firstDelete”.
  • escapeForwardSlashAsterisk: Whether to put a backslash escape in front of / and * as the old PTB3 WSJ does for some reason (something to do with Lisp readers??).

Backup slides in LaTeX beamer

Sometimes you have a LaTeX beamer presentation and want to have some "backup" slides that you may show if the audience is really interested in this detail, but otherwise not. There is a simple solution for that, the package appendixnumberbeamer.

You need to load the package in the preamble:

\usepackage{appendixnumberbeamer}

Then you just need to use "appendix" before the slides you want to have as backup:

\begin{frame}
Thank you for your attention!
\end{frame}

\appendix
% start backup slides here

\begin{frame}
\frametitle{Detailed Results of User Study}
...
\end{frame}

Remember to run pdflatex twice for the changes to take effect!

The slides in the appendix will not count towards the total slide number that is displayed for the normal slides. Backup slides will have their own slide numbers and total slide numbers counted anew from the start of the appendix. Very handy!

You can organize your backup slides in sections, these section will not appear in the table of content. If you use a beamer template with navigation (miniframes like in Szeged, or split like in Malmoe), the backup slides will not appear in the navigation. A cool thing is that on the backup slides, the navigation will show the structure of the backup slides, so you can easily change to the slide you want. A disadvantage is of course that everybody will see that you have more backup slides than actual slides 😉

Change the encoding of a file

My favourite topic is "encoding" (of course that was sarcasm). So my first post is about how to change the encoding of some text file from Latin-1 to UTF-8 on command line:

iconv -f latin1 -t utf8 source_file > target_file

Of course we need to know what encoding the file is in… which may be a topic for some future post.

Hello world!

hro: Welcome to our new blog. We’ll use this site to dump anything too useful to trash but too hard to remember at our next cup of coffee…

swk: Where do you want to attach the PP “at our next cup”? “Dump it at our next cup” or “hard to remember at our next cup”?

hro: Arrgghh, I’m writing with a computational linguist. Ok then, how do you want to phrase it then?

swk: pfff… “We’ll use this site to dump anything too useful to trash, but too hard to remember should we need it again.”