About swk

I am a software developr, data scientist, computational linguist, teacher of computer science and above all a huge fan of LaTeX. I use LaTeX for everything, including things you never wanted to do with LaTeX. My latest love is lilypond, aka LaTeX for music. I'll post at irregular intervals about cool stuff, stupid hacks and annoying settings I want to remember for the future.

Sorting a HashMap by Value in Java

Sometimes Java drives you nuts… you want to save word bigrams and their frequencies as an example. A HashMap<String, Integer> is very convenient. Now we want to sort it by value. And the fun begins! Of course we do not want to lose the connection between keys and values, so we cannot just use Collections.sort(map.keySet()) (or the same for values). Also using a TreeMap with a custom-wrote Comparator for our pairs does not work, because there you cannot have identical values (another fun fact).

Here is a generic method that sorts the given HashMap by value. V can be anything as long as it can be compared with itself.

public static <K, V extends Comparable<V>> List<Entry<K, V>> sortHashMapByValue (HashMap<K, V> theMap) {
   List<Entry<K, V>> resultList = new ArrayList<Entry<K, V>>(theMap.entrySet());
   Collections.sort(resultList,
          new Comparator<Entry<K, V>>() {
            @Override
            public int compare(Entry<K, V> o1, Entry<K, V> o2) {
               return o1.getValue().compareTo(o2.getValue());
            }
       });
   return resultList;

Letters in LaTeX

My standard letter in LaTeX with the class scrlttr2 (created by a German to adhere to German letter guidelines):

\documentclass[fromalign=location, fromphone=true, fromemail=true, locfield=wide]{scrlttr2}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\setkomavar{fromname}{Max Mustermann}
\setkomavar{fromaddress}{Musterstr.\ 1\\12345 Musterstadt}
\setkomavar{fromphone}{01234/56789000}
\setkomavar{fromemail}{mustermx@provider.xy}
\setkomavar{signature}{Max Mustermann}
\setkomavar{subject}[]{}
\setkomavar{date}{7.\ August 2013}

\begin{document}
\begin{letter}{Jane Doe\\
Example street 2\\
54321 Exampleville
}
\opening{Dear Mrs.\ X,}
this is the letter I promised.
\closing{Kind regards,}
\end{letter}
\end{document}

The variables ‘from…’ set the sender, the recipient is given right after begin letter. The sender information can be set at strange places, for a very simple letter I use ‘fromalign=location’ which results in the sender information somewhere at the top right, a bit higher than the recipient, but not in the headline. With the standard settings the e-mail address is too wide for the sender field, so I widen it with ‘locfield=wide’.

Get folder/filename from a path

Get folder name, filename, file extension from a path:


# File name: Strip from start longest match of [*/]
FILENAME="${BASEFILE##*/}"
echo "FILENAME $FILENAME"

# Folder: Substring from 0 to start of filename
FOLDER="${BASEFILE:0:${#BASEFILE} - ${#FILENAME} - 1}"
echo "FOLDER $FOLDER"

# File prefix: Strip from end longest match of [dot plus at least one non-dot char]
FILEPREFIX="${FILENAME%.[^.]*}"
echo "FILEPREFIX $FILEPREFIX"

# File extension: Strip from start shortest match of [at least one non-dot char plus dot]
EXTENSION="${FILENAME##[^.]*.}"
echo "EXTENSION $EXTENSION"

Amazon Review Downloader

If you do sentiment analysis on document level, there are huge amounts of data annotated with star-ratings available on Amazon and similar pages. In theory. In practice, to get this data, you need to crawl Amazon pages, download the reviews and parse the HTML to extract the individual reviews. And this would be the n-th time somebody wrote a script to do that. So, to save you the waste of time, Andrea Esuli kindly offers some scripts to download Amazon reviews and convert them to a csv file. Thank you! You can find it on Andrea Esuli’s web page.

Ugly LaTeX

How to make your LaTeX document look ugly. More specific: The request is to use Arial, 12pt, 1.5 line spacing. Looks very ugly but here it is:

\documentclass[12pt]{scrartcl} % use 12pt
[...]
\usepackage{setspace} % manipulate line spacing
\renewcommand{\rmdefault}{phv} % Arial for serif typeface (normal text)
\renewcommand{\sfdefault}{phv} % Arial for sans-serif (headings)
[...]
\begin{document}
[...]
\onehalfspacing % use 1.5 line spacing
[...]

Hebrew hyphenation patterns for babel

So… in the last post I convinced LaTeX to typeset Hebrew with texlive and babel. There are still a few details to work out, so this deals with the first one: Hyphenation patterns.

The error:

Package babel Warning: No hyphenation patterns were loaded for
(babel)                the language `Hebrew'
(babel)                I will use the patterns loaded for \language=0 instead.

As Hebrew is not hyphenated at all, this is of no concern that no hyphenation patterns are found. But LaTeXs "solution" of taking English hyphenation patterns leads to very strange results. So tell babel that it shouldn’t try and really there are no hyphenation patterns for Hebrew with:

\makeatletter\let\l@hebrew\l@nohyphenation\makeatother

If this doesn’t work, you might fall back to define a hyphenation pattern length of 255 (see Stackexchange).

Using babel with Hebrew in texlive

Many articles say that to use Hebrew with LaTeX, you should use xelatex instead of Tex Live (which is default on Ubuntu). It IS possible to write Hebrew using Tex Live, here is how. Working minimal example:

\documentclass{article}
\usepackage[utf8x]{inputenc} 
\usepackage[hebrew,english]{babel} 

\begin{document}
test
 \sethebrew
שלום
\end{document}

On my Ubuntu 12.10 installation this fails with this error (even though I have the packages ‘culmus’ and ‘texlive-lang-hebrew’):

kpathsea: Running mktextfm jerus10
mktextfm: Running mf-nowin -progname=mf \mode:=ljfour; mag:=1; nonstopmode; input jerus10
This is METAFONT, Version 2.718281 (TeX Live 2012/Debian)
kpathsea: Running mktexmf jerus10
! I can't find file `jerus10'.

Solution:

  1. Get the ‘jerus10.mf’ file from the hebtex LaTeX package (available on CTAN). Don’t install the package, it is deprecated (as in REALLY old).
  2. Put the ‘jerus10.mf’ file in the folder ~/texmf/fonts/source/hebrew/
  3. In terminal, run the command ‘texhash’. If it says ‘done’ without error, everything is fine.

The above file should work. It’s not particularily pretty and there are some compatibility issues with some packages that I might address another time. There are two possible error sources if you didn’t copy/paste carefully enough:

1. Enable UTF8X input

! Package inputenc Error: Keyboard character used is undefined
(inputenc)                in inputencoding `8859-8'.

or

! Package inputenc Error: Unicode char \u8:ש not set up for use with LaTeX.

If you want to write unicode Hebrew or copy-paste Hebrew from somewhere, you need to define the input encoding [utf8x], not only [utf8]:

\usepackage[utf8x]{inputenc} 

2. Be sure to change the language to Hebrew

! LaTeX Error: Command \hebshin unavailable in encoding OT1.

Fix this by declaring that the following text is Hebrew with

\sethebrew

Undefined references – LaTeX Warning

Sometimes LaTeX tells you this:

LaTeX Warning: There were undefined references.

If you get this warning, you will notice some ?? in your document at places where references should be. For references to sections, tables of figures, just run pdflatex again (and check for typos). For bibliography references you need to run bibtex.

Let’s assume you are writing a LaTeX file with the name ‘report.tex’. Do the following:

> pdflatex report.tex
[...]
LaTeX Warning: Citation `Liu2010' on page 1 undefined on input line 39.
[...]
LaTeX Warning: Reference `fig:results' on page 1 undefined on input line 65.
[...]
LaTeX Warning: There were undefined references.
[...]
LaTeX Warning: Label(s) may have changed. Rerun to get cross-references right.
[...]
> bibtex report
[...]
> pdflatex report.tex
[...]
LaTeX Warning: Label(s) may have changed. Rerun to get cross-references right.
[...]
> pdflatex report.tex
[...]

You need to run pdflatex again twice after calling bibtex. Twice, because layout may change and things end up somewhere else after you inserted the references.

Replace newlines with sed

Sed is a commandline linux tool to replace text in a file or input stream. Typically sed works line-oriented, i.e., a line is read, the expression applied, then the next line is read. Say we have a file where one line is one word. We want to reconstruct the sentence. How to replace all linebreaks in the file with a space? Simple:

sed "{:q;N;s/\n/ /g;t q}" 

The regular expression ‘s/\n/ /’ says substitute linebreaks (\n) by a space. ‘g’ says apply this globally. ‘N’ says append the next line to what is processed. Using only ‘N’ would replace linebreaks in every second line. The rest of the thing is a trick to join all lines together. We define the label q (‘:q;’), then we say that in case that there was a sucessfull substitution, go to label q (‘t q’).

Now we have all words in one line. Across sentences! Sentences are separed by an empty line. So easy – replace linebreaks by spaces, replace two adjacent spaces by a linebreak. Gives you one sentence per line, words separated by spaces. Voila:

cat  | sed "{:q;N;s/\n/ /g;t q}" | sed "{s/  /\n/g}"