If you do sentiment analysis on document level, there are huge amounts of data annotated with star-ratings available on Amazon and similar pages. In theory. In practice, to get this data, you need to crawl Amazon pages, download the reviews and parse the HTML to extract the individual reviews. And this would be the n-th time somebody wrote a script to do that. So, to save you the waste of time, Andrea Esuli kindly offers some scripts to download Amazon reviews and convert them to a csv file. Thank you! You can find it on Andrea Esuli’s web page.

Posted in NLP |

Ugly LaTeX

How to make your LaTeX document look ugly. More specific: The request is to use Arial, 12pt, 1.5 line spacing. Looks very ugly but here it is:

\documentclass[12pt]{scrartcl} % use 12pt
[...]
\usepackage{setspace} % manipulate line spacing
\renewcommand{\rmdefault}{phv} % Arial for serif typeface (normal text)
\renewcommand{\sfdefault}{phv} % Arial for sans-serif (headings)
[...]
\begin{document}
[...]
\onehalfspacing % use 1.5 line spacing
[...]


If you access Google, you are usually redirected to the country specific Google-search for the country you are in. If you want to disable this redirect, use www.google.XX/ncr (ncr stands for ‘no country redirect’).

Hebrew hyphenation patterns for babel

So… in the last post I convinced LaTeX to typeset Hebrew with texlive and babel. There are still a few details to work out, so this deals with the first one: Hyphenation patterns.

The error:

Package babel Warning: No hyphenation patterns were loaded for
(babel)                the language Hebrew'


As Hebrew is not hyphenated at all, this is of no concern that no hyphenation patterns are found. But LaTeXs "solution" of taking English hyphenation patterns leads to very strange results. So tell babel that it shouldn’t try and really there are no hyphenation patterns for Hebrew with:

\makeatletter\let\l@hebrew\l@nohyphenation\makeatother


If this doesn’t work, you might fall back to define a hyphenation pattern length of 255 (see Stackexchange).

Using babel with Hebrew in texlive

Many articles say that to use Hebrew with LaTeX, you should use xelatex instead of Tex Live (which is default on Ubuntu). It IS possible to write Hebrew using Tex Live, here is how. Working minimal example:

\documentclass{article}
\usepackage[utf8x]{inputenc}
\usepackage[hebrew,english]{babel}

\begin{document}
test
\sethebrew
\end{document}


On my Ubuntu 12.10 installation this fails with this error (even though I have the packages ‘culmus’ and ‘texlive-lang-hebrew’):

kpathsea: Running mktextfm jerus10
mktextfm: Running mf-nowin -progname=mf \mode:=ljfour; mag:=1; nonstopmode; input jerus10
This is METAFONT, Version 2.718281 (TeX Live 2012/Debian)
kpathsea: Running mktexmf jerus10
! I can't find file jerus10'.


Solution:

1. Get the ‘jerus10.mf’ file from the hebtex LaTeX package (available on CTAN). Don’t install the package, it is deprecated (as in REALLY old).
2. Put the ‘jerus10.mf’ file in the folder ~/texmf/fonts/source/hebrew/
3. In terminal, run the command ‘texhash’. If it says ‘done’ without error, everything is fine.

The above file should work. It’s not particularily pretty and there are some compatibility issues with some packages that I might address another time. There are two possible error sources if you didn’t copy/paste carefully enough:

1. Enable UTF8X input

! Package inputenc Error: Keyboard character used is undefined
(inputenc)                in inputencoding 8859-8'.


or

! Package inputenc Error: Unicode char \u8:×© not set up for use with LaTeX.


If you want to write unicode Hebrew or copy-paste Hebrew from somewhere, you need to define the input encoding [utf8x], not only [utf8]:

\usepackage[utf8x]{inputenc}


2. Be sure to change the language to Hebrew

! LaTeX Error: Command \hebshin unavailable in encoding OT1.


Fix this by declaring that the following text is Hebrew with

\sethebrew
`