Der LISA Webcrawler – Wer sucht, der wird auch finden

Egal was man sucht, im Internet gibt es eine Seite dazu. Aber um diese Seite finden zu können, muss man erst einmal einen Haufen Internetseiten haben, damit man diese Seiten dann lesen (oder automatisch analysieren) kann. Wie kommen wir also an die Seiten? Man könnte vielleicht auf die Idee kommen alle möglichen Adressen alphabetisch aufzulisten. Das wäre aber wenig zielführend, da es eine unendliche Anzahl an Adressen gibt. Eine bessere Idee ist die Strategie, die ein Webcrawler verwendet.

Ein Webcrawler besucht ausgehend von einer Startseite immer weitere Seiten, indem er den Links auf den Seiten folgt. Nehmen wir als Beispiel als Startseite die Seite mit Links zu zwei anderen Seiten, und Zuerst schreibt der Webcrawler die beiden Links auf eine Liste noch zu besuchender Seiten. Diese Liste wird Frontier genannt. Dann sucht er aus dem Frontier eine neue Seite aus, z.B., und besucht diese. Auf der neuen Seite geht dann der gleiche Prozess von vorne los. Die Seite selbst wird abgespeichert. Alle Links auf dieser Seite werden identifiziert und zum Frontier hinzugefügt. Und dann kommt die nächste Seite an die Reihe.

Die Grundidee hinter einem Webcrawler ist einfach, aber in der Umsetzung gibt es einiges, was berücksichtigt werden muss. Zum einen ist da schiere Größe des Internets. Ein einzelner Webcrawler wird nicht weit kommen, daher wird man den Prozess vermutlich parallelisieren wollen. Die nächste Frage ist wie die nächste zu besuchende Seite ausgewählt wird. Die Auswahl kann nach verschiedenen Kriterien erfolgen. Zum Beispiel könnten Seite priorisiert werden, die sich häufig ändern. Oder es werden erst deutsche Seiten besucht. Oder Links die möglichst dicht an der Startseite dran sind (Breitensuche). Eine Schwierigkeit stellen auch dynamisch generierte Seiten dar, die je nach Nutzereingaben unterschiedlich sind. Würde ein Webcrawler z.B. die Startseite einer Suchmaschine besuchen, hätte die Seite für jede mögliche Suche einen anderen Inhalt und andere Links. Der Crawler könnte sich also endlos auf dieser Seite verfangen. Als letzter Punkt sei die Politeness erwähnt. Webcrawler schicken prinzipbedingt sehr viele Anfragen. Das könnte die Webserver auf denen die Seiten liegen schnell überlasten. Daher sollte sich ein Crawler “höflich” verhalten und Wartezeiten zwischen den Anfragen einhalten.

LISA enthält einen Webcrawler, der mit einer Breitensuche Seiten aus dem deutschen Internet besucht. Verschiedene Normalisierungen der URLs im Frontier sorgen dafür, dass LISA sich nicht in dynamischen Inhalten verfängt. LISA kann mehrere Crawler-Threads parallel laufen lassen und verhält sich nach Politeness-Regeln um Webserver nicht zu überlasten. Die vom Crawler erhaltenen Seiten werden in LISA gespeichert und an den LISA Analyzer übergeben, der dann relevante Informationen extrahiert.

This post has first appeared at

Levels of sentiment analysis

Sentiment analysis can be done on different levels of granularity. We could determine the sentiment of a complete review which would give us something similar to the star ratings. We can then go down to a more detailed level, e.g., sentences or to aspects.

Document-level analysis
Input: Text of a complete document
Output: Polarity (as a label positive/negative/neutral or rating on a scale)

This task can be done fairly reliable with automatic methods, as there is usually some redundancy, so if the method misses one clue, there are other clues that are sufficient to know what polarity is expressed. Document-level analysis makes a few assumptions that may not be true. First, it assumes that a document talks about a single target and that all opinion expressions refer to this target. This may be true in some cases, but especially in longer reviews people like to compare the product they are discussing to other similar products, they may describe the plot of a movie or book, they may give opinions about the delivery, tell stories about how they got the product as a gift, and so on. Second, the assumption is that one document expresses one opinion, but human authors may be undecided. Finally, it assumes that the complete review expresses the opinion of one author, but there may be parts where other people’s opinions are cited (for completeness or to refute them).

Sentence-level analysis
Input: One sentence
Output: Polarity + target

On sentence level, we can add the task of finding out what the sentence talks about (the target) to the task of determining the sentiment. While this level of analysis allows us in some cases to overcome the difficulties we talked about on document level, it still makes the same assumptions for each sentence. And all of them may be false even in a single sentence, there may be more than one target ("A is better than B"), more than one opinion ("I liked the UI, but the ring tones were horrible") or opinions of more than one person ("I liked the size, but my wife hated it").

Aspect-level analysis
Input: A document or sentence
Output: many tuples of (polarity, target, possibly holder)

Instead of using the linguistic units of a sentence or a document, we can use individual opinion expressions as the main unit of what we want to extract. A sentence "I liked the UI, but my wife thought the ring tones were horrible" would result in two tuples: (positive, UI, author) and (negative, ring tones, my wife). The different tuples can then be added to get one polarity per aspect, per holder or even an overall polarity

What is sentiment analysis?

People have always been interested in other opinions before taking a decision, traditionally by asking friends or reading surveys or professional reviews published in a magazine. Over the last years, huge amounts of opinions have become available on the web in the form of consumer reviews, blogs, forum posts or twitter and they are widely used in decision making:

(c) Randall Munroe,

Companies have caught up to this and opinions have become a topic of interest for many of them in recent years. Because of the huge amount of available opinions, it is impossible for a human to read them all, so an automatic method of analyzing them is necessary: Sentiment analysis is born (sometimes also called subjectivity analsyis or opinion mining). In its most basic form, sentiment analysis attempts to determine whether a givne text expresses an opinion and whether this opinion is positive or negative.

You might think that we could just use the star ratings that usually accompany reviews to perform the same task. If we only want to know the sentiment of a complete review, this would indeed give us more or less the same information. But reviews usually contain much more detailed information:

(c) Randall Munroe,

While it is nice that the app has a good UI and runs smoothly, a user might want to assign more weight to the review that discusses the aspect of "warning about a tornado". After all, this is the main functionality of the app and probably the main reason of getting it. Everything else is an added bonus. Besides identifying reviews that discuss the important aspects of an item, real reviews usually also contain opinions on a variety of aspects that may be evaluated quite differently by different users ("I liked the UI, but I hated the alarm tones"). So a more detailed analysis is necessary, and this is what makes sentiment analysis interesting!

Tidyverse R package


Tidyverse is a collection of R packages for the comfortable cleanup of data. Loading the package tidyverse will laod the core tidyverse packages: tibble, tidyr, readr, purrr, and dplyr. You can of course also load each package individually.


  • Tidyverse is optimized for interactive workflow with data
  • Each function does one thing easy and well
  • Basic idea: action(data, some_arguments) or data %>% action(some_arguments)
  • Everything works with tibbles
  • Web page:
  • Workshop page (withe example scripts):

Core packages


  • A modern version of dataframes
  • The first argument of every tidyverse function and what every tidyverse function returns
  • Tibbles use characters instead of factors for texts
  • Tibbles have nicer printout than normal dataframes: show data type of columns, number of rows, only the first few rows/columns not all of the data
 mynames <- c('bla', 'jkl', 'xyz', 'asdf', 'asdf')
 age <- c(NA, 30, 20, 25, 18)
 pre <- round(rnorm(length(mynames)), 2)
 post <- round(rnorm(length(mynames)), 2)
 mydata <- tibble(mynames, age, pre, post)  # create tibble from data
 mydf <- data.frame(mynames, age, pre, post)
 as_tibble(mydf)  # convert data frame into tibble


  • Does the same as read.csv from base R, it reads a csv file
  • Faster
  • Automatically creates tibbles
  • Progress bar for big files


  • A data frame is a rectangular array of variables (columns) and observations (rows)
  • A tidy data frame is a data frame where…
    ** Each variable is in a column.
    ** Each observation is a row.
    ** Each value is a cell.

  • Wide format: a row has many entries for observations, e.g., time-series in columns T0, T1, T2, …

  • Long format: each observation is a separate row, time is a new column, e.g., row1 is T0, row2 is T1, row3 is T2
  • Two functions: gather() to convert from wide format to long format and spread() to convert from wide format to long format
 # Convert to long format, so that every observation is one row,
 # with either the text 'pre' or 'post' in the column 'exam'
 # and the value that was in pre or post now in the column 'score'
 tidydf <- gather(mydata, exam, score, pre:post)

 # From tidydf create the same thing back that we had in mydata (wide format)
 spread(tidydf, exam, score)
  • Easily split columns with separate() and merge with unite()
    court # tibble with lots of comma-separated text in one column ‘text’
 # Split it into 14 columns with the names A-N, 
 # Convert = True -> try to guess the datatypes, otherwise everything would be characters
 court <- separate(court, text, into = LETTERS[1:14], convert = T)

 # Put columns B, C and D into one column 'condition'
 court <- unite(court, condition, B, C, D)


  • Filter rows with filter()
 filter(mydata, !, pre>0, !duplicated(mynames))
 filter(mydata, mynames %in% c('jkl', 'bla'))
 filter(mydata, post > pre)
  • Select columns with select()
 select(mydata, pre) # select a column
 select(mydata, -pre) # select everything besides this column
 select(mydata, age:pre) # select all columns between pre and post
 select(mydata, -(pre:post)) # select all columns besides those between pre and post
 select(mydata, pre:post, age, mynames) # select and reorder
  • Sort a tibble by a column with arrange()
 arrange(mydata, desc(age), pre) # sort by age (descending), then by pre
  • Rename one or more columns with rename()
 rename(mydata, newname=pre, othernew=post)
  • Add new columns with mutate() and transmute()
        diff = pre-post, 
        diff = diff*2, 
        diff_c = diff-mean(diff, na.rm=T))
 mutate(mydata, gender = ifelse(mynames == 'jkl', 'F', 'M'))
 # transmute does the same, but returns only newly defined columns
 transmute(mydata,  diff = pre-post,  diff2 = diff*2) 
  • Aggregate data with summarize()
 mydata %>% group_by(gender) %>% 
        summarise(MeanAge = mean(age, na.rm=T), Mean = mean(score, na.rm=T), SD = sd(score, na.rm=T))
 # na.rm -> remove NA values
  • Merge tibbles with left_join() (there are also other joins)

Other packages


  • Pipes: %>%
  • Send the same dataframe as input to a pipeline of actions.
  • Example:
 mydf %>%
        filter(! %>%
        mutate(LogFreq = log(Freq)) %>%
        group_by(Condition) %>%
        summarise(mean = mean(LogFreq))
  • Does the same as:
 mydf.filtered <- filter(mydf, !
 mydf.log <- mutate(mydf.filtered, LogFreq = log(Freq))
 mydf.grouped <- group_by(mydf.log, mydf.log)
 summarise(mydf.grouped, mean = mean(LogFreq))


  • “An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points.”
  • “A geom is the geometrical object that a plot uses to represent data.”
  • General form:
 ggplot(data = <DATA>) +
                mapping = aes(<MAPPINGS>),
                stat = <STAT>,
                position = <POSITION>
        ) +
  • Examples:
 ggplot(mydf, # dataframe/tibble as first arg, mapping -> from data to aestetics/graphic properties
        mapping = aes( # aes -> set of aestetics mappings,
           x = pred, y = resp #  map x/y-values of plot to dataframe columns with these names
         )) + geom_point() # add shape to the plot
 ggplot(mydf, mapping = aes( x = pred)) + 
        geom_histogram(binwidth = .5, 
        fill = rgb(0.2,0.4,0.8,0.6),   # rgb values in [0..1], last part is alpha
        color = 'black')   # use colors() to get a list of all colors
 ggplot(mydf, mapping = aes( x = pred)) + 
        geom_density(fill = rgb(0.8,0.4,0.3,0.6), color = 'black') 


  • Basic String manipulation
 s1 <- "THis is a String 123 that has numbers 456 "
  • String concatenation and splitting
 str_c("Hello", "Bodo", "nice", "to", "meet", "you", sep = " ")
 s2 <- c('Anna Beispiel', 'Cornelia Daten', 'Egon Fritz')
 xsplit <- str_split(s2, ' ') # Returns a list of character vectors
 unlist(xsplit) # Flattens the list into a vector of characters
 str_split(s2, ' ', simplify = T) # Returns a matrix instead of a list
  • Substrings
 str_sub(s2, 1, 1) # get the first letter of every entry
  • Regular expressions on a (list of) Strings
 str_view(s1, "(S|s)tr") # Search and show the result
 str_detect(s1, "[0-9]") # Check presence
 str_extract(s1, "[0-9]+") # Extract the (first) match
 str_replace(s1, "[0-9]+", ":)") # replace first occurrence
 str_replace_all(s1, "([0-9]+)", "\\1 :)") # replace all

Mitgliederbeiträge mit Hibiscus und JVerein einziehen

Der Einzug hat zwei Teile: Erstmal müssen wir definieren für wen wir den Beitrag einziehen wollen, dann müssen wir die Lastschriften über die Bank einziehen lassen.

Die Ermittlung der Mitglieder und ihrer Beiträge geht in JVerein unter Abrechnung im Ordner Abrechnung. Dort ist folgendes einzugeben:

  • Modus: Alle
  • Fälligkeit: Datum eintragen an dem eingezogen werden soll
  • Stichtag: 31.12.[Jahr]
  • Zahlungsgrund: Mitgliedsbeitrag [Jahr]
  • SEPA-Datei: Nein
  • Abbuchungsausgabe: Hibiscus

In Hibiscus gibt des dann unter SEPA Zahlungsverkehr den Punkt SEPA Lastschriften. Dort sollte jetzt für jedes Mitglied eine Lastschrift mit dem korrekten Mitgliedsbeitrag aufgeführt sein. Um sie tatsächlich auszuführen muss man mit rechts klicken und Jetzt ausführen wählen. Dann sollte Hibiscus nach einer TAN fragen. Und dann sollte Geld abgebucht werden.

Leider kann man in unserer Version von Hibiscus nur einzelne Lastschriften machen, d.h. man muss für jedes Mitglied neu klicken und die TAN eingeben. Alternativ kann man eine SEPA-Datei erzeugen und die bei der Bank im Online-Banking hochladen.

Cropmarks (Beschnittmarken)

If you print a professional book where graphics go until the very edge of the page, you need to give a file with cropmarks (Beschnittmarken) to the printer. This is how to produce them in LaTeX, it is actually very simple with the package crop:


The book in question was A5 paper which has a size of 148mm x 210mm. I wanted to have 3 more milimeters to each side. This makes the final paper size I want to have 154mm x 216mm, which I have given above. As I want the markings equally on each side, I use the option center to center the content in the middle of the larger page. The option cam print standard cropmarks (this is also the default).

Now the only thing left to do is adjust the graphics. For every page where a graphic goes to the edge of the page, increase it a little bit over the margin. There are several ways to do that, depending on what exactly it is you do. This is an example for a colored background image that fills the whole page. The important part is \dimexpr\paperwidth+6mm which just adds 6mm to the height of the picture:

\begin{tikzpicture}[remember picture, overlay]
\node[inner sep=0pt] at (current { 

Find file names with invalid encoding on Linux

I have files copied from Windows computers in ancient times. The filenames contain special characters and they have been messed up somewhere along the way. For example I got a file named 9.5.2 Modelo de aceptaci??n (espa??ol).doc in the folder 9 Garant??a del Estado.

First, I want to find and list these files. Stackexchange tells us how to do that:

LC_ALL=C find . -name '*[! -~]*'

This will find all names that have non-ASCII letters, not only those that are broken. But in my case I have folders where ALL of the names are broken, so I don’t mind.

Second, I want to fix the names. I did it manually, but for future reference, if I ever were to do anything like that again, I might use one of the solutions proposed in this thread on

Unison preferences for syncing

Unison is a tool to compare and synchronize two folders. You can configure it by GUI, but at least for me (Kubuntu 18.04) not all settings work. Specifically, I cannot set the value “0”. But there is an easy way around the problem. Unison puts a file called Profilename.prf (where “Profilename” should be replaced with the actual name of your profile) into the folder .unison in your home directory. This is simply a text file with key-value pairs, that you can edit at your leisure.

Here are standard settings for comparing two directories without comparing the file permissions:

label = My first comparison
root = /home/test/FolderOne/
root = /home/test/FolderTwo/
perms = 0
dontchmod = true

Now for the coolest feature of Unison: It is written in OCaml!! OCaml was used in my third semester to teach functional programming. I remember clearly the teacher telling us about the “usefulness” of the language. She had one slide with examples of programs written in OCaml. And she must have looked very hard to find any. There were a grand total of three programs on the slide. Two formal logic resolvers or something to that effect (we were like “yeah, really useful”). And MLDonkey (peer-to-peer filesharing was BIG in those days before Netflix, Spotify and fast internet) which she clearly didn’t know what it was for. So now, if she still has that slide, I can add another program! And a really useful one at that!

Typearea settings for a scrbook

Here are dirty hacks for page layout in LaTeX and a few useful standard settings.

I load the following document class:


The option bibliography=totoc puts the bibliography into table of contents. The option 12pt sets normal font size to 12pt instead of the usual 11pt. This font size was a requirement for my thesis. The option a4paper sets DIN A4 paper (which should be the default anyway). The option headsepline adds a line below the header content on pages without a title.

A useful additional option may be oneside, which creates symmetric margins for one-sided print. Again, that was a requirement for the manuscript of my thesis. For the final print, I needed a normal two-sided print. A useful option there is titlepage=firstiscover, which gives equal margins for the first two pages (the book cover).

Usually you don’t want to tamper with the margins that LaTeX gives you. But in some cases, you may have specific guidelines that you need to adhere to. Or you have a fixed number of pages and run out of space, so you want smaller margins. Anyway, this is not recommended, I am just showing you how it works, because I can.

We have loaded the documentclass scrbook which is the KOMA-Script document class for an DIN A4 page book with a font size of 12pt. At that paper and font size, KOMA-Script uses a value of DIV=12 to calculate margins and text area sizes. The page has a width of 157.50mm and a height of 222.75mm for the text area. The top margin is 24.75mm and the inner margin 17.50mm. You can increase or decrease the margins by setting a different DIV value. So if you use the option DIV=13 for example, you will have a bigger text area (161mm wide instead of only 157mm). You can play around with the values until you find something you like. Here are the measurements for different DIV values for an DIN A4 page:

If you don’t find anything you like, you can set all values by hand with the geometry package. Use at your own peril. This is an example with a larger text height:

\usepackage[width=157.50mm,top=35mm,left=24mm]{geometry}  % gives textheight=226.36mm

When you play around with margins and text area settings, the package showframe is useful to see what you are doing: