About swk

I am a software developr, data scientist, computational linguist, teacher of computer science and above all a huge fan of LaTeX. I use LaTeX for everything, including things you never wanted to do with LaTeX. My latest love is lilypond, aka LaTeX for music. I'll post at irregular intervals about cool stuff, stupid hacks and annoying settings I want to remember for the future.

Version conflict in Google protobuf on EMR cluster

I have written a Spark application that uses a library that uses the Google protobuf library in version 3.3.1. On my computer I can run it with my local Spark and everything is fine. But now, I want to run it on an EMR cluster in on AWS. And I get this:

java.lang.NoSuchMethodError: com.google.protobuf.CodedInputStream.readStringRequireUtf8()Ljava/lang/String;

It seems that this is a common incopatibility issue (says stackoverflow). Java libraries are loaded from the classpath. The classpath is searched from front to back. And in the EMR cluster, Spark and Hadoop libraries are inserted before the user libraries. So the old protobuf version is found.

But there is the possibility to change this behaviour described in the Spark 2.4.4 documentation. So I added two corresponding options to spark-submit in the EMR cluster step:

--conf spark.driver.userClassPathFirst=true 
--conf spark.executor.userClassPathFirst=true

And it worked!

NB: The proposed solution at stackoverflow is to create a “shaded” fat jar. That probably works as well and should be simple for Maven users. Maybe my next post is about that ūüėČ

Login to AWS with the Java SDK and role assumption

The following seems to work for me as contents of .aws/credentials:

[default]
region = eu-west-1
role_arn = arn:aws:iam::12345:role/blubb
source_profile = blabla
[blabla]
aws_access_key_id = XXX
aws_secret_access_key = YYY
aws_session_token = ZZZ
valid_until = DDD

The content of .aws/config does not seem to matter. So it can be whatever it needs to be for the CLI.

Throw away your change and reset a git branch to remote

Sometimes, you do something really stupid and just want to get rid of it. Or you are suddenly in the middle of a complicated merge and don’t really know why anymore (“But I didn’t change anything!”). In this case, if you are sure you want to throw away everything in your local branch and just want to be at the same status as the remote branch, this is your rescue:

git reset --hard origin/<branchname>

But be careful, the --hard option deletes all changes that you have made without recovery.

Spendenbescheinigung mit jVerein erstellen

Um eine Spendenbescheinigung automatisch erzeugen zu können, muss die Buchung korrekt als Spende markiert sein und einem Mitglied zugeordnet sein. Das geht folgendermaßen:
1. Unter “JVerein – Buchf√ľhrung – Buchungen” die entsprechende Buchung finden und in die Detailansicht gehen.
2. Buchungsart “Spende” (oder wie auch immer die entsprechende Kategorie hei√üt) ausw√§hlen. Bei “Mitgliedskonto” die Punkte klicken, auf den Tab “Soll und Ist” und unten in der Tabelle die korrekte Person ausw√§hlen (dort lief eine Suche √ľber die Mitglieder nach dem Namen auf der √úberweisung, falls da nicht das korrekt Mitglied gefunden wurde – oh je!). Als Ergebnis steht jetzt im Feld “Name, Sollbuchung erzeugen”

Wenn das bei allen Buchungen korrekt verbucht wurde, können die Bescheinigungen erzeugt werden. Das geht so:
1. “JVerein – Spendenbescheinigungen”
2. Unten “neu (automatisch)” klicken.
3. Korrektes Jahr und Vorlage auswählen. Dann sollte unten eine Liste der Spenden erscheinen.
3. Auf “erstellen” klicken.
4. Jetzt sollten die entsprechenden Eintr√§ge unter “Spendenbescheinigungen” auffindbar sein. Jeden Eintrag dort ausw√§hlen und einmal auf “speichern” klicken und den Eintrag wieder schlie√üen (unbedingt n√∂tig!!!)
5. Den Eintrag neu √∂ffnen und mit “pdf (individuell)” ein pdf mit der Spendenbescheinigung erzeugen lassen.

Nun kann die Bescheinigung gedruckt, verschickt oder was auch immer werden.

Clean Spark environment in Zeppelin

When you have experimented a lot, there are many things in your notebook and quite often, you are not sure what is there and what not. Zeppelin does not really give you a good tool to do that, so I wrote my own function to clean up the global variables of Python:

def clear(keep=("__builtins__", "clear", 'completion', 'z', '__zeppelin_completion__', '_zsc_', '__zSqlc__', '__zeppelin__', '__zSpark__', 'sc', 'spark', 'sqlc', 'sqlContext')):
    keeps = {}
    for name, value in globals().iteritems():
        if name in keep: 
            keeps[name] = value
    globals().clear()
    for name, value in keeps.iteritems():
        globals()[name] = value

You can then call the function with clear().

Additionally, if you are dealing with libraries, you may want to explicitly remove them with del sys.modules['modulename'].

Cloudformation templates for existing resources

You can use the AWS CLI to get a description of existing resources. This description is in JSON format, which can sometimes directly be used as a Cloudformation template.

Example:

aws glue get-job --job-name MyJobName

Makes the job easier! For example with the Glue job, we can see the undocumented options

"--enable-metrics": ""
"--TempDir": "s3://blablubbtest"
"--enable-continuous-cloudwatch-log": "true"

So since Cloudformation is a pain to debug, a possible way to write a template might be to click together the resources with the Console, then get the description with the CLI and use this to create the resources with Cloudformation next time.

Disable graphical prompt for ssh passphrase

When I open a ssh session in the terminal, it asks for my passphrase in a graphical prompt window. That would be ok in theory. But I don’t know my passphrase. So I need to copy it from my password manager. And unfortunately the stupid window doesn’t allow me to access anything else. So, I wanted to disable it.

The usual way is with the environment variable SSH_ASKPASS. To disable the graphical prompt, just remove the value of this variable:

unset SSH_ASKPASS

Unfortunately, in my case this did not work and I needed to remove also another variable:

unset SSH_AUTH_SOCK

Pip and custom prefixes… again! This time it’s Ubuntu’s fault

I wanted to install a Python library to a custom location. Thanks to a long fight with Python on that issue (I can’t believe I haven’t blogged about this!), I know that --prefix does the trick for pip. So I run pip and this happens:

> pip3 install --prefix tmp/ boto3
ERROR: Can not combine '--user' and '--prefix' 
as they imply different installation locations

Alternatively the error is:

distutils.errors.DistutilsOptionError: can't combine user
with prefix, exec_prefix/home, or install_(plat)base

It seems to be an option that Ubuntu adds by default. The magic solution comes from a GNU bug tracker thread:

> pip3 install -U pip

Basically, this installs pip into my user directory (you can find it now in .local/bin/pip). pip3 still fail afterwards with a version mismatch:

> pip3 install --prefix tmp/ boto3
Traceback (most recent call last):
  File "/usr/bin/pip3", line 9, in <module>
    from pip import main
ImportError: cannot import name 'main'

But now I can call my local pip (which is a pip3):

> pip install --prefix tmp/ boto3
Collecting boto3
...
Successfully installed boto3-1.9.206 botocore-1.12.206

To force a re-install, even if the library is already installed somewhere else, use the flag --ignore-installed.