Version conflict in Google protobuf on EMR cluster

I have written a Spark application that uses a library that uses the Google protobuf library in version 3.3.1. On my computer I can run it with my local Spark and everything is fine. But now, I want to run it on an EMR cluster in on AWS. And I get this:

java.lang.NoSuchMethodError: com.google.protobuf.CodedInputStream.readStringRequireUtf8()Ljava/lang/String;

It seems that this is a common incopatibility issue (says stackoverflow). Java libraries are loaded from the classpath. The classpath is searched from front to back. And in the EMR cluster, Spark and Hadoop libraries are inserted before the user libraries. So the old protobuf version is found.

But there is the possibility to change this behaviour described in the Spark 2.4.4 documentation. So I added two corresponding options to spark-submit in the EMR cluster step:

--conf spark.driver.userClassPathFirst=true 
--conf spark.executor.userClassPathFirst=true

And it worked!

NB: The proposed solution at stackoverflow is to create a “shaded” fat jar. That probably works as well and should be simple for Maven users. Maybe my next post is about that 😉

This entry was posted in Big data, Cloud and tagged , , , , , by swk. Bookmark the permalink.

About swk

I am a software developr, data scientist, computational linguist, teacher of computer science and above all a huge fan of LaTeX. I use LaTeX for everything, including things you never wanted to do with LaTeX. My latest love is lilypond, aka LaTeX for music. I'll post at irregular intervals about cool stuff, stupid hacks and annoying settings I want to remember for the future.