Version conflict in Google protobuf on EMR cluster

I have written a Spark application that uses a library that uses the Google protobuf library in version 3.3.1. On my computer I can run it with my local Spark and everything is fine. But now, I want to run it on an EMR cluster in on AWS. And I get this:

java.lang.NoSuchMethodError: com.google.protobuf.CodedInputStream.readStringRequireUtf8()Ljava/lang/String;

It seems that this is a common incopatibility issue (says stackoverflow). Java libraries are loaded from the classpath. The classpath is searched from front to back. And in the EMR cluster, Spark and Hadoop libraries are inserted before the user libraries. So the old protobuf version is found.

But there is the possibility to change this behaviour described in the Spark 2.4.4 documentation. So I added two corresponding options to spark-submit in the EMR cluster step:

--conf spark.driver.userClassPathFirst=true 
--conf spark.executor.userClassPathFirst=true

And it worked!

NB: The proposed solution at stackoverflow is to create a “shaded” fat jar. That probably works as well and should be simple for Maven users. Maybe my next post is about that 😉

Accessing JVM arguments from inside Java

Whenever I get a ClassNotFoundException error in Java, I think to myself “but it is there!” and then I correct the typo in the classpath or get angry at Eclipse for messing up my classpath. Lately I have programmed in more complex settings where it was not always clear to me where the application gets the classpath from, so I wanted to check which of my libraries actually end up on the classpath. Turns out it is not very complicated. Here is code to print a number of useful things:

System.out.println("Working directory: " 
      + Paths.get(".").toAbsolutePath().normalize().toString());
System.out.println("Classpath: " 
      + System.getProperty("java.class.path"));
System.out.println("Library path: " 
      + System.getProperty("java.library.path"));
System.out.println("java.ext.dirs: " 
      + System.getProperty("java.ext.dirs"));

The current working directory is the starting point for all relative paths, e.g., for reading and writing files. The normalization of the path makes it a bit more readable, but is not necessary. The class Paths is from the package java.nio.file.Paths. The classpath is the place where Java looks for (bytecode for) classes. The entries should be folders or jar-files. The Java library path is where Java looks for native libraries, e.g., platform dependent things. You can of course access other environment variables with the same method, but I cannot at the moment think of a useful example.

Related (at least related enough to put it into the same post), this is how you can print the space used and available on the JVM heap:

int mbfactor = 1024*1024;
System.out.println("Memory free/used/total/max " 
      + Runtime.getRuntime().freeMemory()/mbfactor + "/"
      + (Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory())/mbfactor + "/"
      + Runtime.getRuntime().totalMemory()/mbfactor + "/"
      + Runtime.getRuntime().maxMemory()/mbfactor + " MB"
);

Sorting a HashMap by Value in Java

Sometimes Java drives you nuts… you want to save word bigrams and their frequencies as an example. A HashMap<String, Integer> is very convenient. Now we want to sort it by value. And the fun begins! Of course we do not want to lose the connection between keys and values, so we cannot just use Collections.sort(map.keySet()) (or the same for values). Also using a TreeMap with a custom-wrote Comparator for our pairs does not work, because there you cannot have identical values (another fun fact).

Here is a generic method that sorts the given HashMap by value. V can be anything as long as it can be compared with itself.

public static <K, V extends Comparable<V>> List<Entry<K, V>> sortHashMapByValue (HashMap<K, V> theMap) {
   List<Entry<K, V>> resultList = new ArrayList<Entry<K, V>>(theMap.entrySet());
   Collections.sort(resultList,
          new Comparator<Entry<K, V>>() {
            @Override
            public int compare(Entry<K, V> o1, Entry<K, V> o2) {
               return o1.getValue().compareTo(o2.getValue());
            }
       });
   return resultList;