Load data in Spark from multiple known partitions

Posted on October 19, 2022 by swk

val paths = Seq(“path1”, “path2”, “path3”)
val data = spark.read.option(“basePath”, basePath).parquet(paths:_*)

Run Spark locally and access S3

Posted on February 7, 2022 by swk

By changing the code:

val sparkConfig = new SparkConf()
   .set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
   .setMaster("local[*]")

By adding JVM arguments to Java:

-Dspark.master=local[*]
-Dspark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain

By setting the JVM property from Java (I have not tested if this works for the credentials provider, but it should):

System.setProperty("spark.master", "local[*]")
System.setProperty("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

The AWS credentials will be taken from the default profile or you can specify the profile with the environment variable AWS_PROFILE=<your profile.

Working configuration for Spark, Hadoop and AWS SDK

Posted on February 3, 2022 by swk

Tested combination that works version 1:
– Spark 2.4.4
– Hadoop 3.1.1
– AWS SDK 1.11.271
Tested combination that works version 2:
– Spark 2.4.4
– Hadoop 2.8.5
– AWS SDK 1.11.271

Version conflict in Google protobuf on EMR cluster

Posted on May 15, 2020 by swk

I have written a Spark application that uses a library that uses the Google protobuf library in version 3.3.1. On my computer I can run it with my local Spark and everything is fine. But now, I want to run it on an EMR cluster in on AWS. And I get this:

java.lang.NoSuchMethodError: com.google.protobuf.CodedInputStream.readStringRequireUtf8()Ljava/lang/String;

It seems that this is a common incopatibility issue (says stackoverflow). Java libraries are loaded from the classpath. The classpath is searched from front to back. And in the EMR cluster, Spark and Hadoop libraries are inserted before the user libraries. So the old protobuf version is found.

But there is the possibility to change this behaviour described in the Spark 2.4.4 documentation. So I added two corresponding options to spark-submit in the EMR cluster step:

--conf spark.driver.userClassPathFirst=true 
--conf spark.executor.userClassPathFirst=true

And it worked!

NB: The proposed solution at stackoverflow is to create a “shaded” fat jar. That probably works as well and should be simple for Maven users. Maybe my next post is about that 😉

Clean Spark environment in Zeppelin

Posted on November 21, 2019 by swk

When you have experimented a lot, there are many things in your notebook and quite often, you are not sure what is there and what not. Zeppelin does not really give you a good tool to do that, so I wrote my own function to clean up the global variables of Python:

def clear(keep=("__builtins__", "clear", 'completion', 'z', '__zeppelin_completion__', '_zsc_', '__zSqlc__', '__zeppelin__', '__zSpark__', 'sc', 'spark', 'sqlc', 'sqlContext')):
    keeps = {}
    for name, value in globals().iteritems():
        if name in keep: 
            keeps[name] = value
    globals().clear()
    for name, value in keeps.iteritems():
        globals()[name] = value

You can then call the function with clear().

Additionally, if you are dealing with libraries, you may want to explicitly remove them with del sys.modules['modulename'].

Yesterday's Coffee

Too good to throw away – too hard to remember

Category Archives: Big data

Load data in Spark from multiple known partitions

Run Spark locally and access S3

Working configuration for Spark, Hadoop and AWS SDK

Version conflict in Google protobuf on EMR cluster

Clean Spark environment in Zeppelin