Load data in Spark from multiple known partitions
val data = spark.read.option(“basePath”, basePath).parquet(paths:_*)
Error while Processing folder changes in 'Deleted Items'. database disk image is malformedJust delte the file! Evolution will re-create it. Source: https://mail.gnome.org/archives/evolution-list/2015-July/msg00209.html
docker network ls
docker network rm xxxx
# Figure out which commit you want to edit by getting its SHA.
git log
# Start an interactive rebase ($SHA = your commit's SHA and the ^ is important!).
git rebase --interactive $SHA^
# [Change 'pick' to 'edit' for your commit and save the buffer]
# [Add your changes with git add -p, etc.]
# Change the commit and optionally add --no-edit if you want to keep the existing message.
git commit --amend
# Finalize and apply the rebase.
git rebase --continue
# Or cancel the rebase and go back to what it was like before you started rebasing.
git rebase --abort
From Nick Janetakis – Change a Git Commit in the Past with Amend and Rebase Interactive (https://nickjanetakis.com/blog/change-a-git-commit-in-the-past-with-amend-and-rebase-interactive)
val sparkConfig = new SparkConf()
.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
.setMaster("local[*]")
By adding JVM arguments to Java:
-Dspark.master=local[*]
-Dspark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
By setting the JVM property from Java (I have not tested if this works for the credentials provider, but it should):
System.setProperty("spark.master", "local[*]") System.setProperty("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")The AWS credentials will be taken from the default profile or you can specify the profile with the environment variable
AWS_PROFILE=<your profile
.
To check what connections to ports are open on your computer:
ss -tlnp
Output:
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=445,fd=3))
LISTEN 0 100 0.0.0.0:25 0.0.0.0:* users:(("master",pid=929,fd=13))
LISTEN 0 128 *:3306 *:* users:(("mysqld",pid=534,fd=30))
LISTEN 0 128 *:80 *:* users:(("apache2",pid=765,fd=4),("apache2",pid=764,fd=4),("apache2",pid=515,fd=4))
LISTEN 0 128 [::]:22 [::]:* users:(("sshd",pid=445,fd=4))
LISTEN 0 100 [::]:25 [::]:* users:(("master",pid=929,fd=14))
LISTEN 0 70 *:33060 *:* users:(("mysqld",pid=534,fd=33))
I recently wanted to merge a pull request in a git repository, where a lot had changed in master since the pull request was opened. The owner of the repository wanted me to do a rebase
instead of merge
, so I this is what I figured out:
Step 1) Update your branches.
Update all branches from the remote repository to make sure everything is up to date (newfeature
is your branch with the pull request, develop
is the branch it should be merged to):
$ git checkout develop
$ git pull
$ git checkout newfeature
$ git pull
Step 2) Perform the rebase:
git rebase develop
...
CONFLICT (content): Merge conflict in <bla>
error: Failed to merge in the changes.
Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".
After each time git stops with an error, resolve the conflicts and use git add
. Do not commit anything during the rebase. Instead, just continue the rebase until everything is fine.
git rebase --continue
Now, you will have some modified files. You can make some final changes if necessary, and then commit your changes:
git add <bla> <blubb>
git commit -m "Resolve conflicts from rebase"
Step 3) Push your changes to the remote repository.
Unfortunately, using only git push
will not work, but give the message “Updates were rejected because the tip of your current branch is behind hint: its remote counterpart. Integrate the remote changes (e.g. hint: ‘git pull …’) before pushing again.”
Do NOT use git pull
at this moment!!
Instead, force push your changes onto the remote repository:
git push --force-with-lease
Now, you have overwritten the history on the remote repository with your local history. That is fine. Now you can merge the pull request and delete the branch and go on your merry way.
Step 4) BUT…
If you are not yet at the place where you merge the pull request, but someone else wants to work with the branch, that person is in trouble! A normal git pull
on the branch will fail, because history was changed! What you need to do is:
git pull --rebase
See also: Gerald Versluis: Git Rebase: Don’t be Afraid of the Force (Push)
I have written a Spark application that uses a library that uses the Google protobuf library in version 3.3.1. On my computer I can run it with my local Spark and everything is fine. But now, I want to run it on an EMR cluster in on AWS. And I get this:
java.lang.NoSuchMethodError: com.google.protobuf.CodedInputStream.readStringRequireUtf8()Ljava/lang/String;
It seems that this is a common incopatibility issue (says stackoverflow). Java libraries are loaded from the classpath. The classpath is searched from front to back. And in the EMR cluster, Spark and Hadoop libraries are inserted before the user libraries. So the old protobuf version is found.
But there is the possibility to change this behaviour described in the Spark 2.4.4 documentation. So I added two corresponding options to spark-submit
in the EMR cluster step:
--conf spark.driver.userClassPathFirst=true
--conf spark.executor.userClassPathFirst=true
And it worked!
NB: The proposed solution at stackoverflow is to create a “shaded” fat jar. That probably works as well and should be simple for Maven users. Maybe my next post is about that 😉
When I start a scala console and type after the prompt, nothing is visible. This fixes the issue temporarily:
import sys.process._
"reset" !
This is a known issue for Ubuntu 18.04 and Scala 2.11 (see stackoverflow).