Clean Spark environment in Zeppelin

When you have experimented a lot, there are many things in your notebook and quite often, you are not sure what is there and what not. Zeppelin does not really give you a good tool to do that, so I wrote my own function to clean up the global variables of Python:

def clear(keep=("__builtins__", "clear", 'completion', 'z', '__zeppelin_completion__', '_zsc_', '__zSqlc__', '__zeppelin__', '__zSpark__', 'sc', 'spark', 'sqlc', 'sqlContext')):
    keeps = {}
    for name, value in globals().iteritems():
        if name in keep: 
            keeps[name] = value
    globals().clear()
    for name, value in keeps.iteritems():
        globals()[name] = value

You can then call the function with clear().

Additionally, if you are dealing with libraries, you may want to explicitly remove them with del sys.modules['modulename'].

Pip and custom prefixes… again! This time it’s Ubuntu’s fault

I wanted to install a Python library to a custom location. Thanks to a long fight with Python on that issue (I can’t believe I haven’t blogged about this!), I know that --prefix does the trick for pip. So I run pip and this happens:

> pip3 install --prefix tmp/ boto3
ERROR: Can not combine '--user' and '--prefix' 
as they imply different installation locations

Alternatively the error is:

distutils.errors.DistutilsOptionError: can't combine user
with prefix, exec_prefix/home, or install_(plat)base

It seems to be an option that Ubuntu adds by default. The magic solution comes from a GNU bug tracker thread:

> pip3 install -U pip

Basically, this installs pip into my user directory (you can find it now in .local/bin/pip). pip3 still fail afterwards with a version mismatch:

> pip3 install --prefix tmp/ boto3
Traceback (most recent call last):
  File "/usr/bin/pip3", line 9, in <module>
    from pip import main
ImportError: cannot import name 'main'

But now I can call my local pip (which is a pip3):

> pip install --prefix tmp/ boto3
Collecting boto3
...
Successfully installed boto3-1.9.206 botocore-1.12.206

To force a re-install, even if the library is already installed somewhere else, use the flag --ignore-installed.

Anaconda and environments (basics)

TLDR: Install Anaconda NOT into your path, then do:

ln -s ~/anaconda3/bin/activate ~/bin/activate
source activate
conda create --name myenv
conda activate myenv
python

Long version: Install Anaconda. In the process you will be asked whether you want to edit your .bashrc to setup Anaconda. Answer NO!!

You will still need to add the Anaconda executables into your path to make Anaconda work. The easiest solution that does not destroy your system is to link the activate script somewhere in your path. I do this in a folder ~/bin which is always on my path:

ln -s ~/anaconda3/bin/activate ~/bin/activate

Now call

source activate

The text (base) will be prepended to your prompt and the Anaconda binaries will be in your paths. This should be more or less equivalent to what happens when you call conda init, but without changes to your .bashrc. You could now call python and it should print “[GCC 7.3.0] :: Anaconda, Inc. on linux” instead of your default operating system python.

If you don’t do anything, you have the base environment activated. First thing you want to do, is create a new environment and activate it. Then you can install your few packages into it and work with this environment. The advantage is, that you can throw away this environment if anything goes wrong and start from scratch easily.

These are the most important commands for dealing with environments (in my examples myenv is used as a name for the environment, but of course you can use a better one):
– List all available environments: conda info --envs
– Create an environment: conda create --name myenv
– Delete an environment with all contained packages: conda remove --name myenv --all
– Activate an environment: conda activate myenv
– Deactivate the current environment: conda deactivate
– List all packages installed in the current environment: conda list
– Install a package into the current environment: conda install packagename
– Delete a package from the current environment: conda remove packagename

When you start python with python from an activated environment, you will have all packages in this environment available to you.

Alternative:

export PATH="/home/wkessler/software/miniconda3/bin:$PATH"
source ~/software/miniconda3/bin/activate

Anaconda destroys Plasma

I just started my computer. Plasma did not start. I only saw the message “Could not start D-bus. Can you call qdbus-qt5?”. No. I cannot call this! I don’t have a working desktop!

What I managed to do was start a session with a different desktop environment and then search the internet for a solution. Will the error message, you will quickly find that “Anaconda update breaks KDE if it’s added to PATH”. Yes! I installed Anaconda the last time I used this computer! So I removed the lines it had added to my .bashrc and everything worked again.

Note to self: Do not add Anaconda to your PATH. Do not let it edit your .bashrc (which it does when you use conda init). Stupid snake!

Encoding in Python 2.x

One of the annoying things where I always forget the specifics. So here it is…

Reading a file line-by-line in python and writing it to another file is easy:

input_file = open("input.txt")
outputFile = open("output.txt", "w")
for line in input_file:
   outputFile.write(line + "\n")

But whenever encodings are involved, everything gets complicated. What is the encoding of line? Actually, not really anything, without specification, line is just a Byte-String (type 'str') with no encoding.

Because this is usually not what we want, the first step is to convert the read line to Unicode. This is done with the method decode. The method has two parameters. The fist is the encoding that should be used. This is a predefined String value which you can guess it for the more common encodings (or look it up in the documentation). If left out, ASCII is assumed. The second parameter defines how to handle unknown byte patterns. The value 'strict' will abort with UnicodeDecodeError, 'ignore' will leave the character out of the Unicode result, and 'replace' will replace every unknown pattern with U+FFFD. Let’s assume our input file and therefor the line we read from there is in Latin-1. We convert this to Unicode with:

lineUnicode = line.decode('latin-1','strict')

or equivalently

lineUnicode = unicode(line, encoding='latin-1', errors='strict')

After decoding, we have sometihng of type 'unicode'. If we try to print this and it is not simple English, it will probably give an error (UnicodeEncodeError: 'ascii' codec can't encode characters in position 63-66: ordinal not in range(128)). This is because Python will try to convert the characters to ASCII, which is not possible for characters that are not ASCII. So, to print out a Unicode text, we have to convert it to some encoding. Let’s say we want UTF-8 (there is no reason not to use UTF-8, so this is what you should always want):

lineUtf8 = lineUnicode.encode('utf-8')
print(lineUtf8)

Here again, there is a second parameter which defines how to handle characters that cannot be represented (which shouldn’t happen too often with UTF-8). Happy coding!

Further reading:
Unicode HOWTO in the Python documentation, Overcoming frustration: Correctly using unicode in python2 from the Python Kitchen project, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky (not specific to Python, but gives a good background).