Pip and custom prefixes… again! This time it’s Ubuntu’s fault

I wanted to install a Python library to a custom location. Thanks to a long fight with Python on that issue (I can’t believe I haven’t blogged about this!), I know that --prefix does the trick for pip. So I run pip and this happens:

> pip3 install --prefix tmp/ boto3
ERROR: Can not combine '--user' and '--prefix' 
as they imply different installation locations

Alternatively the error is:

distutils.errors.DistutilsOptionError: can't combine user
with prefix, exec_prefix/home, or install_(plat)base

It seems to be an option that Ubuntu adds by default. The magic solution comes from a GNU bug tracker thread:

> pip3 install -U pip

Basically, this installs pip into my user directory (you can find it now in .local/bin/pip). pip3 still fail afterwards with a version mismatch:

> pip3 install --prefix tmp/ boto3
Traceback (most recent call last):
  File "/usr/bin/pip3", line 9, in <module>
    from pip import main
ImportError: cannot import name 'main'

But now I can call my local pip (which is a pip3):

> pip install --prefix tmp/ boto3
Collecting boto3
...
Successfully installed boto3-1.9.206 botocore-1.12.206

To force a re-install, even if the library is already installed somewhere else, use the flag --ignore-installed.

Anaconda and environments (basics)

TLDR: Install Anaconda NOT into your path, then do:

ln -s ~/anaconda3/bin/activate ~/bin/activate
source activate
conda create --name myenv
conda activate myenv
python

Long version: Install Anaconda. In the process you will be asked whether you want to edit your .bashrc to setup Anaconda. Answer NO!!

You will still need to add the Anaconda executables into your path to make Anaconda work. The easiest solution that does not destroy your system is to link the activate script somewhere in your path. I do this in a folder ~/bin which is always on my path:

ln -s ~/anaconda3/bin/activate ~/bin/activate

Now call

source activate

The text (base) will be prepended to your prompt and the Anaconda binaries will be in your paths. This should be more or less equivalent to what happens when you call conda init, but without changes to your .bashrc. You could now call python and it should print “[GCC 7.3.0] :: Anaconda, Inc. on linux” instead of your default operating system python.

If you don’t do anything, you have the base environment activated. First thing you want to do, is create a new environment and activate it. Then you can install your few packages into it and work with this environment. The advantage is, that you can throw away this environment if anything goes wrong and start from scratch easily.

These are the most important commands for dealing with environments (in my examples myenv is used as a name for the environment, but of course you can use a better one):
– List all available environments: conda info --envs
– Create an environment: conda create --name myenv
– Delete an environment with all contained packages: conda remove --name myenv --all
– Activate an environment: conda activate myenv
– Deactivate the current environment: conda deactivate
– List all packages installed in the current environment: conda list
– Install a package into the current environment: conda install packagename
– Delete a package from the current environment: conda remove packagename

When you start python with python from an activated environment, you will have all packages in this environment available to you.

Alternative:

export PATH="/home/wkessler/software/miniconda3/bin:$PATH"
source ~/software/miniconda3/bin/activate

Tidyverse R package

library(tidyverse)

Tidyverse is a collection of R packages for the comfortable cleanup of data. Loading the package tidyverse will laod the core tidyverse packages: tibble, tidyr, readr, purrr, and dplyr. You can of course also load each package individually.

Intro

  • Tidyverse is optimized for interactive workflow with data
  • Each function does one thing easy and well
  • Basic idea: action(data, some_arguments) or data %>% action(some_arguments)
  • Everything works with tibbles
  • Web page: http://tidyverse.org/
  • Workshop page (withe example scripts): http://bodowinter.com/carpentry/index.html

Core packages

tibble

  • A modern version of dataframes
  • The first argument of every tidyverse function and what every tidyverse function returns
  • Tibbles use characters instead of factors for texts
  • Tibbles have nicer printout than normal dataframes: show data type of columns, number of rows, only the first few rows/columns not all of the data
 mynames <- c('bla', 'jkl', 'xyz', 'asdf', 'asdf')
 age <- c(NA, 30, 20, 25, 18)
 pre <- round(rnorm(length(mynames)), 2)
 post <- round(rnorm(length(mynames)), 2)
 mydata <- tibble(mynames, age, pre, post)  # create tibble from data
 mydf <- data.frame(mynames, age, pre, post)
 as_tibble(mydf)  # convert data frame into tibble

readr

  • Does the same as read.csv from base R, it reads a csv file
  • Faster
  • Automatically creates tibbles
  • Progress bar for big files
 read_csv('somefile.csv')

tidyr

  • A data frame is a rectangular array of variables (columns) and observations (rows)
  • A tidy data frame is a data frame where…
    ** Each variable is in a column.
    ** Each observation is a row.
    ** Each value is a cell.

  • Wide format: a row has many entries for observations, e.g., time-series in columns T0, T1, T2, …

  • Long format: each observation is a separate row, time is a new column, e.g., row1 is T0, row2 is T1, row3 is T2
  • Two functions: gather() to convert from wide format to long format and spread() to convert from wide format to long format
 # Convert to long format, so that every observation is one row,
 # with either the text 'pre' or 'post' in the column 'exam'
 # and the value that was in pre or post now in the column 'score'
 tidydf <- gather(mydata, exam, score, pre:post)

 # From tidydf create the same thing back that we had in mydata (wide format)
 spread(tidydf, exam, score)
  • Easily split columns with separate() and merge with unite()
    court # tibble with lots of comma-separated text in one column ‘text’
 # Split it into 14 columns with the names A-N, 
 # Convert = True -> try to guess the datatypes, otherwise everything would be characters
 court <- separate(court, text, into = LETTERS[1:14], convert = T)

 # Put columns B, C and D into one column 'condition'
 court <- unite(court, condition, B, C, D)

dplyr

  • Filter rows with filter()
 filter(mydata, !is.na(age), pre>0, !duplicated(mynames))
 filter(mydata, mynames %in% c('jkl', 'bla'))
 filter(mydata, post > pre)
  • Select columns with select()
 select(mydata, pre) # select a column
 select(mydata, -pre) # select everything besides this column
 select(mydata, age:pre) # select all columns between pre and post
 select(mydata, -(pre:post)) # select all columns besides those between pre and post
 select(mydata, pre:post, age, mynames) # select and reorder
  • Sort a tibble by a column with arrange()
 arrange(mydata, desc(age), pre) # sort by age (descending), then by pre
  • Rename one or more columns with rename()
 rename(mydata, newname=pre, othernew=post)
  • Add new columns with mutate() and transmute()
 mutate(mydata, 
        diff = pre-post, 
        diff = diff*2, 
        diff_c = diff-mean(diff, na.rm=T))
 mutate(mydata, gender = ifelse(mynames == 'jkl', 'F', 'M'))
 # transmute does the same, but returns only newly defined columns
 transmute(mydata,  diff = pre-post,  diff2 = diff*2) 
  • Aggregate data with summarize()
 mydata %>% group_by(gender) %>% 
        summarise(MeanAge = mean(age, na.rm=T), Mean = mean(score, na.rm=T), SD = sd(score, na.rm=T))
 # na.rm -> remove NA values
  • Merge tibbles with left_join() (there are also other joins)

Other packages

magrittr

  • Pipes: %>%
  • Send the same dataframe as input to a pipeline of actions.
  • Example:
 mydf %>%
        filter(!is.na(F0)) %>%
        mutate(LogFreq = log(Freq)) %>%
        group_by(Condition) %>%
        summarise(mean = mean(LogFreq))
  • Does the same as:
 mydf.filtered <- filter(mydf, !is.na(F0))
 mydf.log <- mutate(mydf.filtered, LogFreq = log(Freq))
 mydf.grouped <- group_by(mydf.log, mydf.log)
 summarise(mydf.grouped, mean = mean(LogFreq))

ggplot2

  • “An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points.”
  • “A geom is the geometrical object that a plot uses to represent data.”
  • General form:
 ggplot(data = <DATA>) +
        <GEOM_FUNCTION>(
                mapping = aes(<MAPPINGS>),
                stat = <STAT>,
                position = <POSITION>
        ) +
        <COORDINATE_FUNCTION> +
        <FACET_FUNCTION>
  • Examples:
 ggplot(mydf, # dataframe/tibble as first arg, mapping -> from data to aestetics/graphic properties
        mapping = aes( # aes -> set of aestetics mappings,
           x = pred, y = resp #  map x/y-values of plot to dataframe columns with these names
         )) + geom_point() # add shape to the plot
 ggplot(mydf, mapping = aes( x = pred)) + 
        geom_histogram(binwidth = .5, 
        fill = rgb(0.2,0.4,0.8,0.6),   # rgb values in [0..1], last part is alpha
        color = 'black')   # use colors() to get a list of all colors
 ggplot(mydf, mapping = aes( x = pred)) + 
        geom_density(fill = rgb(0.8,0.4,0.3,0.6), color = 'black') 

stringr

  • Basic String manipulation
 s1 <- "THis is a String 123 that has numbers 456 "
 str_to_lower(s1)
 str_to_upper(s1)
 str_length(s2) 
  • String concatenation and splitting
 str_c("Hello", "Bodo", "nice", "to", "meet", "you", sep = " ")
 s2 <- c('Anna Beispiel', 'Cornelia Daten', 'Egon Fritz')
 xsplit <- str_split(s2, ' ') # Returns a list of character vectors
 unlist(xsplit) # Flattens the list into a vector of characters
 str_split(s2, ' ', simplify = T) # Returns a matrix instead of a list
  • Substrings
 str_sub(s2, 1, 1) # get the first letter of every entry
  • Regular expressions on a (list of) Strings
 str_view(s1, "(S|s)tr") # Search and show the result
 str_detect(s1, "[0-9]") # Check presence
 str_extract(s1, "[0-9]+") # Extract the (first) match
 str_replace(s1, "[0-9]+", ":)") # replace first occurrence
 str_replace_all(s1, "([0-9]+)", "\\1 :)") # replace all

rJava troubles

I am running code that needs the R package rJava. When I call that code, R just crashes without any indication of what is going wrong. This is a segmentation fault, that for some reason never makes it to the surface. You can solve this by setting the following (see also stackoverflow):

export _JAVA_OPTIONS="-Xss2560k -Xmx2g"

The result in my case is that R starts, but throws the error

Error: 'Error: C stack usage  141780829840 is too close to the limit'

I haven’t found a solution for that problem yet, tell me if you have any hints 🙁

Fun with newlines

Use a typewriter lately? No? Well, who cares… except when you encounter stupidities left over from the early days of computing where people were still used to typewriters. Because typewriters had two ways of going to a new line, ASCII knows two ways of representing the newline:

  • LF (line feed, German Zeilenvorschub), represented as Unicode code point 0x0A, ASCII 00001100 and escape character \n
  • CR (carriage return, German Wagenrücklauf), represented as Unicode code point 0x0D, ASCII 00001101 and escape character \r

ASCII was the first-ever invented encoding for representing text in bits. It’s from the 1960s and at the time someone probably thought it is a good idea to have two characters for the concept of a new line. We’d think "who cares about stuff from the 1960s", it’s 2017, right? But unfortunately many later encodings base themselves on ASCII, most notably those from the Unicode family, e.g., the widely used UTF-8. So – thank you, 1960s! /sarcasm

Two characters for a new line would not be too bad if they were used consistently, but that is where the fun begins. Of course they are not! Differnt operating systems use different conventions to mark the end of a line:

  • Linux and Mac OSX use LF
  • Windows uses CR LF
  • (and to make the chaos complete, Mac OS from before version X uses CR)

So have fun reading "plain text" files! /sarcasm

Encoding in Python 2.x

One of the annoying things where I always forget the specifics. So here it is…

Reading a file line-by-line in python and writing it to another file is easy:

input_file = open("input.txt")
outputFile = open("output.txt", "w")
for line in input_file:
   outputFile.write(line + "\n")

But whenever encodings are involved, everything gets complicated. What is the encoding of line? Actually, not really anything, without specification, line is just a Byte-String (type 'str') with no encoding.

Because this is usually not what we want, the first step is to convert the read line to Unicode. This is done with the method decode. The method has two parameters. The fist is the encoding that should be used. This is a predefined String value which you can guess it for the more common encodings (or look it up in the documentation). If left out, ASCII is assumed. The second parameter defines how to handle unknown byte patterns. The value 'strict' will abort with UnicodeDecodeError, 'ignore' will leave the character out of the Unicode result, and 'replace' will replace every unknown pattern with U+FFFD. Let’s assume our input file and therefor the line we read from there is in Latin-1. We convert this to Unicode with:

lineUnicode = line.decode('latin-1','strict')

or equivalently

lineUnicode = unicode(line, encoding='latin-1', errors='strict')

After decoding, we have sometihng of type 'unicode'. If we try to print this and it is not simple English, it will probably give an error (UnicodeEncodeError: 'ascii' codec can't encode characters in position 63-66: ordinal not in range(128)). This is because Python will try to convert the characters to ASCII, which is not possible for characters that are not ASCII. So, to print out a Unicode text, we have to convert it to some encoding. Let’s say we want UTF-8 (there is no reason not to use UTF-8, so this is what you should always want):

lineUtf8 = lineUnicode.encode('utf-8')
print(lineUtf8)

Here again, there is a second parameter which defines how to handle characters that cannot be represented (which shouldn’t happen too often with UTF-8). Happy coding!

Further reading:
Unicode HOWTO in the Python documentation, Overcoming frustration: Correctly using unicode in python2 from the Python Kitchen project, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky (not specific to Python, but gives a good background).

Accessing JVM arguments from inside Java

Whenever I get a ClassNotFoundException error in Java, I think to myself “but it is there!” and then I correct the typo in the classpath or get angry at Eclipse for messing up my classpath. Lately I have programmed in more complex settings where it was not always clear to me where the application gets the classpath from, so I wanted to check which of my libraries actually end up on the classpath. Turns out it is not very complicated. Here is code to print a number of useful things:

System.out.println("Working directory: " 
      + Paths.get(".").toAbsolutePath().normalize().toString());
System.out.println("Classpath: " 
      + System.getProperty("java.class.path"));
System.out.println("Library path: " 
      + System.getProperty("java.library.path"));
System.out.println("java.ext.dirs: " 
      + System.getProperty("java.ext.dirs"));

The current working directory is the starting point for all relative paths, e.g., for reading and writing files. The normalization of the path makes it a bit more readable, but is not necessary. The class Paths is from the package java.nio.file.Paths. The classpath is the place where Java looks for (bytecode for) classes. The entries should be folders or jar-files. The Java library path is where Java looks for native libraries, e.g., platform dependent things. You can of course access other environment variables with the same method, but I cannot at the moment think of a useful example.

Related (at least related enough to put it into the same post), this is how you can print the space used and available on the JVM heap:

int mbfactor = 1024*1024;
System.out.println("Memory free/used/total/max " 
      + Runtime.getRuntime().freeMemory()/mbfactor + "/"
      + (Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory())/mbfactor + "/"
      + Runtime.getRuntime().totalMemory()/mbfactor + "/"
      + Runtime.getRuntime().maxMemory()/mbfactor + " MB"
);

Histograms of category frequencies in R

I am learning R, so this is my first attempt to create histograms in R. The data that I have is a vector of one category for each data point. For this example we will use a vector of a random sample of letters. The important thing is that we want a histogram of the frequencies of texts, not numbers. And the texts are longer than just one letter. So let’s start with this:

labels <- sample(letters[1:20],100,replace=TRUE)
labels <- vapply(seq_along(labels), 
                 function(x) paste(rep(labels[x],10), collapse = ""),
                 character(1L)) # Repeat each letter 10 times
library(plyr) # for the function 'count'
distribution <- count(labels)
distribution_sorted <- 
   distribution[order(distribution[,"freq"], decreasing=TRUE),]

I use the function count from the package plyr to get a matrix distribution with the different categories in column one (called "x") and the number of times this label occurs in column two (called "freq"). As I would like the histogram to display the categories from the most frequent to the least frequent one, I then sort this matrix by frequency with the function order. The function gives back a vector of indices in the correct order, so I need to plug this into the original matrix as row numbers.

Now let's do the histogram:

mp <- barplot(distribution_sorted[,"freq"],
         names.arg=distribution_sorted[,1], # X-axis names
         las=2,  # turn labels by 90 degrees
         col=c("blue"), # blue bars (just for fun)
         xlab="Kategorie", ylab="Häufigkeit", # Axis labels
         )

There are many more settings to adapt, e.g., you can use cex to increase the font size for the numerical y-axis values (cex.axis), the categorical x-axis names (cex.names), and axis labels (cex.lab).

In my plot there is one problem. My categorie names are much longer than the values on the y-axis and so the axis labels are positioned incorrectly. This is the point to give up and do the plot in Excel (ahem, LaTeX!) - or take input from fellow bloggers. They explain the issues way better than me, so I will just post my final solution. I took the x-axis label out of the plot and inserted it separately with mtext. I then wanted a line for the x-axis as well and in the end I took out the x-axis names from the plot again and put them into a separate axis at the bottom (side=1) with zero-length ticks (tcl=0) intersecting the y-axis at pos=-0.3.

# mai = space around the plot: bottom - left - top - right
# mgp = spacing for axis title - axis labels - axis line
par(mai=c(2.5,1,0.3,0.15), mgp=c(2.5,0.75,0))
mp <- barplot(distribution_sorted[,"freq"],
         #names.arg=distribution_sorted[,1], # X-axis names/labels
         las=2,  # turn labels by 90 degrees
         col=c("blue"), # blue bars (just for fun)
         ylab="Häufigkeit", # Axis title
         )
axis(side=1, at=mp, pos=-0.3, 
     tick=TRUE, tcl=0, 
     labels=distribution_sorted[,1], las=2, 
     )
mtext("Kategorie", side=1, line=8.5) # x-axis label

There has to be an easier way !?