When I start a scala console and type after the prompt, nothing is visible. This fixes the issue temporarily:
import sys.process._
"reset" !
This is a known issue for Ubuntu 18.04 and Scala 2.11 (see stackoverflow).
When I start a scala console and type after the prompt, nothing is visible. This fixes the issue temporarily:
import sys.process._
"reset" !
This is a known issue for Ubuntu 18.04 and Scala 2.11 (see stackoverflow).
I wanted to install a Python library to a custom location. Thanks to a long fight with Python on that issue (I can’t believe I haven’t blogged about this!), I know that --prefix
does the trick for pip
. So I run pip
and this happens:
> pip3 install --prefix tmp/ boto3
ERROR: Can not combine '--user' and '--prefix'
as they imply different installation locations
Alternatively the error is:
distutils.errors.DistutilsOptionError: can't combine user
with prefix, exec_prefix/home, or install_(plat)base
It seems to be an option that Ubuntu adds by default. The magic solution comes from a GNU bug tracker thread:
> pip3 install -U pip
Basically, this installs pip
into my user directory (you can find it now in .local/bin/pip
). pip3
still fail afterwards with a version mismatch:
> pip3 install --prefix tmp/ boto3
Traceback (most recent call last):
File "/usr/bin/pip3", line 9, in <module>
from pip import main
ImportError: cannot import name 'main'
But now I can call my local pip
(which is a pip3
):
> pip install --prefix tmp/ boto3
Collecting boto3
...
Successfully installed boto3-1.9.206 botocore-1.12.206
To force a re-install, even if the library is already installed somewhere else, use the flag --ignore-installed
.
TLDR: Install Anaconda NOT into your path, then do:
ln -s ~/anaconda3/bin/activate ~/bin/activate
source activate
conda create --name myenv
conda activate myenv
python
Long version: Install Anaconda. In the process you will be asked whether you want to edit your .bashrc
to setup Anaconda. Answer NO!!
You will still need to add the Anaconda executables into your path to make Anaconda work. The easiest solution that does not destroy your system is to link the activate
script somewhere in your path. I do this in a folder ~/bin
which is always on my path:
ln -s ~/anaconda3/bin/activate ~/bin/activate
Now call
source activate
The text (base)
will be prepended to your prompt and the Anaconda binaries will be in your paths. This should be more or less equivalent to what happens when you call conda init
, but without changes to your .bashrc
. You could now call python
and it should print “[GCC 7.3.0] :: Anaconda, Inc. on linux” instead of your default operating system python.
If you don’t do anything, you have the base environment activated. First thing you want to do, is create a new environment and activate it. Then you can install your few packages into it and work with this environment. The advantage is, that you can throw away this environment if anything goes wrong and start from scratch easily.
These are the most important commands for dealing with environments (in my examples myenv
is used as a name for the environment, but of course you can use a better one):
– List all available environments: conda info --envs
– Create an environment: conda create --name myenv
– Delete an environment with all contained packages: conda remove --name myenv --all
– Activate an environment: conda activate myenv
– Deactivate the current environment: conda deactivate
– List all packages installed in the current environment: conda list
– Install a package into the current environment: conda install packagename
– Delete a package from the current environment: conda remove packagename
When you start python with python
from an activated environment, you will have all packages in this environment available to you.
Alternative:
export PATH="/home/wkessler/software/miniconda3/bin:$PATH"
source ~/software/miniconda3/bin/activate
library(tidyverse)
Tidyverse is a collection of R packages for the comfortable cleanup of data. Loading the package tidyverse
will laod the core tidyverse packages: tibble
, tidyr
, readr
, purrr
, and dplyr
. You can of course also load each package individually.
action(data, some_arguments)
or data %>% action(some_arguments)
tibbles
mynames <- c('bla', 'jkl', 'xyz', 'asdf', 'asdf')
age <- c(NA, 30, 20, 25, 18)
pre <- round(rnorm(length(mynames)), 2)
post <- round(rnorm(length(mynames)), 2)
mydata <- tibble(mynames, age, pre, post) # create tibble from data
mydf <- data.frame(mynames, age, pre, post)
as_tibble(mydf) # convert data frame into tibble
read.csv
from base R, it reads a csv file read_csv('somefile.csv')
A tidy data frame is a data frame where…
** Each variable is in a column.
** Each observation is a row.
** Each value is a cell.
Wide format: a row has many entries for observations, e.g., time-series in columns T0, T1, T2, …
gather()
to convert from wide format to long format and spread()
to convert from wide format to long format # Convert to long format, so that every observation is one row,
# with either the text 'pre' or 'post' in the column 'exam'
# and the value that was in pre or post now in the column 'score'
tidydf <- gather(mydata, exam, score, pre:post)
# From tidydf create the same thing back that we had in mydata (wide format)
spread(tidydf, exam, score)
separate()
and merge with unite()
# Split it into 14 columns with the names A-N,
# Convert = True -> try to guess the datatypes, otherwise everything would be characters
court <- separate(court, text, into = LETTERS[1:14], convert = T)
# Put columns B, C and D into one column 'condition'
court <- unite(court, condition, B, C, D)
filter()
filter(mydata, !is.na(age), pre>0, !duplicated(mynames))
filter(mydata, mynames %in% c('jkl', 'bla'))
filter(mydata, post > pre)
select()
select(mydata, pre) # select a column
select(mydata, -pre) # select everything besides this column
select(mydata, age:pre) # select all columns between pre and post
select(mydata, -(pre:post)) # select all columns besides those between pre and post
select(mydata, pre:post, age, mynames) # select and reorder
arrange()
arrange(mydata, desc(age), pre) # sort by age (descending), then by pre
rename()
rename(mydata, newname=pre, othernew=post)
mutate()
and transmute()
mutate(mydata,
diff = pre-post,
diff = diff*2,
diff_c = diff-mean(diff, na.rm=T))
mutate(mydata, gender = ifelse(mynames == 'jkl', 'F', 'M'))
# transmute does the same, but returns only newly defined columns
transmute(mydata, diff = pre-post, diff2 = diff*2)
summarize()
mydata %>% group_by(gender) %>%
summarise(MeanAge = mean(age, na.rm=T), Mean = mean(score, na.rm=T), SD = sd(score, na.rm=T))
# na.rm -> remove NA values
left_join()
(there are also other joins)%>%
mydf %>%
filter(!is.na(F0)) %>%
mutate(LogFreq = log(Freq)) %>%
group_by(Condition) %>%
summarise(mean = mean(LogFreq))
mydf.filtered <- filter(mydf, !is.na(F0))
mydf.log <- mutate(mydf.filtered, LogFreq = log(Freq))
mydf.grouped <- group_by(mydf.log, mydf.log)
summarise(mydf.grouped, mean = mean(LogFreq))
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
ggplot(mydf, # dataframe/tibble as first arg, mapping -> from data to aestetics/graphic properties
mapping = aes( # aes -> set of aestetics mappings,
x = pred, y = resp # map x/y-values of plot to dataframe columns with these names
)) + geom_point() # add shape to the plot
ggplot(mydf, mapping = aes( x = pred)) +
geom_histogram(binwidth = .5,
fill = rgb(0.2,0.4,0.8,0.6), # rgb values in [0..1], last part is alpha
color = 'black') # use colors() to get a list of all colors
ggplot(mydf, mapping = aes( x = pred)) +
geom_density(fill = rgb(0.8,0.4,0.3,0.6), color = 'black')
s1 <- "THis is a String 123 that has numbers 456 "
str_to_lower(s1)
str_to_upper(s1)
str_length(s2)
str_c("Hello", "Bodo", "nice", "to", "meet", "you", sep = " ")
s2 <- c('Anna Beispiel', 'Cornelia Daten', 'Egon Fritz')
xsplit <- str_split(s2, ' ') # Returns a list of character vectors
unlist(xsplit) # Flattens the list into a vector of characters
str_split(s2, ' ', simplify = T) # Returns a matrix instead of a list
str_sub(s2, 1, 1) # get the first letter of every entry
str_view(s1, "(S|s)tr") # Search and show the result
str_detect(s1, "[0-9]") # Check presence
str_extract(s1, "[0-9]+") # Extract the (first) match
str_replace(s1, "[0-9]+", ":)") # replace first occurrence
str_replace_all(s1, "([0-9]+)", "\\1 :)") # replace all
It’s easy!
String version = myObject.getClass() .getPackage() .getImplementationVersion();
Where it comes from…? I think from the MANIFESTO in the jar file.
I am running code that needs the R package rJava
. When I call that code, R just crashes without any indication of what is going wrong. This is a segmentation fault, that for some reason never makes it to the surface. You can solve this by setting the following (see also stackoverflow):
export _JAVA_OPTIONS="-Xss2560k -Xmx2g"
The result in my case is that R starts, but throws the error
Error: 'Error: C stack usage 141780829840 is too close to the limit'
I haven’t found a solution for that problem yet, tell me if you have any hints 🙁
Use a typewriter lately? No? Well, who cares… except when you encounter stupidities left over from the early days of computing where people were still used to typewriters. Because typewriters had two ways of going to a new line, ASCII knows two ways of representing the newline:
0x0A
, ASCII 00001100
and escape character \n
0x0D
, ASCII 00001101
and escape character \r
ASCII was the first-ever invented encoding for representing text in bits. It’s from the 1960s and at the time someone probably thought it is a good idea to have two characters for the concept of a new line. We’d think "who cares about stuff from the 1960s", it’s 2017, right? But unfortunately many later encodings base themselves on ASCII, most notably those from the Unicode family, e.g., the widely used UTF-8. So – thank you, 1960s! /sarcasm
Two characters for a new line would not be too bad if they were used consistently, but that is where the fun begins. Of course they are not! Differnt operating systems use different conventions to mark the end of a line:
So have fun reading "plain text" files! /sarcasm
One of the annoying things where I always forget the specifics. So here it is…
Reading a file line-by-line in python and writing it to another file is easy:
input_file = open("input.txt") outputFile = open("output.txt", "w") for line in input_file: outputFile.write(line + "\n")
But whenever encodings are involved, everything gets complicated. What is the encoding of line
? Actually, not really anything, without specification, line
is just a Byte-String (type 'str'
) with no encoding.
Because this is usually not what we want, the first step is to convert the read line to Unicode. This is done with the method decode
. The method has two parameters. The fist is the encoding that should be used. This is a predefined String value which you can guess it for the more common encodings (or look it up in the documentation). If left out, ASCII is assumed. The second parameter defines how to handle unknown byte patterns. The value 'strict'
will abort with UnicodeDecodeError
, 'ignore'
will leave the character out of the Unicode result, and 'replace'
will replace every unknown pattern with U+FFFD
. Let’s assume our input file and therefor the line we read from there is in Latin-1. We convert this to Unicode with:
lineUnicode = line.decode('latin-1','strict')
or equivalently
lineUnicode = unicode(line, encoding='latin-1', errors='strict')
After decoding, we have sometihng of type 'unicode'
. If we try to print this and it is not simple English, it will probably give an error (UnicodeEncodeError: 'ascii' codec can't encode characters in position 63-66: ordinal not in range(128)
). This is because Python will try to convert the characters to ASCII, which is not possible for characters that are not ASCII. So, to print out a Unicode text, we have to convert it to some encoding. Let’s say we want UTF-8 (there is no reason not to use UTF-8, so this is what you should always want):
lineUtf8 = lineUnicode.encode('utf-8') print(lineUtf8)
Here again, there is a second parameter which defines how to handle characters that cannot be represented (which shouldn’t happen too often with UTF-8). Happy coding!
Further reading:
Unicode HOWTO in the Python documentation, Overcoming frustration: Correctly using unicode in python2 from the Python Kitchen project, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky (not specific to Python, but gives a good background).
Whenever I get a ClassNotFoundException error in Java, I think to myself “but it is there!” and then I correct the typo in the classpath or get angry at Eclipse for messing up my classpath. Lately I have programmed in more complex settings where it was not always clear to me where the application gets the classpath from, so I wanted to check which of my libraries actually end up on the classpath. Turns out it is not very complicated. Here is code to print a number of useful things:
System.out.println("Working directory: " + Paths.get(".").toAbsolutePath().normalize().toString()); System.out.println("Classpath: " + System.getProperty("java.class.path")); System.out.println("Library path: " + System.getProperty("java.library.path")); System.out.println("java.ext.dirs: " + System.getProperty("java.ext.dirs"));
The current working directory is the starting point for all relative paths, e.g., for reading and writing files. The normalization of the path makes it a bit more readable, but is not necessary. The class Paths
is from the package java.nio.file.Paths
. The classpath is the place where Java looks for (bytecode for) classes. The entries should be folders or jar-files. The Java library path is where Java looks for native libraries, e.g., platform dependent things. You can of course access other environment variables with the same method, but I cannot at the moment think of a useful example.
Related (at least related enough to put it into the same post), this is how you can print the space used and available on the JVM heap:
int mbfactor = 1024*1024; System.out.println("Memory free/used/total/max " + Runtime.getRuntime().freeMemory()/mbfactor + "/" + (Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory())/mbfactor + "/" + Runtime.getRuntime().totalMemory()/mbfactor + "/" + Runtime.getRuntime().maxMemory()/mbfactor + " MB" );
I am learning R, so this is my first attempt to create histograms in R. The data that I have is a vector of one category for each data point. For this example we will use a vector of a random sample of letters. The important thing is that we want a histogram of the frequencies of texts, not numbers. And the texts are longer than just one letter. So let’s start with this:
labels <- sample(letters[1:20],100,replace=TRUE) labels <- vapply(seq_along(labels), function(x) paste(rep(labels[x],10), collapse = ""), character(1L)) # Repeat each letter 10 times library(plyr) # for the function 'count' distribution <- count(labels) distribution_sorted <- distribution[order(distribution[,"freq"], decreasing=TRUE),]
I use the function count
from the package plyr
to get a matrix distribution
with the different categories in column one (called "x") and the number of times this label occurs in column two (called "freq"). As I would like the histogram to display the categories from the most frequent to the least frequent one, I then sort this matrix by frequency with the function order
. The function gives back a vector of indices in the correct order, so I need to plug this into the original matrix as row numbers.
Now let's do the histogram:
mp <- barplot(distribution_sorted[,"freq"], names.arg=distribution_sorted[,1], # X-axis names las=2, # turn labels by 90 degrees col=c("blue"), # blue bars (just for fun) xlab="Kategorie", ylab="Häufigkeit", # Axis labels )
There are many more settings to adapt, e.g., you can use cex
to increase the font size for the numerical y-axis values (cex.axis
), the categorical x-axis names (cex.names
), and axis labels (cex.lab
).
In my plot there is one problem. My categorie names are much longer than the values on the y-axis and so the axis labels are positioned incorrectly. This is the point to give up and do the plot in Excel (ahem, LaTeX!) - or take input from fellow bloggers. They explain the issues way better than me, so I will just post my final solution. I took the x-axis label out of the plot and inserted it separately with mtext
. I then wanted a line for the x-axis as well and in the end I took out the x-axis names from the plot again and put them into a separate axis
at the bottom (side=1
) with zero-length ticks (tcl=0
) intersecting the y-axis at pos=-0.3
.
# mai = space around the plot: bottom - left - top - right # mgp = spacing for axis title - axis labels - axis line par(mai=c(2.5,1,0.3,0.15), mgp=c(2.5,0.75,0)) mp <- barplot(distribution_sorted[,"freq"], #names.arg=distribution_sorted[,1], # X-axis names/labels las=2, # turn labels by 90 degrees col=c("blue"), # blue bars (just for fun) ylab="Häufigkeit", # Axis title ) axis(side=1, at=mp, pos=-0.3, tick=TRUE, tcl=0, labels=distribution_sorted[,1], las=2, ) mtext("Kategorie", side=1, line=8.5) # x-axis label
There has to be an easier way !?