Typesetting text in math mode (2)

In a previous post (Typesetting text in math mode) I advertised the use of \mbox to write text in mathematical formulas. This works when you are in the "standard size", but looks funny if you have subscripts because the sizes are off:

$ 50 \mbox{ apples}_{\mbox{yellow}} \times 
100 \mbox{ apples}_{\mbox{red-green}} 
= \mbox{lots of apples}^{\mbox{to eat}} $

looks like
50 \mbox{ apples}_{\mbox{yellow}} \times 100 \mbox{ apples}_{\mbox{red-green}} = \mbox{lots of apples}^{\mbox{to eat}}

In these cases (and also in the standard cases but there it looks the same), you can use the command \text which will come out in the right font size. In addition to just \text, there is also \textbf (bold face), \textit (italics) and \texttt (typewriter).

$ 50 \text{ apples}_{\text{yellow}} \times 
100 \textit{ apples}_{\texttt{red-green}} 
= \textbf{lots of apples}^\text{to eat} $

looks like
50 \text{ apples}_{\text{yellow}} \times 100 \textit{ apples}_{\texttt{red-green}} = \textbf{lots of apples}^\text{to eat}

Note: Most of the time \text should just work in math mode without any packages, but for some distributions you need to explicitly load the package amstext or amsmath.

Euclidean and cosine distance for unit vectors (and negative entries!)

Just a few quick words about the assumption we made in the last post about all our entries in the vectors being positive so that we can define the cosine distance as 1 minus the similarity. This assumption is actually not necessary. We can have negative entries, as long as our vectors are normalized to unit length everything still works.

Remember Euclidean distance for unit vectors:
d_{\text{euclid}}(\vec{p},\vec{q}) = \sqrt{2(1 - \sum_i p_i q_i)}

And cosine similarity for two unit vectors:
s_{\text{cosine}}(\vec{p},\vec{q}) = \sum_i p_i q_i

So now, like we did in the last post, let’s say we have two vectors v and w and we know that measured with Euclidean distance, v is closer to some other point p than w*:
d_{\text{euclid}}(\vec{p},\vec{v}) \leq d_{\text{euclid}}(\vec{p},\vec{w})

We do the same steps as in the last post, but then go on and get rid of the 1 and the minus (attention, this changes the direction of the inequality):
1 - \sum_i p_i v_i \leq 1 - \sum_i p_i w_i
\Leftrightarrow  - \sum_i p_i v_i \leq - \sum_i p_i w_i
\Leftrightarrow  \sum_i p_i v_i \geq \sum_i p_i w_i

Voila, cosine similarity!

So if p is closer to v than to w as measured with Euclidean distance, the cosine similarity of p and v is higher than that of p and w:
d_{\text{euclid}}(\vec{p},\vec{v}) \leq d_{\text{euclid}}(\vec{p},\vec{w})  \Leftrightarrow  s_{\text{cosine}}(\vec{p},\vec{v}) \geq s_{\text{cosine}}(\vec{p},\vec{w})

So whenever you have unit length vectors and are only interested in relative distances, it shouldn’t make a distance whether you use Euclidean distance or cosine similarity.

* Same footnote as last time: The text says “closer” and not “closer or the same” and that is actually what I wanted to say, but there seems to be some strange bug in this LaTeX plugin that doesn’t allow you to use the < sign in a formula... so we'll take the less-or-equal sign and just ignore the equal-part.

Euclidean and cosine distance for unit vectors

The Euclidean distance between two vectors p and q is the length of the line segment that connects them (here and in all following formulas the sum is over all dimensions of the vectors, i.e., if we have n dimensions the sum ranges from i=0 to n):
d_{\text{euclid}}(\vec{p},\vec{q}) = |\vec{p} - \vec{q}| = \sqrt{\sum_i (p_i - q_i)^2}

Using the binomial expansion, we can write this as follows:
d_{\text{euclid}}(\vec{p},\vec{q}) = \sqrt{\sum_i p_i^2 - 2\sum_i p_i q_i +\sum_i q_i^2}

Unit vectors have a length of 1 (by definition), length is calculated as the Euclidean norm, that is, the Euclidean distance of a vector to the zero vector, i.e., the square root of the sum of all sqared entries in the vector:
|\vec{p}| = d_{\text{euclid}}(\vec{p},0) = \sqrt{\sum_i (p_i-0)^2 } = \sqrt{\sum_i p_i^2 }

If something is 1, its square is also 1:
\sqrt{\sum_i p_i^2 } = 1  \Leftrightarrow \sum_i p_i^2 = 1

We can now replace the squared sums over all vector elements in the formula for Euclidean distance with 1:
d_{\text{euclid}}(\vec{p},\vec{q}) = \sqrt{1 - 2\sum_i p_i q_i + 1} = \sqrt{2 - 2\sum_i p_i q_i} = \sqrt{2(1 - \sum_i p_i q_i)}

Now let’s see how the cosine distance is defined. The more common thing to do is to calculate the cosine similarity of two vectors as the cosine of the angle between them:
s_{\text{cosine}}(\vec{p},\vec{q}) = \frac{\vec{p} \cdot \vec{q}}{|\vec{p}| |\vec{q}|} = \frac{\sum_i p_i q_i}{|\vec{p}| |\vec{q}|}

As we have unit vectors, we can get rid of the division by the length (which is always 1), so the formula is simplified to the dot product between the two vectors:
s_{\text{cosine}}(\vec{p},\vec{q}) = \sum_i p_i q_i

When we have a vector space where the entries correspond to occurrences of terms in a document, all entries are positive, so the value of the cosine similarity will always be between zero and one. This means, we can define the cosine distance as:
d_{\text{cosine}}(\vec{p},\vec{q}) = 1 - s_{\text{cosine}}(\vec{p},\vec{q}) = 1 - \sum_i p_i q_i

So let’s put it together. Let’s say we have two vectors v and w and we know that measured with Euclidean distance, v is closer to some other point p than w is*:
d_{\text{euclid}}(\vec{p},\vec{v}) \leq d_{\text{euclid}}(\vec{p},\vec{w})

We can now replace the Euclidean distance with the formula from above, square both sides (because that doesn’t change the inequality relation) and get rid of the two that appears on both sides:
\sqrt{2(1 - \sum_i p_i v_i)} \leq \sqrt{2(1 - \sum_i p_i w_i)}
\Leftrightarrow  2(1 - \sum_i p_i v_i) \leq 2(1 - \sum_i p_i w_i)
\Leftrightarrow  1 - \sum_i p_i v_i \leq 1 - \sum_i p_i w_i

What we are left with is the cosine distance! So, putting start and end together, what we have shown is:
d_{\text{euclid}}(\vec{p},\vec{v}) \leq d_{\text{euclid}}(\vec{p},\vec{w})  \Leftrightarrow  d_{\text{cosine}}(\vec{p},\vec{v}) \leq d_{\text{cosine}}(\vec{p},\vec{w})

This doesn’t mean that when you calculate Euclidean distance and cosine distance between two vectors that you will get the same number. But whenever you are only interested in relative distances (that means you only want to know which of two vectors is closer to something than the other) and you have vectors that are normalized to unit length with only positive entries, then the result should be the same whether you use cosine or Euclidean distance.

* The text says “closer” and not “closer or the same” and that is actually what I wanted to say, but there seems to be some strange bug in this LaTeX plugin that doesn’t allow you to use the < sign in a formula... so we'll take the less-or-equal sign and just ignore the equal-part.

LaTeX a5 paper size

In theory, the option ‘a5paper’ should give you a page of A5. Unfortunately, this option only sets the are in which LaTeX typesets, not the physical output pdf size. You need to load the ‘geometry’ packet to achieve this:

\documentclass[a5paper]{scrartcl}
\usepackage[pass]{geometry}

Letters in LaTeX

My standard letter in LaTeX with the class scrlttr2 (created by a German to adhere to German letter guidelines):

\documentclass[fromalign=location, fromphone=true, fromemail=true, locfield=wide]{scrlttr2}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\setkomavar{fromname}{Max Mustermann}
\setkomavar{fromaddress}{Musterstr.\ 1\\12345 Musterstadt}
\setkomavar{fromphone}{01234/56789000}
\setkomavar{fromemail}{mustermx@provider.xy}
\setkomavar{signature}{Max Mustermann}
\setkomavar{subject}[]{}
\setkomavar{date}{7.\ August 2013}

\begin{document}
\begin{letter}{Jane Doe\\
Example street 2\\
54321 Exampleville
}
\opening{Dear Mrs.\ X,}
this is the letter I promised.
\closing{Kind regards,}
\end{letter}
\end{document}

The variables ‘from…’ set the sender, the recipient is given right after begin letter. The sender information can be set at strange places, for a very simple letter I use ‘fromalign=location’ which results in the sender information somewhere at the top right, a bit higher than the recipient, but not in the headline. With the standard settings the e-mail address is too wide for the sender field, so I widen it with ‘locfield=wide’.

Get folder/filename from a path

Get folder name, filename, file extension from a path:


# File name: Strip from start longest match of [*/]
FILENAME="${BASEFILE##*/}"
echo "FILENAME $FILENAME"

# Folder: Substring from 0 to start of filename
FOLDER="${BASEFILE:0:${#BASEFILE} - ${#FILENAME} - 1}"
echo "FOLDER $FOLDER"

# File prefix: Strip from end longest match of [dot plus at least one non-dot char]
FILEPREFIX="${FILENAME%.[^.]*}"
echo "FILEPREFIX $FILEPREFIX"

# File extension: Strip from start shortest match of [at least one non-dot char plus dot]
EXTENSION="${FILENAME##[^.]*.}"
echo "EXTENSION $EXTENSION"

Undefined references – LaTeX Warning

Sometimes LaTeX tells you this:

LaTeX Warning: There were undefined references.

If you get this warning, you will notice some ?? in your document at places where references should be. For references to sections, tables of figures, just run pdflatex again (and check for typos). For bibliography references you need to run bibtex.

Let’s assume you are writing a LaTeX file with the name ‘report.tex’. Do the following:

> pdflatex report.tex
[...]
LaTeX Warning: Citation `Liu2010' on page 1 undefined on input line 39.
[...]
LaTeX Warning: Reference `fig:results' on page 1 undefined on input line 65.
[...]
LaTeX Warning: There were undefined references.
[...]
LaTeX Warning: Label(s) may have changed. Rerun to get cross-references right.
[...]
> bibtex report
[...]
> pdflatex report.tex
[...]
LaTeX Warning: Label(s) may have changed. Rerun to get cross-references right.
[...]
> pdflatex report.tex
[...]

You need to run pdflatex again twice after calling bibtex. Twice, because layout may change and things end up somewhere else after you inserted the references.

The most important commands for SVN

Here are the most important commands for using SVN in the command line on Linux. You have to be inside your local folder where you put the svn else it won’t work (most common source for error “Skipping .'” or “. is not a working copy”).

update

To update your local working copy to the newest version that exists on the server (ALWAYS do this before you start to change things or your teammates will kill you!!):

svn update

add

Files you move into the local working copy folder are not added automatically. If you want the file to be part of the SVN, you have to add it. It works for multiple files or folders, too.

svn add 

delete

To delete files from the repository, first mark them for deletion:

svn rm 

On the next commit, the file will be deleted from the repository and from your local copy! If you want to keep the local copy, do

svn rm --keep-local 

revert

With revert, you can undo pending changes in your working copy (e.g. add, delete) before the next commit.

svn revert 

Also handy in case you forgot what local changes you made and you want to return to the latest “safe” version from the repository.
Note that this does NOT enable you to go back to a previous already-commited version. To do that, you can checkout the specific version of your repository at some other place (with the option -r) and manually get what you need or follow the procedure outlined here.

commit (changes to the repository)

If you have changed a file, added or deleted something and want to put the changes into the SVN you have to commit it, without that the changes are only in your working copy and not on the server!

svn ci -m ""

log

It is good practice to write log messages with commits. You can review these log messages with

svn log

You should do an update of your working copy before this command, otherwise you will not get all messages. In case this is a lot of messages, you can add a limit, e.g., display only the latest 5 log entries:

svn log -l 5

status

To see which files of your working copy haven’t been committed yet:

svn status

Common SVN status codes:

diff

To see what has changed in a file from the last version to the current version:

svn diff 

More resources: You can always use “svn help” to see what else is there or take a look at the excellent book.

A typical SVN session

We assume you have created a working copy and there is already some content in your SVN that you share with others. All of this assumes that you are using some linux shell and are in the folder of your working copy. If you are in the wrong folder else it won’t work (most common source for error “Skipping .'” or “. is not a working copy”).

First thing you do is update (i.e. get the latest changes from the server), in case your teammates changed something. You don’t want to work on an old version!

svn update

Then you open some files, change some things (in "main.adb"), add a new file ("list.adb") and delete a different file ("array.adb"). After two hours work you need a coffee and it’s always a good idea to commit (i.e. send your changes to the server) before taking a longer break. Before you commit, you want to know what changed:

svn status

The message you get will look more or less like this:

M    main.adb
?    list.adb
!    array.adb

This means, you have modified "main.adb", there is a file "list.adb" that SVN doesn’t really know about and "array.adb" should be there, but SVN cannot find it.

If you just commit, only "main.adb" will get changed and on the next update "array.adb" will be restored in your working copy. Why? Because you need to tell SVN explicitly that you want a file to be added or deleted. So let’s do that.

svn add list.adb
svn del array.adb

Now let’s check the status again, the result will be:

M    main.adb
A    list.adb
D    array.adb

We are satisfied and commit the whole thing:

svn ci -m "Replaced array with list, added list.adb, deleted array.adb"

It is always a very good idea to write a meaningful commit message (the parameter -m), so that your teammates know what has been changed. It also makes it easier to go back to a specific version, e.g. the version just before you removed the array.

Creating a SVN working copy (checkout)

You will need to do this once to get the first working copy from the server to your computer.

svn co server_url folder_where_you_want_to_have_your_working_copy

The "server url" isn’t actually a URL like in the internet most of the time. It can be a path to a file (this would work if e.g. if you are inside the IMS and want to access a SVN that is located in a folder that you have mounted) or something with svn+ssh or the like. The one who created the SVN for you should tell you the server URL.