Euclidean and cosine distance for unit vectors (and negative entries!)

Just a few quick words about the assumption we made in the last post about all our entries in the vectors being positive so that we can define the cosine distance as 1 minus the similarity. This assumption is actually not necessary. We can have negative entries, as long as our vectors are normalized to unit length everything still works.

Remember Euclidean distance for unit vectors:
d_{\text{euclid}}(\vec{p},\vec{q}) = \sqrt{2(1 - \sum_i p_i q_i)}

And cosine similarity for two unit vectors:
s_{\text{cosine}}(\vec{p},\vec{q}) = \sum_i p_i q_i

So now, like we did in the last post, let’s say we have two vectors v and w and we know that measured with Euclidean distance, v is closer to some other point p than w*:
d_{\text{euclid}}(\vec{p},\vec{v}) \leq d_{\text{euclid}}(\vec{p},\vec{w})

We do the same steps as in the last post, but then go on and get rid of the 1 and the minus (attention, this changes the direction of the inequality):
1 - \sum_i p_i v_i \leq 1 - \sum_i p_i w_i
\Leftrightarrow  - \sum_i p_i v_i \leq - \sum_i p_i w_i
\Leftrightarrow  \sum_i p_i v_i \geq \sum_i p_i w_i

Voila, cosine similarity!

So if p is closer to v than to w as measured with Euclidean distance, the cosine similarity of p and v is higher than that of p and w:
d_{\text{euclid}}(\vec{p},\vec{v}) \leq d_{\text{euclid}}(\vec{p},\vec{w})  \Leftrightarrow  s_{\text{cosine}}(\vec{p},\vec{v}) \geq s_{\text{cosine}}(\vec{p},\vec{w})

So whenever you have unit length vectors and are only interested in relative distances, it shouldn’t make a distance whether you use Euclidean distance or cosine similarity.

* Same footnote as last time: The text says “closer” and not “closer or the same” and that is actually what I wanted to say, but there seems to be some strange bug in this LaTeX plugin that doesn’t allow you to use the < sign in a formula... so we'll take the less-or-equal sign and just ignore the equal-part.