Euclidean and cosine distance for unit vectors (and negative entries!)

Just a few quick words about the assumption we made in the last post about all our entries in the vectors being positive so that we can define the cosine distance as 1 minus the similarity. This assumption is actually not necessary. We can have negative entries, as long as our vectors are normalized to unit length everything still works.

Remember Euclidean distance for unit vectors:
d_{\text{euclid}}(\vec{p},\vec{q}) = \sqrt{2(1 - \sum_i p_i q_i)}

And cosine similarity for two unit vectors:
s_{\text{cosine}}(\vec{p},\vec{q}) = \sum_i p_i q_i

So now, like we did in the last post, let’s say we have two vectors v and w and we know that measured with Euclidean distance, v is closer to some other point p than w*:
d_{\text{euclid}}(\vec{p},\vec{v}) \leq d_{\text{euclid}}(\vec{p},\vec{w})

We do the same steps as in the last post, but then go on and get rid of the 1 and the minus (attention, this changes the direction of the inequality):
1 - \sum_i p_i v_i \leq 1 - \sum_i p_i w_i
\Leftrightarrow  - \sum_i p_i v_i \leq - \sum_i p_i w_i
\Leftrightarrow  \sum_i p_i v_i \geq \sum_i p_i w_i

Voila, cosine similarity!

So if p is closer to v than to w as measured with Euclidean distance, the cosine similarity of p and v is higher than that of p and w:
d_{\text{euclid}}(\vec{p},\vec{v}) \leq d_{\text{euclid}}(\vec{p},\vec{w})  \Leftrightarrow  s_{\text{cosine}}(\vec{p},\vec{v}) \geq s_{\text{cosine}}(\vec{p},\vec{w})

So whenever you have unit length vectors and are only interested in relative distances, it shouldn’t make a distance whether you use Euclidean distance or cosine similarity.

* Same footnote as last time: The text says “closer” and not “closer or the same” and that is actually what I wanted to say, but there seems to be some strange bug in this LaTeX plugin that doesn’t allow you to use the < sign in a formula... so we'll take the less-or-equal sign and just ignore the equal-part.

Euclidean and cosine distance for unit vectors

The Euclidean distance between two vectors p and q is the length of the line segment that connects them (here and in all following formulas the sum is over all dimensions of the vectors, i.e., if we have n dimensions the sum ranges from i=0 to n):
d_{\text{euclid}}(\vec{p},\vec{q}) = |\vec{p} - \vec{q}| = \sqrt{\sum_i (p_i - q_i)^2}

Using the binomial expansion, we can write this as follows:
d_{\text{euclid}}(\vec{p},\vec{q}) = \sqrt{\sum_i p_i^2 - 2\sum_i p_i q_i +\sum_i q_i^2}

Unit vectors have a length of 1 (by definition), length is calculated as the Euclidean norm, that is, the Euclidean distance of a vector to the zero vector, i.e., the square root of the sum of all sqared entries in the vector:
|\vec{p}| = d_{\text{euclid}}(\vec{p},0) = \sqrt{\sum_i (p_i-0)^2 } = \sqrt{\sum_i p_i^2 }

If something is 1, its square is also 1:
\sqrt{\sum_i p_i^2 } = 1  \Leftrightarrow \sum_i p_i^2 = 1

We can now replace the squared sums over all vector elements in the formula for Euclidean distance with 1:
d_{\text{euclid}}(\vec{p},\vec{q}) = \sqrt{1 - 2\sum_i p_i q_i + 1} = \sqrt{2 - 2\sum_i p_i q_i} = \sqrt{2(1 - \sum_i p_i q_i)}

Now let’s see how the cosine distance is defined. The more common thing to do is to calculate the cosine similarity of two vectors as the cosine of the angle between them:
s_{\text{cosine}}(\vec{p},\vec{q}) = \frac{\vec{p} \cdot \vec{q}}{|\vec{p}| |\vec{q}|} = \frac{\sum_i p_i q_i}{|\vec{p}| |\vec{q}|}

As we have unit vectors, we can get rid of the division by the length (which is always 1), so the formula is simplified to the dot product between the two vectors:
s_{\text{cosine}}(\vec{p},\vec{q}) = \sum_i p_i q_i

When we have a vector space where the entries correspond to occurrences of terms in a document, all entries are positive, so the value of the cosine similarity will always be between zero and one. This means, we can define the cosine distance as:
d_{\text{cosine}}(\vec{p},\vec{q}) = 1 - s_{\text{cosine}}(\vec{p},\vec{q}) = 1 - \sum_i p_i q_i

So let’s put it together. Let’s say we have two vectors v and w and we know that measured with Euclidean distance, v is closer to some other point p than w is*:
d_{\text{euclid}}(\vec{p},\vec{v}) \leq d_{\text{euclid}}(\vec{p},\vec{w})

We can now replace the Euclidean distance with the formula from above, square both sides (because that doesn’t change the inequality relation) and get rid of the two that appears on both sides:
\sqrt{2(1 - \sum_i p_i v_i)} \leq \sqrt{2(1 - \sum_i p_i w_i)}
\Leftrightarrow  2(1 - \sum_i p_i v_i) \leq 2(1 - \sum_i p_i w_i)
\Leftrightarrow  1 - \sum_i p_i v_i \leq 1 - \sum_i p_i w_i

What we are left with is the cosine distance! So, putting start and end together, what we have shown is:
d_{\text{euclid}}(\vec{p},\vec{v}) \leq d_{\text{euclid}}(\vec{p},\vec{w})  \Leftrightarrow  d_{\text{cosine}}(\vec{p},\vec{v}) \leq d_{\text{cosine}}(\vec{p},\vec{w})

This doesn’t mean that when you calculate Euclidean distance and cosine distance between two vectors that you will get the same number. But whenever you are only interested in relative distances (that means you only want to know which of two vectors is closer to something than the other) and you have vectors that are normalized to unit length with only positive entries, then the result should be the same whether you use cosine or Euclidean distance.

* The text says “closer” and not “closer or the same” and that is actually what I wanted to say, but there seems to be some strange bug in this LaTeX plugin that doesn’t allow you to use the < sign in a formula... so we'll take the less-or-equal sign and just ignore the equal-part.