Just a few quick words about the assumption we made in the last post about all our entries in the vectors being positive so that we can define the cosine distance as 1 minus the similarity. This assumption is actually not necessary. We can have negative entries, as long as our vectors are normalized to unit length everything still works.
Remember Euclidean distance for unit vectors:
And cosine similarity for two unit vectors:
So now, like we did in the last post, let’s say we have two vectors v and w and we know that measured with Euclidean distance, v is closer to some other point p than w*:
We do the same steps as in the last post, but then go on and get rid of the 1 and the minus (attention, this changes the direction of the inequality):
Voila, cosine similarity!
So if p is closer to v than to w as measured with Euclidean distance, the cosine similarity of p and v is higher than that of p and w:
So whenever you have unit length vectors and are only interested in relative distances, it shouldn’t make a distance whether you use Euclidean distance or cosine similarity.
* Same footnote as last time: The text says “closer” and not “closer or the same” and that is actually what I wanted to say, but there seems to be some strange bug in this LaTeX plugin that doesn’t allow you to use the < sign in a formula... so we'll take the less-or-equal sign and just ignore the equal-part.