Search This Blog

Translate

Thursday, December 20, 2012

What is Cosine Similarity?

Get real time news update from your favorite websites.
Don't miss any news about your favorite topic.
Personalize your app.

Check out NTyles.


Get it on....

NTyles-App




This tutorial is for the newbie's who are trying to make out something with similarity measures between two documents or sentences and found nothing on the web.
Cosine Similarity measures the similarity between two sentences or documents in terms of the value within the range of [-1,1] whichever you want to measure.That's all, that is Cosine Similarity. Let me clairfy Cosine Similarity with an example.

Let's consider two setences:
1. Xeon goes to marry Xeonian girl, a girl.

2. Leon goes to forest to find Xeon.

From the first sentence, calculating the terms and their respective frequencies :
TermsFrequencies
Xeon1
goes1
to1
marry1
Xeonian1
girl2
a1

If we do same for the second sentence,
TermsFrequencies
Leon1
goes1
to2
forest1
find1
Xeon1

In the above table the total number of terms in sentence 1 is 8 and in sentence 2 is 7.

Now grab a coffee, Let's take a break. When there's mathematics just take a break. "Ready for some mathematics."---Here it goes: Recall vector: Let's suppose : vector a = [2,2] and vector b = [0,1].
Then the cos product of vector a and b is :

i.e for above example it will be :

That's all.

Now let's move on to our topic. Cosine Similarity !!

Now assuming you all know what is cos product, now get what I am doing with terms in sentence 1 and sentence 2.

TermsFreq. in 1Freq. in 2
Xeon11
goes11
to12
marry10
Xeonian10
girl20
Leon01
forest01
find01


Then let : vec1 = [1,1,1,1,1,2,0,0,0] and vec2 = [1,1,2,0,0,0,1,1,1].
Therefore finally we get :




(or something.(I didn't calculate it, Coz I'm LAZY.))

Further readings:

Personally in my projects I use Lucene. Lucene is a very cool. You can treat the index made by Lucene as a database, do searches very fast and it supports variuos queries. If you guys need the code to calculate similairty using Lucene for version 3.x in java then please refer :
JAVA CODE FOR CALCULATING COSINE SIMILARITY USING LUCENE 3.x.
And if you need code for calculating cosine similarity using Java in Lucene greater than 4 then please refer :
JAVA CODE FOR CALCULATING COSINE SIMILARITY USING LUCENE 4.X.

To calculate cosine similairty using tfidf in java without using Lucene then please refer :
JAVA CODE FOR CALCULATING COSINE SIMILARITY, TF-IDF

Fire Up all if you need any help!!

2 comments:

  1. I think you left the freq of "a" in while calculating Freq 1 and the dot product is wrong..

    ReplyDelete
  2. @SiGKiLL :
    Yes, exactly. Thank you. I will update the blog soon.

    ReplyDelete