Get real time news update from your favorite websites.
Don't miss any news about your favorite topic.
Personalize your app.
Check out NTyles.
NOTE: Lucene 4.x users please do refer
Calculate Cosine Similarity Using Lucene
For beginners doing a project in text mining aches them a lot by various term like :
Many of you must be familiar with Tf-Idf(Term frequency-Inverse Document Frequency).
I will enlighten them in brief.
Term Frequency:
Suppose for a document "Tf-Idf Brief Introduction" there are overall 60000 words and a word Term-Frequency occurs 60 times.
Then , mathematically, its Term Frequency, TF = 60/60000 =0.001.
Inverse Document Frequency:
Suppose one bought Harry-Potter series, all series. Suppose there are 7 series and a word "AbraKaDabra" comes in 2 of the series.
Then, mathematically, its Inverse-Document Frequency , IDF = 1 + log(7/2) = .......(calculated it guys, don't be lazy, I am lazy not you guys.)
And Finally, TFIDF = TF * IDF;
By mathematically I assume you now know its meaning physically.
Document Vector:
There are various ways to calculate document vectors. I am just giving you an example. Suppose If I calculate all the term's TF-IDF of a document A and store them in an array(list, matrix ... in any ordered way, .. you guys are genius you know how to create a vector. ) then I get an Document Vector of TF-IDF scores of document A.
The class shown below calculates the Term Frequency(TF) and Inverse Document Frequency(IDF).
The class shown below parsed the text documents and split them into tokens. This class will communicate with TfIdf.java class to calculated TfIdf. It also calls CosineSimilarity.java class to calculated the similarity between the passed documents.
This is the class that calculates Cosine Similarity:
Here's the main class to run the code:
You can also download the whole source code from here: Download.
Overall what I did is, I first calculate the TfIdf matrix of all the documents and then document vectors of each documents. Then I used those document vectors to calculate cosine similarity.
You think clarification is not enough. Hit me..
Happy Text-Mining!!
Please check out my first Android app, NTyles:
Don't miss any news about your favorite topic.
Personalize your app.
Check out NTyles.
Get it on....
NOTE: Lucene 4.x users please do refer
Calculate Cosine Similarity Using Lucene
For beginners doing a project in text mining aches them a lot by various term like :
- TF-IDF
- COSINE SIMILARITY
- CLUSTERING
- DOCUMENT VECTORS
Many of you must be familiar with Tf-Idf(Term frequency-Inverse Document Frequency).
I will enlighten them in brief.
Term Frequency:
Suppose for a document "Tf-Idf Brief Introduction" there are overall 60000 words and a word Term-Frequency occurs 60 times.
Then , mathematically, its Term Frequency, TF = 60/60000 =0.001.
Inverse Document Frequency:
Suppose one bought Harry-Potter series, all series. Suppose there are 7 series and a word "AbraKaDabra" comes in 2 of the series.
Then, mathematically, its Inverse-Document Frequency , IDF = 1 + log(7/2) = .......(calculated it guys, don't be lazy, I am lazy not you guys.)
And Finally, TFIDF = TF * IDF;
By mathematically I assume you now know its meaning physically.
Document Vector:
There are various ways to calculate document vectors. I am just giving you an example. Suppose If I calculate all the term's TF-IDF of a document A and store them in an array(list, matrix ... in any ordered way, .. you guys are genius you know how to create a vector. ) then I get an Document Vector of TF-IDF scores of document A.
The class shown below calculates the Term Frequency(TF) and Inverse Document Frequency(IDF).
//TfIdf.java package com.computergodzilla.tfidf; import java.util.List; /** * Class to calculate TfIdf of term. * @author Mubin Shrestha */ public class TfIdf { /** * Calculates the tf of term termToCheck * @param totalterms : Array of all the words under processing document * @param termToCheck : term of which tf is to be calculated. * @return tf(term frequency) of term termToCheck */ public double tfCalculator(String[] totalterms, String termToCheck) { double count = 0; //to count the overall occurrence of the term termToCheck for (String s : totalterms) { if (s.equalsIgnoreCase(termToCheck)) { count++; } } return count / totalterms.length; } /** * Calculates idf of term termToCheck * @param allTerms : all the terms of all the documents * @param termToCheck * @return idf(inverse document frequency) score */ public double idfCalculator(ListallTerms, String termToCheck) { double count = 0; for (String[] ss : allTerms) { for (String s : ss) { if (s.equalsIgnoreCase(termToCheck)) { count++; break; } } } return 1 + Math.log(allTerms.size() / count); } }
The class shown below parsed the text documents and split them into tokens. This class will communicate with TfIdf.java class to calculated TfIdf. It also calls CosineSimilarity.java class to calculated the similarity between the passed documents.
//DocumentParser.java package com.computergodzilla.tfidf; import java.io.BufferedReader; import java.io.File; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.IOException; import java.util.ArrayList; import java.util.List; /** * Class to read documents * * @author Mubin Shrestha */ public class DocumentParser { //This variable will hold all terms of each document in an array. private ListtermsDocsArray = new ArrayList<>(); private List allTerms = new ArrayList<>(); //to hold all terms private List tfidfDocsVector = new ArrayList<>(); /** * Method to read files and store in array. * @param filePath : source file path * @throws FileNotFoundException * @throws IOException */ public void parseFiles(String filePath) throws FileNotFoundException, IOException { File[] allfiles = new File(filePath).listFiles(); BufferedReader in = null; for (File f : allfiles) { if (f.getName().endsWith(".txt")) { in = new BufferedReader(new FileReader(f)); StringBuilder sb = new StringBuilder(); String s = null; while ((s = in.readLine()) != null) { sb.append(s); } String[] tokenizedTerms = sb.toString().replaceAll("[\\W&&[^\\s]]", "").split("\\W+"); //to get individual terms for (String term : tokenizedTerms) { if (!allTerms.contains(term)) { //avoid duplicate entry allTerms.add(term); } } termsDocsArray.add(tokenizedTerms); } } } /** * Method to create termVector according to its tfidf score. */ public void tfIdfCalculator() { double tf; //term frequency double idf; //inverse document frequency double tfidf; //term requency inverse document frequency for (String[] docTermsArray : termsDocsArray) { double[] tfidfvectors = new double[allTerms.size()]; int count = 0; for (String terms : allTerms) { tf = new TfIdf().tfCalculator(docTermsArray, terms); idf = new TfIdf().idfCalculator(termsDocsArray, terms); tfidf = tf * idf; tfidfvectors[count] = tfidf; count++; } tfidfDocsVector.add(tfidfvectors); //storing document vectors; } } /** * Method to calculate cosine similarity between all the documents. */ public void getCosineSimilarity() { for (int i = 0; i < tfidfDocsVector.size(); i++) { for (int j = 0; j < tfidfDocsVector.size(); j++) { System.out.println("between " + i + " and " + j + " = " + new CosineSimilarity().cosineSimilarity ( tfidfDocsVector.get(i), tfidfDocsVector.get(j) ) ); } } } }
This is the class that calculates Cosine Similarity:
//CosineSimilarity.java /* * To change this template, choose Tools | Templates * and open the template in the editor. */ package com.computergodzilla.tfidf; /** * Cosine similarity calculator class * @author Mubin Shrestha */ public class CosineSimilarity { /** * Method to calculate cosine similarity between two documents. * @param docVector1 : document vector 1 (a) * @param docVector2 : document vector 2 (b) * @return */ public double cosineSimilarity(double[] docVector1, double[] docVector2) { double dotProduct = 0.0; double magnitude1 = 0.0; double magnitude2 = 0.0; double cosineSimilarity = 0.0; for (int i = 0; i < docVector1.length; i++) //docVector1 and docVector2 must be of same length { dotProduct += docVector1[i] * docVector2[i]; //a.b magnitude1 += Math.pow(docVector1[i], 2); //(a^2) magnitude2 += Math.pow(docVector2[i], 2); //(b^2) } magnitude1 = Math.sqrt(magnitude1);//sqrt(a^2) magnitude2 = Math.sqrt(magnitude2);//sqrt(b^2) if (magnitude1 != 0.0 | magnitude2 != 0.0) { cosineSimilarity = dotProduct / (magnitude1 * magnitude2); } else { return 0.0; } return cosineSimilarity; } }
Here's the main class to run the code:
//TfIdfMain.java package com.computergodzilla.tfidf; import java.io.FileNotFoundException; import java.io.IOException; /** * * @author Mubin Shrestha */ public class TfIdfMain { /** * Main method * @param args * @throws FileNotFoundException * @throws IOException */ public static void main(String args[]) throws FileNotFoundException, IOException { DocumentParser dp = new DocumentParser(); dp.parseFiles("D:\\FolderToCalculateCosineSimilarityOf"); // give the location of source file dp.tfIdfCalculator(); //calculates tfidf dp.getCosineSimilarity(); //calculates cosine similarity } }
You can also download the whole source code from here: Download.
Overall what I did is, I first calculate the TfIdf matrix of all the documents and then document vectors of each documents. Then I used those document vectors to calculate cosine similarity.
You think clarification is not enough. Hit me..
Happy Text-Mining!!
Please check out my first Android app, NTyles:
java.lang.NoClassDefFoundError: com/computergodzilla/tfidf/TfIdfMain
ReplyDeleteCaused by: java.lang.ClassNotFoundException: com.computergodzilla.tfidf.TfIdfMain
at java.net.URLClassLoader$1.run(URLClassLoader.java:221)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:209)
at java.lang.ClassLoader.loadClass(ClassLoader.java:324)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:269)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:337)
Exception in thread "main" Java Result: 1
BUILD SUCCESSFUL (total time: 0 seconds)
java.lang.NoClassDefFoundError: com/computergodzilla/tfidf/TfIdfMain
ReplyDeleteCaused by: java.lang.ClassNotFoundException: com.computergodzilla.tfidf.TfIdfMain
at java.net.URLClassLoader$1.run(URLClassLoader.java:221)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:209)
at java.lang.ClassLoader.loadClass(ClassLoader.java:324)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:269)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:337)
Exception in thread "main" Java Result: 1
@prasanna wadekar
ReplyDeleteCreate a package named "com.computergodzilla.tfidf" and copy all the downloaded files inside this package and run the project. This should solve your problem.
What if i want to print the TfIdf value for a particular term?
ReplyDelete@Abha
ReplyDeleteYou can simply do that by using:
tf = new TfIdf().tfCalculator(docTermsArray, term); //give your term here
idf = new TfIdf().idfCalculator(termsDocsArray, term);
tfidf = tf * idf;
System.out.println(tfidf); //this is your required tfidf value.
How can I specify the document file name when output "between " + i + "and " + j + " = ") in getCosineSimilarity
ReplyDelete@Jubilee:
ReplyDeleteFirst add a list to store all the filenames. For this add below line in DocumentParser.java :
private List fileNameList = new ArrayList();
Next add all the filenames to the list as shown below:
if (f.getName().endsWith(".txt")) {
fileNameList.add(f.getName()); ///add here
in = new BufferedReader(new FileReader(f));
StringBuilder sb = new StringBuilder();
Then you can specify document file name as below:
System.out.println("between " + fileNameList.get(i) + " and " + fileNameList.get(j) + " = "
@shresthaMubin Thank you - it works. I also noticed that you have specified that docVector1 and docVector2 must be in the same length. Just wondering where did you specify the length normalization in cosineSimilarity class since not all documents are in the same length to perform comparison.
ReplyDeleteThank you for your reply! I also noticed that you have specified that docVector must be in the same length in cosineSimilarity. Just wonder where do you specify the length normalization in that class since not all documents are in the same length.
ReplyDeleteThank you for your quick reply! Another question: would like to know if you have done length normalization when comparing two document vectors (in case they are not in the same length) in CosineSimilarity - thanks!
ReplyDeleteHi shresthaMubin, thanks for the great tutorial. It's very easy to understand. I'd like to point out a possible optimization that you could do. You could actually precalculate idfCalculator and store it in an Hashtable before you start calculated TF. Both of the arguments used in that function don't change when you start calculating TFIDF. But it's probably easier to understand it if you write the code that way.
ReplyDeleteAlso, in CosineSimilarity.java, for the line:
ReplyDeleteif (magnitude1 != 0.0 | magnitude2 != 0.0) {
Shouldn't it be this instead?
if ((magnitude1 != 0.0) && (magnitude2 != 0.0)) {
If one of the variables was zero, it will still end up trying to divide by zero in the original code which is what you seem to be avoiding.
@sw2de3fr4gt
ReplyDeleteYes thats a bug, will fix them soon and update the content. Thank you.
@Jubilee:
ReplyDeleteThe above code works for document with any length. The document vector is created for all the unique terms of all the documents.
run:
ReplyDeleteException in thread "main" java.lang.NullPointerException
at com.computergodzilla.tfidf.DocumentParser.parseFiles(DocumentParser.java:37)
at com.computergodzilla.tfidf.TfIdfMain.main(TfIdfMain.java:26)
Java Result: 1
BUILD SUCCESSFUL (total time: 0 seconds)
I am getting this error while executing...
And the program shows error in the below line,
for (String[] ss : allTerms)
and the error is,
incompatible types
required: java.lang.String[]
found: java.lang.Object
Thank u
Update your jdk.
DeleteException in thread "main" java.lang.Error: Unresolved compilation problems:
ReplyDeleteType mismatch: cannot convert from element type Object to String[]
Type mismatch: cannot convert from element type Object to String
at DocumentParser.tfIdfCalculator(DocumentParser.java:64)
at TfIdfMain.main(TfIdfMain.java:28)
I get this error in TFIDF calculator method
public void tfIdfCalculator() {
double tf; //term frequency
double idf; //inverse document frequency
double tfidf; //term requency inverse document frequency
for (String[] docTermsArray : termsDocsArray) {
double[] tfidfvectors = new double[allTerms.size()];
int count = 0;
for (String terms : allTerms) {
tf = new TfIdf().tfCalculator(docTermsArray, terms);
idf = new TfIdf().idfCalculator(termsDocsArray, terms);
tfidf = tf * idf;
tfidfvectors[count] = tfidf;
count++;
}
tfidfDocsVector.add(tfidfvectors); //storing document vectors;
}
}
Hello first, thank you for your effort in clarifying the program and I have a question
ReplyDeleteHow could calculate Cosine Similarity one from file path and other from another path
What are the possible changes that occur on the program
Just modify the function
DeleteparseFiles(String filePath)
to
parseFiles(String filePath1, String filePath2)
and replace
File[] allfiles = new File(filePath).listFiles();
with
List<FIle> allFiles = new ArrayList<FIle>();
for(File f : new File(filePath1).listFiles())
{
allFiles.add(f);
}
for(File f : new File(filePath2).listFiles())
{
allFiles.add(f);
}
Hello first, thank you for your effort in clarifying the program and I have a question
ReplyDeleteHow could calculate Cosine Similarity one from file path and other from another path
What are the possible changes that occur on the program
Hi everyone i need help for my assignment which requires me to create a programme to check the tfidf of each word that a user searches.
ReplyDelete1. Loading in all the text document information from all the files. A set of files from Open American National Corpus is used for testing in this assignment.
2. Pre-process each text document to do the relevant word counts, storing the data in hashmaps(one hashmap for one text document) for fast retrieval during the analysis phase.
3. Provide a menu for user to enter the search query terms, and then calculate the td-idf score for each text document. For example if user enters query term “Singapore attraction” then the document will have a td-idf score which is the sum of td-idf of Singapore + td-idf of attraction.
4. Display the top 10 query search documents with the score information. You are required to make use of the Comparable interface to help you do sorting.
Please share the code for this problem.
Deleterun:
ReplyDeleteException in thread "main" java.lang.NullPointerException
at com.computergodzilla.tfidf.DocumentParser.parseFiles(DocumentParser.java:37)
at com.computergodzilla.tfidf.TfIdfMain.main(TfIdfMain.java:26)
Java Result: 1
BUILD SUCCESSFUL (total time: 0 seconds)
I am getting this error while executing...
i have created a package and placed the code above and executed it in netbeans.
still it is showing any output.
Did you give the location of the source files.
DeleteI made these changes in main method
Deletedp.parseFiles("D:\student.txt");
student.txt is the tabulated source file that i've given.
run:
Exception in thread "main" java.lang.NullPointerException
at javaapplication3.DocumentParser.parseFiles(DocumentParser.java:36)
at javaapplication3.TfIdfMain.main(TfIdfMain.java:25)
Java Result: 1
BUILD SUCCESSFUL (total time: 0 seconds)
I'm getting this error.
Use dp.parseFiles("D:\\student.txt"); or dp.parseFiles("D:/student.txt"); instead of dp.parseFiles("D:\student.txt");. I guess you know why. Also the above program calculates the cosine similarity between two or more than two files and you are using only one file. So it will not work. Also pass dp.parseFiles("");. folder location instead of file name.
Deleteerror: incompatible types
ReplyDeletefor (String[] ss : allTerms) {
required: String[]
found: Object
1 error
object cannot be converted to string
error in tfidf.java please hel me
This has been repeated problem with the users. I will update my code base soon to make it run with older version of jdk. Please upgrade your jdk.
Deletei have updated jdk1.7 to jdk 1.8 as you have said but still giving bsame error . please can you help
ReplyDeleteDownload the source code from download link from post. I am sure that will blog. I will update the blog. Let me know, if it works for you.
Deletethanks a lot it works
ReplyDeletei have one more query what if i have find tfidf for only a single text document how to do this ?
hope you will help
i am new to java so facing this much problem
This comment has been removed by the author.
ReplyDeletei am having issue in code. when i add two files in folder then it shows similarity between them 0.0 but when i add more two only then it shows proper score. why it would ? how can i correct it??
ReplyDeleteplz tell why it is not showing similarity when i add two files in folder. it just shows 0.0 score. but if i add more than two files only then the score is correct.
ReplyDeleteIt was a bug in my code base. I had corrected it. The issue was not with the number of files present in the folder but rather the formula of idf was wrong. The idf value should have been 0.0 when you ran the program. The scenario occurs when you both the file contained the same word. The code will work fine now. I had also updated the idf formula.
DeleteCan you tell me how can i show only those files which cosine score is greater than 0.4??
DeleteIts very simple. Make allFiles variable public and you will have to add a if condition checking in the code base as below:
Delete/**
* Method to calculate cosine similarity between all the documents.
*/
public void getCosineSimilarity() {
for (int i = 0; i < tfidfDocsVector.size(); i++) {
for (int j = 0; j < tfidfDocsVector.size(); j++) {
double cosineSimilarity = new CosineSimilarity().cosineSimilarity
(
tfidfDocsVector.get(i),
tfidfDocsVector.get(j)
);
if(cosineSimilarity > 0.4)
{
System.out.println("between " + allFiles[i].getName() + " and " + allFiles[j].getName() + " = " + cosineSimilarity);
}
}
}
}
public double idfCalculator(List allTerms, String termToCheck) {
ReplyDeletedouble count = 0;
for (String[] ss : allTerms) {
its showing error in this 3rd line now.
This comment has been removed by the author.
ReplyDeleteThanks it worked perfectly.
ReplyDeleteThanks it worked.
ReplyDeleteI need Some changes in formula because this formula needs docs in same length . Can we use tfidf formula where it wont affect the length of files on similarity score. one thing if we do use tf= 1 + log (tf) and idf = log(idf)... can we achieve this goal. i did it but getting NaN because tf.idf score is in minus. how can we resolve it. if we can resolve it can you write the code for it.
ReplyDeleteFirst clear up your mind that the formula does not need the same length documents, the source documents can be of any length. For calculating cosine similarity, the two vector under going dot product must be of same length. This does not mean that the document needs to be of same length. My code transforms all length document into the required document vector length. Please read my "What is cosine similairty" blog.
ReplyDeleteGreat article shresthaMubin. So helpful. Thanks.
Deletehmmm okz thanks for clearing it. Now my question is what would happen if we calculate Tf = 1+Math.log(count / totalterms.length ) and idf Math.log(allTerms.size() / count);. can we do this?? if not why??
ReplyDeletePlease study wiki page http://en.wikipedia.org/wiki/Tf%E2%80%93idf for clarifying your concept of TF and IDF. The formula you mentioned above are wrong so you certainly can't do them.
DeleteThanks for this code
ReplyDeleteI want to ask you how we can calculate the cosineSimilarity using TFIDF between two ontologies instead of document as the elements of ontologies like class , properties instead of words in a document
shresthaMubin i want source code for the information retrievel system in java which will have following functionalities :
ReplyDelete1. User will give the query to the system
2. system will show us the related ranked documents retrieved from the directory or corpus.
kindly help me.. :(
my email id is : firstwebdevelopers@gmail.com
why diffents inputs come the same output
ReplyDeletewhy diffent inputs comes same output..how to give the input
ReplyDeleteGive the location of the folder where you have all the files to be processed. I have commented the section where you should give the folder location.
Deletecan u plz tell me that where i can add file names ,m so confused
ReplyDeleteerror: cannot find symbol
ReplyDeleteDocumentParser dp=new DocumentParser() ;
Exception in thread "main" java.lang.RuntimeException: Uncompilable source code - incompatible types: java.lang.Object cannot be converted to java.lang.String[]
Deleteat com.computergodzilla.tfidf.DocumentParser.tfIdfCalculator(DocumentParser.java:67)
at com.computergodzilla.tfidf.TfIdfMain.main(TfIdfMain.java:32)
Java Result: 1 even i am using jdk 1.8 and i have two txt file of uique word in a folder but it does not worl
Exception in thread "main" java.lang.RuntimeException: Uncompilable source code - incompatible types: java.lang.Object cannot be converted to java.lang.String[]
Deleteat com.computergodzilla.tfidf.DocumentParser.tfIdfCalculator(DocumentParser.java:67)
at com.computergodzilla.tfidf.TfIdfMain.main(TfIdfMain.java:32)
Java Result: 1 even i am using jdk 1.8 and i have two txt file of uique word in a folder but it does not worl
Exception in thread "main" java.lang.RuntimeException: Uncompilable source code - incompatible types: java.lang.Object cannot be converted to java.lang.String[]
Deleteat com.computergodzilla.tfidf.DocumentParser.tfIdfCalculator(DocumentParser.java:67)
at com.computergodzilla.tfidf.TfIdfMain.main(TfIdfMain.java:32)
Java Result: 1
Did you work with the downloadable source code from the link. https://drive.google.com/file/d/0BzQONlWil3VGRVNmYm5KUEJsTWM/view?usp=sharing. If not please try it and let me know.
DeleteCan you provide sample data..
ReplyDeleteWhy do you need sample data. You can try with any text files.
Deletecan you please provide a code for finding idf value of more than one term jointly.
ReplyDeleteThere would certainly won't be anything such as calculating idf for "more than one word jointly." TFIDF scoring is for single term in a collection of documents. Please clarify your concepts regarding TFIDF. BTW I have provided TFIDF class for calculating TF and IDF above in the blog. You will have to calculate tfidf of each term individually.
Deletecan you plz provide a code for finding idf of more than one term jointly
ReplyDeleteNOTE: Below reply is same as commented for rohini 454 above.
DeleteThere would certainly won't be anything such as calculating idf for "more than one word jointly." TFIDF scoring is for single term in a collection of documents. Please clarify your concepts regarding TFIDF. BTW I have provided TFIDF class for calculating TF and IDF above in the blog. You will have to calculate tfidf of each term individually.
I have copied all java programs in TfIdfMain.java program.i am getting following error.please give a solution for this error.
ReplyDeleteerror:class TfIdf is public,should be declared in a file named TfIdf.java.
You don't have to copy all the java programs to the TFIDFMain.java. TFIDFmain.java is the executor class. You should look into tfidfcalculator method of documentparser.java. And follow up accordingly.
DeleteCan u plz send the vedio(execution of above program).i tried but i always getting an error:can't find the symbol DocumentParser..once plz show me that execution procedure
ReplyDeletePlease explain the execution procedure of above program..plz help..
ReplyDeleteI need above requirement urgently...so plz give a reply as early as possible.
ReplyDeleteHell abc123, create a new project in you favourite IDE. Create a new package call com.computergodzilla.tfidf. Now copy all above class files into that package. Change your folder source in Documentparser.java. And then run the program. If it didnt help. I am really busy right. I would put a details explanation this weekend. Just let me know if it helped. Thank you
DeleteThank you so much..its working....but i got the outPut as follows:between 0 and 0=1.0
ReplyDeletebetween 0 and 1=0.0
between 1 and 0=0.0
between 1 and 1=1.0
This is the output what i got...plz explain what represents the above values....
Please read my blog on "What is cosine similarity?"
Deletecomputergodzilla.blogspot.com/2012/12/what-is-cosine-similarity.html
on understanding what those values means.
Actually i need tfidf value for particular term which is present in text files....above you have given modifications for finding tdidf value for particular term i tried, but it showing the error as:gladiator cannot be resolved to a variable...here gladiator is a term which is present in text files..i want to findout tfidf value for gladiator term...
ReplyDeleteAbove code gives the cosine simiarity scores. Above all trying to give gladiator as the input is not accepted. Above code takes files as input. Not terms. And it obvious that you will get the error.
DeleteHow can we finout the tfidf of particulat term...plz explain it...
ReplyDeleteReplace tfidfcalculator() with below method:
Delete/**
* Method to create termVector according to its tfidf score.
* term : pass you term here.
*/
public void tfIdfCalculator(String term) {
double tf; //term frequency
double idf; //inverse document frequency
double tfidf; //term requency inverse document frequency
for (String[] docTermsArray : termsDocsArray) {
double[] tfidfvectors = new double[allTerms.size()];
int count = 0;
tf = new TfIdf().tfCalculator(docTermsArray, term);
idf = new TfIdf().idfCalculator(termsDocsArray, term);
tfidf = tf * idf;
tfidfvectors[count] = tfidf;
count++;
tfidfDocsVector.add(tfidfvectors); //storing document vectors;
}
}
Now, pass your term to above function and enjoy.
Hi shresthaMubin,
ReplyDeleteyou have a mistake in your downloadable files. In TfIdf.java in the function "idfCalculator" there is missing a "1+":
return 1 + Math.log(allTerms.size() / count);
Regards,
Chris
Yes Chris, Thank you. I will correct it soon. Thank you for your valuable comment.
DeleteHi shresthaMubin,
ReplyDeleteyou have a mistake in your downloadable files. In TfIdf.java in the function "idfCalculator" there is missing a "1+":
return 1 + Math.log(allTerms.size() / count);
Regards,
Chris
When will you update the code for K-means clustering with cosine similarity as a distance measure?? :) Waiting!!
ReplyDeleteHi Rizwan,
DeleteI won't be adding the code for K-means Clustering. Since cosine measures are there, it is straightforward job to calculate K-means Clustering. You will find the lot of open source projects or even source codes in stackoverflow or in google about K-means Clustering.
Awesome code. A great big thank you ;-).
ReplyDeleteIn December 2014 someone asked you about modifying getCosineSimilarity to print the file names in "between + [i] + " and " + [j]. When I made allFiles in the parseFiles I got a lot of underlined code.
I changed:
File[] allfiles = new File(filePath).listFiles();
to
public File[] allfiles = new File(filePath).listFiles();
but received "Illegal start of expression". Can you please help? Thank you.
I managed by declaring Files allfiles as a global variable under the private variables at the beginning.
ReplyDeleteHello Mubin
ReplyDeleteWould it be possible to modify the code so that it computes the similarity of in one pass? For example; say I have 3 documents of type txt and 10 documents of type html all in one folder and I want to find the cosine similarity of the first 3 with the rest, without comparing each document with another. So the iteration will compare the first document with the remaining 12, the second with 12 and the third with 12 and then stop. Any help would be greatly appreciated. Thanks
what to do at the error
ReplyDeleteException in thread "main" java.lang.NullPointerException
at com.computergodzilla.tfidf.DocumentParser.parseFiles(DocumentParser.java:36)
at com.computergodzilla.tfidf.TfIdfMain.main(TfIdfMain.java:25)
Java Result: 1
BUILD SUCCESSFUL (total time: 0 seconds)
how do you perform clustering on the output ans what are the steps for that
ReplyDeleteHey Shrestha Mubin,
ReplyDeleteThis is exactly what I wanted and it worked perfectly. Nice explanation and sample code. Thanks a lot!!!