Get real time news update from your favorite websites.
Don't miss any news about your favorite topic.
Personalize your app.
Check out NTyles.
***EDIT
Lot of people are facing the problem with the code base that I had pasted below. Please download the whole project from HERE. It should work. Cheers.
Someone recently asked me on calculating cosine similarity between documents using Lucene 4.10.2. It has been more than two years that I had not used Lucene. There was a great piece of article regarding computing document similarity Salmon Run: Computing Document Similarity Using Lucene implemented in Lucene version 3.x. So as to update myself, this new year I did some coding myself to calculate the cosine similarity using Lucene version 4.10.2. For my code base I am using following three text files. My test files are as follows:
Don't miss any news about your favorite topic.
Personalize your app.
Check out NTyles.
Get it on....
***EDIT
Lot of people are facing the problem with the code base that I had pasted below. Please download the whole project from HERE. It should work. Cheers.
Someone recently asked me on calculating cosine similarity between documents using Lucene 4.10.2. It has been more than two years that I had not used Lucene. There was a great piece of article regarding computing document similarity Salmon Run: Computing Document Similarity Using Lucene implemented in Lucene version 3.x. So as to update myself, this new year I did some coding myself to calculate the cosine similarity using Lucene version 4.10.2. For my code base I am using following three text files. My test files are as follows:
Document Id | File Name | Text |
0 | Document1.txt | This New Year I am learning how to calculate cosine similarity using Lucene. It will be fun. |
1 | Document2.txt | Huh!! What you want to learn cosine similarity. Its new year man. Do something crazy and blasting. By the way, what is Cosine Similarity? |
2 | Document3.txt | Dude, don't under estimate the power of cosine similarity.I can tell you which types of books are there in your computer simply by running my scripts of cosine similarity. |
Moving on I assume you know following:
- Cosine Similarity
- TFIDF
- Basic of Lucene
For calculating cosine similarity in lucene first you should do some pre-configuration in Lucene index. You have to store terms and their frequencies in the index.Lucene has inbuilt functions to create terms vectors during indexing. You will have to enable following while indexing and creating the indexing fields:
Above lines will create term vectors and store them in the Lucene index of each and every term.
FieldType fieldType = new FieldType(); fieldType.setIndexed(true); fieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); fieldType.setStored(true); fieldType.setStoreTermVectors(true); fieldType.setTokenized(true);
Now, let us calculate cosine similarity between above three documents using Lucene 4.10.2. The first step is to create the Lucene index. For my code I had followed the same pattern as Sir. Sujit Pal had did in his awesome blog. But before we start creating Lucene index, make sure you had downloaded following libraries and add to your project.
Also, please consider that my program package name is com.computergodzilla.cosinesimilarity. Make sure that you copy and paste below code base into the same folder. If not you will have to create your own package and edit the java files respectively.
Also, please consider that my program package name is com.computergodzilla.cosinesimilarity. Make sure that you copy and paste below code base into the same folder. If not you will have to create your own package and edit the java files respectively.
Step I. Preparing Lucene 4.10.2 Index
I am creating Lucene index in the hard drive. You can also create the Lucene index in the memory. Please refer to Lucene documentation for detail. My following code base will create a Lucene index with only one field. This field will store all the terms of the documents with their respective frequencies. All the information like location of source files and index directory are configured separately in Configuration class which is given just after the Indexer.java file below.
// Indexer.java package com.computergodzilla.cosinesimilarity; import java.io.*; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.FieldType; import org.apache.lucene.index.CorruptIndexException; import org.apache.lucene.index.FieldInfo; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.store.LockObtainFailedException; import org.apache.lucene.util.Version; /** * Class to create Lucene Index from files. * Remember this class will only index files inside a folder. * If there are multiple folder inside the source folder it will not index * those files. * * It will only index text files * @author Mubin Shrestha */ public class Indexer { private final File sourceDirectory; private final File indexDirectory; private static String fieldName; public Indexer() { this.sourceDirectory = new File(Configuration.SOURCE_DIRECTORY_TO_INDEX); this.indexDirectory = new File(Configuration.INDEX_DIRECTORY); fieldName = Configuration.FIELD_CONTENT; } /** * Method to create Lucene Index * Keep in mind that always index text value to Lucene for calculating * Cosine Similarity. * You have to generate tokens, terms and their frequencies and store * them in the Lucene Index. * @throws CorruptIndexException * @throws LockObtainFailedException * @throws IOException */ public void index() throws CorruptIndexException, LockObtainFailedException, IOException { Directory dir = FSDirectory.open(indexDirectory); Analyzer analyzer = new StandardAnalyzer(StandardAnalyzer.STOP_WORDS_SET); // using stop words IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_10_2, analyzer); if (indexDirectory.exists()) { iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE); } else { // Add new documents to an existing index: iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND); } IndexWriter writer = new IndexWriter(dir, iwc); for (File f : sourceDirectory.listFiles()) { Document doc = new Document(); FieldType fieldType = new FieldType(); fieldType.setIndexed(true); fieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); fieldType.setStored(true); fieldType.setStoreTermVectors(true); fieldType.setTokenized(true); Field contentField = new Field(fieldName, getAllText(f), fieldType); doc.add(contentField); writer.addDocument(doc); } writer.close(); } /** * Method to get all the texts of the text file. * Lucene cannot create the term vetors and tokens for reader class. * You have to index its text values to the index. * It would be more better if this was in another class. * I am lazy you know all. * @param f * @return * @throws FileNotFoundException * @throws IOException */ public String getAllText(File f) throws FileNotFoundException, IOException { String textFileContent = ""; for (String line : Files.readAllLines(Paths.get(f.getAbsolutePath()))) { textFileContent += line; } return textFileContent; } }
Below is the Configuration Class:
// Configuration.java package com.computergodzilla.cosinesimilairty; /** * @author Mubin Shrestha */ public class Configuration { public static final String SOURCE_DIRECTORY_TO_INDEX = "E:/TEST"; public static final String INDEX_DIRECTORY = "E:/INDEXDIRECTORY"; public static final String FIELD_CONTENT = "contents"; // name of the field to index }
Step II. Preparing IndexReader to read in the Lucene Index
Once the index is created, prepare an index reader class. You will use IndexReader class to read the index, to count the number of terms, frequencies and counting total number of documents. To work on the indexed materials you will use an IndexReader class to do your work. Below is the IndexOpener.java class.
package com.computergodzilla.cosinesimilairty; import java.io.File; import java.io.IOException; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.store.FSDirectory; /** * Class to Get the Lucene Index Reader * @author Mubin Shrestha */ public class IndexOpener { public static IndexReader GetIndexReader() throws IOException { IndexReader indexReader = DirectoryReader.open(FSDirectory.open(new File(Configuration.INDEX_DIRECTORY))); return indexReader; } /** * Returns the total number of documents in the index * @return * @throws IOException */ public static Integer TotalDocumentInIndex() throws IOException { Integer maxDoc = GetIndexReader().maxDoc(); GetIndexReader().close(); return maxDoc; } }
Steps III. Getting all the terms indexed from the Lucene Index
This class is necessary to generate the document vectors. All terms in the index gives the total length of the vector needed to be created for the document vector. Below is the AllTerms.java Class.
package com.computergodzilla.cosinesimilarity; import java.io.IOException; import java.util.HashMap; import java.util.Map; import java.util.Map.Entry; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Terms; import org.apache.lucene.index.TermsEnum; import org.apache.lucene.util.BytesRef; /** * Class that will get all the terms in the index. * @author Mubin Shrestha */ public class AllTerms { private MapallTerms; Integer totalNoOfDocumentInIndex; IndexReader indexReader; public AllTerms() throws IOException { allTerms = new HashMap<>(); indexReader = IndexOpener.GetIndexReader(); totalNoOfDocumentInIndex = IndexOpener.TotalDocumentInIndex(); } public void initAllTerms() throws IOException { int pos = 0; for (int docId = 0; docId < totalNoOfDocumentInIndex; docId++) { Terms vector = indexReader.getTermVector(docId, Configuration.FIELD_CONTENT); TermsEnum termsEnum = null; termsEnum = vector.iterator(termsEnum); BytesRef text = null; while ((text = termsEnum.next()) != null) { String term = text.utf8ToString(); allTerms.put(term, pos++); } } //Update postition pos = 0; for(Entry<String,Integer> s : allTerms.entrySet()) { System.out.println(s.getKey()); s.setValue(pos++); } } public Map<String,Integer> getAllTerms() { return allTerms; } }
Step IV. Generating Document Vectors
The next step is to create document vectors of all the documents indexed in the Lucene Index. Below is the VectorGenerator.java Class.
package com.computergodzilla.cosinesimilairty; import java.io.IOException; import java.util.HashMap; import java.util.Map; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Terms; import org.apache.lucene.index.TermsEnum; import org.apache.lucene.util.BytesRef; /** * Class to generate Document Vectors from Lucene Index * @author Mubin Shrestha */ public class VectorGenerator { DocVector[] docVector; private Mapallterms; Integer totalNoOfDocumentInIndex; IndexReader indexReader; public VectorGenerator() throws IOException { allterms = new HashMap<>(); indexReader = IndexOpener.GetIndexReader(); totalNoOfDocumentInIndex = IndexOpener.TotalDocumentInIndex(); docVector = new DocVector[totalNoOfDocumentInIndex]; } public void GetAllTerms() throws IOException { AllTerms allTerms = new AllTerms(); allTerms.initAllTerms(); allterms = allTerms.getAllTerms(); } public DocVector[] GetDocumentVectors() throws IOException { for (int docId = 0; docId < totalNoOfDocumentInIndex; docId++) { Terms vector = indexReader.getTermVector(docId, Configuration.FIELD_CONTENT); TermsEnum termsEnum = null; termsEnum = vector.iterator(termsEnum); BytesRef text = null; docVector[docId] = new DocVector(allterms); while ((text = termsEnum.next()) != null) { String term = text.utf8ToString(); int freq = (int) termsEnum.totalTermFreq(); docVector[docId].setEntry(term, freq); } docVector[docId].normalize(); } indexReader.close(); return docVector; } }
Step V. Create Cosine Similarity Class
The next step is to prepare the cosine similarity calculating class. Below is the cosinesimilairty.java class.
package com.computergodzilla.cosinesimilairty; /** * Class to calculate cosine similarity * @author Mubin Shrestha */ public class CosineSimilarity { public static double CosineSimilarity(DocVector d1,DocVector d2) { double cosinesimilarity; try { cosinesimilarity = (d1.vector.dotProduct(d2.vector)) / (d1.vector.getNorm() * d2.vector.getNorm()); } catch (Exception e) { return 0.0; } return cosinesimilarity; } }
Step VI. Document vector class
package com.computergodzilla.cosinesimilarity; import java.util.Map; import org.apache.commons.math.linear.OpenMapRealVector; import org.apache.commons.math.linear.RealVectorFormat; /** * * @author Mubin */ public class DocVector { public Mapterms; public OpenMapRealVector vector; public DocVector(Map terms) { this.terms = terms; this.vector = new OpenMapRealVector(terms.size()); } public void setEntry(String term, int freq) { if (terms.containsKey(term)) { int pos = terms.get(term); vector.setEntry(pos, (double) freq); } } public void normalize() { double sum = vector.getL1Norm(); vector = (OpenMapRealVector) vector.mapDivide(sum); } @Override public String toString() { RealVectorFormat formatter = new RealVectorFormat(); return formatter.format(vector); } }
Step VII. All done now fire up the program
All done now run the program from below main class.
Outputpackage com.computergodzilla.cosinesimilairty; import java.io.IOException; import org.apache.lucene.store.LockObtainFailedException; /** * Main Class * @author Mubin Shrestha */ public class Test { public static void main(String[] args) throws LockObtainFailedException, IOException { getCosineSimilarity(); } public static void getCosineSimilarity() throws LockObtainFailedException, IOException { Indexer index = new Indexer(); index.index(); VectorGenerator vectorGenerator = new VectorGenerator(); vectorGenerator.GetAllTerms(); DocVector[] docVector = vectorGenerator.GetDocumentVectors(); // getting document vectors for(int i = 0; i < docVector.length; i++) { double cosineSimilarity = CosineSimilarity.CosineSimilarity(docVector[1], docVector[i]); System.out.println("Cosine Similarity Score between document 0 and "+i+" = " + cosineSimilarity); } } }
Doc 1 | Doc 2 | Cosine Score |
0 | 0 | 1 |
0 | 1 | 0.346410162 |
0 | 2 | 0.132453236 |
Please check out my first Android app, NTyles:
There is a compile error in AllTerms class:
ReplyDeletefor(Entry s : allTerms.entrySet())
Type mismatch: conn't convert element type Object to Map.Entry
Please let's if you fix it.
I have same error. How to solve this error?
DeleteThis comment has been removed by the author.
DeletePlease check the code in the blog and the code you had downloaded. There could be some problem with copying greater than and less than sign from blog.
Deleteprivate Map allTerms = new HashMap();
ReplyDeleteprivate Map allTerms = new HashMap<>();
DeleteThis should work fine in jdk > 1.8
All of you guys who are facing the problem please download the whole project. The link is provided at the top of the blog.
ReplyDeleteHi thanks for the code but there is an error like:
ReplyDeleteException in thread "main" java.lang.NullPointerException
at com.computergodzilla.cosinesimilarity.AllTerms.initAllTerms(AllTerms.java:41)
at com.computergodzilla.cosinesimilarity.VectorGenerator.GetAllTerms(VectorGenerator.java:32)
at com.computergodzilla.cosinesimilarity.Test.getCosineSimilarity(Test.java:23)
at com.computergodzilla.cosinesimilarity.Test.main(Test.java:15)
HI thanks for your code but There is still a probelm as nullpointer in Allterm class
ReplyDeleteany advice, I used the download code and it is same as the code here in blog
HI, I got my fault, in INDEX folder there were some opended file with ~ so it must be clean and just with the targeting files
ReplyDeletethansk
This comment has been removed by the author.
ReplyDeleteHi
ReplyDeleteI want to edit this code for more than one field, but i cannot write AllTerms and VectorGenerator rightly. please help me to do it.
Thanks
Could you please elaborate.
DeleteI have some documents have two field 'title' and 'body'. I want to separate this fields of them in generating document's vector so that every section has own weight in scoring document. I could build two field but i cannot generate vector correctly.
DeleteYou simply document vector for each field. For doing this, you will have to get the total words in the fields for a document of which you are generating document vector. And secondly get all the words in that field. Using this two information you can generate the document vector according to the field. You will have to do you own coding. Please let me know if you face any further issues.
DeleteHi, Im getting errors such as:
ReplyDeleteException in thread "main" java.lang.NullPointerException
at org.apache.lucene.analysis.standard.StandardTokenizer.init(StandardTokenizer.java:144)
at org.apache.lucene.analysis.standard.StandardTokenizer.(StandardTokenizer.java:132)
at org.apache.lucene.analysis.standard.StandardAnalyzer.createComponents(StandardAnalyzer.java:111)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:180)
at org.apache.lucene.document.Field.tokenStream(Field.java:552)
at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:103)
at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:455)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1534)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1204)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1185)
at com.computergodzilla.cosinesimilarity.Indexer.index(Indexer.java:67)
at com.computergodzilla.cosinesimilarity.Test.getCosineSimilarity(Test.java:20)
at com.computergodzilla.cosinesimilarity.Test.main(Test.java:14)
What seems to be the issue? Any idea?
Hi, Im getting errors such as:
ReplyDeleteException in thread "main" java.lang.NullPointerException
at org.apache.lucene.analysis.standard.StandardTokenizer.init(StandardTokenizer.java:144)
at org.apache.lucene.analysis.standard.StandardTokenizer.(StandardTokenizer.java:132)
at org.apache.lucene.analysis.standard.StandardAnalyzer.createComponents(StandardAnalyzer.java:111)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:180)
at org.apache.lucene.document.Field.tokenStream(Field.java:552)
at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:103)
at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:455)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1534)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1204)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1185)
at com.computergodzilla.cosinesimilarity.Indexer.index(Indexer.java:67)
at com.computergodzilla.cosinesimilarity.Test.getCosineSimilarity(Test.java:20)
at com.computergodzilla.cosinesimilarity.Test.main(Test.java:14)
What seems to be the issue? Any idea?
Could you please send me your project files.
ReplyDeleteHi, I was able to run the code. The error was in this line:
ReplyDeleteAnalyzer analyzer = new StandardAnalyzer(StandardAnalyzer.STOP_WORDS_SET);
in the index() method. The code was giving NullPointerException, and I made the following changes and it ran:
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_47, StandardAnalyzer.STOP_WORDS_SET);
Awesome!! Enjoy Lucene.
DeleteThis comment has been removed by the author.
ReplyDelete`How to implement the pre- configuration part in lucene index ?
ReplyDeleteThere is no preconfiguration part. All you have to do is create the index with term vector enabled and above code will use that index as input.
Deleteokay got it. Thanks :)
ReplyDeleteHi
ReplyDeleteI want to multiply my docvector to array of number, but i cannot use mapmultiply.
How can calculate this multiplication?
Hi thank you very much for nice information!
ReplyDeletehow can i get cosine similarity between my query to all other documents ??
Hello Sagar,
DeleteIf you want to do that, wrap you query to a file, and index it in Lucene, and treat it as a Lucene document and calculate the cosine similarity between that file with other documents. How cool is that :)
First off: thanks for the great work! That's a nice example for someone who didn't know much about Lucene in the first place - like me.
ReplyDeleteI do have a question tho: I'd like to modify the code a bit, as I do not want to calculate the cosine similarity between two documents but to create a term-document matrix using tf-idf (and to extract the highest rated terms for each document).
Do I understand right that term frequency is implemented (in VectorGenerator.GetDocumentVector()) but inverse document frequency isn't?
I'd like to add that, but how can I access only the terms which occur in a specific document? It seems like the Maps in all docVectors contain all terms which occur in any of the documents, so I can't use that.
Thanks!
Hi Jacob. Thank you for the awesome complement.
DeleteAnd sorry for replying late. Below is the sample psuedo code for calculating inverse document frequency:
/**
* Calculates idf of term termToCheck
* @param allTerms : all the terms of all the documents
* @param termToCheck
* @return idf(inverse document frequency) score
*/
public double idfCalculator(List allTerms, String termToCheck) {
double count = 0;
for (String[] ss : allTerms) {
for (String s : ss) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
break;
}
}
}
return 1 + Math.log(allTerms.size() / count);
}
Please refer here:
http://computergodzilla.blogspot.com/2013/07/how-to-calculate-tf-idf-of-document.html?showComment=1449327050417#c553401772374378224
for more detail.
Thank you.
Hi Thanks for uploading the code. I have one difficulty, when i imported the project mentioned in the link then i got errors at all the places wherever iterator.(termsEnum) has been used.Can you please help me with that? I am using latest version of lucene(5.3.1).
ReplyDeleteHello,
DeleteI am off LUCENE now. So I am dumb as you are. :). I suggest you to walk through Term and TermEnum documentaion of Lucene 5.3.1.
I got these value for your three documents. The third one disagrees with your results. I am wondering why.
ReplyDeleteCosine similarity 00 = 1.0
Cosine similarity 01 = 0.3464101615137754
Cosine similarity 02 = 0.18057877962865382
Thank you for the code! It's of great help to my project
ReplyDeleteHi,
ReplyDeleteThanks for your explanation.
I have a doubt. how to get file name for each doc vector?
Thanks a lot.