Search This Blog

Translate

Friday, January 2, 2015

Calculate Cosine Similarity Using Lucene 4.10.2

Get real time news update from your favorite websites.
Don't miss any news about your favorite topic.
Personalize your app.

Check out NTyles.


Get it on....

NTyles-App




***EDIT
Lot of people are facing the problem with the code base that I had pasted below. Please download the whole project from HERE. It should work. Cheers.

Someone recently asked me on calculating cosine similarity between documents using Lucene 4.10.2. It has been more than two years that I had not used Lucene. There was a great piece of article regarding computing document similarity Salmon Run: Computing Document Similarity Using Lucene implemented in Lucene version 3.x. So as to update myself, this new year I did some coding myself to calculate the cosine similarity using Lucene version 4.10.2. For my code base I am using following three text files. My test files are as follows:
Document Id File Name  Text
0 Document1.txt This New Year I am learning how to calculate cosine similarity using Lucene. It will be fun.
1 Document2.txt Huh!! What you want to learn cosine similarity. Its new year man. Do something crazy and blasting. By the way, what is Cosine Similarity?
2 Document3.txt Dude, don't under estimate the power of cosine similarity.I can tell you which types of books are there in your computer simply by running my scripts of cosine similarity.
Moving on I assume you know following:
For calculating cosine similarity in lucene first you should do some pre-configuration in Lucene index. You have to store terms and their frequencies in the index.Lucene has inbuilt functions to create terms vectors during indexing. You will have to enable following while indexing and creating the indexing fields:
    FieldType fieldType = new FieldType();
    fieldType.setIndexed(true);
    fieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
    fieldType.setStored(true);
    fieldType.setStoreTermVectors(true);
    fieldType.setTokenized(true);
Above lines will create term vectors and store them in the Lucene index of each and every term.
Now, let us calculate cosine similarity between above three documents using Lucene 4.10.2. The first step is to create the Lucene index. For my code I had followed the same pattern as Sir. Sujit Pal had did in his awesome blog. But before we start creating Lucene index, make sure you had downloaded following libraries and add to your project.
  1. lucene-core-4.10.2.jar
  2. lucene-analyzers-common-4.10.2.jar
  3. commons-math-2.0.jar

Also, please consider that my program package name is com.computergodzilla.cosinesimilarity. Make sure that you copy and paste below code base into the same folder. If not you will have to create your own package and edit the java files respectively.

Step I. Preparing Lucene 4.10.2 Index

I am creating Lucene index in the hard drive. You can also create the Lucene index in the memory. Please refer to Lucene documentation for detail. My following code base will create a Lucene index with only one field. This field will store all the terms of the documents with their respective frequencies. All the information like location of source files and index directory are configured separately in Configuration class which is given just after the Indexer.java file below.
// Indexer.java
package com.computergodzilla.cosinesimilarity;

import java.io.*;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.FieldInfo;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.util.Version;


/**
 * Class to create Lucene Index from files.
 * Remember this class will only index files inside a folder.
 * If there are  multiple folder inside the source folder it will not index 
 * those files.
 * 
 *  It will only index text files 
 * @author Mubin Shrestha
 */
public class Indexer {

    private final File sourceDirectory;
    private final File indexDirectory;
    private static String fieldName;

    public Indexer() {
        this.sourceDirectory = new File(Configuration.SOURCE_DIRECTORY_TO_INDEX);
        this.indexDirectory = new File(Configuration.INDEX_DIRECTORY);
        fieldName = Configuration.FIELD_CONTENT;
    }

    /**
     * Method to create Lucene Index
     * Keep in mind that always index text value to Lucene for calculating 
     * Cosine Similarity.
     * You have to generate tokens, terms and their frequencies and store
     * them in the Lucene Index.
     * @throws CorruptIndexException
     * @throws LockObtainFailedException
     * @throws IOException 
     */
    public void index() throws CorruptIndexException,
            LockObtainFailedException, IOException {
        Directory dir = FSDirectory.open(indexDirectory);
        Analyzer analyzer = new StandardAnalyzer(StandardAnalyzer.STOP_WORDS_SET);  // using stop words
        IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_10_2, analyzer);

        if (indexDirectory.exists()) {
            iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
        } else {
            // Add new documents to an existing index:
            iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
        }

        IndexWriter writer = new IndexWriter(dir, iwc);
        for (File f : sourceDirectory.listFiles()) {
            Document doc = new Document();
            FieldType fieldType = new FieldType();
            fieldType.setIndexed(true);
            fieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
            fieldType.setStored(true);
            fieldType.setStoreTermVectors(true);
            fieldType.setTokenized(true);
            Field contentField = new Field(fieldName, getAllText(f), fieldType);
            doc.add(contentField);
            writer.addDocument(doc);
        }
        writer.close();
    }

    /**
     * Method to get all the texts of the text file.
     * Lucene cannot create the term vetors and tokens for reader class.
     * You have to index its text values to the index.
     * It would be more better if this was in another class.
     * I am lazy you know all.
     * @param f
     * @return
     * @throws FileNotFoundException
     * @throws IOException 
     */
    public String getAllText(File f) throws FileNotFoundException, IOException {
        String textFileContent = "";

        for (String line : Files.readAllLines(Paths.get(f.getAbsolutePath()))) {
            textFileContent += line;
        }
        return textFileContent;
    }
}
Below is the Configuration Class:
// Configuration.java
package com.computergodzilla.cosinesimilairty;

/**
 * @author Mubin Shrestha
 */
public class Configuration { 
    public static final String SOURCE_DIRECTORY_TO_INDEX = "E:/TEST";
    public static final String INDEX_DIRECTORY = "E:/INDEXDIRECTORY";
    public static final String FIELD_CONTENT = "contents"; // name of the field to index
}

Step II. Preparing IndexReader to read in the Lucene Index

Once the index is created, prepare an index reader class. You will use IndexReader class to read the index, to count the number of terms, frequencies and counting total number of documents. To work on the indexed materials you will use an IndexReader class to do your work. Below is the IndexOpener.java class.
package com.computergodzilla.cosinesimilairty;

import java.io.File;
import java.io.IOException;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.store.FSDirectory;

/**
 * Class to Get the Lucene Index Reader
 * @author Mubin Shrestha
 */
public class IndexOpener {
    
    public static IndexReader GetIndexReader() throws IOException {
        IndexReader indexReader = DirectoryReader.open(FSDirectory.open(new File(Configuration.INDEX_DIRECTORY)));
        return indexReader;
    }

    /**
     * Returns the total number of documents in the index
     * @return
     * @throws IOException 
     */
    public static Integer TotalDocumentInIndex() throws IOException
    {
        Integer maxDoc = GetIndexReader().maxDoc();
        GetIndexReader().close();
        return maxDoc;
    }
}

Steps III. Getting all the terms indexed from the Lucene Index

This class is necessary to generate the document vectors. All terms in the index gives the total length of the vector needed to be created for the document vector. Below is the AllTerms.java Class.
package com.computergodzilla.cosinesimilarity;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.util.BytesRef;

/**
 * Class that will get all the terms in the index.
 * @author Mubin Shrestha
 */
public class AllTerms {
    private Map allTerms;
    Integer totalNoOfDocumentInIndex;
    IndexReader indexReader;
    
    public AllTerms() throws IOException
    {    
        allTerms = new HashMap<>();
        indexReader = IndexOpener.GetIndexReader();
        totalNoOfDocumentInIndex = IndexOpener.TotalDocumentInIndex();
    }
        
    public void initAllTerms() throws IOException
    {
        int pos = 0;
        for (int docId = 0; docId < totalNoOfDocumentInIndex; docId++) {
            Terms vector = indexReader.getTermVector(docId, Configuration.FIELD_CONTENT);
            TermsEnum termsEnum = null;
            termsEnum = vector.iterator(termsEnum);
            BytesRef text = null;
            while ((text = termsEnum.next()) != null) {
                String term = text.utf8ToString();
                allTerms.put(term, pos++);
            }
        }       
        
        //Update postition
        pos = 0;
        for(Entry<String,Integer> s : allTerms.entrySet())
        {        
            System.out.println(s.getKey());
            s.setValue(pos++);
        }
    }
    
    public Map<String,Integer> getAllTerms() {
        return allTerms;
    }
}

Step IV. Generating Document Vectors

The next step is to create document vectors of all the documents indexed in the Lucene Index. Below is the VectorGenerator.java Class.
package com.computergodzilla.cosinesimilairty;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.util.BytesRef;

/**
 * Class to generate Document Vectors from Lucene Index
 * @author Mubin Shrestha
 */
public class VectorGenerator {
    DocVector[] docVector;
    private Map allterms;
    Integer totalNoOfDocumentInIndex;
    IndexReader indexReader;
    
    public VectorGenerator() throws IOException
    {
        allterms = new HashMap<>();
        indexReader = IndexOpener.GetIndexReader();
        totalNoOfDocumentInIndex = IndexOpener.TotalDocumentInIndex();
        docVector = new DocVector[totalNoOfDocumentInIndex];
    }
    
    public void GetAllTerms() throws IOException
    {
        AllTerms allTerms = new AllTerms();
        allTerms.initAllTerms();
        allterms = allTerms.getAllTerms();
    }
    
    public DocVector[] GetDocumentVectors() throws IOException {
        for (int docId = 0; docId < totalNoOfDocumentInIndex; docId++) {
            Terms vector = indexReader.getTermVector(docId, Configuration.FIELD_CONTENT);
            TermsEnum termsEnum = null;
            termsEnum = vector.iterator(termsEnum);
            BytesRef text = null;            
            docVector[docId] = new DocVector(allterms);            
            while ((text = termsEnum.next()) != null) {
                String term = text.utf8ToString();
                int freq = (int) termsEnum.totalTermFreq();
                docVector[docId].setEntry(term, freq);
            }
            docVector[docId].normalize();
        }        
        indexReader.close();
        return docVector;
    }
}

Step V. Create Cosine Similarity Class

The next step is to prepare the cosine similarity calculating class. Below is the cosinesimilairty.java class.
package com.computergodzilla.cosinesimilairty;

/**
 * Class to calculate cosine similarity
 * @author Mubin Shrestha
 */
public class CosineSimilarity {    
    public static double CosineSimilarity(DocVector d1,DocVector d2) {
        double cosinesimilarity;
        try {
            cosinesimilarity = (d1.vector.dotProduct(d2.vector))
                    / (d1.vector.getNorm() * d2.vector.getNorm());
        } catch (Exception e) {
            return 0.0;
        }
        return cosinesimilarity;
    }
}

Step VI. Document vector class

package com.computergodzilla.cosinesimilarity;

import java.util.Map;
import org.apache.commons.math.linear.OpenMapRealVector;
import org.apache.commons.math.linear.RealVectorFormat;

/**
 *
 * @author Mubin
 */
public class DocVector {

    public Map terms;
    public OpenMapRealVector vector;
    
    public DocVector(Map terms) {
        this.terms = terms;
        this.vector = new OpenMapRealVector(terms.size());        
    }

    public void setEntry(String term, int freq) {
        if (terms.containsKey(term)) {
            int pos = terms.get(term);
            vector.setEntry(pos, (double) freq);
        }
    }

    public void normalize() {
        double sum = vector.getL1Norm();
        vector = (OpenMapRealVector) vector.mapDivide(sum);
    }

    @Override
    public String toString() {
        RealVectorFormat formatter = new RealVectorFormat();
        return formatter.format(vector);
    }
}

Step VII. All done now fire up the program

All done now run the program from below main class.
package com.computergodzilla.cosinesimilairty;

import java.io.IOException;
import org.apache.lucene.store.LockObtainFailedException;

/**
 * Main Class
 * @author Mubin Shrestha
 */
public class Test {
    
    public static void main(String[] args) throws LockObtainFailedException, IOException
    {
        getCosineSimilarity();
    }
    
    public static void getCosineSimilarity() throws LockObtainFailedException, IOException
    {
       Indexer index = new Indexer();
       index.index();
       VectorGenerator vectorGenerator = new VectorGenerator();
       vectorGenerator.GetAllTerms();       
       DocVector[] docVector = vectorGenerator.GetDocumentVectors(); // getting document vectors
       for(int i = 0; i < docVector.length; i++)
       {
           double cosineSimilarity = CosineSimilarity.CosineSimilarity(docVector[1], docVector[i]);
           System.out.println("Cosine Similarity Score between document 0 and "+i+"  = " + cosineSimilarity);
       }    
    }
}
Output
Doc 1 Doc 2 Cosine Score
0 0 1
0 1 0.346410162
0 2 0.132453236
Fire Up guys, if you have any questions!!

Please check out my first Android app, NTyles:


34 comments:

  1. There is a compile error in AllTerms class:
    for(Entry s : allTerms.entrySet())
    Type mismatch: conn't convert element type Object to Map.Entry

    Please let's if you fix it.

    ReplyDelete
    Replies
    1. I have same error. How to solve this error?

      Delete
    2. This comment has been removed by the author.

      Delete
    3. Please check the code in the blog and the code you had downloaded. There could be some problem with copying greater than and less than sign from blog.

      Delete
  2. private Map allTerms = new HashMap();

    ReplyDelete
    Replies
    1. private Map allTerms = new HashMap<>();
      This should work fine in jdk > 1.8

      Delete
  3. All of you guys who are facing the problem please download the whole project. The link is provided at the top of the blog.

    ReplyDelete
  4. Hi thanks for the code but there is an error like:
    Exception in thread "main" java.lang.NullPointerException
    at com.computergodzilla.cosinesimilarity.AllTerms.initAllTerms(AllTerms.java:41)
    at com.computergodzilla.cosinesimilarity.VectorGenerator.GetAllTerms(VectorGenerator.java:32)
    at com.computergodzilla.cosinesimilarity.Test.getCosineSimilarity(Test.java:23)
    at com.computergodzilla.cosinesimilarity.Test.main(Test.java:15)

    ReplyDelete
  5. HI thanks for your code but There is still a probelm as nullpointer in Allterm class
    any advice, I used the download code and it is same as the code here in blog

    ReplyDelete
  6. HI, I got my fault, in INDEX folder there were some opended file with ~ so it must be clean and just with the targeting files
    thansk

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Hi
    I want to edit this code for more than one field, but i cannot write AllTerms and VectorGenerator rightly. please help me to do it.
    Thanks

    ReplyDelete
    Replies
    1. I have some documents have two field 'title' and 'body'. I want to separate this fields of them in generating document's vector so that every section has own weight in scoring document. I could build two field but i cannot generate vector correctly.

      Delete
    2. You simply document vector for each field. For doing this, you will have to get the total words in the fields for a document of which you are generating document vector. And secondly get all the words in that field. Using this two information you can generate the document vector according to the field. You will have to do you own coding. Please let me know if you face any further issues.

      Delete
  9. Hi, Im getting errors such as:
    Exception in thread "main" java.lang.NullPointerException
    at org.apache.lucene.analysis.standard.StandardTokenizer.init(StandardTokenizer.java:144)
    at org.apache.lucene.analysis.standard.StandardTokenizer.(StandardTokenizer.java:132)
    at org.apache.lucene.analysis.standard.StandardAnalyzer.createComponents(StandardAnalyzer.java:111)
    at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:180)
    at org.apache.lucene.document.Field.tokenStream(Field.java:552)
    at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:103)
    at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:455)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1534)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1204)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1185)
    at com.computergodzilla.cosinesimilarity.Indexer.index(Indexer.java:67)
    at com.computergodzilla.cosinesimilarity.Test.getCosineSimilarity(Test.java:20)
    at com.computergodzilla.cosinesimilarity.Test.main(Test.java:14)

    What seems to be the issue? Any idea?

    ReplyDelete
  10. Hi, Im getting errors such as:
    Exception in thread "main" java.lang.NullPointerException
    at org.apache.lucene.analysis.standard.StandardTokenizer.init(StandardTokenizer.java:144)
    at org.apache.lucene.analysis.standard.StandardTokenizer.(StandardTokenizer.java:132)
    at org.apache.lucene.analysis.standard.StandardAnalyzer.createComponents(StandardAnalyzer.java:111)
    at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:180)
    at org.apache.lucene.document.Field.tokenStream(Field.java:552)
    at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:103)
    at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:455)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1534)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1204)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1185)
    at com.computergodzilla.cosinesimilarity.Indexer.index(Indexer.java:67)
    at com.computergodzilla.cosinesimilarity.Test.getCosineSimilarity(Test.java:20)
    at com.computergodzilla.cosinesimilarity.Test.main(Test.java:14)

    What seems to be the issue? Any idea?

    ReplyDelete
  11. Could you please send me your project files.

    ReplyDelete
  12. Hi, I was able to run the code. The error was in this line:

    Analyzer analyzer = new StandardAnalyzer(StandardAnalyzer.STOP_WORDS_SET);

    in the index() method. The code was giving NullPointerException, and I made the following changes and it ran:

    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_47, StandardAnalyzer.STOP_WORDS_SET);

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. `How to implement the pre- configuration part in lucene index ?

    ReplyDelete
    Replies
    1. There is no preconfiguration part. All you have to do is create the index with term vector enabled and above code will use that index as input.

      Delete
  15. Hi
    I want to multiply my docvector to array of number, but i cannot use mapmultiply.
    How can calculate this multiplication?

    ReplyDelete
  16. Hi thank you very much for nice information!

    how can i get cosine similarity between my query to all other documents ??

    ReplyDelete
    Replies
    1. Hello Sagar,

      If you want to do that, wrap you query to a file, and index it in Lucene, and treat it as a Lucene document and calculate the cosine similarity between that file with other documents. How cool is that :)

      Delete
  17. First off: thanks for the great work! That's a nice example for someone who didn't know much about Lucene in the first place - like me.

    I do have a question tho: I'd like to modify the code a bit, as I do not want to calculate the cosine similarity between two documents but to create a term-document matrix using tf-idf (and to extract the highest rated terms for each document).

    Do I understand right that term frequency is implemented (in VectorGenerator.GetDocumentVector()) but inverse document frequency isn't?
    I'd like to add that, but how can I access only the terms which occur in a specific document? It seems like the Maps in all docVectors contain all terms which occur in any of the documents, so I can't use that.

    Thanks!

    ReplyDelete
    Replies
    1. Hi Jacob. Thank you for the awesome complement.
      And sorry for replying late. Below is the sample psuedo code for calculating inverse document frequency:


      /**
      * Calculates idf of term termToCheck
      * @param allTerms : all the terms of all the documents
      * @param termToCheck
      * @return idf(inverse document frequency) score
      */
      public double idfCalculator(List allTerms, String termToCheck) {
      double count = 0;
      for (String[] ss : allTerms) {
      for (String s : ss) {
      if (s.equalsIgnoreCase(termToCheck)) {
      count++;
      break;
      }
      }
      }
      return 1 + Math.log(allTerms.size() / count);
      }

      Please refer here:
      http://computergodzilla.blogspot.com/2013/07/how-to-calculate-tf-idf-of-document.html?showComment=1449327050417#c553401772374378224
      for more detail.

      Thank you.

      Delete
  18. Hi Thanks for uploading the code. I have one difficulty, when i imported the project mentioned in the link then i got errors at all the places wherever iterator.(termsEnum) has been used.Can you please help me with that? I am using latest version of lucene(5.3.1).

    ReplyDelete
    Replies
    1. Hello,

      I am off LUCENE now. So I am dumb as you are. :). I suggest you to walk through Term and TermEnum documentaion of Lucene 5.3.1.

      Delete
  19. I got these value for your three documents. The third one disagrees with your results. I am wondering why.
    Cosine similarity 00 = 1.0
    Cosine similarity 01 = 0.3464101615137754
    Cosine similarity 02 = 0.18057877962865382

    ReplyDelete
  20. Thank you for the code! It's of great help to my project

    ReplyDelete
  21. Hi,
    Thanks for your explanation.

    I have a doubt. how to get file name for each doc vector?

    Thanks a lot.

    ReplyDelete