ComputerGodzilla: Calculate Cosine Similarity Using Lucene 4.10.2

Friday, January 2, 2015

Calculate Cosine Similarity Using Lucene 4.10.2

Get real time news update from your favorite websites.
Don't miss any news about your favorite topic.
Personalize your app.

Check out NTyles.

Get it on....

***EDIT
Lot of people are facing the problem with the code base that I had pasted below. Please download the whole project from HERE. It should work. Cheers.

Someone recently asked me on calculating cosine similarity between documents using Lucene 4.10.2. It has been more than two years that I had not used Lucene. There was a great piece of article regarding computing document similarity Salmon Run: Computing Document Similarity Using Lucene implemented in Lucene version 3.x. So as to update myself, this new year I did some coding myself to calculate the cosine similarity using Lucene version 4.10.2. For my code base I am using following three text files. My test files are as follows:

Document Id	File Name	Text
0	Document1.txt	This New Year I am learning how to calculate cosine similarity using Lucene. It will be fun.
1	Document2.txt	Huh!! What you want to learn cosine similarity. Its new year man. Do something crazy and blasting. By the way, what is Cosine Similarity?
2	Document3.txt	Dude, don't under estimate the power of cosine similarity.I can tell you which types of books are there in your computer simply by running my scripts of cosine similarity.

Moving on I assume you know following:

For calculating cosine similarity in lucene first you should do some pre-configuration in Lucene index. You have to store terms and their frequencies in the index.Lucene has inbuilt functions to create terms vectors during indexing. You will have to enable following while indexing and creating the indexing fields:

    FieldType fieldType = new FieldType();
    fieldType.setIndexed(true);
    fieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
    fieldType.setStored(true);
    fieldType.setStoreTermVectors(true);
    fieldType.setTokenized(true);

Above lines will create term vectors and store them in the Lucene index of each and every term.

Now, let us calculate cosine similarity between above three documents using Lucene 4.10.2. The first step is to create the Lucene index. For my code I had followed the same pattern as Sir. Sujit Pal had did in his awesome blog. But before we start creating Lucene index, make sure you had downloaded following libraries and add to your project.

Also, please consider that my program package name is com.computergodzilla.cosinesimilarity. Make sure that you copy and paste below code base into the same folder. If not you will have to create your own package and edit the java files respectively.

Step I. Preparing Lucene 4.10.2 Index

I am creating Lucene index in the hard drive. You can also create the Lucene index in the memory. Please refer to Lucene documentation for detail. My following code base will create a Lucene index with only one field. This field will store all the terms of the documents with their respective frequencies. All the information like location of source files and index directory are configured separately in Configuration class which is given just after the Indexer.java file below.

// Indexer.java
package com.computergodzilla.cosinesimilarity;

import java.io.*;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.FieldInfo;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.util.Version;


/**
 * Class to create Lucene Index from files.
 * Remember this class will only index files inside a folder.
 * If there are  multiple folder inside the source folder it will not index 
 * those files.
 * 
 *  It will only index text files 
 * @author Mubin Shrestha
 */
public class Indexer {

    private final File sourceDirectory;
    private final File indexDirectory;
    private static String fieldName;

    public Indexer() {
        this.sourceDirectory = new File(Configuration.SOURCE_DIRECTORY_TO_INDEX);
        this.indexDirectory = new File(Configuration.INDEX_DIRECTORY);
        fieldName = Configuration.FIELD_CONTENT;
    }

    /**
     * Method to create Lucene Index
     * Keep in mind that always index text value to Lucene for calculating 
     * Cosine Similarity.
     * You have to generate tokens, terms and their frequencies and store
     * them in the Lucene Index.
     * @throws CorruptIndexException
     * @throws LockObtainFailedException
     * @throws IOException 
     */
    public void index() throws CorruptIndexException,
            LockObtainFailedException, IOException {
        Directory dir = FSDirectory.open(indexDirectory);
        Analyzer analyzer = new StandardAnalyzer(StandardAnalyzer.STOP_WORDS_SET);  // using stop words
        IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_10_2, analyzer);

        if (indexDirectory.exists()) {
            iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
        } else {
            // Add new documents to an existing index:
            iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
        }

        IndexWriter writer = new IndexWriter(dir, iwc);
        for (File f : sourceDirectory.listFiles()) {
            Document doc = new Document();
            FieldType fieldType = new FieldType();
            fieldType.setIndexed(true);
            fieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
            fieldType.setStored(true);
            fieldType.setStoreTermVectors(true);
            fieldType.setTokenized(true);
            Field contentField = new Field(fieldName, getAllText(f), fieldType);
            doc.add(contentField);
            writer.addDocument(doc);
        }
        writer.close();
    }

    /**
     * Method to get all the texts of the text file.
     * Lucene cannot create the term vetors and tokens for reader class.
     * You have to index its text values to the index.
     * It would be more better if this was in another class.
     * I am lazy you know all.
     * @param f
     * @return
     * @throws FileNotFoundException
     * @throws IOException 
     */
    public String getAllText(File f) throws FileNotFoundException, IOException {
        String textFileContent = "";

        for (String line : Files.readAllLines(Paths.get(f.getAbsolutePath()))) {
            textFileContent += line;
        }
        return textFileContent;
    }
}

Below is the Configuration Class:

// Configuration.java
package com.computergodzilla.cosinesimilairty;

/**
 * @author Mubin Shrestha
 */
public class Configuration { 
    public static final String SOURCE_DIRECTORY_TO_INDEX = "E:/TEST";
    public static final String INDEX_DIRECTORY = "E:/INDEXDIRECTORY";
    public static final String FIELD_CONTENT = "contents"; // name of the field to index
}

Step II. Preparing IndexReader to read in the Lucene Index

Once the index is created, prepare an index reader class. You will use IndexReader class to read the index, to count the number of terms, frequencies and counting total number of documents. To work on the indexed materials you will use an IndexReader class to do your work. Below is the IndexOpener.java class.

package com.computergodzilla.cosinesimilairty;

import java.io.File;
import java.io.IOException;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.store.FSDirectory;

/**
 * Class to Get the Lucene Index Reader
 * @author Mubin Shrestha
 */
public class IndexOpener {
    
    public static IndexReader GetIndexReader() throws IOException {
        IndexReader indexReader = DirectoryReader.open(FSDirectory.open(new File(Configuration.INDEX_DIRECTORY)));
        return indexReader;
    }

    /**
     * Returns the total number of documents in the index
     * @return
     * @throws IOException 
     */
    public static Integer TotalDocumentInIndex() throws IOException
    {
        Integer maxDoc = GetIndexReader().maxDoc();
        GetIndexReader().close();
        return maxDoc;
    }
}

Steps III. Getting all the terms indexed from the Lucene Index

This class is necessary to generate the document vectors. All terms in the index gives the total length of the vector needed to be created for the document vector. Below is the AllTerms.java Class.

package com.computergodzilla.cosinesimilarity;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.util.BytesRef;

/**
 * Class that will get all the terms in the index.
 * @author Mubin Shrestha
 */
public class AllTerms {
    private Map allTerms;
    Integer totalNoOfDocumentInIndex;
    IndexReader indexReader;
    
    public AllTerms() throws IOException
    {    
        allTerms = new HashMap<>();
        indexReader = IndexOpener.GetIndexReader();
        totalNoOfDocumentInIndex = IndexOpener.TotalDocumentInIndex();
    }
        
    public void initAllTerms() throws IOException
    {
        int pos = 0;
        for (int docId = 0; docId < totalNoOfDocumentInIndex; docId++) {
            Terms vector = indexReader.getTermVector(docId, Configuration.FIELD_CONTENT);
            TermsEnum termsEnum = null;
            termsEnum = vector.iterator(termsEnum);
            BytesRef text = null;
            while ((text = termsEnum.next()) != null) {
                String term = text.utf8ToString();
                allTerms.put(term, pos++);
            }
        }       
        
        //Update postition
        pos = 0;
        for(Entry<String,Integer> s : allTerms.entrySet())
        {        
            System.out.println(s.getKey());
            s.setValue(pos++);
        }
    }
    
    public Map<String,Integer> getAllTerms() {
        return allTerms;
    }
}

Step IV. Generating Document Vectors

The next step is to create document vectors of all the documents indexed in the Lucene Index. Below is the VectorGenerator.java Class.

package com.computergodzilla.cosinesimilairty;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.util.BytesRef;

/**
 * Class to generate Document Vectors from Lucene Index
 * @author Mubin Shrestha
 */
public class VectorGenerator {
    DocVector[] docVector;
    private Map allterms;
    Integer totalNoOfDocumentInIndex;
    IndexReader indexReader;
    
    public VectorGenerator() throws IOException
    {
        allterms = new HashMap<>();
        indexReader = IndexOpener.GetIndexReader();
        totalNoOfDocumentInIndex = IndexOpener.TotalDocumentInIndex();
        docVector = new DocVector[totalNoOfDocumentInIndex];
    }
    
    public void GetAllTerms() throws IOException
    {
        AllTerms allTerms = new AllTerms();
        allTerms.initAllTerms();
        allterms = allTerms.getAllTerms();
    }
    
    public DocVector[] GetDocumentVectors() throws IOException {
        for (int docId = 0; docId < totalNoOfDocumentInIndex; docId++) {
            Terms vector = indexReader.getTermVector(docId, Configuration.FIELD_CONTENT);
            TermsEnum termsEnum = null;
            termsEnum = vector.iterator(termsEnum);
            BytesRef text = null;            
            docVector[docId] = new DocVector(allterms);            
            while ((text = termsEnum.next()) != null) {
                String term = text.utf8ToString();
                int freq = (int) termsEnum.totalTermFreq();
                docVector[docId].setEntry(term, freq);
            }
            docVector[docId].normalize();
        }        
        indexReader.close();
        return docVector;
    }
}

Step V. Create Cosine Similarity Class

The next step is to prepare the cosine similarity calculating class. Below is the cosinesimilairty.java class.

package com.computergodzilla.cosinesimilairty;

/**
 * Class to calculate cosine similarity
 * @author Mubin Shrestha
 */
public class CosineSimilarity {    
    public static double CosineSimilarity(DocVector d1,DocVector d2) {
        double cosinesimilarity;
        try {
            cosinesimilarity = (d1.vector.dotProduct(d2.vector))
                    / (d1.vector.getNorm() * d2.vector.getNorm());
        } catch (Exception e) {
            return 0.0;
        }
        return cosinesimilarity;
    }
}

Step VI. Document vector class

package com.computergodzilla.cosinesimilarity;

import java.util.Map;
import org.apache.commons.math.linear.OpenMapRealVector;
import org.apache.commons.math.linear.RealVectorFormat;

/**
 *
 * @author Mubin
 */
public class DocVector {

    public Map terms;
    public OpenMapRealVector vector;
    
    public DocVector(Map terms) {
        this.terms = terms;
        this.vector = new OpenMapRealVector(terms.size());        
    }

    public void setEntry(String term, int freq) {
        if (terms.containsKey(term)) {
            int pos = terms.get(term);
            vector.setEntry(pos, (double) freq);
        }
    }

    public void normalize() {
        double sum = vector.getL1Norm();
        vector = (OpenMapRealVector) vector.mapDivide(sum);
    }

    @Override
    public String toString() {
        RealVectorFormat formatter = new RealVectorFormat();
        return formatter.format(vector);
    }
}

Step VII. All done now fire up the program

All done now run the program from below main class.

package com.computergodzilla.cosinesimilairty;

import java.io.IOException;
import org.apache.lucene.store.LockObtainFailedException;

/**
 * Main Class
 * @author Mubin Shrestha
 */
public class Test {
    
    public static void main(String[] args) throws LockObtainFailedException, IOException
    {
        getCosineSimilarity();
    }
    
    public static void getCosineSimilarity() throws LockObtainFailedException, IOException
    {
       Indexer index = new Indexer();
       index.index();
       VectorGenerator vectorGenerator = new VectorGenerator();
       vectorGenerator.GetAllTerms();       
       DocVector[] docVector = vectorGenerator.GetDocumentVectors(); // getting document vectors
       for(int i = 0; i < docVector.length; i++)
       {
           double cosineSimilarity = CosineSimilarity.CosineSimilarity(docVector[1], docVector[i]);
           System.out.println("Cosine Similarity Score between document 0 and "+i+"  = " + cosineSimilarity);
       }    
    }
}

Output

Doc 1	Doc 2	Cosine Score
0	0	1
0	1	0.346410162
0	2	0.132453236

Fire Up guys, if you have any questions!!

Please check out my first Android app, NTyles:

34 comments:

PengchuJanuary 26, 2015 at 8:11 PM
There is a compile error in AllTerms class:
for(Entry s : allTerms.entrySet())
Type mismatch: conn't convert element type Object to Map.Entry

Please let's if you fix it.
ReplyDelete
Replies
UnknownMarch 6, 2015 at 4:50 AM
private Map allTerms = new HashMap();
ReplyDelete
Replies
shresthaMubinMarch 7, 2015 at 12:55 PM
All of you guys who are facing the problem please download the whole project. The link is provided at the top of the blog.
ReplyDelete
Replies
UnknownMarch 8, 2015 at 7:50 AM
Hi thanks for the code but there is an error like:
Exception in thread "main" java.lang.NullPointerException
at com.computergodzilla.cosinesimilarity.AllTerms.initAllTerms(AllTerms.java:41)
at com.computergodzilla.cosinesimilarity.VectorGenerator.GetAllTerms(VectorGenerator.java:32)
at com.computergodzilla.cosinesimilarity.Test.getCosineSimilarity(Test.java:23)
at com.computergodzilla.cosinesimilarity.Test.main(Test.java:15)
ReplyDelete
Replies
UnknownMarch 8, 2015 at 1:54 PM
HI thanks for your code but There is still a probelm as nullpointer in Allterm class
any advice, I used the download code and it is same as the code here in blog
ReplyDelete
Replies
UnknownMarch 8, 2015 at 3:17 PM
HI, I got my fault, in INDEX folder there were some opended file with ~ so it must be clean and just with the targeting files
thansk
ReplyDelete
Replies
UnknownJuly 7, 2015 at 10:26 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJuly 7, 2015 at 10:29 AM
Hi
I want to edit this code for more than one field, but i cannot write AllTerms and VectorGenerator rightly. please help me to do it.
Thanks
ReplyDelete
Replies
UnknownJuly 9, 2015 at 4:52 PM
Hi, Im getting errors such as:
Exception in thread "main" java.lang.NullPointerException
at org.apache.lucene.analysis.standard.StandardTokenizer.init(StandardTokenizer.java:144)
at org.apache.lucene.analysis.standard.StandardTokenizer.(StandardTokenizer.java:132)
at org.apache.lucene.analysis.standard.StandardAnalyzer.createComponents(StandardAnalyzer.java:111)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:180)
at org.apache.lucene.document.Field.tokenStream(Field.java:552)
at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:103)
at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:455)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1534)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1204)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1185)
at com.computergodzilla.cosinesimilarity.Indexer.index(Indexer.java:67)
at com.computergodzilla.cosinesimilarity.Test.getCosineSimilarity(Test.java:20)
at com.computergodzilla.cosinesimilarity.Test.main(Test.java:14)

What seems to be the issue? Any idea?
ReplyDelete
Replies
UnknownJuly 9, 2015 at 4:52 PM
Hi, Im getting errors such as:
Exception in thread "main" java.lang.NullPointerException
at org.apache.lucene.analysis.standard.StandardTokenizer.init(StandardTokenizer.java:144)
at org.apache.lucene.analysis.standard.StandardTokenizer.(StandardTokenizer.java:132)
at org.apache.lucene.analysis.standard.StandardAnalyzer.createComponents(StandardAnalyzer.java:111)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:180)
at org.apache.lucene.document.Field.tokenStream(Field.java:552)
at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:103)
at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:455)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1534)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1204)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1185)
at com.computergodzilla.cosinesimilarity.Indexer.index(Indexer.java:67)
at com.computergodzilla.cosinesimilarity.Test.getCosineSimilarity(Test.java:20)
at com.computergodzilla.cosinesimilarity.Test.main(Test.java:14)

What seems to be the issue? Any idea?
ReplyDelete
Replies
shresthaMubinJuly 13, 2015 at 9:59 PM
Could you please send me your project files.
ReplyDelete
Replies
UnknownJuly 19, 2015 at 8:20 PM
Hi, I was able to run the code. The error was in this line:

Analyzer analyzer = new StandardAnalyzer(StandardAnalyzer.STOP_WORDS_SET);

in the index() method. The code was giving NullPointerException, and I made the following changes and it ran:

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_47, StandardAnalyzer.STOP_WORDS_SET);
ReplyDelete
Replies
UnknownJuly 26, 2015 at 5:22 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJuly 26, 2015 at 5:48 PM
`How to implement the pre- configuration part in lucene index ?
ReplyDelete
Replies
UnknownJuly 27, 2015 at 2:37 PM
okay got it. Thanks :)
ReplyDelete
Replies
UnknownAugust 13, 2015 at 10:28 PM
Hi
I want to multiply my docvector to array of number, but i cannot use mapmultiply.
How can calculate this multiplication?
ReplyDelete
Replies
UnknownSeptember 11, 2015 at 5:55 PM
Hi thank you very much for nice information!

how can i get cosine similarity between my query to all other documents ??
ReplyDelete
Replies
UnknownNovember 30, 2015 at 10:55 PM
First off: thanks for the great work! That's a nice example for someone who didn't know much about Lucene in the first place - like me.

I do have a question tho: I'd like to modify the code a bit, as I do not want to calculate the cosine similarity between two documents but to create a term-document matrix using tf-idf (and to extract the highest rated terms for each document).

Do I understand right that term frequency is implemented (in VectorGenerator.GetDocumentVector()) but inverse document frequency isn't?
I'd like to add that, but how can I access only the terms which occur in a specific document? It seems like the Maps in all docVectors contain all terms which occur in any of the documents, so I can't use that.

Thanks!
ReplyDelete
Replies
nk_2015December 1, 2015 at 5:07 PM
Hi Thanks for uploading the code. I have one difficulty, when i imported the project mentioned in the link then i got errors at all the places wherever iterator.(termsEnum) has been used.Can you please help me with that? I am using latest version of lucene(5.3.1).
ReplyDelete
Replies
UnknownDecember 20, 2015 at 10:17 AM
I got these value for your three documents. The third one disagrees with your results. I am wondering why.
Cosine similarity 00 = 1.0
Cosine similarity 01 = 0.3464101615137754
Cosine similarity 02 = 0.18057877962865382
ReplyDelete
Replies
UnknownJanuary 26, 2016 at 3:54 PM
Thank you for the code! It's of great help to my project
ReplyDelete
Replies
ElisaDecember 12, 2017 at 1:41 PM
Hi,
Thanks for your explanation.

I have a doubt. how to get file name for each doc vector?

Thanks a lot.
ReplyDelete
Replies

Add comment

ComputerGodzilla

Search This Blog

Translate