Search This Blog

Translate

Friday, January 2, 2015

Calculate Cosine Similarity Using Lucene 4.10.2

Get real time news update from your favorite websites.
Don't miss any news about your favorite topic.
Personalize your app.

Check out NTyles.


Get it on....

NTyles-App




***EDIT
Lot of people are facing the problem with the code base that I had pasted below. Please download the whole project from HERE. It should work. Cheers.

Someone recently asked me on calculating cosine similarity between documents using Lucene 4.10.2. It has been more than two years that I had not used Lucene. There was a great piece of article regarding computing document similarity Salmon Run: Computing Document Similarity Using Lucene implemented in Lucene version 3.x. So as to update myself, this new year I did some coding myself to calculate the cosine similarity using Lucene version 4.10.2. For my code base I am using following three text files. My test files are as follows:
Document Id File Name  Text
0 Document1.txt This New Year I am learning how to calculate cosine similarity using Lucene. It will be fun.
1 Document2.txt Huh!! What you want to learn cosine similarity. Its new year man. Do something crazy and blasting. By the way, what is Cosine Similarity?
2 Document3.txt Dude, don't under estimate the power of cosine similarity.I can tell you which types of books are there in your computer simply by running my scripts of cosine similarity.
Moving on I assume you know following:
For calculating cosine similarity in lucene first you should do some pre-configuration in Lucene index. You have to store terms and their frequencies in the index.Lucene has inbuilt functions to create terms vectors during indexing. You will have to enable following while indexing and creating the indexing fields:
    FieldType fieldType = new FieldType();
    fieldType.setIndexed(true);
    fieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
    fieldType.setStored(true);
    fieldType.setStoreTermVectors(true);
    fieldType.setTokenized(true);
Above lines will create term vectors and store them in the Lucene index of each and every term.
Now, let us calculate cosine similarity between above three documents using Lucene 4.10.2. The first step is to create the Lucene index. For my code I had followed the same pattern as Sir. Sujit Pal had did in his awesome blog. But before we start creating Lucene index, make sure you had downloaded following libraries and add to your project.
  1. lucene-core-4.10.2.jar
  2. lucene-analyzers-common-4.10.2.jar
  3. commons-math-2.0.jar

Also, please consider that my program package name is com.computergodzilla.cosinesimilarity. Make sure that you copy and paste below code base into the same folder. If not you will have to create your own package and edit the java files respectively.

Step I. Preparing Lucene 4.10.2 Index

I am creating Lucene index in the hard drive. You can also create the Lucene index in the memory. Please refer to Lucene documentation for detail. My following code base will create a Lucene index with only one field. This field will store all the terms of the documents with their respective frequencies. All the information like location of source files and index directory are configured separately in Configuration class which is given just after the Indexer.java file below.
// Indexer.java
package com.computergodzilla.cosinesimilarity;

import java.io.*;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.FieldInfo;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.util.Version;


/**
 * Class to create Lucene Index from files.
 * Remember this class will only index files inside a folder.
 * If there are  multiple folder inside the source folder it will not index 
 * those files.
 * 
 *  It will only index text files 
 * @author Mubin Shrestha
 */
public class Indexer {

    private final File sourceDirectory;
    private final File indexDirectory;
    private static String fieldName;

    public Indexer() {
        this.sourceDirectory = new File(Configuration.SOURCE_DIRECTORY_TO_INDEX);
        this.indexDirectory = new File(Configuration.INDEX_DIRECTORY);
        fieldName = Configuration.FIELD_CONTENT;
    }

    /**
     * Method to create Lucene Index
     * Keep in mind that always index text value to Lucene for calculating 
     * Cosine Similarity.
     * You have to generate tokens, terms and their frequencies and store
     * them in the Lucene Index.
     * @throws CorruptIndexException
     * @throws LockObtainFailedException
     * @throws IOException 
     */
    public void index() throws CorruptIndexException,
            LockObtainFailedException, IOException {
        Directory dir = FSDirectory.open(indexDirectory);
        Analyzer analyzer = new StandardAnalyzer(StandardAnalyzer.STOP_WORDS_SET);  // using stop words
        IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_10_2, analyzer);

        if (indexDirectory.exists()) {
            iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
        } else {
            // Add new documents to an existing index:
            iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
        }

        IndexWriter writer = new IndexWriter(dir, iwc);
        for (File f : sourceDirectory.listFiles()) {
            Document doc = new Document();
            FieldType fieldType = new FieldType();
            fieldType.setIndexed(true);
            fieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
            fieldType.setStored(true);
            fieldType.setStoreTermVectors(true);
            fieldType.setTokenized(true);
            Field contentField = new Field(fieldName, getAllText(f), fieldType);
            doc.add(contentField);
            writer.addDocument(doc);
        }
        writer.close();
    }

    /**
     * Method to get all the texts of the text file.
     * Lucene cannot create the term vetors and tokens for reader class.
     * You have to index its text values to the index.
     * It would be more better if this was in another class.
     * I am lazy you know all.
     * @param f
     * @return
     * @throws FileNotFoundException
     * @throws IOException 
     */
    public String getAllText(File f) throws FileNotFoundException, IOException {
        String textFileContent = "";

        for (String line : Files.readAllLines(Paths.get(f.getAbsolutePath()))) {
            textFileContent += line;
        }
        return textFileContent;
    }
}
Below is the Configuration Class:
// Configuration.java
package com.computergodzilla.cosinesimilairty;

/**
 * @author Mubin Shrestha
 */
public class Configuration { 
    public static final String SOURCE_DIRECTORY_TO_INDEX = "E:/TEST";
    public static final String INDEX_DIRECTORY = "E:/INDEXDIRECTORY";
    public static final String FIELD_CONTENT = "contents"; // name of the field to index
}

Step II. Preparing IndexReader to read in the Lucene Index

Once the index is created, prepare an index reader class. You will use IndexReader class to read the index, to count the number of terms, frequencies and counting total number of documents. To work on the indexed materials you will use an IndexReader class to do your work. Below is the IndexOpener.java class.
package com.computergodzilla.cosinesimilairty;

import java.io.File;
import java.io.IOException;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.store.FSDirectory;

/**
 * Class to Get the Lucene Index Reader
 * @author Mubin Shrestha
 */
public class IndexOpener {
    
    public static IndexReader GetIndexReader() throws IOException {
        IndexReader indexReader = DirectoryReader.open(FSDirectory.open(new File(Configuration.INDEX_DIRECTORY)));
        return indexReader;
    }

    /**
     * Returns the total number of documents in the index
     * @return
     * @throws IOException 
     */
    public static Integer TotalDocumentInIndex() throws IOException
    {
        Integer maxDoc = GetIndexReader().maxDoc();
        GetIndexReader().close();
        return maxDoc;
    }
}

Steps III. Getting all the terms indexed from the Lucene Index

This class is necessary to generate the document vectors. All terms in the index gives the total length of the vector needed to be created for the document vector. Below is the AllTerms.java Class.
package com.computergodzilla.cosinesimilarity;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.util.BytesRef;

/**
 * Class that will get all the terms in the index.
 * @author Mubin Shrestha
 */
public class AllTerms {
    private Map allTerms;
    Integer totalNoOfDocumentInIndex;
    IndexReader indexReader;
    
    public AllTerms() throws IOException
    {    
        allTerms = new HashMap<>();
        indexReader = IndexOpener.GetIndexReader();
        totalNoOfDocumentInIndex = IndexOpener.TotalDocumentInIndex();
    }
        
    public void initAllTerms() throws IOException
    {
        int pos = 0;
        for (int docId = 0; docId < totalNoOfDocumentInIndex; docId++) {
            Terms vector = indexReader.getTermVector(docId, Configuration.FIELD_CONTENT);
            TermsEnum termsEnum = null;
            termsEnum = vector.iterator(termsEnum);
            BytesRef text = null;
            while ((text = termsEnum.next()) != null) {
                String term = text.utf8ToString();
                allTerms.put(term, pos++);
            }
        }       
        
        //Update postition
        pos = 0;
        for(Entry<String,Integer> s : allTerms.entrySet())
        {        
            System.out.println(s.getKey());
            s.setValue(pos++);
        }
    }
    
    public Map<String,Integer> getAllTerms() {
        return allTerms;
    }
}

Step IV. Generating Document Vectors

The next step is to create document vectors of all the documents indexed in the Lucene Index. Below is the VectorGenerator.java Class.
package com.computergodzilla.cosinesimilairty;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.util.BytesRef;

/**
 * Class to generate Document Vectors from Lucene Index
 * @author Mubin Shrestha
 */
public class VectorGenerator {
    DocVector[] docVector;
    private Map allterms;
    Integer totalNoOfDocumentInIndex;
    IndexReader indexReader;
    
    public VectorGenerator() throws IOException
    {
        allterms = new HashMap<>();
        indexReader = IndexOpener.GetIndexReader();
        totalNoOfDocumentInIndex = IndexOpener.TotalDocumentInIndex();
        docVector = new DocVector[totalNoOfDocumentInIndex];
    }
    
    public void GetAllTerms() throws IOException
    {
        AllTerms allTerms = new AllTerms();
        allTerms.initAllTerms();
        allterms = allTerms.getAllTerms();
    }
    
    public DocVector[] GetDocumentVectors() throws IOException {
        for (int docId = 0; docId < totalNoOfDocumentInIndex; docId++) {
            Terms vector = indexReader.getTermVector(docId, Configuration.FIELD_CONTENT);
            TermsEnum termsEnum = null;
            termsEnum = vector.iterator(termsEnum);
            BytesRef text = null;            
            docVector[docId] = new DocVector(allterms);            
            while ((text = termsEnum.next()) != null) {
                String term = text.utf8ToString();
                int freq = (int) termsEnum.totalTermFreq();
                docVector[docId].setEntry(term, freq);
            }
            docVector[docId].normalize();
        }        
        indexReader.close();
        return docVector;
    }
}

Step V. Create Cosine Similarity Class

The next step is to prepare the cosine similarity calculating class. Below is the cosinesimilairty.java class.
package com.computergodzilla.cosinesimilairty;

/**
 * Class to calculate cosine similarity
 * @author Mubin Shrestha
 */
public class CosineSimilarity {    
    public static double CosineSimilarity(DocVector d1,DocVector d2) {
        double cosinesimilarity;
        try {
            cosinesimilarity = (d1.vector.dotProduct(d2.vector))
                    / (d1.vector.getNorm() * d2.vector.getNorm());
        } catch (Exception e) {
            return 0.0;
        }
        return cosinesimilarity;
    }
}

Step VI. Document vector class

package com.computergodzilla.cosinesimilarity;

import java.util.Map;
import org.apache.commons.math.linear.OpenMapRealVector;
import org.apache.commons.math.linear.RealVectorFormat;

/**
 *
 * @author Mubin
 */
public class DocVector {

    public Map terms;
    public OpenMapRealVector vector;
    
    public DocVector(Map terms) {
        this.terms = terms;
        this.vector = new OpenMapRealVector(terms.size());        
    }

    public void setEntry(String term, int freq) {
        if (terms.containsKey(term)) {
            int pos = terms.get(term);
            vector.setEntry(pos, (double) freq);
        }
    }

    public void normalize() {
        double sum = vector.getL1Norm();
        vector = (OpenMapRealVector) vector.mapDivide(sum);
    }

    @Override
    public String toString() {
        RealVectorFormat formatter = new RealVectorFormat();
        return formatter.format(vector);
    }
}

Step VII. All done now fire up the program

All done now run the program from below main class.
package com.computergodzilla.cosinesimilairty;

import java.io.IOException;
import org.apache.lucene.store.LockObtainFailedException;

/**
 * Main Class
 * @author Mubin Shrestha
 */
public class Test {
    
    public static void main(String[] args) throws LockObtainFailedException, IOException
    {
        getCosineSimilarity();
    }
    
    public static void getCosineSimilarity() throws LockObtainFailedException, IOException
    {
       Indexer index = new Indexer();
       index.index();
       VectorGenerator vectorGenerator = new VectorGenerator();
       vectorGenerator.GetAllTerms();       
       DocVector[] docVector = vectorGenerator.GetDocumentVectors(); // getting document vectors
       for(int i = 0; i < docVector.length; i++)
       {
           double cosineSimilarity = CosineSimilarity.CosineSimilarity(docVector[1], docVector[i]);
           System.out.println("Cosine Similarity Score between document 0 and "+i+"  = " + cosineSimilarity);
       }    
    }
}
Output
Doc 1 Doc 2 Cosine Score
0 0 1
0 1 0.346410162
0 2 0.132453236
Fire Up guys, if you have any questions!!

Please check out my first Android app, NTyles: