Search This Blog

Translate

Wednesday, July 17, 2013

How to use Lucene Highlighter.

Get real time news update from your favorite websites.
Don't miss any news about your favorite topic.
Personalize your app.

Check out NTyles.


Get it on....

NTyles-App


In my previous blog I show you some java code for calculating cosine similarity, tfidf and generating document vectors using tfidf.
Somebody recently asked me about highlighting a search results using Lucene. I also didn't knew that Lucene has the highlighting capability. I googled a little and did some experiments of my own. So this post is totally dedicated for those who want to learn lucene highlighter.

What is Highlight in Lucene?
Highlighting in Lucene means getting the search word along with other keywords alongside with it.
As in lucene document getting the "keyword in context". Highlighting helps in getting the parts of the text related to the search word. As for example if I search for Highlight in Google then this will give me related search articles containing text like "Lucene highlighter" , "How to use Lucene  Highlighter", "Lucene Highlighter rocks" and etc, etc.

Highlighter is the main central class and this class is used to extract the intresting parts of the search word hits and highlight them. By highlight, I mean, one can color the intresting result, bold them. Above all one can format the intresting part of the search hit by using the format given by Lucene Highlighter. For this formatting purpose there are classes like :
  • Formatter
  • Fragmenter
Implementing Lucene Highlighter in java:
For this post I am using Lucene 4.2.1. For Lucene 4.2.1 the highlighter library is lucene-highlighter-4.2.1.jar  which resides in the folder "Highlighter" after you unzip the downloaded extract. There are overall three highlighter packages in Lucene :
  1. org.apache.lucene.search.highlight
  2. org.apache.lucene.search.postinghighlight
  3. org.apache.lucene.search.vectorhighlight
Among the above three I will be explaining only first one. If you want to learn the other two you can refer to lucene documentation.  Without furthur ado let me introduce to you the steps involved in making lucene highlighter work out:

Step 1 :
     Create a Lucene document with two fields one with term vector enabled and another without term vector.
Below is the java code on how to create a Lucene document with two fields. One with term vector enabled and another without term vector.


        Document doc = new Document(); //create a new document
        
        /**
        *Create a field with term vector enabled
         */
        FieldType type = new FieldType();
        type.setIndexed(true);
        type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
        type.setStored(true);
        type.setStoreTermVectors(true);
        type.setTokenized(true);
        type.setStoreTermVectorOffsets(true);
        Field field = new Field("content", "Lucene Highlighter rocks", type);//with term vector enabled
        /***/
        TextField f =new TextField("ncontent","Lucene Highlighter rocks", Field.Store.YES); //without term vector
        /**
         * Add above two field to document
         */
        doc.add(field);
        doc.add(f);


Step 2 : 
Add the documents created by "Step 1 " in Lucene Index. Read How To Make Lucene Index.
For those who don't know how to make a index in lucene please refer "Use Lucene to Index Files".

Step 3 :
Integrate Lucene Highlighter into your search engine of lucene.
Below is the code on using lucene  highlighter:

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package com.computergodzilla.highlighter;

import java.io.File;
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.search.highlight.TextFragment;
import org.apache.lucene.search.highlight.TokenSources;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 * Example of Lucene Highlighter
 * @author Mubin Shrestha
 */
public class LuceneHighlighter {

    public void highLighter() throws IOException, ParseException, InvalidTokenOffsetsException {
        IndexReader reader = DirectoryReader.open(FSDirectory.open(new File("D:/INDEXDIRECTORY")));
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
        IndexSearcher searcher = new IndexSearcher(reader);
        QueryParser parser = new QueryParser(Version.LUCENE_42, "ncontent", analyzer);
        Query query = parser.parse("going");
        TopDocs hits = searcher.search(query, reader.maxDoc());
        System.out.println(hits.totalHits);
        SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
        Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
        for (int i = 0; i < reader.maxDoc(); i++) {
            int id = hits.scoreDocs[i].doc;
            Document doc = searcher.doc(id);
            String text = doc.get("ncontent");
            TokenStream tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), id, "ncontent", analyzer);
            TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 4);
            for (int j = 0; j < frag.length; j++) {
                if ((frag[j] != null) && (frag[j].getScore() > 0)) {
                    System.out.println((frag[j].toString()));
                }
            }
            //Term vector
            text = doc.get("content");
            tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), hits.scoreDocs[i].doc, "content", analyzer);
            frag = highlighter.getBestTextFragments(tokenStream, text, false, 4);
            for (int j = 0; j < frag.length; j++) {
                if ((frag[j] != null) && (frag[j].getScore() > 0)) {
                    System.out.println((frag[j].toString()));
                }
            }
        }
    }
}


All the steps are complete.
The above code will make the texts with context to search word bold when the output are open in any browser.

Happy Highliting!! 

3 comments:

  1. This loop seems questionable:

    for (int i = 0; i < reader.maxDoc(); i++) {
    int id = hits.scoreDocs[i].doc;

    You sure the top bound isn't supposed to be hits.totalHits?

    ReplyDelete
  2. what is NaN in tf-idf calculation

    ReplyDelete
    Replies
    1. NaN stands for Not-a-Number. So you must be do some calculation with non number. For example "NaN" + 2.0 = "NaN". If you think there is bug in my code please let me know. I will verify. But the code given in tfidf calculation is totally verified. Please refer to java documentation on NaN. http://docs.oracle.com/javase/6/docs/api/java/lang/Double.html#NaN

      Delete