ComputerGodzilla: How to use Lucene Highlighter.

Get real time news update from your favorite websites.
Don't miss any news about your favorite topic.
Personalize your app.

Check out NTyles.

Get it on....

In my previous blog I show you some java code for calculating cosine similarity, tfidf and generating document vectors using tfidf.
Somebody recently asked me about highlighting a search results using Lucene. I also didn't knew that Lucene has the highlighting capability. I googled a little and did some experiments of my own. So this post is totally dedicated for those who want to learn lucene highlighter.

What is Highlight in Lucene?
Highlighting in Lucene means getting the search word along with other keywords alongside with it.
As in lucene document getting the "keyword in context". Highlighting helps in getting the parts of the text related to the search word. As for example if I search for Highlight in Google then this will give me related search articles containing text like "Lucene highlighter" , "How to use Lucene Highlighter", "Lucene Highlighter rocks" and etc, etc.

Highlighter is the main central class and this class is used to extract the intresting parts of the search word hits and highlight them. By highlight, I mean, one can color the intresting result, bold them. Above all one can format the intresting part of the search hit by using the format given by Lucene Highlighter. For this formatting purpose there are classes like :

Formatter
Fragmenter

Implementing Lucene Highlighter in java:
For this post I am using Lucene 4.2.1. For Lucene 4.2.1 the highlighter library is lucene-highlighter-4.2.1.jar which resides in the folder "Highlighter" after you unzip the downloaded extract. There are overall three highlighter packages in Lucene :

org.apache.lucene.search.highlight
org.apache.lucene.search.postinghighlight
org.apache.lucene.search.vectorhighlight

Among the above three I will be explaining only first one. If you want to learn the other two you can refer to lucene documentation. Without furthur ado let me introduce to you the steps involved in making lucene highlighter work out:

Step 1 :
Create a Lucene document with two fields one with term vector enabled and another without term vector.
Below is the java code on how to create a Lucene document with two fields. One with term vector enabled and another without term vector.

        Document doc = new Document(); //create a new document
        
        /**
        *Create a field with term vector enabled
         */
        FieldType type = new FieldType();
        type.setIndexed(true);
        type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
        type.setStored(true);
        type.setStoreTermVectors(true);
        type.setTokenized(true);
        type.setStoreTermVectorOffsets(true);
        Field field = new Field("content", "Lucene Highlighter rocks", type);//with term vector enabled
        /***/
        TextField f =new TextField("ncontent","Lucene Highlighter rocks", Field.Store.YES); //without term vector
        /**
         * Add above two field to document
         */
        doc.add(field);
        doc.add(f);

Step 2 :
Add the documents created by "Step 1 " in Lucene Index. Read How To Make Lucene Index.
For those who don't know how to make a index in lucene please refer "Use Lucene to Index Files".

Step 3 :
Integrate Lucene Highlighter into your search engine of lucene.
Below is the code on using lucene highlighter:

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package com.computergodzilla.highlighter;

import java.io.File;
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.search.highlight.TextFragment;
import org.apache.lucene.search.highlight.TokenSources;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 * Example of Lucene Highlighter
 * @author Mubin Shrestha
 */
public class LuceneHighlighter {

    public void highLighter() throws IOException, ParseException, InvalidTokenOffsetsException {
        IndexReader reader = DirectoryReader.open(FSDirectory.open(new File("D:/INDEXDIRECTORY")));
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
        IndexSearcher searcher = new IndexSearcher(reader);
        QueryParser parser = new QueryParser(Version.LUCENE_42, "ncontent", analyzer);
        Query query = parser.parse("going");
        TopDocs hits = searcher.search(query, reader.maxDoc());
        System.out.println(hits.totalHits);
        SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
        Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
        for (int i = 0; i < reader.maxDoc(); i++) {
            int id = hits.scoreDocs[i].doc;
            Document doc = searcher.doc(id);
            String text = doc.get("ncontent");
            TokenStream tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), id, "ncontent", analyzer);
            TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 4);
            for (int j = 0; j < frag.length; j++) {
                if ((frag[j] != null) && (frag[j].getScore() > 0)) {
                    System.out.println((frag[j].toString()));
                }
            }
            //Term vector
            text = doc.get("content");
            tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), hits.scoreDocs[i].doc, "content", analyzer);
            frag = highlighter.getBestTextFragments(tokenStream, text, false, 4);
            for (int j = 0; j < frag.length; j++) {
                if ((frag[j] != null) && (frag[j].getScore() > 0)) {
                    System.out.println((frag[j].toString()));
                }
            }
        }
    }
}

All the steps are complete.
The above code will make the texts with context to search word bold when the output are open in any browser.

Happy Highliting!!

ComputerGodzilla

Search This Blog

Translate

Wednesday, July 17, 2013

How to use Lucene Highlighter.

Get it on....

3 comments: