Search This Blog

Translate

Monday, December 31, 2012

Apache Lucene -- How to index .doc , .pdf and .text file?

Want to follow news you care about.
Don't want to miss any action from premier League, Spanish League and other Leagues.
Want to make app with your own layout.

Check out NTyles.

Get it on....

NTyles-App
In my previous two posts I show you how to parse texts from .doc and .pdf file. If you have tried all the source code and parsing stuffs from my previous posts then you are certainly ready for indexing .doc and .pdf files. In case if you have not tried my previous posts then don’t get confused I am only parsing texts from the .doc and .pdf files and indexing them respectively and if you insist of visiting my earlier posts then you’re always welcome. Indexing .doc, .pdf and .txt files are very helpful in document management , clustering and classification and searching and may be a lot more other stuffs. Now assuming you guys have set up your parsers I am diving into the code. I am modifying our previous Indexer.java. The only new method I have added is :
 public void StartIndex(File file) throws FileNotFoundException, CorruptIndexException, IOException 
I have also introduced new variable :
     private String fileContent;
This variable is to store all the text returned after parsing .doc and .pdf files. And some edit in :
     private void checkFileValidity();
As :
&& file.isFile() ) {
                    if(file.getName().endsWith(".txt")){
                        indexTextFiles(file);//if the file text file no need to parse text. 
                    System.out.println("INDEXED FILE " + file.getAbsolutePath() + " :-) ");
                    }
                    else if(file.getName().endsWith(".doc") || file.getName().endsWith(".pdf")){
                        //different method for indexing doc and pdf file.
                       StartIndex(file);                    
                    }
All the modification was needed only to call the parser methods of .doc and .pdf files and get the text from them and to finally index them along with their full path and file name itself. You can download the overall project from here or you can copy paste the code and try to make it run . You can download the source code from here. The Indexer.java program is shown below:
/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */

//Indexer.java
package com.blogspot.computergodzilla.index;

/*The following code snippet uses APACHE LUCENE 3.4.0
    The parsers uses : Apache POI for .doc parsing and 
          PDFBox to parse .pdf files.
*/
import com.blogspot.computergodzilla.parsers.DocFileParser;
import com.blogspot.computergodzilla.parsers.PdfFileParser;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 * This class will build index of the source location specified into the
 * destination specified.This class will index text, .doc and .pdf files.
 *
 * @author Mubin Shrestha
 */
public class Indexer {

    private final String sourceFilePath = "H:/FolderToIndex";    //give the location of the source files location here
    private final String indexFilePath = "H:/DemoIndex";   //give the location where you guys want to create index
    private IndexWriter writer = null;
    private File indexDirectory = null;
    private String fileContent;  //temporary storer of all the text parsed from doc and pdf 

    /**
     *
     * @throws FileNotFoundException
     * @throws CorruptIndexException
     * @throws IOException
     */
    private Indexer() throws FileNotFoundException, CorruptIndexException, IOException {
        try {
            long start = System.currentTimeMillis();
            createIndexWriter();
            checkFileValidity();
            closeIndexWriter();
            long end = System.currentTimeMillis();
            System.out.println("Total Document Indexed : " + TotalDocumentsIndexed());
            System.out.println("Total time" + (end - start) / (100 * 60));
        } catch (Exception e) {
            System.out.println("Sorry task cannot be completed");
        }
    }

    /**
     * IndexWriter writes the data to the index. Its provided by Lucene
     *
     * @param analyzer : its a standard analyzer, in this case it filters out
     *  englishStopWords and also analyses TFIDF
     */
    private void createIndexWriter() {
        try {
            indexDirectory = new File(indexFilePath);
            if (!indexDirectory.exists()) {
                indexDirectory.mkdir();
            }
            FSDirectory dir = FSDirectory.open(indexDirectory);
            StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_34, analyzer);
            writer = new IndexWriter(dir, config);
        } catch (Exception ex) {
            System.out.println("Sorry cannot get the index writer");
        }
    }

    /**
     * This function checks whenther the file passed is valid or not
     */
    private void checkFileValidity() {
        File[] filesToIndex = new File[100]; // suppose there are 100 files at max
        filesToIndex = new File(sourceFilePath).listFiles();
        for (File file : filesToIndex) {
            try {
                //to check whenther the file is a readable file or not.
                if (!file.isDirectory()
                        && !file.isHidden()
                        && file.exists()
                        && file.canRead()
                        && file.length() > 0.0
                        && file.isFile() ) {
                    if(file.getName().endsWith(".txt")){
                        indexTextFiles(file);//if the file text file no need to parse text. 
                    System.out.println("INDEXED FILE " + file.getAbsolutePath() + " :-) ");
                    }
                    else if(file.getName().endsWith(".doc") || file.getName().endsWith(".pdf")){
                        //different methof for indexing doc and pdf file.
                       StartIndex(file);                    
                    }
                }
            } catch (Exception e) {
                System.out.println("Sorry cannot index " + file.getAbsolutePath());
            }
        }
    }
    
    
    /**
     * This method is for indexing pdf file and doc file.
     * The text parsed from them are indexed along with the filename and filepath
     * @param file : the file which you want to index
     * @throws FileNotFoundException
     * @throws CorruptIndexException
     * @throws IOException 
     */
    public void StartIndex(File file) throws FileNotFoundException, CorruptIndexException, IOException {
         fileContent = null;
        try {
            Document doc = new Document();
            if (file.getName().endsWith(".doc")) {
                //call the doc file parser and get the content of doc file in txt format
                fileContent = new DocFileParser().DocFileContentParser(file.getAbsolutePath());
            }
            if (file.getName().endsWith(".pdf")) {
                //call the pdf file parser and get the content of pdf file in txt format
                fileContent = new PdfFileParser().PdfFileParser(file.getAbsolutePath());
            }
            doc.add(new Field("content", fileContent,
                    Field.Store.YES, Field.Index.ANALYZED,
                    Field.TermVector.WITH_POSITIONS_OFFSETS));
            doc.add(new Field("filename", file.getName(),
                    Field.Store.YES, Field.Index.ANALYZED));
            doc.add(new Field("fullpath", file.getAbsolutePath(),
                    Field.Store.YES, Field.Index.ANALYZED));
            if (doc != null) {
                writer.addDocument(doc);
            }
            System.out.println("Indexed" + file.getAbsolutePath());
        } catch (Exception e) {
            System.out.println("error in indexing" + (file.getAbsolutePath()));
        }
    }

    /**
     * This method indexed text files.
     * @param file
     * @throws CorruptIndexException
     * @throws IOException
     */
    private void indexTextFiles(File file) throws CorruptIndexException, IOException {
        Document doc = new Document();
        doc.add(new Field("content", new FileReader(file)));
        doc.add(new Field("filename", file.getName(),
                Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field("fullpath", file.getAbsolutePath(),
                Field.Store.YES, Field.Index.ANALYZED));
        if (doc != null) {
            writer.addDocument(doc);
        }
    }

    /**
     * This method returns the total number of documents indexed.
     * @return total number of documents indexed.
     */
    private int TotalDocumentsIndexed() {
        try {
            IndexReader reader = IndexReader.open(FSDirectory.open(indexDirectory));
            return reader.maxDoc();
        } catch (Exception ex) {
            System.out.println("Sorry no index found");
        }
        return 0;
    }

    /**
     *  closes the IndexWriter
     */
    private void closeIndexWriter() {
        try {
            writer.optimize();
            writer.close();
        } catch (Exception e) {
            System.out.println("Indexer Cannot be closed");
        }
    }

   /**
     *  Main method.
     */

    public static void main(String arg[]) {
        try {
            new Indexer();
        } catch (Exception ex) {
            System.out.println("Cannot Start :(");
        }
    }
}

You all can download the whole project from here.
Apache POI now can parse text from .docx, .xlsx and .pptx files. You guys can now also index those files. Parse text from .docx, .xlsx and .pptx shows how to parse texts from .docx , .pptx and .xlsx files. If you want to index them simply make a function that will return texts from them and index them. You can use the code sample given above to achieve that. If you guys need any help regarding indexing .docx,.xlsx and .pptx files please don't hesitate to reach out to me.

3 comments:

  1. How to index Content using lucene?

    ReplyDelete
  2. Sir, If i want to search for text in doc file and pdf file then should i use the same function that i used for text files which you have explained in your previous tutorials.
    Thnx

    ReplyDelete