Search This Blog

Translate

Sunday, December 30, 2012

Apache Lucene- How to parse texts from .doc, .xls and .ppt files?

Want to follow news you care about.
Don't want to miss any action from premier League, Spanish League and other Leagues.
Want to make app with your own layout.

Check out NTyles.

Get it on....

NTyles-App
In my previous post I told you we have to parse text from .doc and other file to make it usable for indexing in Lucene Index. In this post I will show you how to parse text from .doc, .xls and .ppt file. In Lucene there is no any mechanism to parse text from .doc files. If you had tried doing this:
public class Indexer {

    private final String sourceFilePath = "H:/FolderToIndex/abc.doc"; 

and had tried to run the file you must have gone crazy seeing a whole dozens of compilation errors!!!! That's why I am using a third party library to parse text from those files. For this I am using Apache POI library, which you can download from here. This is a free open source wonderful library which is able to parse text from following file extensions and even more:
  • .doc
  • .xls
  • .ppt
  • .docx
  • .pptx
  • .xlsx
  • and many more..
Here is the sample code, which of course you can download from here.
//DocFileParser.java
/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package com.blogspot.computergodzilla.parsers;

import java.io.FileInputStream;
import org.apache.poi.hslf.extractor.PowerPointExtractor;
import org.apache.poi.hssf.extractor.ExcelExtractor;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;

/**
 * This class parses the microsoft word files except .docx,.pptx and 
 * latest MSword files.
 * 
 * @author Mubin Shrestha
 */
public class DocFileParser {
    
   /**
    * This method parses the content of the .doc file.
    * i.e. this method will return all the text of the file passed to it.
    * @param fileName : file name of which you want the content of.
    * @return : returns the content of the file
    */
    public String DocFileContentParser(String fileName) {
        POIFSFileSystem fs = null;
        try {
           
            if (fileName.endsWith(".xls")) { //if the file is excel file
                ExcelExtractor ex = new ExcelExtractor(fs);
                return ex.getText(); //returns text of the excel file
            } else if (fileName.endsWith(".ppt")) { //if the file is power point file
                PowerPointExtractor extractor = new PowerPointExtractor(fs);
                return extractor.getText(); //returns text of the power point file

            }
            
            //else for .doc file
            fs = new POIFSFileSystem(new FileInputStream(fileName));
            HWPFDocument doc = new HWPFDocument(fs);
            WordExtractor we = new WordExtractor(doc);
            return we.getText();//if the extension is .doc
        } catch (Exception e) {
            System.out.println("document file cant be indexed");
        }
        return "";
    }

    /**
     * Main method.
     * @param args 
     */
    public static void main(String args[])
    {
        String filepath = "H:/Filtering.ppt";
        System.out.println(new DocFileParser().DocFileContentParser(filepath));
        
    }
}
Here is a sample video demonstrating the parsing from above code:



Parse text from .docx, .pptx and .xlsx files shows how to parse text from .docx, .pptx and .xlsx files.

If you guys knew other third party open source tools please feel free to share. In my next post I will show how to parse text from pdf files.

6 comments:

  1. I m getting error in indexing a docx file. DocFileContentParser() returning empty. Why is this happening? I have spend a lot of time on this. After debugging I noticed that it is not returning anything and why i m getting this error in indexing the abc.docx file?

    ReplyDelete
    Replies
    1. Hey Kunal,

      I have clearly mentioned that the above source code would not work for .docx files. Please the comment section of DocFileParser. Please visit http://computergodzilla.blogspot.com/2013/05/index-docx-pptx-xlsx-file-using-apache.html for parsing .docx files.

      Delete
  2. Hey.
    For PDF parsing, do you know how to extract the text as a .txt file, sir?
    I'm working on learning lucene for OCR and I'd appreciate any help! Thank you.

    ReplyDelete
    Replies
    1. Hello Unknown,

      Please follow this article: http://computergodzilla.blogspot.com/2012/12/apache-lucene-how-to-parse-texts-from_30.html.

      Just save the text parsed from pdf file into a text file. There you are you will get a text file of your pdf file.

      Delete
  3. I figured that part out. Thank you. You were very helpful Now, would you happen to know any way to do the same with scanned pdfs?

    ReplyDelete
    Replies
    1. Scanned pdfs are images files. So apache pdf box cannot parse text from images. If you have to parse images from pdfs, i would suggest Abbyy OCR. But i have a very little knowledge about it. Also if OCR is your way Tesseract OCR is also great. Good luck

      Delete