Want to follow news you care about.
Don't want to miss any action from premier League, Spanish League and other Leagues.
Want to make app with your own layout.
Check out NTyles.
Get it on....
In my previous post I told you we have to parse text from .doc and other file to make it usable for indexing in Lucene Index. In this post I will show you how to parse text from .doc, .xls and .ppt file. In Lucene there is no any mechanism to parse text from .doc files. If you had tried doing this:public class Indexer { private final String sourceFilePath = "H:/FolderToIndex/abc.doc";and had tried to run the file you must have gone crazy seeing a whole dozens of compilation errors!!!! That's why I am using a third party library to parse text from those files. For this I am using Apache POI library, which you can download from here. This is a free open source wonderful library which is able to parse text from following file extensions and even more:
- .doc
- .xls
- .ppt
- .docx
- .pptx
- .xlsx and many more..
//DocFileParser.java /* * To change this template, choose Tools | Templates * and open the template in the editor. */ package com.blogspot.computergodzilla.parsers; import java.io.FileInputStream; import org.apache.poi.hslf.extractor.PowerPointExtractor; import org.apache.poi.hssf.extractor.ExcelExtractor; import org.apache.poi.hwpf.HWPFDocument; import org.apache.poi.hwpf.extractor.WordExtractor; import org.apache.poi.poifs.filesystem.POIFSFileSystem; /** * This class parses the microsoft word files except .docx,.pptx and * latest MSword files. * * @author Mubin Shrestha */ public class DocFileParser { /** * This method parses the content of the .doc file. * i.e. this method will return all the text of the file passed to it. * @param fileName : file name of which you want the content of. * @return : returns the content of the file */ public String DocFileContentParser(String fileName) { POIFSFileSystem fs = null; try { if (fileName.endsWith(".xls")) { //if the file is excel file ExcelExtractor ex = new ExcelExtractor(fs); return ex.getText(); //returns text of the excel file } else if (fileName.endsWith(".ppt")) { //if the file is power point file PowerPointExtractor extractor = new PowerPointExtractor(fs); return extractor.getText(); //returns text of the power point file } //else for .doc file fs = new POIFSFileSystem(new FileInputStream(fileName)); HWPFDocument doc = new HWPFDocument(fs); WordExtractor we = new WordExtractor(doc); return we.getText();//if the extension is .doc } catch (Exception e) { System.out.println("document file cant be indexed"); } return ""; } /** * Main method. * @param args */ public static void main(String args[]) { String filepath = "H:/Filtering.ppt"; System.out.println(new DocFileParser().DocFileContentParser(filepath)); } }Here is a sample video demonstrating the parsing from above code:
Parse text from .docx, .pptx and .xlsx files shows how to parse text from .docx, .pptx and .xlsx files.
If you guys knew other third party open source tools please feel free to share. In my next post I will show how to parse text from pdf files.
I m getting error in indexing a docx file. DocFileContentParser() returning empty. Why is this happening? I have spend a lot of time on this. After debugging I noticed that it is not returning anything and why i m getting this error in indexing the abc.docx file?
ReplyDeleteHey Kunal,
DeleteI have clearly mentioned that the above source code would not work for .docx files. Please the comment section of DocFileParser. Please visit http://computergodzilla.blogspot.com/2013/05/index-docx-pptx-xlsx-file-using-apache.html for parsing .docx files.
Hey.
ReplyDeleteFor PDF parsing, do you know how to extract the text as a .txt file, sir?
I'm working on learning lucene for OCR and I'd appreciate any help! Thank you.
Hello Unknown,
DeletePlease follow this article: http://computergodzilla.blogspot.com/2012/12/apache-lucene-how-to-parse-texts-from_30.html.
Just save the text parsed from pdf file into a text file. There you are you will get a text file of your pdf file.
I figured that part out. Thank you. You were very helpful Now, would you happen to know any way to do the same with scanned pdfs?
ReplyDeleteScanned pdfs are images files. So apache pdf box cannot parse text from images. If you have to parse images from pdfs, i would suggest Abbyy OCR. But i have a very little knowledge about it. Also if OCR is your way Tesseract OCR is also great. Good luck
Delete