ComputerGodzilla: Apache POI : Parse text from .docx, .pptx, .xlsx file using Apache POI 3.9

Thursday, May 16, 2013

Apache POI : Parse text from .docx, .pptx, .xlsx file using Apache POI 3.9

Get real time news update from your favorite websites.
Don't miss any news about your favorite topic.
Personalize your app.

Check out NTyles.

Get it on....

In past when I first created this blog and blogged in "Parse text from word files", Apache Poi package was not able to parse texts from .docx, .xlsx and .pptx files. And now it can extract text from .docx, .pptx, .xlsx files.

Here's the code on how to parse text from those files using Apache Poi 3.9 .

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package com.blogspot.computergodzilla.xparser;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import javax.xml.parsers.ParserConfigurationException;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xslf.extractor.XSLFPowerPointExtractor;
import org.apache.poi.xssf.extractor.XSSFExcelExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.xmlbeans.XmlException;
import org.xml.sax.SAXException;

/**
 * This class parses the .docx ,.pptx and .xlsx files.
 *
 * @author Mubin Shrestha
 */
public class XParsers {

    /**
     * This method parses the .docx files.
     *
     * @param docx
     * @throws FileNotFoundException
     * @throws IOException
     * @throws XmlException
     * @throws InvalidFormatException
     * @throws OpenXML4JException
     * @throws ParserConfigurationException
     * @throws SAXException
     */
    public void DocFileContentParser(OPCPackage docx) throws FileNotFoundException,
            IOException,
            XmlException,
            InvalidFormatException,
            OpenXML4JException,
            ParserConfigurationException,
            SAXException {
        XWPFWordExtractor xw = new XWPFWordExtractor(docx);
        System.out.println(xw.getText());
    }

    /**
     * This method parses the pptx files
     *
     * @param pptx
     * @throws FileNotFoundException
     * @throws IOException
     * @throws InvalidFormatException
     * @throws XmlException
     * @throws OpenXML4JException
     */
    public void ppFileContentParser(OPCPackage pptx) throws FileNotFoundException,
            IOException,
            InvalidFormatException,
            XmlException,
            OpenXML4JException {
        XSLFPowerPointExtractor xw = new XSLFPowerPointExtractor(pptx);
        System.out.println(xw.getText());
    }

    /**
     * This method parsed xlsx files
     *
     * @param xlsx
     * @throws FileNotFoundException
     * @throws IOException
     * @throws InvalidFormatException
     * @throws XmlException
     * @throws OpenXML4JException
     */
    public void excelContentParser(OPCPackage xlsx) throws FileNotFoundException,
            IOException,
            InvalidFormatException,
            XmlException,
            OpenXML4JException {
        XSSFExcelExtractor xe = new XSSFExcelExtractor(xlsx);
        System.out.println(xe.getText());
    }

    /**
     * main method
     *
     * @param args
     * @throws FileNotFoundException
     * @throws IOException
     * @throws XmlException
     * @throws InvalidFormatException
     * @throws OpenXML4JException
     * @throws ParserConfigurationException
     * @throws SAXException
     */
    public static void main(String args[]) throws FileNotFoundException,
            IOException,
            XmlException,
            InvalidFormatException,
            OpenXML4JException,
            ParserConfigurationException,
            SAXException {
        File file = new File("fileName"); //give your file name here of 
                                          //which you want to parse text
        FileInputStream fs = new FileInputStream(file);
        OPCPackage d = OPCPackage.open(fs);
        XParsers xp = new XParsers();
        if (file.getName().endsWith(".docx")) {
            xp.DocFileContentParser(d);
        } else if (file.getName().endsWith(".xlsx")) {
            xp.excelContentParser(d);
        } else if (file.getName().endsWith(".pptx")) {
            xp.ppFileContentParser(d);
        }
    }
}

Edit:

Well I must be a very bad blogger. Here is the picture on the list of libraries needed:

15 comments:

AnonymousJuly 5, 2013 at 10:03 PM
There is also good set of Java APIs (JOffice, JWord, JSpreadsheet, JPresentation, JODF) to process documents, spreadsheets and presentations.

http://www.independentsoft.de
ReplyDelete
Replies
UnknownNovember 25, 2013 at 4:31 PM
i was trying to execute above program,
but it shows runtime error
please tell me how to remove this error.
error is pasted below..

Exception in thread "main" java.lang.NoClassDefFoundError: org/dom4j/DocumentException
at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:149)
at org.apache.poi.openxml4j.opc.OPCPackage.(OPCPackage.java:136)
at org.apache.poi.openxml4j.opc.Package.(Package.java:52)
at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:81)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:220)
at XParsers.main(XParsers.java:102)
Caused by: java.lang.ClassNotFoundException: org.dom4j.DocumentException
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 6 more
ReplyDelete
Replies
shresthaMubinNovember 25, 2013 at 5:27 PM
@Kundan Dhande

First create a package "com.blogspot.computergodzilla.xparser" and create a empty java class file.Copy above code to the class file. This should solve your problem.

If you want to use different package name then update line "package com.blogspot.computergodzilla.xparser;" to "package ;"
ReplyDelete
Replies
shresthaMubinNovember 26, 2013 at 4:46 PM
Do you have all the libraries needed for it. I am attaching here a snapshot of the libraries that you will be needing :
Inline image 1
If this still doesn't solve your problem Then create a new project in eclipse or netbeans. Then add all the libraries needed.
Create a new empty java file and copy the code into that file. Name the package according to your package name. This should solve your problem.
ReplyDelete
Replies
UnknownNovember 28, 2013 at 1:06 PM
Hie i am unable to see the image ..
can u please mail me the same image at dhkundan@gmail.com ??
please if possible
it is very important to understand me this program..
ReplyDelete
Replies
UnknownNovember 28, 2013 at 4:44 PM
Sir got it..
i have included al the libraries and the program is running now.
thanksss....
ReplyDelete
Replies
UnknownDecember 4, 2013 at 2:01 PM
why there is out of memory error while trying to parse a docx file of 10 mb??
at line XSSFExcelExtractor xe = new XSSFExcelExtractor(xlsx);
ReplyDelete
Replies
shresthaMubinDecember 19, 2013 at 10:37 AM
@Kundan,

Please do refer here :
http://stackoverflow.com/questions/18147585/excel-sheet-poi-validation-out-of-memory-error
ReplyDelete
Replies
UnknownMarch 12, 2014 at 12:28 PM
can u please mail me the same image of library needed to run this program at pravin.s731@gmail.com ??
ReplyDelete
Replies
UnknownAugust 17, 2015 at 3:42 PM
Sir can u also mail me the img file so that i can get which jar's are missing in my lib folder
ReplyDelete
Replies
UnknownJune 2, 2016 at 10:22 PM
Can you please give me the link for PDF parser file which i saw on youtube
ReplyDelete
Replies

Add comment

ComputerGodzilla

Search This Blog

Translate

Thursday, May 16, 2013

Apache POI : Parse text from .docx, .pptx, .xlsx file using Apache POI 3.9

Get it on....

15 comments: