Search This Blog

Translate

Thursday, May 16, 2013

Apache POI : Parse text from .docx, .pptx, .xlsx file using Apache POI 3.9

Get real time news update from your favorite websites.
Don't miss any news about your favorite topic.
Personalize your app.

Check out NTyles.


Get it on....

NTyles-App


In past when I first created this blog and blogged in "Parse text from word files", Apache Poi package was not able to parse texts from .docx, .xlsx and .pptx files. And now it can extract text from .docx, .pptx, .xlsx files.

Here's the code on how to parse text from those files using Apache Poi 3.9 .


/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package com.blogspot.computergodzilla.xparser;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import javax.xml.parsers.ParserConfigurationException;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xslf.extractor.XSLFPowerPointExtractor;
import org.apache.poi.xssf.extractor.XSSFExcelExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.xmlbeans.XmlException;
import org.xml.sax.SAXException;

/**
 * This class parses the .docx ,.pptx and .xlsx files.
 *
 * @author Mubin Shrestha
 */
public class XParsers {

    /**
     * This method parses the .docx files.
     *
     * @param docx
     * @throws FileNotFoundException
     * @throws IOException
     * @throws XmlException
     * @throws InvalidFormatException
     * @throws OpenXML4JException
     * @throws ParserConfigurationException
     * @throws SAXException
     */
    public void DocFileContentParser(OPCPackage docx) throws FileNotFoundException,
            IOException,
            XmlException,
            InvalidFormatException,
            OpenXML4JException,
            ParserConfigurationException,
            SAXException {
        XWPFWordExtractor xw = new XWPFWordExtractor(docx);
        System.out.println(xw.getText());
    }

    /**
     * This method parses the pptx files
     *
     * @param pptx
     * @throws FileNotFoundException
     * @throws IOException
     * @throws InvalidFormatException
     * @throws XmlException
     * @throws OpenXML4JException
     */
    public void ppFileContentParser(OPCPackage pptx) throws FileNotFoundException,
            IOException,
            InvalidFormatException,
            XmlException,
            OpenXML4JException {
        XSLFPowerPointExtractor xw = new XSLFPowerPointExtractor(pptx);
        System.out.println(xw.getText());
    }

    /**
     * This method parsed xlsx files
     *
     * @param xlsx
     * @throws FileNotFoundException
     * @throws IOException
     * @throws InvalidFormatException
     * @throws XmlException
     * @throws OpenXML4JException
     */
    public void excelContentParser(OPCPackage xlsx) throws FileNotFoundException,
            IOException,
            InvalidFormatException,
            XmlException,
            OpenXML4JException {
        XSSFExcelExtractor xe = new XSSFExcelExtractor(xlsx);
        System.out.println(xe.getText());
    }

    /**
     * main method
     *
     * @param args
     * @throws FileNotFoundException
     * @throws IOException
     * @throws XmlException
     * @throws InvalidFormatException
     * @throws OpenXML4JException
     * @throws ParserConfigurationException
     * @throws SAXException
     */
    public static void main(String args[]) throws FileNotFoundException,
            IOException,
            XmlException,
            InvalidFormatException,
            OpenXML4JException,
            ParserConfigurationException,
            SAXException {
        File file = new File("fileName"); //give your file name here of 
                                          //which you want to parse text
        FileInputStream fs = new FileInputStream(file);
        OPCPackage d = OPCPackage.open(fs);
        XParsers xp = new XParsers();
        if (file.getName().endsWith(".docx")) {
            xp.DocFileContentParser(d);
        } else if (file.getName().endsWith(".xlsx")) {
            xp.excelContentParser(d);
        } else if (file.getName().endsWith(".pptx")) {
            xp.ppFileContentParser(d);
        }
    }
}


Edit:

Well I must be a very bad blogger. Here is the picture on the list of libraries needed:


15 comments:

  1. There is also good set of Java APIs (JOffice, JWord, JSpreadsheet, JPresentation, JODF) to process documents, spreadsheets and presentations.

    http://www.independentsoft.de

    ReplyDelete
  2. i was trying to execute above program,
    but it shows runtime error
    please tell me how to remove this error.
    error is pasted below..


    Exception in thread "main" java.lang.NoClassDefFoundError: org/dom4j/DocumentException
    at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:149)
    at org.apache.poi.openxml4j.opc.OPCPackage.(OPCPackage.java:136)
    at org.apache.poi.openxml4j.opc.Package.(Package.java:52)
    at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:81)
    at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:220)
    at XParsers.main(XParsers.java:102)
    Caused by: java.lang.ClassNotFoundException: org.dom4j.DocumentException
    at java.net.URLClassLoader$1.run(Unknown Source)
    at java.net.URLClassLoader$1.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    ... 6 more

    ReplyDelete
  3. @Kundan Dhande

    First create a package "com.blogspot.computergodzilla.xparser" and create a empty java class file.Copy above code to the class file. This should solve your problem.

    If you want to use different package name then update line "package com.blogspot.computergodzilla.xparser;" to "package ;"

    ReplyDelete
    Replies
    1. yes sir i commented that line..
      i tried all of the programs on this blog, only this program is not running.
      its not the compilation error .,,
      it is runtime error..
      still not getting solution to remove this error

      Delete
  4. Do you have all the libraries needed for it. I am attaching here a snapshot of the libraries that you will be needing :
    Inline image 1
    If this still doesn't solve your problem Then create a new project in eclipse or netbeans. Then add all the libraries needed.
    Create a new empty java file and copy the code into that file. Name the package according to your package name. This should solve your problem.

    ReplyDelete
  5. Hie i am unable to see the image ..
    can u please mail me the same image at dhkundan@gmail.com ??
    please if possible
    it is very important to understand me this program..

    ReplyDelete
    Replies
    1. Disculpa puedes enviarme la imagen con las bibliotecas a mi correo?? .. carrier_6@hotmail.com thanks

      Delete
  6. Sir got it..
    i have included al the libraries and the program is running now.
    thanksss....

    ReplyDelete
  7. why there is out of memory error while trying to parse a docx file of 10 mb??
    at line XSSFExcelExtractor xe = new XSSFExcelExtractor(xlsx);

    ReplyDelete
  8. @Kundan,

    Please do refer here :
    http://stackoverflow.com/questions/18147585/excel-sheet-poi-validation-out-of-memory-error

    ReplyDelete
  9. can u please mail me the same image of library needed to run this program at pravin.s731@gmail.com ??

    ReplyDelete
  10. Sir can u also mail me the img file so that i can get which jar's are missing in my lib folder

    ReplyDelete
    Replies
    1. If you got that image , Would you plz send it to me also?

      Delete
    2. Hello gurjot please check the edit. Also, please use the app, rate and review. Thank you.

      Delete
  11. Can you please give me the link for PDF parser file which i saw on youtube

    ReplyDelete