Get real time news update from your favorite websites.
Don't miss any news about your favorite topic.
Personalize your app.
Check out NTyles.
In past when I first created this blog and blogged in "Parse text from word files", Apache Poi package was not able to parse texts from .docx, .xlsx and .pptx files. And now it can extract text from .docx, .pptx, .xlsx files.
Here's the code on how to parse text from those files using Apache Poi 3.9 .
Don't miss any news about your favorite topic.
Personalize your app.
Check out NTyles.
Get it on....
In past when I first created this blog and blogged in "Parse text from word files", Apache Poi package was not able to parse texts from .docx, .xlsx and .pptx files. And now it can extract text from .docx, .pptx, .xlsx files.
Here's the code on how to parse text from those files using Apache Poi 3.9 .
/* * To change this template, choose Tools | Templates * and open the template in the editor. */ package com.blogspot.computergodzilla.xparser; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; import javax.xml.parsers.ParserConfigurationException; import org.apache.poi.openxml4j.exceptions.InvalidFormatException; import org.apache.poi.openxml4j.exceptions.OpenXML4JException; import org.apache.poi.openxml4j.opc.OPCPackage; import org.apache.poi.xslf.extractor.XSLFPowerPointExtractor; import org.apache.poi.xssf.extractor.XSSFExcelExtractor; import org.apache.poi.xwpf.extractor.XWPFWordExtractor; import org.apache.xmlbeans.XmlException; import org.xml.sax.SAXException; /** * This class parses the .docx ,.pptx and .xlsx files. * * @author Mubin Shrestha */ public class XParsers { /** * This method parses the .docx files. * * @param docx * @throws FileNotFoundException * @throws IOException * @throws XmlException * @throws InvalidFormatException * @throws OpenXML4JException * @throws ParserConfigurationException * @throws SAXException */ public void DocFileContentParser(OPCPackage docx) throws FileNotFoundException, IOException, XmlException, InvalidFormatException, OpenXML4JException, ParserConfigurationException, SAXException { XWPFWordExtractor xw = new XWPFWordExtractor(docx); System.out.println(xw.getText()); } /** * This method parses the pptx files * * @param pptx * @throws FileNotFoundException * @throws IOException * @throws InvalidFormatException * @throws XmlException * @throws OpenXML4JException */ public void ppFileContentParser(OPCPackage pptx) throws FileNotFoundException, IOException, InvalidFormatException, XmlException, OpenXML4JException { XSLFPowerPointExtractor xw = new XSLFPowerPointExtractor(pptx); System.out.println(xw.getText()); } /** * This method parsed xlsx files * * @param xlsx * @throws FileNotFoundException * @throws IOException * @throws InvalidFormatException * @throws XmlException * @throws OpenXML4JException */ public void excelContentParser(OPCPackage xlsx) throws FileNotFoundException, IOException, InvalidFormatException, XmlException, OpenXML4JException { XSSFExcelExtractor xe = new XSSFExcelExtractor(xlsx); System.out.println(xe.getText()); } /** * main method * * @param args * @throws FileNotFoundException * @throws IOException * @throws XmlException * @throws InvalidFormatException * @throws OpenXML4JException * @throws ParserConfigurationException * @throws SAXException */ public static void main(String args[]) throws FileNotFoundException, IOException, XmlException, InvalidFormatException, OpenXML4JException, ParserConfigurationException, SAXException { File file = new File("fileName"); //give your file name here of //which you want to parse text FileInputStream fs = new FileInputStream(file); OPCPackage d = OPCPackage.open(fs); XParsers xp = new XParsers(); if (file.getName().endsWith(".docx")) { xp.DocFileContentParser(d); } else if (file.getName().endsWith(".xlsx")) { xp.excelContentParser(d); } else if (file.getName().endsWith(".pptx")) { xp.ppFileContentParser(d); } } }
There is also good set of Java APIs (JOffice, JWord, JSpreadsheet, JPresentation, JODF) to process documents, spreadsheets and presentations.
ReplyDeletehttp://www.independentsoft.de
i was trying to execute above program,
ReplyDeletebut it shows runtime error
please tell me how to remove this error.
error is pasted below..
Exception in thread "main" java.lang.NoClassDefFoundError: org/dom4j/DocumentException
at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:149)
at org.apache.poi.openxml4j.opc.OPCPackage.(OPCPackage.java:136)
at org.apache.poi.openxml4j.opc.Package.(Package.java:52)
at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:81)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:220)
at XParsers.main(XParsers.java:102)
Caused by: java.lang.ClassNotFoundException: org.dom4j.DocumentException
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 6 more
@Kundan Dhande
ReplyDeleteFirst create a package "com.blogspot.computergodzilla.xparser" and create a empty java class file.Copy above code to the class file. This should solve your problem.
If you want to use different package name then update line "package com.blogspot.computergodzilla.xparser;" to "package ;"
yes sir i commented that line..
Deletei tried all of the programs on this blog, only this program is not running.
its not the compilation error .,,
it is runtime error..
still not getting solution to remove this error
Do you have all the libraries needed for it. I am attaching here a snapshot of the libraries that you will be needing :
ReplyDeleteInline image 1
If this still doesn't solve your problem Then create a new project in eclipse or netbeans. Then add all the libraries needed.
Create a new empty java file and copy the code into that file. Name the package according to your package name. This should solve your problem.
Hie i am unable to see the image ..
ReplyDeletecan u please mail me the same image at dhkundan@gmail.com ??
please if possible
it is very important to understand me this program..
Disculpa puedes enviarme la imagen con las bibliotecas a mi correo?? .. carrier_6@hotmail.com thanks
DeleteSir got it..
ReplyDeletei have included al the libraries and the program is running now.
thanksss....
why there is out of memory error while trying to parse a docx file of 10 mb??
ReplyDeleteat line XSSFExcelExtractor xe = new XSSFExcelExtractor(xlsx);
@Kundan,
ReplyDeletePlease do refer here :
http://stackoverflow.com/questions/18147585/excel-sheet-poi-validation-out-of-memory-error
can u please mail me the same image of library needed to run this program at pravin.s731@gmail.com ??
ReplyDeleteSir can u also mail me the img file so that i can get which jar's are missing in my lib folder
ReplyDeleteIf you got that image , Would you plz send it to me also?
DeleteHello gurjot please check the edit. Also, please use the app, rate and review. Thank you.
DeleteCan you please give me the link for PDF parser file which i saw on youtube
ReplyDelete