Search This Blog

Translate

Sunday, December 30, 2012

Apache Lucene--How to parse texts from PDF files?

Want to follow news you care about.
Don't want to miss any action from premier League, Spanish League and other Leagues.
Want to make app with your own layout.

Check out NTyles.

Get it on....

NTyles-App




In my previous post I show you how to parse text from MSWord files. In this post I will be showing you guys how to parse text from PDF files.
For parsing pdf file I am using PDFBox library. It is a free open source tool. There is another library called Aspire PDF but it is a commercial version and as a free product it parses 1000 words per page or something. Aspire can also be used as OCR. But I am not explaining details about these libraries.
Here is sample code to parse text from pdf files.
/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package com.blogspot.computergodzilla.parsers;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

/**
 * This class parses the pdf file.
 * i.e this class returns the text from the pdf file.
 * @author Mubin Shrestha
 */
public class PdfFileParser {
   /**
    * This method returns the pdf content in text form.
    * @param pdffilePath : pdf file path of which you want to parse text
    * @return : texts from pdf file
    * @throws FileNotFoundException
    * @throws IOException 
    */
    public String PdfFileParser(String pdffilePath) throws FileNotFoundException, IOException
    {
        String content;
        FileInputStream fi = new FileInputStream(new File(pdffilePath));
        PDFParser parser = new PDFParser(fi);
        parser.parse();
        COSDocument cd = parser.getDocument();
        PDFTextStripper stripper = new PDFTextStripper();
        content = stripper.getText(new PDDocument(cd));
        cd.close();
        return content;
    }
    
    /**
     * Main method.
     * @param args
     * @throws FileNotFoundException
     * @throws IOException 
     */
    public static void main(String args[]) throws FileNotFoundException, IOException
    {
        String filepath = "H:/lab.pdf";
        System.out.println(new PdfFileParser().PdfFileParser(filepath));    
    }
}


This is it. Enough with parsing and blah blah. In my next post I will show you how to index these parsed files. 

3 comments:

  1. Hi, when I run this, The line with
    public String PdfFileParser(String pdffilePath) throws FileNotFoundException, IOException

    has a warning which says this method has a constructor name.
    And the code runs and exits, but the console just says

    PdfFileParser [java application] C:\program files\java\jdk1.7.0_03\bin\javaw.exe

    and shows no text..

    I am trying to recognize text from a scanned pre processed pdf image

    ReplyDelete
    Replies
    1. Hello John,

      Apache PDF Box is not able to process scanned images. You would not be able to achieve what you are trying to do with Apache pdf box.

      You will find other alternatives.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete