Search This Blog

Translate

Sunday, December 30, 2012

Apache Lucene--How to index .doc and .pdf files?

Want to follow news you care about.
Don't want to miss any action from premier League, Spanish League and other Leagues.
Want to make app with your own layout.

Check out NTyles.

Get it on....

NTyles-App


In my previous blog I show you guys how to index text files. Some of you may be thinking "Why the heck this guy index only text files? Why not .doc and .pdf files?". So this post is dedicated for those who are wondering about how to and why he didn't. The overall mechanism of how to index .doc , .pdf files will be presented in three series. Before we really move on to the real topic first let me make you'al clear that Apache Lucene is able to index texts only so we first have to parse texts from unsupported files (.doc, .pdf, .xls and etc). So now in upcoming posts we will parse texts from .doc file and in second post we will parse .pdf files and finally the third post will be to index both .doc and .pdf files. Well gonna be long!! Be ready.

No comments:

Post a Comment