ComputerGodzilla: December 2012

Monday, December 31, 2012

Apache Lucene -- How to index .doc , .pdf and .text file?

Want to follow news you care about.

Don't want to miss any action from premier League, Spanish League and other Leagues.

Want to make app with your own layout.

Check out NTyles.

Get it on....

In my previous two posts I show you how to parse texts from .doc and .pdf file. If you have tried all the source code and parsing stuffs from my previous posts then you are certainly ready for indexing .doc and .pdf files. In case if you have not tried my previous posts then don’t get confused I am only parsing texts from the .doc and .pdf files and indexing them respectively and if you insist of visiting my earlier posts then you’re always welcome. Indexing .doc, .pdf and .txt files are very helpful in document management , clustering and classification and searching and may be a lot more other stuffs. Now assuming you guys have set up your parsers I am diving into the code. I am modifying our previous Indexer.java. The only new method I have added is :

 public void StartIndex(File file) throws FileNotFoundException, CorruptIndexException, IOException

I have also introduced new variable :

     private String fileContent;

This variable is to store all the text returned after parsing .doc and .pdf files. And some edit in :

     private void checkFileValidity();

As :

&& file.isFile() ) {
                    if(file.getName().endsWith(".txt")){
                        indexTextFiles(file);//if the file text file no need to parse text. 
                    System.out.println("INDEXED FILE " + file.getAbsolutePath() + " :-) ");
                    }
                    else if(file.getName().endsWith(".doc") || file.getName().endsWith(".pdf")){
                        //different method for indexing doc and pdf file.
                       StartIndex(file);                    
                    }

All the modification was needed only to call the parser methods of .doc and .pdf files and get the text from them and to finally index them along with their full path and file name itself. You can download the overall project from here or you can copy paste the code and try to make it run . You can download the source code from here. The Indexer.java program is shown below:

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */

//Indexer.java
package com.blogspot.computergodzilla.index;

/*The following code snippet uses APACHE LUCENE 3.4.0
    The parsers uses : Apache POI for .doc parsing and 
          PDFBox to parse .pdf files.
*/
import com.blogspot.computergodzilla.parsers.DocFileParser;
import com.blogspot.computergodzilla.parsers.PdfFileParser;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 * This class will build index of the source location specified into the
 * destination specified.This class will index text, .doc and .pdf files.
 *
 * @author Mubin Shrestha
 */
public class Indexer {

    private final String sourceFilePath = "H:/FolderToIndex";    //give the location of the source files location here
    private final String indexFilePath = "H:/DemoIndex";   //give the location where you guys want to create index
    private IndexWriter writer = null;
    private File indexDirectory = null;
    private String fileContent;  //temporary storer of all the text parsed from doc and pdf 

    /**
     *
     * @throws FileNotFoundException
     * @throws CorruptIndexException
     * @throws IOException
     */
    private Indexer() throws FileNotFoundException, CorruptIndexException, IOException {
        try {
            long start = System.currentTimeMillis();
            createIndexWriter();
            checkFileValidity();
            closeIndexWriter();
            long end = System.currentTimeMillis();
            System.out.println("Total Document Indexed : " + TotalDocumentsIndexed());
            System.out.println("Total time" + (end - start) / (100 * 60));
        } catch (Exception e) {
            System.out.println("Sorry task cannot be completed");
        }
    }

    /**
     * IndexWriter writes the data to the index. Its provided by Lucene
     *
     * @param analyzer : its a standard analyzer, in this case it filters out
     *  englishStopWords and also analyses TFIDF
     */
    private void createIndexWriter() {
        try {
            indexDirectory = new File(indexFilePath);
            if (!indexDirectory.exists()) {
                indexDirectory.mkdir();
            }
            FSDirectory dir = FSDirectory.open(indexDirectory);
            StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_34, analyzer);
            writer = new IndexWriter(dir, config);
        } catch (Exception ex) {
            System.out.println("Sorry cannot get the index writer");
        }
    }

    /**
     * This function checks whenther the file passed is valid or not
     */
    private void checkFileValidity() {
        File[] filesToIndex = new File[100]; // suppose there are 100 files at max
        filesToIndex = new File(sourceFilePath).listFiles();
        for (File file : filesToIndex) {
            try {
                //to check whenther the file is a readable file or not.
                if (!file.isDirectory()
                        && !file.isHidden()
                        && file.exists()
                        && file.canRead()
                        && file.length() > 0.0
                        && file.isFile() ) {
                    if(file.getName().endsWith(".txt")){
                        indexTextFiles(file);//if the file text file no need to parse text. 
                    System.out.println("INDEXED FILE " + file.getAbsolutePath() + " :-) ");
                    }
                    else if(file.getName().endsWith(".doc") || file.getName().endsWith(".pdf")){
                        //different methof for indexing doc and pdf file.
                       StartIndex(file);                    
                    }
                }
            } catch (Exception e) {
                System.out.println("Sorry cannot index " + file.getAbsolutePath());
            }
        }
    }
    
    
    /**
     * This method is for indexing pdf file and doc file.
     * The text parsed from them are indexed along with the filename and filepath
     * @param file : the file which you want to index
     * @throws FileNotFoundException
     * @throws CorruptIndexException
     * @throws IOException 
     */
    public void StartIndex(File file) throws FileNotFoundException, CorruptIndexException, IOException {
         fileContent = null;
        try {
            Document doc = new Document();
            if (file.getName().endsWith(".doc")) {
                //call the doc file parser and get the content of doc file in txt format
                fileContent = new DocFileParser().DocFileContentParser(file.getAbsolutePath());
            }
            if (file.getName().endsWith(".pdf")) {
                //call the pdf file parser and get the content of pdf file in txt format
                fileContent = new PdfFileParser().PdfFileParser(file.getAbsolutePath());
            }
            doc.add(new Field("content", fileContent,
                    Field.Store.YES, Field.Index.ANALYZED,
                    Field.TermVector.WITH_POSITIONS_OFFSETS));
            doc.add(new Field("filename", file.getName(),
                    Field.Store.YES, Field.Index.ANALYZED));
            doc.add(new Field("fullpath", file.getAbsolutePath(),
                    Field.Store.YES, Field.Index.ANALYZED));
            if (doc != null) {
                writer.addDocument(doc);
            }
            System.out.println("Indexed" + file.getAbsolutePath());
        } catch (Exception e) {
            System.out.println("error in indexing" + (file.getAbsolutePath()));
        }
    }

    /**
     * This method indexed text files.
     * @param file
     * @throws CorruptIndexException
     * @throws IOException
     */
    private void indexTextFiles(File file) throws CorruptIndexException, IOException {
        Document doc = new Document();
        doc.add(new Field("content", new FileReader(file)));
        doc.add(new Field("filename", file.getName(),
                Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field("fullpath", file.getAbsolutePath(),
                Field.Store.YES, Field.Index.ANALYZED));
        if (doc != null) {
            writer.addDocument(doc);
        }
    }

    /**
     * This method returns the total number of documents indexed.
     * @return total number of documents indexed.
     */
    private int TotalDocumentsIndexed() {
        try {
            IndexReader reader = IndexReader.open(FSDirectory.open(indexDirectory));
            return reader.maxDoc();
        } catch (Exception ex) {
            System.out.println("Sorry no index found");
        }
        return 0;
    }

    /**
     *  closes the IndexWriter
     */
    private void closeIndexWriter() {
        try {
            writer.optimize();
            writer.close();
        } catch (Exception e) {
            System.out.println("Indexer Cannot be closed");
        }
    }

   /**
     *  Main method.
     */

    public static void main(String arg[]) {
        try {
            new Indexer();
        } catch (Exception ex) {
            System.out.println("Cannot Start :(");
        }
    }
}

You all can download the whole project from here.
Apache POI now can parse text from .docx, .xlsx and .pptx files. You guys can now also index those files. Parse text from .docx, .xlsx and .pptx shows how to parse texts from .docx , .pptx and .xlsx files. If you want to index them simply make a function that will return texts from them and index them. You can use the code sample given above to achieve that. If you guys need any help regarding indexing .docx,.xlsx and .pptx files please don't hesitate to reach out to me.

Sunday, December 30, 2012

Apache Lucene--How to parse texts from PDF files?

Want to follow news you care about.

Don't want to miss any action from premier League, Spanish League and other Leagues.

Want to make app with your own layout.

Check out NTyles.

Get it on....

In my previous post I show you how to parse text from MSWord files. In this post I will be showing you guys how to parse text from PDF files.

For parsing pdf file I am using PDFBox library. It is a free open source tool. There is another library called Aspire PDF but it is a commercial version and as a free product it parses 1000 words per page or something. Aspire can also be used as OCR. But I am not explaining details about these libraries.

Here is sample code to parse text from pdf files.

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package com.blogspot.computergodzilla.parsers;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

/**
 * This class parses the pdf file.
 * i.e this class returns the text from the pdf file.
 * @author Mubin Shrestha
 */
public class PdfFileParser {
   /**
    * This method returns the pdf content in text form.
    * @param pdffilePath : pdf file path of which you want to parse text
    * @return : texts from pdf file
    * @throws FileNotFoundException
    * @throws IOException 
    */
    public String PdfFileParser(String pdffilePath) throws FileNotFoundException, IOException
    {
        String content;
        FileInputStream fi = new FileInputStream(new File(pdffilePath));
        PDFParser parser = new PDFParser(fi);
        parser.parse();
        COSDocument cd = parser.getDocument();
        PDFTextStripper stripper = new PDFTextStripper();
        content = stripper.getText(new PDDocument(cd));
        cd.close();
        return content;
    }
    
    /**
     * Main method.
     * @param args
     * @throws FileNotFoundException
     * @throws IOException 
     */
    public static void main(String args[]) throws FileNotFoundException, IOException
    {
        String filepath = "H:/lab.pdf";
        System.out.println(new PdfFileParser().PdfFileParser(filepath));    
    }
}

This is it. Enough with parsing and blah blah. In my next post I will show you how to index these parsed files.

Apache Lucene- How to parse texts from .doc, .xls and .ppt files?

Want to follow news you care about.

Don't want to miss any action from premier League, Spanish League and other Leagues.

Want to make app with your own layout.

Check out NTyles.

Get it on....

In my previous post I told you we have to parse text from .doc and other file to make it usable for indexing in Lucene Index. In this post I will show you how to parse text from .doc, .xls and .ppt file. In Lucene there is no any mechanism to parse text from .doc files. If you had tried doing this:

public class Indexer {

    private final String sourceFilePath = "H:/FolderToIndex/abc.doc";

and had tried to run the file you must have gone crazy seeing a whole dozens of compilation errors!!!! That's why I am using a third party library to parse text from those files. For this I am using Apache POI library, which you can download from here. This is a free open source wonderful library which is able to parse text from following file extensions and even more:

.doc
.xls
.ppt
.docx
.pptx
.xlsx

Here is the sample code, which of course you can download from here.

//DocFileParser.java
/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package com.blogspot.computergodzilla.parsers;

import java.io.FileInputStream;
import org.apache.poi.hslf.extractor.PowerPointExtractor;
import org.apache.poi.hssf.extractor.ExcelExtractor;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;

/**
 * This class parses the microsoft word files except .docx,.pptx and 
 * latest MSword files.
 * 
 * @author Mubin Shrestha
 */
public class DocFileParser {
    
   /**
    * This method parses the content of the .doc file.
    * i.e. this method will return all the text of the file passed to it.
    * @param fileName : file name of which you want the content of.
    * @return : returns the content of the file
    */
    public String DocFileContentParser(String fileName) {
        POIFSFileSystem fs = null;
        try {
           
            if (fileName.endsWith(".xls")) { //if the file is excel file
                ExcelExtractor ex = new ExcelExtractor(fs);
                return ex.getText(); //returns text of the excel file
            } else if (fileName.endsWith(".ppt")) { //if the file is power point file
                PowerPointExtractor extractor = new PowerPointExtractor(fs);
                return extractor.getText(); //returns text of the power point file

            }
            
            //else for .doc file
            fs = new POIFSFileSystem(new FileInputStream(fileName));
            HWPFDocument doc = new HWPFDocument(fs);
            WordExtractor we = new WordExtractor(doc);
            return we.getText();//if the extension is .doc
        } catch (Exception e) {
            System.out.println("document file cant be indexed");
        }
        return "";
    }

    /**
     * Main method.
     * @param args 
     */
    public static void main(String args[])
    {
        String filepath = "H:/Filtering.ppt";
        System.out.println(new DocFileParser().DocFileContentParser(filepath));
        
    }
}

Here is a sample video demonstrating the parsing from above code:

Parse text from .docx, .pptx and .xlsx files shows how to parse text from .docx, .pptx and .xlsx files.

If you guys knew other third party open source tools please feel free to share. In my next post I will show how to parse text from pdf files.

Apache Lucene--How to index .doc and .pdf files?

Want to follow news you care about.

Don't want to miss any action from premier League, Spanish League and other Leagues.

Want to make app with your own layout.

Check out NTyles.

Get it on....

In my previous blog I show you guys how to index text files. Some of you may be thinking "Why the heck this guy index only text files? Why not .doc and .pdf files?". So this post is dedicated for those who are wondering about how to and why he didn't. The overall mechanism of how to index .doc , .pdf files will be presented in three series. Before we really move on to the real topic first let me make you'al clear that Apache Lucene is able to index texts only so we first have to parse texts from unsupported files (.doc, .pdf, .xls and etc). So now in upcoming posts we will parse texts from .doc file and in second post we will parse .pdf files and finally the third post will be to index both .doc and .pdf files. Well gonna be long!! Be ready.

Tuesday, December 25, 2012

How to search in a Lucene Index

Want to follow news you care about.

Don't want to miss any action from premier League, Spanish League and other Leagues.

Want to make app with your own layout.

Check out NTyles.

Get it on....

In my previous blog I show you guys the way how to index text files using apache lucene. Now in this post let me show you how to search in the index that we had made.
Lot of student's and newbie's who try Lucene will go directly to index any document and would probably want to search something out of that index. Like for example you create a text document named "Gangnam Style" and you fill up the text file with its lyrics and surely some of you guys may wanna search for the term "Ooppaaaaaaaa". So to make sure you guys won't get lost, here I am giving a nice cute little code for searching in a Lucene index.

The code sample that I am showing here uses Apache Lucene 3.4.0 which you can download it from here. First make sure you had read my previous blog on How to build an Lucene Index.

So now, without furthur ado let me show you the code:

//Searcher.java

package com.blogspot.computergodzilla;

import java.io.File;
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 *
 * @author Mubin Shrestha
 */
public class Searcher
{
 /**
  *
  * @instring : pass the query string to search.
  */
 public void searchIndex(String instring) throws IOException, ParseException
 {
  System.out.println("Searching for ' " + instring +  " '");          
  IndexSearcher searcher = new IndexSearcher(FSDirectory.open(new  File("INDEX_DIRECTORY")));
  Analyzer analyzer1 = new StandardAnalyzer(Version.LUCENE_34);
  QueryParser queryParser = new QueryParser(Version.LUCENE_34,FIELD_CONTENTS, analyzer1);
  QueryParser queryParserfilename = new   QueryParser(Version.LUCENE_34,FILE_PATH,analyzer1);
  Query query = queryParser.parse(instring);
  Query queryfilename = queryParserfilename.parse(instring);        
  TopDocs hits = searcher.search(query, 100);
  ScoreDoc[] document = hits.scoreDocs;
  
  System.out.println("Total no of hits for content: " + hits.totalHits);
  for(int i = 0;i <document.length;i++)
  {                 
     Document doc = searcher.doc(document[i].doc);      
     String filePath = doc.get("fullpath");                                      
     System.out.println(filePath);
  }
  
  TopDocs hitfilename = searcher.search(queryfilename,100);
  ScoreDoc[] documentfilename = hitfilename.scoreDocs;
  System.out.println("Total no of hits according to file name" + hitfilename.totalHits);       
  for(int i = 0;i < documentfilename.length ; i++)
  {
   Document doc = searcher.doc(documentfilename[i].doc);
   String filename= doc.get("filename");
   System.out.println(filename);                   
  } 
 }
 
 public static void main(String args[])
 {
  new Searcher().searchIndex("hello");  
 } 
}

This code will only run if you guys have created the index with field in the index that I have specified in my earlier post.

In my next blog I will show you guys how to index .pdf, .doc, .html, .xls, .ppt etc files.

Monday, December 24, 2012

Apache Lucene..How to use apache lucene 3.4.0 to index text files in java?

Want to follow news you care about.
Don't want to miss any action from premier League, Spanish League and other Leagues.
Want to make app with your own layout.

Check out NTyles.

Get it on....

in which I used Lucene index. I really got amazed with the powerful feature of the Lucene. For "Content Based Searching" and for cosine similarity calculation Lucene Index helped me a lot. I would really have failed to do the project without Apache Lucene and Sir. Sujit Pal.
Those days I goggled a lot in order to find the nice cute java code for indexing documents using Lucene but hardly I found any. Then luckily I found this book " Lucene In Action - Second Edition" by Erik Hatcher and Otis Gospodnetić. The below code snippet took a lot of reference from above book.
Below here I am giving you guys the java code to index text files using Apache Lucene 3.4.0. If you wanna run the below code snippet first make sure you have Lucene 3.4.0 in your .lib folder

OR

you can download it from here .
The code for indexing text files :

//Indexer.java
/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package com.blogspot.computergodzilla;


/*The following code snippet uses APACHE LUCENE 3.4.0*/
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 * This class will build index of the source location specified into the
 * destination specified.This class will only index txt files.
 *
 * @author Mubin Shrestha
 */
public class Indexer {

    private final String sourceFilePath = "H:/FolderToIndex";    //give the location of the source files location here
    private final String indexFilePath = "H:/INDEXDIRECTORY";   //give the location where you guys want to create index
    private IndexWriter writer = null;
    private File indexDirectory = null;

    /**
     * Constructor
     * @throws FileNotFoundException
     * @throws CorruptIndexException
     * @throws IOException
     */
    private Indexer() throws FileNotFoundException, CorruptIndexException, IOException {
        try {
            long start = System.currentTimeMillis();
            createIndexWriter();
            checkFileValidity();
            closeIndexWriter();
            long end = System.currentTimeMillis();
            System.out.println("Total Document Indexed : " + TotalDocumentsIndexed());
            System.out.println("Total time" + (end - start) / (100 * 60));
        } catch (Exception e) {
            System.out.println("Sorry task cannot be completed");
        }
    }

    /**
     * IndexWriter writes the data to the index.
     * @param analyzer : its a standard analyzer, in this case it filters out
     * englishStopWords and also analyses TFIDF
     */
    private void createIndexWriter() {
        try {
            indexDirectory = new File(indexFilePath);
            if (!indexDirectory.exists()) {
                indexDirectory.mkdir();
            }
            FSDirectory dir = FSDirectory.open(indexDirectory);
            StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_34, analyzer);
            writer = new IndexWriter(dir, config);
        } catch (Exception ex) {
            System.out.println("Sorry cannot get the index writer");
        }
    }

    /**
     * Filters out the files that can be indexed.
     */
    private void checkFileValidity() {

        File[] filesToIndex = new File[100]; // suppose there are 100 files at max
        filesToIndex = new File(sourceFilePath).listFiles();
        for (File file : filesToIndex) {
            try {
                //to check whenther the file is a readable file or not.
                if (!file.isDirectory()
                        && !file.isHidden()
                        && file.exists()
                        && file.canRead()
                        && file.length() > 0.0
                        && file.isFile() && file.getName().endsWith(".txt")) {
                    System.out.println();
                    System.out.println("INDEXING FILE " + file.getAbsolutePath() + "......");
                    indexTextFiles(file);
                    System.out.println("INDEXED FILE " + file.getAbsolutePath() + " :-) ");
                }
            } catch (Exception e) {
                System.out.println("Sorry cannot index " + file.getAbsolutePath());
            }
        }
    }

    /**
     * writes file to index
     * @param file : file to index
     * @throws CorruptIndexException
     * @throws IOException
     */
    private void indexTextFiles(File file) throws CorruptIndexException, IOException {
        Document doc = new Document();
        doc.add(new Field("content", new FileReader(file)));
        doc.add(new Field("filename", file.getName(),
                Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field("fullpath", file.getAbsolutePath(),
                Field.Store.YES, Field.Index.ANALYZED));
        if (doc != null) {
            writer.addDocument(doc);
        }
    }

    /**
     *
     * @return : total number of documents in the index
     */
    private int TotalDocumentsIndexed() {
        try {
            IndexReader reader = IndexReader.open(FSDirectory.open(indexDirectory));
            return reader.maxDoc();
        } catch (Exception ex) {
            System.out.println("Sorry no index found");
        }
        return 0;
    }

    /**
     * Closes the IndexWriter
     */
    private void closeIndexWriter() {
        try {
            writer.optimize();
            writer.close();
        } catch (Exception e) {
            System.out.println("Indexer Cannot be closed");
        }
    }
     
     /**
      * Main method
      */
    public static void main(String arg[]) {
        try {
            new Indexer();
        } catch (Exception ex) {
            System.out.println("Cannot Start :(");
        }
    }
}

If you want the whole project you can simply download from here.
Below is Video demonstrating the Lucene Indexing code :

In the next blog I will show you guys how to index documents with extensions .pdf, .doc, .xls and .ppt using Apache Lucene 3.4.0. Till then keep LUCENING. And uh...."Fire up if you have any queries!!".

Thursday, December 20, 2012

What is Cosine Similarity?

Get real time news update from your favorite websites.
Don't miss any news about your favorite topic.
Personalize your app.

Check out NTyles.

Get it on....

This tutorial is for the newbie's who are trying to make out something with similarity measures between two documents or sentences and found nothing on the web.
Cosine Similarity measures the similarity between two sentences or documents in terms of the value within the range of [-1,1] whichever you want to measure.That's all, that is Cosine Similarity. Let me clairfy Cosine Similarity with an example.

Let's consider two setences:
1. Xeon goes to marry Xeonian girl, a girl.

2. Leon goes to forest to find Xeon.

From the first sentence, calculating the terms and their respective frequencies :

Terms	Frequencies
Xeon	1
goes	1
to	1
marry	1
Xeonian	1
girl	2
a	1

If we do same for the second sentence,

Terms	Frequencies
Leon	1
goes	1
to	2
forest	1
find	1
Xeon	1

In the above table the total number of terms in sentence 1 is 8 and in sentence 2 is 7.

Now grab a coffee, Let's take a break. When there's mathematics just take a break. "Ready for some mathematics."---Here it goes: Recall vector: Let's suppose : vector a = [2,2] and vector b = [0,1].
Then the cos product of vector a and b is :

i.e for above example it will be :

That's all.

Now let's move on to our topic. Cosine Similarity !!

Now assuming you all know what is cos product, now get what I am doing with terms in sentence 1 and sentence 2.

Terms	Freq. in 1	Freq. in 2
Xeon	1	1
goes	1	1
to	1	2
marry	1	0
Xeonian	1	0
girl	2	0
Leon	0	1
forest	0	1
find	0	1

Then let : vec1 = [1,1,1,1,1,2,0,0,0] and vec2 = [1,1,2,0,0,0,1,1,1].
Therefore finally we get :

(or something.(I didn't calculate it, Coz I'm LAZY.))

Further readings:

Personally in my projects I use Lucene. Lucene is a very cool. You can treat the index made by Lucene as a database, do searches very fast and it supports variuos queries. If you guys need the code to calculate similairty using Lucene for version 3.x in java then please refer :
JAVA CODE FOR CALCULATING COSINE SIMILARITY USING LUCENE 3.x.
And if you need code for calculating cosine similarity using Java in Lucene greater than 4 then please refer :
JAVA CODE FOR CALCULATING COSINE SIMILARITY USING LUCENE 4.X.

To calculate cosine similairty using tfidf in java without using Lucene then please refer :
JAVA CODE FOR CALCULATING COSINE SIMILARITY, TF-IDF

Fire Up all if you need any help!!

How to add a tagcloud in your blog? --From Tagxedo

Get real time news update from your favorite websites.
Don't miss any news about your favorite topic.
Personalize your app.

Check out NTyles.

Get it on....

Most of you guys may be wondering "How the heck this blogger embed the tagcloud in his blog?". Well here's the solution :
STEP 1: Go to http://Tagxedo.com. If you wanna know what tagxedo does then please do GOOGLE!!. GOOGLE IS GOD! you'll know.
STEP 2: Hit "Start now" .

You will be then welcomed by following screen:

STEP 3: Hit "Load ...".Following screen appears:

Enter the location of your blog or website or write something of which you want to create the TagCloud of. And then click on Submit button. This will a take seconds or minute it depends on your website content size and wallllaaaahhhh!!!! it will give you the tagcloud. Now if you wanna do some edit then there are plenty of options embedded. Most probably you guys will go for shape so here it is:
STEP 4: Click on Shape . Following screens appears :

Choose your required shape. Now when you're done, THEN ..
STEP 5: Click on Save | Share .... Following screens appears :

There are multiple options for you to save your tagcloud in any of the available forms. I am going for web version becuase you need the iframe code snippet to embed in your blog. After providing your username and title (can be any of ur choise) all the rewuired fields will be desplayed as shown in fig above.
STEP 6: Now copy the Iframe code snippet. And paste it in your blog's HTML view. (Templates -> Edit HTML). What I did for mine is shown in following figures :

This will give your HTML then paste your code where you want to display the tagcloud.

That's all. Now you're done. :) Please do provide me the links if you guys know other methods to add a tagcloud in a blog.

Wednesday, December 19, 2012

Lil' about me.

There was a subject called 'Nepali' (yeap that's right, I am from Nepal), during my secondary school level.

"I was a perfectionist, my handwriting was gr8 and was my favorite subject.....regarding 'Nepali' ".

If anybody hear me saying that please, you're always welcome, PLEASE do kick me in my ***.

This was the subject that I always fear.If there was a question like :
Q. The name of my country is ___________ ?
My ans would always be : The name of my country is ___________ ?
Well most of you have guessed right I was the best student of my Nepali teacher. Am I? For the first time in my life I failed, but luckily I was upgraded.I tried in my succeeding classes to improve 'Nepali' but thank God I passed hardly in all levels till I reached higher level i.e 11 standard.

Here continues the story....I topped the class XI and XII in aggregate score.I was the physical science group topper student in my college(KIST College)........the reason was there was no such subject like 'Nepali'.(was in class XII but I passed in Nepali). Then I got admitted in Institute Of Engineering, under Computer Engineering department and completed my bachelors from there.

After my graduation I worked as a Software Engineer @ Verisk Information Technologies an under wing of Verisk Analytics , USA for a year. Then I left Verisk to work in a Startup company, Viveka. After working a year in Viveka, I got bored and started to work an app for about 4 months.
NTyles is what I created after 4 months.

Get it on....

Read more about NTyles.

Currently I am working as a Project Lead @ Elite Networks.

For complete details of my profile, don't hesitate to view my :

See I was damn bright student, the only reason I ever failed..... there was subject called....well you all must be getting bored. But lot of interesting facts will follow your computers soon. :).

Cheers to all. Have a great and wonderful life ahead. Good Luck!!

ComputerGodzilla

Search This Blog

Translate

Monday, December 31, 2012

Apache Lucene -- How to index .doc , .pdf and .text file?

Get it on....

Sunday, December 30, 2012

Apache Lucene--How to parse texts from PDF files?

Get it on....

Apache Lucene- How to parse texts from .doc, .xls and .ppt files?

Get it on....

Apache Lucene--How to index .doc and .pdf files?

Get it on....

Tuesday, December 25, 2012

How to search in a Lucene Index

Get it on....

Monday, December 24, 2012

Apache Lucene..How to use apache lucene 3.4.0 to index text files in java?

Get it on....

Thursday, December 20, 2012

What is Cosine Similarity?

Get it on....

How to add a tagcloud in your blog? --From Tagxedo

Get it on....

Wednesday, December 19, 2012

Lil' about me.

Get it on....