Blog posts on Data Science, Machine Learning, Data Mining, Artificial Intelligence, Spark Machine Learning

Saturday, November 4, 2017

information retrieval document search using vector space model in R


Introduction:

In this post, we learn about building a basic search engine or document retrieval system using Vector space model. This use case is widely used in information retrieval systems. Given a set of documents and search term(s)/query we need to retrieve relevant documents that are similar to the search query.

Problem statement:

The problem statement explained above is represented as in below image.
Document retrieval system




Before we get into building the search engine, we will learn briefly about different concepts we use in this post:

Vector Space Model:

A vector space model is an algebraic model, involving two steps, in first step we represent the text documents into vector of words and in second step we transform to numerical format so that we can apply any text mining techniques such as information retrieval, information extraction,information filtering etc.

Let us understand with an example. consider below statements and a query term. The statements are referred as documents hereafter.
Document 1: Cat runs behind rat
Document 2: Dog runs behind cat
Query: rat

Document vectors representation:


In this step includes breaking each document into words, applying preprocessing steps such as removing stopwords, punctuations, special characters etc. After preprocessing the documents we represent them as vectors of words.
Below is a sample representation of the document vectors.
Document 1: (cat, runs, behind, rat)
Document 2: (Dog, runs, behind, cat)
Query: (rat)

the relevant document to Query = greater of (similarity score between (Document1, Query), similarity score between (Document2, Query)

Next step is to represent the above created vectors of terms to numerical format known as term document matrix.

Term Document Matrix:

A term document matrix is a way of representing documents vectors in a matrix format in which each row represents term vectors across all the documents and columns represent document vectors across all the terms. The cell values frequency counts of each term in corresponding document. If a term is present in a document, then the corresponding cell value contains 1 else if the term is not present in the document then the cell value contains 0.

After creating the term document matrix, we will calculate term weights for all the terms in the matrix across all the documents. It is also important to calculate the term weightings because we need to find out terms which uniquely define a document.

We should note that a word which occurs in most of the documents might not contribute to represent the document relevance whereas less frequently occurred terms might define document relevance. This can be achieved using a method known as term frequency - inverse document frequency (tf-idf), which gives higher weights to the terms which occurs more in a document but rarely occurs in all other documents, lower weights to the terms which commonly occurs within and across all the documents.
Tf-idf = tf X idf
tf = term frequency is the number of times a term occurs in a document
idf = inverse of the document frequency, given as below
idf = log(N/df), where df is the document frequency-number of documents containing a term

total number of documents

term document matrix
inverse document frequency

Note: idf is calculated using logarithm of inverse fraction between document count and document frequency
tf-idf calculation

Note: Tf-idf weightage is calculated using tf X idf

Note, there are many variations in the way we calculate the term-frequency(tf) and inverse document frequency (idf), in this post we have seen one variation. Below images show as the other recommended variations of tf and idf, taken from wiki.
term frequency variations

inverse document frequency variations

Similarity Measures: cosine similarity


Mathematically, closeness between two vectors is calculated by calculating the cosine angle between two vectors. In similar lines, we can calculate cosine angle between each document vector and the query vector to find its closeness. To find relevant document to the query term , we may calculate the similarity score between each document vector and the query term vector by applying cosine similarity . Finally, whichever documents having high similarity scores will be considered as relevant documents to the query term.

When we plot the term document matrix, each document vector represents a point in the vector space. In the below example query, Document 1 and Document 2 represent 3 points in the vector space. We can now compare the query with each of the document by calculating the cosine angle between them.

cosine similarity

Apart from cosine similarity, we have other variants for calculating the similarity scores and are shown below:
  • Jaccard distance
  • Kullback-Leibler divergence
  • Euclidean distance

Now that we have learnt the important concepts required for implementing our problem statement, we now look at the data which will be used in this post and its implementation in R programming language.

Data description:


For this post, we use 9 text files containing news articles and a query file containing search queries. Our task is to find top-3 news articles relevant to each of the query in the queries files.
The dataset which we will be using is uploaded to GitHub and is located at below location:
https://github.com/sureshgorakala/machinelearning/tree/master/data

The news articles data is available in txt files as shown in below image:
Below is the snippet of news article in the first txt file baract_hussein_obama.txt,
“Barack Hussein Obama II (US Listeni/bəˈrɑːk huːˈseɪn oʊˈbɑːmə/;[1][2] born August 4, 1961) is the 44th and current President of the United States. He is the first African American to hold the office and the first president born outside the continental United States. Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School, where he was president of the Harvard Law Review. He was a community organizer in Chicago before earning his law degree. He worked as a civil rights attorney and taught constitutional law at the University of Chicago Law School between 1992 and 2004. While serving three terms representing the 13th District in the Illinois Senate from 1997 to 2004, he ran unsuccessfully in the Democratic primary for the United States House of Representatives in 2000 against incumbent Bobby Rush ….”
Below are the sample queries to which we will extract relevant documents, available in query.txt file, is shown below:
“largest world economy
barack obama
united state president
donald trump and united state
donald trump and barack obama
current President of the United States”
our task is to create a system in which for each of the query terms retrieve top-3 relevant documents.

High level system design:


In this section we show the high-level design implementation. The implementation steps are as follows:

  • Load documents and search queries into the R programming environment as list objects. 
  • Preprocess the data by creating a corpus object with all the documents and query terms, removing stop words, punctuations using tm package.

high level information retrieval system

  • Creating a term document matrix with tf-idf weight setting available in TermDocumentMatrix() method.
  • Separate the term document matrix into two parts- one containing all the documents with term weights and other containing all the queries with term weights.
  • Now calculate cosine similarity between each document and each query.
  • For each query sort the cosine similarity scores for all the documents and take top-3 documents having high scores.

Full code implementation:


20 comments:

  1. I loved as much as you'll receive carried out right here. The sketch is attractive, your authored subject matter stylish. nonetheless, you command get bought an edginess over that you wish be delivering the following. unwell unquestionably come further formerly again since exactly the same nearly a lot often inside case you shield this hike. Best math tutor

    ReplyDelete
  2. It is truly a great and helpful piece of info. I am satisfied that you shared this useful info with us. Please stay us informed like this. Thanks for sharing.home cooked dog food

    ReplyDelete
  3. Hi there! I know this is kinda off topic however , I'd figured I'd ask. Would you be interested in exchanging links or maybe guest writing a blog post or vice-versa? My blog discusses a lot of the same subjects as yours and I think we could greatly benefit from each other. If you might be interested feel free to send me an e-mail. I look forward to hearing from you! Excellent blog by the way! enrichment classes

    ReplyDelete
  4. I have been exploring for a little for any high-quality articles or blog posts on this sort of area . Exploring in Yahoo I at last stumbled upon this web site. Reading this information So i’m happy to convey that I've a very good uncanny feeling I discovered just what I needed. I most certainly will make sure to don’t forget this site and give it a glance on a constant basis. ecommerce web development company

    ReplyDelete
  5. Once I originally commented I clicked the -Notify me when new feedback are added- checkbox and now each time a remark is added I get four emails with the same comment. Is there any way you'll be able to take away me from that service? Thanks! 5 star hotel singapore

    ReplyDelete
  6. Generally I do not read post on blogs, but I wish to say that this write-up very forced me to try and do it! Your writing style has been surprised me. Thanks, very nice article.LinkedIn

    ReplyDelete
  7. Normally I don't read post on blogs, but I wish to say that this write-up very forced me to try and do so! Your writing style has been amazed me. Thanks, very nice post.halal buffet catering Singapore

    ReplyDelete
  8. Howdy this is somewhat of off topic but I was wondering if blogs use WYSIWYG editors or if you have to manually code with HTML. I'm starting a blog soon but have no coding knowledge so I wanted to get advice from someone with experience. Any help would be enormously appreciated!
    singapore divorce lawyer free consultation

    ReplyDelete
  9. I just couldn't depart your website prior to suggesting that I extremely enjoyed the standard info a person provide for your visitors? Is gonna be back often in order to check up on new posts
    ecommerce website development

    ReplyDelete
  10. Incredible! This blog looks exactly like my old one! It's on a completely different topic but it has pretty much the same page layout and design. Great choice of colors!
    travel insurance

    ReplyDelete
  11. buddy how can we attach the files with the code

    ReplyDelete
  12. Greetings! This is my 1st comment here so I just wanted to give a quick shout out and say I genuinely enjoy reading through your blog posts. Can you recommend any other blogs/websites/forums that deal with the same topics? Appreciate it!
    marketing strategy

    ReplyDelete
  13. Youre so cool! I dont suppose Ive read anything like this before. So nice to seek out someone with some original thoughts on this subject. realy thank you for beginning this up. this web site is one thing that's wanted on the web, someone with slightly originality. useful job for bringing something new to the web!
    role of a professional web designer

    ReplyDelete
  14. Hey There. I found your blog using msn. This is a really well written article. I will make sure to bookmark it and return to read more of your useful information. Thanks for the post. I will definitely return.
    dominate in SERPs

    ReplyDelete
  15. The very core of your writing whilst sounding agreeable in the beginning, did not really settle well with me personally after some time. Someplace within the sentences you actually managed to make me a believer unfortunately just for a while. I however have got a problem with your jumps in assumptions and one might do nicely to help fill in those gaps. When you can accomplish that, I would definitely be fascinated.local seo in singapore

    ReplyDelete
  16. Good day! This is my 1st comment here so I just wanted to give a quick shout out and tell you I truly enjoy reading your articles. Can you suggest any other blogs/websites/forums that cover the same topics? Thanks a lot! seo agencies singapore

    ReplyDelete
  17. What’s Happening i am new to this, I stumbled upon this I have found It positively helpful and it has helped me out loads. I hope to contribute & aid other users like its helped me. Great job. web design agency singapore

    ReplyDelete
  18. We do the vital hand-holding until you are set. Our master mentors will assist you with upskilling the ideas, to finish the assignments and live tasks.
    data science course in pune

    ReplyDelete
  19. Appreciating the commitment you put into your site and detailed information you present. It's nice to come across a blog every once in a while that isn't the same out of date rehashed information. Fantastic read! I've saved your site and I'm including your RSS feeds to my Google account.
    hire a Singapore website designer

    ReplyDelete
  20. Write more, thats all I have to say. Literally, it seems as though you relied on the video to make your point. You obviously know what youre talking about, why throw away your intelligence on just posting videos to your blog when you could be giving us something informative to read?
    Focus on credibility

    ReplyDelete