Blog posts on Data Science, Machine Learning, Data Mining, Artificial Intelligence, Spark Machine Learning

Saturday, November 4, 2017

information retrieval document search using vector space model in R


In this post, we learn about building a basic search engine or document retrieval system using Vector space model. This use case is widely used in information retrieval systems. Given a set of documents and search term(s)/query we need to retrieve relevant documents that are similar to the search query.

Problem statement:

The problem statement explained above is represented as in below image.
Document retrieval system

Before we get into building the search engine, we will learn briefly about different concepts we use in this post:

Vector Space Model:

A vector space model is an algebraic model, involving two steps, in first step we represent the text documents into vector of words and in second step we transform to numerical format so that we can apply any text mining techniques such as information retrieval, information extraction,information filtering etc.

Let us understand with an example. consider below statements and a query term. The statements are referred as documents hereafter.
Document 1: Cat runs behind rat
Document 2: Dog runs behind cat
Query: rat

Document vectors representation:

In this step includes breaking each document into words, applying preprocessing steps such as removing stopwords, punctuations, special characters etc. After preprocessing the documents we represent them as vectors of words.
Below is a sample representation of the document vectors.
Document 1: (cat, runs, behind, rat)
Document 2: (Dog, runs, behind, cat)
Query: (rat)

the relevant document to Query = greater of (similarity score between (Document1, Query), similarity score between (Document2, Query)

Next step is to represent the above created vectors of terms to numerical format known as term document matrix.

Term Document Matrix:

A term document matrix is a way of representing documents vectors in a matrix format in which each row represents term vectors across all the documents and columns represent document vectors across all the terms. The cell values frequency counts of each term in corresponding document. If a term is present in a document, then the corresponding cell value contains 1 else if the term is not present in the document then the cell value contains 0.

After creating the term document matrix, we will calculate term weights for all the terms in the matrix across all the documents. It is also important to calculate the term weightings because we need to find out terms which uniquely define a document.

We should note that a word which occurs in most of the documents might not contribute to represent the document relevance whereas less frequently occurred terms might define document relevance. This can be achieved using a method known as term frequency - inverse document frequency (tf-idf), which gives higher weights to the terms which occurs more in a document but rarely occurs in all other documents, lower weights to the terms which commonly occurs within and across all the documents.
Tf-idf = tf X idf
tf = term frequency is the number of times a term occurs in a document
idf = inverse of the document frequency, given as below
idf = log(N/df), where df is the document frequency-number of documents containing a term

total number of documents

term document matrix
inverse document frequency

Note: idf is calculated using logarithm of inverse fraction between document count and document frequency
tf-idf calculation

Note: Tf-idf weightage is calculated using tf X idf

Note, there are many variations in the way we calculate the term-frequency(tf) and inverse document frequency (idf), in this post we have seen one variation. Below images show as the other recommended variations of tf and idf, taken from wiki.
term frequency variations

inverse document frequency variations

Similarity Measures: cosine similarity

Mathematically, closeness between two vectors is calculated by calculating the cosine angle between two vectors. In similar lines, we can calculate cosine angle between each document vector and the query vector to find its closeness. To find relevant document to the query term , we may calculate the similarity score between each document vector and the query term vector by applying cosine similarity . Finally, whichever documents having high similarity scores will be considered as relevant documents to the query term.

When we plot the term document matrix, each document vector represents a point in the vector space. In the below example query, Document 1 and Document 2 represent 3 points in the vector space. We can now compare the query with each of the document by calculating the cosine angle between them.

cosine similarity

Apart from cosine similarity, we have other variants for calculating the similarity scores and are shown below:
  • Jaccard distance
  • Kullback-Leibler divergence
  • Euclidean distance

Now that we have learnt the important concepts required for implementing our problem statement, we now look at the data which will be used in this post and its implementation in R programming language.

Data description:

For this post, we use 9 text files containing news articles and a query file containing search queries. Our task is to find top-3 news articles relevant to each of the query in the queries files.
The dataset which we will be using is uploaded to GitHub and is located at below location:

The news articles data is available in txt files as shown in below image:
Below is the snippet of news article in the first txt file baract_hussein_obama.txt,
“Barack Hussein Obama II (US Listeni/bəˈrɑːk huːˈseɪn oʊˈbɑːmə/;[1][2] born August 4, 1961) is the 44th and current President of the United States. He is the first African American to hold the office and the first president born outside the continental United States. Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School, where he was president of the Harvard Law Review. He was a community organizer in Chicago before earning his law degree. He worked as a civil rights attorney and taught constitutional law at the University of Chicago Law School between 1992 and 2004. While serving three terms representing the 13th District in the Illinois Senate from 1997 to 2004, he ran unsuccessfully in the Democratic primary for the United States House of Representatives in 2000 against incumbent Bobby Rush ….”
Below are the sample queries to which we will extract relevant documents, available in query.txt file, is shown below:
“largest world economy
barack obama
united state president
donald trump and united state
donald trump and barack obama
current President of the United States”
our task is to create a system in which for each of the query terms retrieve top-3 relevant documents.

High level system design:

In this section we show the high-level design implementation. The implementation steps are as follows:

  • Load documents and search queries into the R programming environment as list objects. 
  • Preprocess the data by creating a corpus object with all the documents and query terms, removing stop words, punctuations using tm package.

high level information retrieval system

  • Creating a term document matrix with tf-idf weight setting available in TermDocumentMatrix() method.
  • Separate the term document matrix into two parts- one containing all the documents with term weights and other containing all the queries with term weights.
  • Now calculate cosine similarity between each document and each query.
  • For each query sort the cosine similarity scores for all the documents and take top-3 documents having high scores.

Full code implementation:


  1. I loved as much as you'll receive carried out right here. The sketch is attractive, your authored subject matter stylish. nonetheless, you command get bought an edginess over that you wish be delivering the following. unwell unquestionably come further formerly again since exactly the same nearly a lot often inside case you shield this hike. Best math tutor

  2. It is truly a great and helpful piece of info. I am satisfied that you shared this useful info with us. Please stay us informed like this. Thanks for sharing.home cooked dog food

  3. Hi there! I know this is kinda off topic however , I'd figured I'd ask. Would you be interested in exchanging links or maybe guest writing a blog post or vice-versa? My blog discusses a lot of the same subjects as yours and I think we could greatly benefit from each other. If you might be interested feel free to send me an e-mail. I look forward to hearing from you! Excellent blog by the way! enrichment classes

  4. I have been exploring for a little for any high-quality articles or blog posts on this sort of area . Exploring in Yahoo I at last stumbled upon this web site. Reading this information So i’m happy to convey that I've a very good uncanny feeling I discovered just what I needed. I most certainly will make sure to don’t forget this site and give it a glance on a constant basis. ecommerce web development company

  5. Once I originally commented I clicked the -Notify me when new feedback are added- checkbox and now each time a remark is added I get four emails with the same comment. Is there any way you'll be able to take away me from that service? Thanks! 5 star hotel singapore

  6. Generally I do not read post on blogs, but I wish to say that this write-up very forced me to try and do it! Your writing style has been surprised me. Thanks, very nice article.LinkedIn

  7. Normally I don't read post on blogs, but I wish to say that this write-up very forced me to try and do so! Your writing style has been amazed me. Thanks, very nice post.halal buffet catering Singapore

  8. Howdy this is somewhat of off topic but I was wondering if blogs use WYSIWYG editors or if you have to manually code with HTML. I'm starting a blog soon but have no coding knowledge so I wanted to get advice from someone with experience. Any help would be enormously appreciated!
    singapore divorce lawyer free consultation

  9. I just couldn't depart your website prior to suggesting that I extremely enjoyed the standard info a person provide for your visitors? Is gonna be back often in order to check up on new posts
    ecommerce website development

  10. Incredible! This blog looks exactly like my old one! It's on a completely different topic but it has pretty much the same page layout and design. Great choice of colors!
    travel insurance

  11. buddy how can we attach the files with the code

  12. Greetings! This is my 1st comment here so I just wanted to give a quick shout out and say I genuinely enjoy reading through your blog posts. Can you recommend any other blogs/websites/forums that deal with the same topics? Appreciate it!
    marketing strategy

  13. Youre so cool! I dont suppose Ive read anything like this before. So nice to seek out someone with some original thoughts on this subject. realy thank you for beginning this up. this web site is one thing that's wanted on the web, someone with slightly originality. useful job for bringing something new to the web!
    role of a professional web designer

  14. Hey There. I found your blog using msn. This is a really well written article. I will make sure to bookmark it and return to read more of your useful information. Thanks for the post. I will definitely return.
    dominate in SERPs

  15. The very core of your writing whilst sounding agreeable in the beginning, did not really settle well with me personally after some time. Someplace within the sentences you actually managed to make me a believer unfortunately just for a while. I however have got a problem with your jumps in assumptions and one might do nicely to help fill in those gaps. When you can accomplish that, I would definitely be fascinated.local seo in singapore

  16. Good day! This is my 1st comment here so I just wanted to give a quick shout out and tell you I truly enjoy reading your articles. Can you suggest any other blogs/websites/forums that cover the same topics? Thanks a lot! seo agencies singapore

  17. What’s Happening i am new to this, I stumbled upon this I have found It positively helpful and it has helped me out loads. I hope to contribute & aid other users like its helped me. Great job. web design agency singapore

  18. We do the vital hand-holding until you are set. Our master mentors will assist you with upskilling the ideas, to finish the assignments and live tasks.
    data science course in pune

  19. Appreciating the commitment you put into your site and detailed information you present. It's nice to come across a blog every once in a while that isn't the same out of date rehashed information. Fantastic read! I've saved your site and I'm including your RSS feeds to my Google account.
    hire a Singapore website designer

  20. Write more, thats all I have to say. Literally, it seems as though you relied on the video to make your point. You obviously know what youre talking about, why throw away your intelligence on just posting videos to your blog when you could be giving us something informative to read?
    Focus on credibility

  21. thanks for sharing this information with us .

  22. Wow, superb blog layout! How long have you been blogging for? you make blogging look easy. The overall look of your website is excellent, let alone the content!
    singapore social media influencers

  23. My brother recommended I might like this web site. He was entirely right. This post actually made my day. You can not imagine simply how much time I had spent for this info! Thanks! social media marketing agency Singapore

  24. Study Business Analytics Courses in Bangalore with ExcelR where you get a great experience and better knowledge. Business Analytics Courses in Bangalore

  25. Now let us know the importance and the advantages of Digital Marketing. Digital Marketing

  26. Very interesting to read this article.I would like to thank you for the effortsData Science Course in Hyderabad

  27. I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Simple Linear Regression
    Correlation vs covariance
    data science interview questions
    KNN Algorithm
    Logistic Regression explained

  28. Thanks for the amazing info that you provided on about building a basic search engine or document retrieval system using Vector space model. Loved it. I hope you will post more meaningful articles in the near future. You are amazing. Keep up the good work. Because it influences a lot of people in a good way. Thanks again. Have a nice day!! We help the users who are not able to access AOL email through login page. AOL mail login page

  29. You have a flair for informational writing. Your selection of topic is very good and also well written. I feel you have good knowledge on this topic. Also, we assist you in providing the solutions to the SBCGlobal Email issues coming while trying to login the email account on Apple device. The solutions provided on our site are written by experts, so we guarantee you that your problem will be resolved.

  30. Great post...You have done good information on this article and i feel you have good knowledge on this topic. Visit our website to know how you can start a raid in Minecraft ( Minecraft Raid ) in simple steps. For more information visit our website. Campfire Minecraft

  31. Keep sharing stuff like this actually helps a lot. Need a logo for your business just click the link below:
    custom logo design services

  32. Therefore, if you are one of the people who do not believe that consultants are useful, think again. There are several unique uses and benefits these professionals can provide for your business! salesforce training in noida

  33. We provide training for those who want to work as high-conflict divorce coaches, consultants or advocates. There is an overwhelming need for professionals in this industry and the HCDCCP’s eight-week certification program provides guidance, training and hands-on experience. For more information about Certified Divorce Coach visit our official website.

  34. Test strips – These are small, single-use strips that change color to indicate the concentration of a specific chemical. Depending on the particular test, the user “activates” the paper or plastic strip by dipping it into the water sample and swishing it around, or by holding the strip in a stream of water.
    To know more about Water Quality Testing Company visit our official website.

  35. Nice article. I liked very much. All the informations given by you are really helpful. Also, PlantNeeds products are the world’s best soil conditioners and soil nutrients, which can fit in to all crops with various growing conditions. For more information visit our site.
    Neem Cake Product Manufacturer

  36. The information you have posted is very useful. The sites you have referred to were good. Thanks for sharing.
    data scientist training and placement in hyderabad

  37. Howdy, i read your blog from time to time and i own a similar one and i was just curious if you get a lot of spam responses? If so how do you prevent it, any plugin or anything you can recommend? I get so much lately it's driving me crazy so any help is very much appreciated.
    This design is wicked! You certainly know how to keep a reader entertained. Between your wit and your videos, I was almost moved to start my own blog (well, almost...HaHa!) Wonderful job. I really loved what you had to say, and more than that, how you presented it. Too cool! psychotherapy singapore

  38. We stumbled over here from a different website and thought I should check things out. I like what I see so now i am following you. Look forward to going over your web page again. bunion corrector

  39. Very amazing information its helpful for me Thanks for sharing this nice information keep going on in future

    quickbooks customer service

  40. Thanks for such a great post and the review, I am totally impressed! Keep stuff like this coming.
    data scientist training and placement in hyderabad

  41. Stupendous blog huge applause to the blogger and hoping you to come up with such an extraordinary content in future. Surely, this post will inspire many aspirants who are very keen in gaining the knowledge. Expecting many more contents with lot more curiosity further.

    data science course in varanasi

  42. In case you have recently purchased Arlo Pro 2 camera. And you don’t know how to log in. Don’t worry, we will help you with Arlo Pro 2 Login. We will help you by recovering your details. However, If you need any information about how to login in Arlo Pro 2 camera, You can take a look at our website. We have provided the written instruction on our website.

  43. We can abet you with expert guidance related to your TWC account. If you have set up your account on the new device or you are a new Twc user, then you will probably face login issues. So, for the proper Twc Login Email procedures, visit our website. Also, get in touch with us by our contact, if needed.

  44. We deliver the best solutions for the Charter Email. So, if you need any help, reach out to our webpage and do away with all the glitches in your mail account. Our webpage can help you with  Charter Email Login, set up, and Twc email login, and many more. So to get access to all of this information, please consider reaching out to our webpage. 

  45. Multi Cuisine Family Restaurant in Malpe. Malpe Lunch Home is Famous Local
    Favourite Restaurant for seafood, Near Malpe Beach Udupi.

  46. I recently found a lot of useful information on your website, especially this blog page. Among the lots of comments on your articles. Thanks for sharing.
    best aws training in hyderabad

  47. Very nice blog. You really provide such a good information. We assure you that we provide the best roadrunner customer service.

  48. I have read your blog and it is very informative. I personally really like it. The correct page is hbomax tv sign in, and if you are trying to sign in on this page, but couldn’t do so, then contact HBOMax experts.

  49. 5 easy ways to finally quit porn
    Let’s be real: You watch too much porn. This is the plan I used to quit it entirely. It’s a real, battle-tested, science-backed, reality-based plan to break up with porn and the embarrassment that comes with it.
    This is the plan I used to quit porn.
    1. Delete your browser history
    2. Install an adult site blocker
    3. Don’t touch your private part
    4. Get physical and get away from the computer
    5. Talk to some women
    To know more about Quit Porn Addiction, visit our website at

  50. Are you looking Setup Arlo Camera help? Get an expert now to get a free consultation. Contact us for your Arlo Camera issues.

  51. Really awesome blog, useful information. If you want to become a data scientist, here is the best option for you. For more information click the below link.
    Data Science Course in Hyderabad

  52. I just google and came up your post. Interesting..and thanks for sharing

  53. Wonderful information thanks for sharing such a useful information. Also, We offer wide range of courses for students who aspire to study abroad. Our areas of consultancy services includes admission process, course selection, scholarship and visa guidance. With our experience and expertise we have successfully helped students to get admission in world’s top universities and colleges in Australia.

    study for australia

  54. Excellent Content. I enjoy how you present. Continue to impart more information.

  55. If you are looking for where to download modded apps and games, try apkframe all is free, check it out! real drift car racing lite mod apk