# information retrieval document search using vector space model in R

### Introduction:

In this post, we learn about building a basic search engine or document retrieval system using Vector space model. This use case is widely used in information retrieval systems. Given a set of documents and search term(s)/query we need to retrieve relevant documents that are similar to the search query.

### Problem statement:

The problem statement explained above is represented as in below image.
Document retrieval system

Before we get into building the search engine, we will learn briefly about different concepts we use in this post:

### Vector Space Model:

A vector space model is an algebraic model, involving two steps, in first step we represent the text documents into vector of words and in second step we transform to numerical format so that we can apply any text mining techniques such as information retrieval, information extraction,information filtering etc.

Let us understand with an example. consider below statements and a query term. The statements are referred as documents hereafter.
Document 1: Cat runs behind rat
Document 2: Dog runs behind cat
Query: rat

### Document vectors representation:

In this step includes breaking each document into words, applying preprocessing steps such as removing stopwords, punctuations, special characters etc. After preprocessing the documents we represent them as vectors of words.
Below is a sample representation of the document vectors.
Document 1: (cat, runs, behind, rat)
Document 2: (Dog, runs, behind, cat)
Query: (rat)

the relevant document to Query = greater of (similarity score between (Document1, Query), similarity score between (Document2, Query)

Next step is to represent the above created vectors of terms to numerical format known as term document matrix.

### Term Document Matrix:

A term document matrix is a way of representing documents vectors in a matrix format in which each row represents term vectors across all the documents and columns represent document vectors across all the terms. The cell values frequency counts of each term in corresponding document. If a term is present in a document, then the corresponding cell value contains 1 else if the term is not present in the document then the cell value contains 0.

After creating the term document matrix, we will calculate term weights for all the terms in the matrix across all the documents. It is also important to calculate the term weightings because we need to find out terms which uniquely define a document.

We should note that a word which occurs in most of the documents might not contribute to represent the document relevance whereas less frequently occurred terms might define document relevance. This can be achieved using a method known as term frequency - inverse document frequency (tf-idf), which gives higher weights to the terms which occurs more in a document but rarely occurs in all other documents, lower weights to the terms which commonly occurs within and across all the documents.
Tf-idf = tf X idf
tf = term frequency is the number of times a term occurs in a document
idf = inverse of the document frequency, given as below
idf = log(N/df), where df is the document frequency-number of documents containing a term

total number of documents

term document matrix
inverse document frequency

Note: idf is calculated using logarithm of inverse fraction between document count and document frequency
tf-idf calculation

Note: Tf-idf weightage is calculated using tf X idf

Note, there are many variations in the way we calculate the term-frequency(tf) and inverse document frequency (idf), in this post we have seen one variation. Below images show as the other recommended variations of tf and idf, taken from wiki.
term frequency variations

inverse document frequency variations

### Similarity Measures: cosine similarity

Mathematically, closeness between two vectors is calculated by calculating the cosine angle between two vectors. In similar lines, we can calculate cosine angle between each document vector and the query vector to find its closeness. To find relevant document to the query term , we may calculate the similarity score between each document vector and the query term vector by applying cosine similarity . Finally, whichever documents having high similarity scores will be considered as relevant documents to the query term.

When we plot the term document matrix, each document vector represents a point in the vector space. In the below example query, Document 1 and Document 2 represent 3 points in the vector space. We can now compare the query with each of the document by calculating the cosine angle between them.

cosine similarity

Apart from cosine similarity, we have other variants for calculating the similarity scores and are shown below:
• Jaccard distance
• Kullback-Leibler divergence
• Euclidean distance

Now that we have learnt the important concepts required for implementing our problem statement, we now look at the data which will be used in this post and its implementation in R programming language.

### Data description:

For this post, we use 9 text files containing news articles and a query file containing search queries. Our task is to find top-3 news articles relevant to each of the query in the queries files.
The dataset which we will be using is uploaded to GitHub and is located at below location:
https://github.com/sureshgorakala/machinelearning/tree/master/data

The news articles data is available in txt files as shown in below image:
Below is the snippet of news article in the first txt file baract_hussein_obama.txt,
“Barack Hussein Obama II (US Listeni/bəˈrɑːk huːˈseɪn oʊˈbɑːmə/;[1][2] born August 4, 1961) is the 44th and current President of the United States. He is the first African American to hold the office and the first president born outside the continental United States. Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School, where he was president of the Harvard Law Review. He was a community organizer in Chicago before earning his law degree. He worked as a civil rights attorney and taught constitutional law at the University of Chicago Law School between 1992 and 2004. While serving three terms representing the 13th District in the Illinois Senate from 1997 to 2004, he ran unsuccessfully in the Democratic primary for the United States House of Representatives in 2000 against incumbent Bobby Rush ….”
Below are the sample queries to which we will extract relevant documents, available in query.txt file, is shown below:
“largest world economy
barack obama
united state president
donald trump and united state
donald trump and barack obama
current President of the United States”
our task is to create a system in which for each of the query terms retrieve top-3 relevant documents.

### High level system design:

In this section we show the high-level design implementation. The implementation steps are as follows:

• Load documents and search queries into the R programming environment as list objects.
• Preprocess the data by creating a corpus object with all the documents and query terms, removing stop words, punctuations using tm package.

high level information retrieval system

• Creating a term document matrix with tf-idf weight setting available in TermDocumentMatrix() method.
• Separate the term document matrix into two parts- one containing all the documents with term weights and other containing all the queries with term weights.
• Now calculate cosine similarity between each document and each query.
• For each query sort the cosine similarity scores for all the documents and take top-3 documents having high scores.

### Full code implementation:

#### 104 comments:

1. I loved as much as you'll receive carried out right here. The sketch is attractive, your authored subject matter stylish. nonetheless, you command get bought an edginess over that you wish be delivering the following. unwell unquestionably come further formerly again since exactly the same nearly a lot often inside case you shield this hike. Best math tutor

2. It is truly a great and helpful piece of info. I am satisfied that you shared this useful info with us. Please stay us informed like this. Thanks for sharing.home cooked dog food

3. Hi there! I know this is kinda off topic however , I'd figured I'd ask. Would you be interested in exchanging links or maybe guest writing a blog post or vice-versa? My blog discusses a lot of the same subjects as yours and I think we could greatly benefit from each other. If you might be interested feel free to send me an e-mail. I look forward to hearing from you! Excellent blog by the way! enrichment classes

4. I have been exploring for a little for any high-quality articles or blog posts on this sort of area . Exploring in Yahoo I at last stumbled upon this web site. Reading this information So i’m happy to convey that I've a very good uncanny feeling I discovered just what I needed. I most certainly will make sure to don’t forget this site and give it a glance on a constant basis. ecommerce web development company

5. Once I originally commented I clicked the -Notify me when new feedback are added- checkbox and now each time a remark is added I get four emails with the same comment. Is there any way you'll be able to take away me from that service? Thanks! 5 star hotel singapore

6. Generally I do not read post on blogs, but I wish to say that this write-up very forced me to try and do it! Your writing style has been surprised me. Thanks, very nice article.LinkedIn

7. Normally I don't read post on blogs, but I wish to say that this write-up very forced me to try and do so! Your writing style has been amazed me. Thanks, very nice post.halal buffet catering Singapore

8. Howdy this is somewhat of off topic but I was wondering if blogs use WYSIWYG editors or if you have to manually code with HTML. I'm starting a blog soon but have no coding knowledge so I wanted to get advice from someone with experience. Any help would be enormously appreciated!
singapore divorce lawyer free consultation

9. I just couldn't depart your website prior to suggesting that I extremely enjoyed the standard info a person provide for your visitors? Is gonna be back often in order to check up on new posts
ecommerce website development

10. Incredible! This blog looks exactly like my old one! It's on a completely different topic but it has pretty much the same page layout and design. Great choice of colors!
travel insurance

11. buddy how can we attach the files with the code

12. Greetings! This is my 1st comment here so I just wanted to give a quick shout out and say I genuinely enjoy reading through your blog posts. Can you recommend any other blogs/websites/forums that deal with the same topics? Appreciate it!
marketing strategy

13. Youre so cool! I dont suppose Ive read anything like this before. So nice to seek out someone with some original thoughts on this subject. realy thank you for beginning this up. this web site is one thing that's wanted on the web, someone with slightly originality. useful job for bringing something new to the web!
role of a professional web designer

14. Hey There. I found your blog using msn. This is a really well written article. I will make sure to bookmark it and return to read more of your useful information. Thanks for the post. I will definitely return.
dominate in SERPs

15. The very core of your writing whilst sounding agreeable in the beginning, did not really settle well with me personally after some time. Someplace within the sentences you actually managed to make me a believer unfortunately just for a while. I however have got a problem with your jumps in assumptions and one might do nicely to help fill in those gaps. When you can accomplish that, I would definitely be fascinated.local seo in singapore

16. Good day! This is my 1st comment here so I just wanted to give a quick shout out and tell you I truly enjoy reading your articles. Can you suggest any other blogs/websites/forums that cover the same topics? Thanks a lot! seo agencies singapore

17. What’s Happening i am new to this, I stumbled upon this I have found It positively helpful and it has helped me out loads. I hope to contribute & aid other users like its helped me. Great job. web design agency singapore

18. We do the vital hand-holding until you are set. Our master mentors will assist you with upskilling the ideas, to finish the assignments and live tasks.
data science course in pune

19. Appreciating the commitment you put into your site and detailed information you present. It's nice to come across a blog every once in a while that isn't the same out of date rehashed information. Fantastic read! I've saved your site and I'm including your RSS feeds to my Google account.
hire a Singapore website designer

20. Write more, thats all I have to say. Literally, it seems as though you relied on the video to make your point. You obviously know what youre talking about, why throw away your intelligence on just posting videos to your blog when you could be giving us something informative to read?
Focus on credibility

21. thanks for sharing this information with us .

22. Wow, superb blog layout! How long have you been blogging for? you make blogging look easy. The overall look of your website is excellent, let alone the content!
singapore social media influencers

23. I am a new user of this site so here i saw multiple articles and posts posted by this site,I curious more interest in some of them hope you will give more information on this topics in your next articles.
ExcelR business analytics course

24. Do you know how to register a company in Saudi Arabia???
https://www.sb-lawyersweb.com/services-inner/6/service/how-to-register-a-company-in-saudi-arabia

25. My brother recommended I might like this web site. He was entirely right. This post actually made my day. You can not imagine simply how much time I had spent for this info! Thanks! social media marketing agency Singapore

26. Thanks for giving me the time to share such nice information. Thanks for sharing.
Data Science Course
Data Science Course in Marathahalli

27. Attend The Data Science Courses From ExcelR. Practical Data Science Courses Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Science Courses.
Data Science Courses
Data Science Interview Questions

28. Study Machine Learning Course Bangalore with ExcelR where you get a great experience and better knowledge.
Machine Learning Course Bangalore

29. Study Business Analytics Courses in Bangalore with ExcelR where you get a great experience and better knowledge. Business Analytics Courses in Bangalore

30. wonderful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article resolved my all queries.
Data science Interview Questions

31. Study Machine learning course bangalore with ExcelR where you get a great experience and better knowledge. Machine learning course bangalore

32. wonderful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article resolved my all queries.
Data science Interview Questions
Data Science Course

33. Now let us know the importance and the advantages of Digital Marketing. Digital Marketing

34. wonderful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article resolved my all queries. keep it up.
data analytics course in Bangalore

35. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspried me to read more. keep it up.
Correlation vs Covariance

36. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
Correlation vs Covariance

37. I have to search sites with relevant information on given topic and provide them to teacher our opinion and the article.

Simple Linear Regression

38. Very interesting blog. Many blogs I see these days do not really provide anything that attracts others, but believe me the way you interact is literally awesome.You can also check my articles as well.

Data Science In Banglore With Placements
Data Science Course In Bangalore
Data Science Training In Bangalore
Best Data Science Courses In Bangalore
Data Science Institute In Bangalore

Thank you..

39. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
Correlation vs Covariance
Simple linear regression
data science interview questions

40. Very interesting to read this article.I would like to thank you for the effortsData Science Course in Hyderabad

41. This Was An Amazing ! I Haven't Seen This Type of Blog Ever ! Thankyou For Sharing, data science course

42. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
Correlation vs Covariance
Simple linear regression
data science interview questions

43. Cool stuff you have and you keep overhaul every one of us

data science interview questions

44. Very interesting blog Thank you for sharing such a nice and interesting blog and really very helpful article.
Data Science Course in Hyderabad

45. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
Correlation vs Covariance
Simple linear regression
data science interview questions

46. I really enjoy simply reading all of your weblogs. Simply wanted to inform you that you have people like me who appreciate your work. Definitely a great post. Hats off to you! The information that you have provided is very helpful.

Simple Linear Regression

Correlation vs Covariance

47. Amazing Article ! I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
Simple Linear Regression
Correlation vs covariance
data science interview questions
KNN Algorithm

48. I have to search sites with relevant information on given topic and provide them to teacher our opinion and the article.

Simple Linear Regression

Correlation vs Covariance

49. Amazing Article ! I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
Simple Linear Regression
Correlation vs covariance
data science interview questions
KNN Algorithm
Logistic Regression explained

50. I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
Simple Linear Regression
Correlation vs covariance
data science interview questions
KNN Algorithm
Logistic Regression explained

51. very well explained. I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
Correlation vs Covariance
Simple Linear Regression
data science interview questions
KNN Algorithm
Logistic Regression explained

52. Attend The Data Science Course From ExcelR. Practical Data Science Course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Science Course.data science courses

53. very well explained .I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
Simple Linear Regression
Correlation vs covariance
data science interview questions
KNN Algorithm
Logistic Regression explained

54. very well explained. I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
Logistic Regression explained
Correlation vs Covariance
Simple Linear Regression
data science interview questions
KNN Algorithm

55. I feel really happy to have seen your webpage and look forward to so many more entertaining times reading here. Thanks once more for all the details.Business Analytics Courses

56. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. ExcelR Data Analytics Course Any way I’ll be subscribing to your feed and I hope you post again soon. Big thanks for the use

57. Nice post and this is very helpful to develop my skills. Thank you...
WordPress Training in Chennai
WordPress Course in Chennai
HTML5 Training in Chennai

58. ExcelR provides Data Analytics courses. It is a great platform for those who want to learn and become a Data Analytics course. Students are tutored by professionals who have a degree in a particular topic. It is a great opportunity to learn and grow.

Data Analytics courses

59. Useful information, Thank you for sharing...

Data science training in chennai
Data science course in chennai

60. Thanks for posting the best information and the blog is very informative.data science interview questions and answers

61. Honestly speaking this blog is absolutely amazing in learning the subject that is building up the knowledge of every individual and enlarging to develop the skills which can be applied in to practical one. Finally, thanking the blogger to launch more further too.
Data Analytics online course

62. Thanks for the amazing info that you provided on about building a basic search engine or document retrieval system using Vector space model. Loved it. I hope you will post more meaningful articles in the near future. You are amazing. Keep up the good work. Because it influences a lot of people in a good way. Thanks again. Have a nice day!! We help the users who are not able to access AOL email through login page. AOL mail login page

63. You have a flair for informational writing. Your selection of topic is very good and also well written. I feel you have good knowledge on this topic. Also, we assist you in providing the solutions to the SBCGlobal Email issues coming while trying to login the email account on Apple device. The solutions provided on our site are written by experts, so we guarantee you that your problem will be resolved.

64. I want to leave a little comment to support and wish you the best of luck.we wish the best best of luck in all your blogging endeavors.
data science course bangalore

65. I Want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging endeavors.
business analytics course in bangalore

66. Great post...You have done good information on this article and i feel you have good knowledge on this topic. Visit our website to know how you can start a raid in Minecraft ( Minecraft Raid ) in simple steps. For more information visit our website. Campfire Minecraft

67. Thanks for posting the best information and the blog is very helpful.data science courses in Bangalore

68. Keep sharing stuff like this actually helps a lot. Need a logo for your business just click the link below:
custom logo design services

69. Therefore, if you are one of the people who do not believe that consultants are useful, think again. There are several unique uses and benefits these professionals can provide for your business! salesforce training in noida

70. We provide training for those who want to work as high-conflict divorce coaches, consultants or advocates. There is an overwhelming need for professionals in this industry and the HCDCCP’s eight-week certification program provides guidance, training and hands-on experience. For more information about Certified Divorce Coach visit our official website.

71. Truly mind blowing blog went amazed with the subject they have developed the content. These kind of posts really helpful to gain the knowledge of unknown things which surely triggers to motivate and learn the new innovative contents. Hope you deliver the similar successive contents forthcoming as well.

data science in bangalore

72. Really awesome blog!!! I finally found great post here.I really enjoyed reading this article. Thanks for sharing your innovative ideas to our vision. your writing style is simply awesome with useful information. Very informative, Excellent work! I will get back here. . AWS Course in Chennai

73. Test strips – These are small, single-use strips that change color to indicate the concentration of a specific chemical. Depending on the particular test, the user “activates” the paper or plastic strip by dipping it into the water sample and swishing it around, or by holding the strip in a stream of water.
To know more about Water Quality Testing Company visit our official website.

74. Nice article. I liked very much. All the informations given by you are really helpful. Also, PlantNeeds products are the world’s best soil conditioners and soil nutrients, which can fit in to all crops with various growing conditions. For more information visit our site.
Neem Cake Product Manufacturer

75. Nice artice and it is really helpful...

76. Thanks for posting the best information and the blog is very helpful.data science institutes in hyderabad

77. The information you have posted is very useful. The sites you have referred to were good. Thanks for sharing.
data scientist training and placement in hyderabad

78. Thanks for posting the best information and the blog is very important.artificial intelligence course in hyderabad

79. Howdy, i read your blog from time to time and i own a similar one and i was just curious if you get a lot of spam responses? If so how do you prevent it, any plugin or anything you can recommend? I get so much lately it's driving me crazy so any help is very much appreciated.
This design is wicked! You certainly know how to keep a reader entertained. Between your wit and your videos, I was almost moved to start my own blog (well, almost...HaHa!) Wonderful job. I really loved what you had to say, and more than that, how you presented it. Too cool! psychotherapy singapore

80. We stumbled over here from a different website and thought I should check things out. I like what I see so now i am following you. Look forward to going over your web page again. bunion corrector

81. Guys If you want to watch and download Netflix, Amazon Prime's Premium Shows, Webseries, Movies for free, then definitely visit our website and see.
PREMIUM NETFLIX MOVIES AND WEBSERIES, SHOWS

82. The writer is enthusiastic about purchasing wooden furniture on the web and his exploration about the best wooden furniture has brought about the arrangement of this article.
data scientist training and placement in hyderabad

83. Very amazing information its helpful for me Thanks for sharing this nice information keep going on in future

quickbooks customer service

84. Truly mind blowing blog went amazed with the subject they have developed the content. These kind of posts really helpful to gain the knowledge of unknown things which surely triggers to motivate and learn the new innovative contents. Hope you deliver the similar successive contents forthcoming as well.

Data Science in Bangalore

85. Thanks for posting the best information and the blog is very important.digital marketing institute in hyderabad

86. Great to become visiting your weblog once more, it has been a very long time for me. Pleasantly this article i've been sat tight for such a long time. I will require this post to add up to my task in the school, and it has identical subject along with your review. Much appreciated, great offer. data science course in nagpur

87. Thanks for posting the best information and the blog is very important.artificial intelligence course in hyderabad

88. Stupendous blog huge applause to the blogger and hoping you to come up with such an extraordinary content in future. Surely, this post will inspire many aspirants who are very keen in gaining the knowledge. Expecting many more contents with lot more curiosity further.

data science course in faridabad

89. Thanks for such a great post and the review, I am totally impressed! Keep stuff like this coming.
data scientist training and placement in hyderabad

90. This Was An Amazing ! I Haven't Seen This Type of Blog Ever ! Thankyou For Sharing…

AWS Training in Hyderabad

91. Thanks for posting the best information and the blog is very important.data science course in Lucknow