**Curse of Dimensionality:**

One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is high-dimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality.

In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension.

Consider below scenario:

The data, we want to work with, is in the form of a matrix A of mXn dimension, shown as below, where Ai,j represents the value of the i-th observation of the j-th variable.

Thus the N members of the matrix can be identified with the M rows, each variable corresponding to N-dimensional vectors. If N is very large it is often desirable to reduce the number of variables to a smaller number of variables, say k variables as in the image below, while losing as little information as possible.

Mathematically spoken, PCA is a linear orthogonal transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

The algorithm when applied linearly transforms m-dimensional input space to n-dimensional (n < m) output space, with the objective to minimize the amount of information/variance lost by discarding (m-n) dimensions. PCA allows us to discard the variables/features that have less variance.

Technically speaking, PCA uses orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This linear transformation is defined in such a way that the first principal component has the largest possible variance. It accounts for as much of the variability in the data as possible by considering highly correlated features. Each succeeding component in turn has the highest variance using the features that are less correlated with the first principal component and that are orthogonal to the preceding component.

In the above image, u1 & u2 are principal components wherein u1 accounts for highest variance in the dataset and u2 accounts for next highest variance and is orthogonal to u1.

**PCA implementation in R:**

head(crimtab) 142.24 144.78 147.32 149.86 152.4 154.94 157.48 160.02 162.56 165.1 167.64 170.18 172.72 175.26 177.8 180.34 9.4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9.5 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 9.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9.7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9.8 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 9.9 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 182.88 185.42 187.96 190.5 193.04 195.58 9.4 0 0 0 0 0 0 9.5 0 0 0 0 0 0 9.6 0 0 0 0 0 0 9.7 0 0 0 0 0 0 9.8 0 0 0 0 0 0 9.9 0 0 0 0 0 0 dim(crimtab) [1] 42 22 str(crimtab) 'table' int [1:42, 1:22] 0 0 0 0 0 0 1 0 0 0 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:42] "9.4" "9.5" "9.6" "9.7" ... ..$ : chr [1:22] "142.24" "144.78" "147.32" "149.86" ... sum(crimtab) [1] 3000 colnames(crimtab) [1] "142.24" "144.78" "147.32" "149.86" "152.4" "154.94" "157.48" "160.02" "162.56" "165.1" "167.64" "170.18" "172.72" "175.26" "177.8" "180.34" [17] "182.88" "185.42" "187.96" "190.5" "193.04" "195.58"

let us use apply() to the crimtab dataset row wise to calculate the variance to see how each variable is varying.

apply(crimtab,2,var)

We observe that column “165.1” contains maximum variance in the data. Applying PCA using prcomp().

pca =prcomp(crimtab)pca

Note: the resultant components of pca object from the above code corresponds to the standard deviations and Rotation. From the above standard deviations we can observe that the 1st PCA explained most of the variation, followed by other pcas’. Rotation contains the principal component loadings matrix values which explains /proportion of each variable along each principal component.

Let’s plot all the principal components and see how the variance is accounted with each component.

par(mar = rep(2, 4)) plot(pca)

Clearly the first principal component accounts for maximum information.

Let us interpret the results of pca using biplot graph. Biplot is used to show the proportions of each variable along the two principal components.

#below code changes the directions of the biplot, if we donot include the below two lines the plot will be mirror image to the below one.pca$rotation=-pca$rotation pca$x=-pca$x biplot (pca , scale =0)

The output of the preceding code is as follows:

In the preceding image, known as a biplot, we can see the two principal components (PC1 and PC2) of the crimtab dataset. The red arrows represent the loading vectors, which represent how the feature space varies along the principal component vectors.

From the plot, we can see that the first principal component vector, PC1, more or less places equal weight on three features: 165.1, 167.64, and 170.18. This means that these three features are more correlated with each other than the 160.02 and 162.56 features.

In the second principal component, PC2 places more weight on 160.02, 162.56 than the 3 features, "165.1, 167.64, and 170.18" which are less correlated with them.

**Complete Code for PCA implementation in R:**

So by now we understood how to run the PCA, and how to interpret the principal components, where do we go from here? How do we apply the reduced variable dataset? In our next post we shall answer the above questions.

I think this is one of the best posts on this topic. I am very happy to see your amazing post and thank you for your sharing with us. I like a more advanced level of information from your post and please keep it up...

ReplyDeleteSAS Training in Chennai

SAS Course in Chennai

Hadoop Admin Training in Chennai

Html5 Training in Chennai

Drupal Training in Chennai

Pega Training in Chennai

SAS Training in OMR

SAS Training in Porur

Nice article. I liked very much. All the informations given by you are really helpful for my research. keep on posting your views.

ReplyDeleteJava Training in Chennai

Java course in Chennai

Big Data Training in Chennai

Advanced Java Training in Chennai

German Language Course in Chennai

Java Training in Velachery

I am really enjoying reading your well written articles.

ReplyDeleteIt looks like you spend a lot of effort and time on your blog.

I have bookmarked it and I am looking forward to reading new articles.Keep up the good work..

RPA Training in Chennai

RPA Classes in Chennai

Blue Prism Training in Chennai

Ethical Hacking Training in Chennai

Cloud Computing Training in Chennai

RPA Training in T Nagar

RPA Training in Porur

Thanks to the admin for spending time to share this valuable information with us. This was a wonderful post.

ReplyDeleteSpoken English Class in Thiruvanmiyur

Spoken English Classes in Adyar

Spoken English Classes in T-Nagar

Spoken English Classes in Vadapalani

Spoken English Classes in Porur

Spoken English Classes in Anna Nagar

Spoken English Classes in Chennai Anna Nagar

Spoken English Classes in Perambur

Spoken English Classes in Anna Nagar West

It’s really a Great Post .Looking for Some More Stuff

ReplyDeleteAviation Academy in Chennai

Air hostess training in Chennai

Airport management courses in Chennai

Ground staff training in Chennai

Aviation Courses in Chennai

air hostess academy in Chennai

Airport management courses in Chennai

Airport Management Training in Chennai

I'm extremely inspired together with your writing skills and also with the format in your weblog. Is that this a paid topic or did you modify it your self? Anyway keep up the nice quality writing, it’s rare to peer a great weblog like this one today..

ReplyDeletematernity shops SingaporeI would like to thanks for your comprehensive article and these concepts are very helped to increase my knowledge. I hope more unique information from your post...

ReplyDeleteExcel Training in Chennai

Advanced Excel Training in Chennai

Excel Advanced course

corporate training in chennai

Embedded System Course Chennai

Linux Training in Chennai

Excel Training in Chennai

Advanced Excel Training in Chennai

Oh my goodness! an incredible article dude. Thank you Nonetheless I'm experiencing difficulty with ur rss . Don’t know why Unable to subscribe to it. Is there anybody getting an identical rss drawback? Anyone who is aware of kindly respond. Thnkx

ReplyDeletedigital marketing jobs singaporehello!,I like your writing so much! share we communicate more about your post on AOL? I require an expert on this area to solve my problem. May be that's you! Looking forward to see you.

ReplyDeletecost per click singaporeThis is the appropriate blog for anybody who desires to search out out about this topic. You realize a lot its almost laborious to argue with you (not that I truly would need…HaHa). You undoubtedly put a new spin on a subject thats been written about for years. Great stuff, just nice!

ReplyDeletemarketing companies in singaporefantastic points altogether, you simply gained a new reader. What would you suggest about your post that you made some days ago? Any positive? help desk software

ReplyDeleteAre you sure your data are centered and scaled?

ReplyDeleteI mean, before doing PCA

Delete