Text Analytics Using R - Part A: Extraction of reviews of galaxy s4 product reviews in flipkart

Hi Folks!!

This is my first blog in the series where I would love to share my experimentation with text analytics using R.. The initial 3 parts we would concentrate our efforts towards scrapping or extraction of reviews of samsung s4 product from the flipkart.

We would be understanding the code bit by bit to get a hang of it and make it simpler to follow for a novice who is just enticed and want to start afresh here.

When I am starting with -- this means R code

1) R is an open source platform and in case you want to have added functionalities you load the library

--library(RCurl)
--library(XML)

RCurl: Provides functions to allow one to compose general HTTP requests and provides convenient functions to fetch URIs, get & post forms, etc. and process the results returned by the Web server. ( Ref: https://cran.r-project.org/web/packages/RCurl/index.html)

XML:This collection of functions allow us to add, remove and replace children from an XML node and also to and and remove attributes on an XML node. These are generic functions that work on both internal C-level XMLInternalElementNode objects and regular R-level XMLNode objects. addChildren is similar to addNode and the two may be consolidated into a single generic function and methods in the future.(Ref: https://cran.r-project.org/web/packages/XML/XML.pdf)

Both may sound alien and different but actually they are used for scrapping of websites.

2)
--init <- "http://www.flipkart.com/samsung-galaxy-s4/product-reviews/ITME7RSP2RYWWNPG"
--crawlCandicate <- "start="
--base = "http://www.flipkart.com"

a) init refers to the website we would be referring to extract the reviews here samsung galaxy s4 product reviews from flipkart

b) Go to the website and right click and select inspect element.
You will get the below result if you put your cursor at page 2

"href="/samsung-galaxy-s4/product-reviews/ITME7RSP2RYWWNPG?pid=MOBDK7U9FFPUAGPZ&rating=1,2,3,4,5&reviewers=all&type=top&sort=most_helpful&start=10"

For page 3 it is "href="/samsung-galaxy-s4/product-reviews/ITME7RSP2RYWWNPG?pid=MOBDK7U9FFPUAGPZ&rating=1,2,3,4,5&reviewers=all&type=top&sort=most_helpful&start=20"

As we can see start = is common we have given that as crawl candicate.

c) Base refers to the base website here it is flipkart.com

3)
--num = 3 --doclist = list() --anchorlist = vector() --j <- 0 --while(j <num){ -- if(j==0){ -- doclist[j+1] <- getURL(init) -- }else{ -- doclist[j+1] <- getURL(paste(base,anchorlist[j+1],sep = "")) -- } -- doc <- htmlParse(doclist[[j+1]]) --anchor <- getNodeSet(doc,"//a") --anchor <- sapply(anchor,function(x)xmlGetAttr(x,"href")) --anchor <- anchor[grep(crawlCandicate,anchor)] --anchorlist = c(anchorlist,anchor) --anchorlist = unique(anchorlist) --j= j+1 --}

a) Here the num refers to the number of pages which are to be extracted for the website
b) doclist and anchorlist is initialized the former as the list and latter as vector. j is initialized as vector
c) we would be running a while loop where I would be comparing the j with the number of pages here in case of j is 0,1,2 you move further enter the if else loop. when j is 0 the doclist[1] is assigned the url ie first page of the product review else doclist[2] and doclist[3] are url from combination of base and *anchorlist(explanation is followed in coming points)

while specifying the base we do that because in the inspect element in the href category the base site is missing then with would become

<a class="nav_bar_link" href="/samsung-galaxy-s4/product-reviews/ITME7RSP2RYWWNPG?pid=MOBDK7U9FFPUAGPZ&rating=1,2,3,4,5&reviewers=all&type=top&sort=most_helpful&start=10">2</a>

The new link with the base becomes

" href="http://www.flipkart.com /samsung-galaxy-s4/product-reviews/ITME7RSP2RYWWNPG?pid=MOBDK7U9FFPUAGPZ&amp

d) doc <- htmlParse(doclist[[j+1]]) here what we got the doclist is a text file we would use the function to htmlParse to convert it back to the html format or to better say retain the html properties which get lost since it is downloaded as text file

e) anchor<- getNodeSet(doc,"//a")
anchor <- sapply(anchor,function(x)xmlGetAttr(x,"href")
getNodeset:
These functions provide a way to find XML nodes that match a particular criterion. It uses the XPath syntax and allows very powerful expressions to identify nodes of interest within a document both clearly and efficiently.(ref:help("getNodeSet")
I am interested in <a> tags
The latter will capture all the href tags

f)anchor <- anchor[grep(crawlCandicate,anchor)]
Now it will extract among those tags that match the crawl candidate here start=

h)anchorlist = c(anchorlist,anchor)
here it is capturing the links in the external container

i)anchorlist = unique(anchorlist)
This will return the unique elements in the anchorlist with the duplicates removed

4) Now we would focus on extracting the reviews
reviews = c()
for (i in 1:3){
doc = htmlParse(doclist[[i]])
l = getNodeSet(doc , "//div/p[@class = 'line bmargin10']")
l1 = sapply(l,xmlValue)
reviews = c(reviews,l1)

}

In the point 3 we were able to extract the pages now our aim is to get the reviews .We would create the empty reviews list run a for loop with the number of iterations till the num specified earlier here it is 3. The reviews are under the tag://div/p[@class = 'line bmargin10'] hence use the GetNodeset will get a particular value that matches the above criterian. Ins simple words the reviews are under the div then under p tag where xpath is @class = 'line bmargin10'.

xmlvalue: This function provides access to their raw contents.(Ref:help(xmlValue))

lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.(Ref:help(lapply)

sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if appropriate, by applying simplify2array(). sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f).(Ref:help(sapply)

reviews = c(reviews,l1) This will create a final reviews list

5) strlist=c() for(i in 1:3) { doc=htmlParse(doclist[[i]]) str=getNodeSet(doc,"//div[@class='fk-stars']") strl<-sapply(str,function(x)xmlGetAttr(x,"title")) strlist<-c(strlist,strl) }

The code is similar to the code in step 4 but here we are extracting the star rating

xmlGetAttr: This is a convenience function that retrieves the value of a named attribute in an XML node, taking care of checking for its existence. It also allows the caller to provide a default value to use as the return value if the attribute is not present. here it is after title thats why (Ref:help("xmlGetAttr")

Thanks for bring part of part A This would be followed by other parts:

Enjoy!!!!

Search This Blog

vida amorosa