Text Analytics Using R - Part B: Extraction of reviews of movie martian from imdb website

The main aim of this blog is to understand how to extract movie reviews of martian from IMDB website.  IMDB website is very famous for rating of the movies

Again the pattern followed here -- which means start of the code

1) --library(rvest)
--library(RCurl)
--library(XML)
--library(dplyr)

The main aim of  the above mentioned code is to load the above libraries . RCurl and XML

rvest - Wrappers around the 'xml2' and 'httr' packages to make it easy to download, then manipulate, HTML and XML.(Ref:https://cran.r-project.org/web/packages/rvest/rvest.pdf)

Dplyr package:A fast, consistent tool for working with data frame like objects, both in memory and out of memory.(Ref:https://cran.r-project.org/web/packages/dplyr/dplyr.pdf)

2) --init <- "http://www.imdb.com/title/tt3659388/reviews?filter=best"
--crawlCandicate <- "reviews\\?filter=best"
--base = "www.imdb.com/title/tt3659388/"
--num = 10
--doclist = list()
--anchorlist = vector()
--j <- 0
--while(j <num){
-- if(j==0){
--   doclist[j+1] <- getURL(init)
--  }else{
--   doclist[j+1] <-  getURL(paste(base,anchorlist[j+1],sep = ""))
--  }
--  doc <- htmlParse(doclist[[j+1]]) 
--  anchor <- getNodeSet(doc,"//a") 
-- anchor <- sapply(anchor,function(x)xmlGetAttr(x,"href"))
--  anchorlist = c(anchorlist,anchor)
--  anchorlist = unique(anchorlist)
 -- j= j+1
--}


The explanation of this part of code is similar to Part A of the text analytics blog series.


3) To extract the star ratings, reviewers name and reviews

--reviewDataFrame1=data.frame(Reviews=character(),Reviewer=character(),Ratings=character())

This line of code creates a data frame where we have introduced empty components for reviews, reviewer and ratings that too in the character format
#Part 0
--for(i in 1:10){
 --doc=htmlParse(doclist[[i]])
 -- y=getNodeSet(doc,"//div[@id='tn15content']/div")

Explanation:
We are running here a loop for i from 1 to 10 then we need to understand one thing that the structure of one webpage is different from another webpage. The ratings , reviews and star ratings are under the xpath "div[@id='tn15content']"  tn15content captures all the contents to be webscraped. 


#Part 1
 -- reviewers=sapply(1:10,function(x)getNodeSet(y[[x]],
                                              paste("//div[@id='tn15content']/div[",2*x-1,"]/a[2]",sep = ""))) %>% 
 --   sapply(.,xmlValue) 

Explanation: reviewers re extracted under this code so y is tncontent15.

sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if appropriate, by applying simplify2array(). sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f).(Ref: help(sapply)

Here another imp component is Out of 20 Node sets, the odd ones capture the reviewers name

Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.(Ref:help(apply))

reviewers are under the tncontent15 and under the <div> tags then the <a>  tags this has been captured by inspect element option by right clicking on the name of the reviewer 

xmlvalue: This function provides access to their raw contents.(Ref:help(xmlValue))

#Part 2

--  y1=lapply(1:10,function(x)getNodeSet(y[[x]],paste("//div[@id='tn15content']/div[",2*x-1,"]/img",sep = "")))  
--  ratings=sapply(1:10,function(j)tryCatch(xmlGetAttr(y1[[j]][[1]],
                                                     "alt",default = NA),
                                          error=function(e)return(NA))) 

Explanation:  Ratings are img tag that can be extracted using the same way as reviewers have beem. Here not all have given ratings hence we run a function try catch where there is no rating NA is returned


#Part 3 
 -- reviews=sapply(1:10,function(x)getNodeSet(y[[x]],paste("//div[@id='tn15content']/p[",x,"]",sep = --""))) %>% 
   -- sapply(.,xmlValue) 

Explanation: Here the imp part is the reviews are under the p tag under tncontent15


#Part 4 
--  reviewDataFrame1=rbind(reviewDataFrame1,
                        data.frame(Reviews=reviews,
                                   Reviewer=reviewers,
                                   Ratings=ratings))
--}


Explanation: reviewDataFrame1 which is the rowbind for all the three combinations reviews , reviewer and ratings



Finally for 10 pages you have a set of ratings , reveiewers name and reviews










Comments

Popular posts from this blog

Kabaddi Match: Lets meet at the arena!! Aa jao Dam Dikhane!!!

Text Analytics Using R - Part A: Extraction of reviews of galaxy s4 product reviews in flipkart

Replace your Social Media with the New Age Social Media: Fitness with Motivation