How to process data using Hive Batting Example :Big Data Assignment Part 3 for Praxis Business School
As part of the exercise in learning about the hive query language we completed one sample exercise on hive .
The sample exercise has been taken from the following website
http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/
Hive provides a platform to run SQL queries. This is a more familiar with the programmers of SQL background. It is a known and learned fact that sql queries are similar to comprehend and understand
1) Data Downloading:
The csv file can be downloaded from the following zip file.
http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip
http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip
2) Uploading of the data:
Although the link from where we are solving the program is an easy guide. We would be using the hue to upload the files for execution of the program.
Steps to upload:
a) Start VM Box and open SSH Terminal
b) Log on to the address http://127.0.0.1:8000/
c) Click on file browser option followed by view then click on hue
d) Upload 2 files batting.csv and master.csv.

3) Lets look into the programme:
create table temp_batting (col_value STRING);
This would create the table to store the data. Remember we have to save the queries with a name create and also press execute once it is done
Next we would load the batting.csv into the temp_batting we created. Refer snapshot below:
Load data Inpath '/user/admin/Batting.csv' OVERWRITE INTO TABLE temp_batting;
As in sql for looking at the data we would write the select command . Here also we follow suite
Select * from batting_temp;
The output is given below:
create table batting ( player_id STRING , year INT ,runs INT);
Through regex operation we would copy the data from temp_batting and copy it into batting. For this we need to create the table batting with 3 fields player_id , years and run. After this we would group the data by year to highlight the maximum runs obtained for each year
Select year , max(runs) FROM batting GROUP BY year;
Last step we would find the player id with maximum runs. For this refer the query below:
Select a. year , a. player_ID , a.runs from batting a
JOIN (SELECT year , max(runs) runs FROM batting GROUP BY year ) b
ON (a.year = b. year AND a.runs = b.runs);
This was a simple yet engaging exercise to learn about the HQL. Interesting part we found the batting.csv file vanishes from the view of file browser. Refer Below:
Through regex operation we would copy the data from temp_batting and copy it into batting. For this we need to create the table batting with 3 fields player_id , years and run. After this we would group the data by year to highlight the maximum runs obtained for each year
Select year , max(runs) FROM batting GROUP BY year;
Last step we would find the player id with maximum runs. For this refer the query below:
Select a. year , a. player_ID , a.runs from batting a
JOIN (SELECT year , max(runs) runs FROM batting GROUP BY year ) b
ON (a.year = b. year AND a.runs = b.runs);
This was a simple yet engaging exercise to learn about the HQL. Interesting part we found the batting.csv file vanishes from the view of file browser. Refer Below:
For further insights one can look at regex functions
ReplyDelete