How to process data using Hive Batting Example :Big Data Assignment Part 3 for Praxis Business School

December 09, 2015

As part of the exercise in learning about the hive query language we completed one sample exercise on hive .

The sample exercise has been taken from the following website

http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/

Hive provides a platform to run SQL queries. This is a more familiar with the programmers of SQL background. It is a known and learned fact that sql queries are similar to comprehend and understand

1) Data Downloading:

The csv file can be downloaded from the following zip file.

http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

2) Uploading of the data:

Although the link from where we are solving the program is an easy guide. We would be using the hue to upload the files for execution of the program.

Steps to upload:

a) Start VM Box and open SSH Terminal
b) Log on to the address http://127.0.0.1:8000/
c) Click on file browser option followed by view then click on hue
d) Upload 2 files batting.csv and master.csv.

3) Lets look into the programme:

create table temp_batting (col_value STRING);

This would create the table to store the data. Remember we have to save the queries with a name create and also press execute once it is done

Next we would load the batting.csv into the temp_batting we created. Refer snapshot below:

Load data Inpath '/user/admin/Batting.csv' OVERWRITE INTO TABLE temp_batting;

As in sql for looking at the data we would write the select command . Here also we follow suite

Select * from batting_temp;

The output is given below:

create table batting ( player_id STRING , year INT ,runs INT);

Through regex operation we would copy the data from temp_batting and copy it into batting. For this we need to create the table batting with 3 fields player_id , years and run. After this we would group the data by year to highlight the maximum runs obtained for each year

Select year , max(runs) FROM batting GROUP BY year;

Last step we would find the player id with maximum runs. For this refer the query below:

Select a. year , a. player_ID , a.runs from batting a
JOIN (SELECT year , max(runs) runs FROM batting GROUP BY year ) b
ON (a.year = b. year AND a.runs = b.runs);

This was a simple yet engaging exercise to learn about the HQL. Interesting part we found the batting.csv file vanishes from the view of file browser. Refer Below: