Apache Pig : A tutorial to learn a small exercise on how to run a pig program Big Data Assignment Part 2 for Praxis Business School


Welcome to the second part of the blog where we would learn how to run a simple pig program. This blog has been written to complete an assignment on big data with the praxis business school.


The sample exercise has been taken from the following link 

"http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-pig/" .






About Pig:

Pig is a high level scripting language that i used with Apache hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functios(UDF) facility in Pig you can have Pig invoke code in many languages like Jruby, Jython and Java

Pig Scripts are translated into a series of Mapreduce jobs that run on the Apache Hadoop Cluster. As part of the translation the Pig interpreter does perform optimizations to speed execution on Apache Hadoop. 

Lets solve a simple exercise on how to run a pig script on hadoop  platform:

1) Data Downloading:

The csv file can be downloaded from the following zip file.

http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip


2) Uploading of the data:

Although the link from where we are solving the program is an easy guide. We would be using the hue to upload the files for execution of the program.

Steps to upload:

a) Start VM Box and open SSH Terminal 
b) Log on to the address http://127.0.0.1:8000/
c) Click on file browser option followed by view then click on hue
d) Upload 2 files batting.csv and master.csv.

Screenshot is attached below:






3) Write the pig script and execute the program:



Code:




Click on the pig icon on the header and a new script will open. Name this script as any name. Here it is T2 and Code is written in the notepad type file.
Code: 

    1) load 'Batting.csv' using PigStorage(',');

This line will load the file Batting.Csv.Through pig storage we are passing the comma as the delimiter

     2)  Raw runs = FILTER batting BY $1>0;

To filter the row the fist line would be used

     3) runs = FOREACH raw runs GENERATE $0 as playerID , $1 as year , $8 as runs;

Foreach would iterate through batting data object. Generate pulls out selected fields and assigns name,



     4)  grp_data = GROUP runs by (year);

This line would group runs by year.


     5) max_runs = FOREACH grp_data GENERATE group as  grp,MAX(runs.runs) as max_runs;

This line would find out the maximum runs 

  6) join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs); 
 join_data = FOREACH join_max_run GENERATE $0 as year, $2 as 
playerID, $1 as runs; 
DUMP join_data;


We are combining and generating the dataset year, player id and maximum runs and dumping the output to the screen

4) Click on the save and execute to compile the program and run

You can check the progress in the Job Browser Window.

This would take around 10-15 Minutes to work around.

5) Finally the  output will be shown as below:






Happy Solving!! Please keep following for further links ........


Referances: Hadoop Image Google
Website from which excercise is taken has been mentioned in the start

Comments

Popular posts from this blog

Kabaddi Match: Lets meet at the arena!! Aa jao Dam Dikhane!!!

Text Analytics Using R - Part A: Extraction of reviews of galaxy s4 product reviews in flipkart