Posts

Showing posts from November, 2015

Apache Pig : A tutorial to learn a small exercise on how to run a pig program Big Data Assignment Part 2 for Praxis Business School

Image
Welcome to the second part of the blog where we would learn how to run a simple pig program. This blog has been written to complete an assignment on big data with the praxis business school. The sample exercise has been taken from the following link  "http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-pig/" . About Pig: Pig is a high level scripting language that i used with Apache hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functios(UDF) facility in Pig you can have Pig invoke code in many languages like Jruby, Jython and Java Pig Scripts are translated into a series of Mapreduce jobs that run on the Apache Hadoop Cluster. As part of the translation the Pig interpreter does perform optimizations to speed execution on Apache Hadoop.  Lets solve a simple exercise on how ...

Welcome to the world of Hadoop: A small baby elephant : Big Data Assignment for Praxis Business School Part 1

Image
Big Data has attracted the attention of lots of corporates, individuals, big  honchos in the field of analytics.  Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques.  A small baby elephant comes to the rescue by the name of hadoop. Hadoop is a tool that helps solving the problem of processing large amount of data (Terabytes) by the combination  of number of computers.  Before I proceed ahead with the objective in mind regarding this blog. We must understand  two principles concepts of Hadoop: a) HDFS: Hadoop Distributed File System distributes large files across multiple machines in a way that is invisible to the user b) Map Reduce Concept: This is the crux of the hadoop. It can be broken down into two independent tasks map an...