Welcome to the world of Hadoop: A small baby elephant : Big Data Assignment for Praxis Business School Part 1



Big Data has attracted the attention of lots of corporates, individuals, big  honchos in the field of analytics. Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques. 

A small baby elephant comes to the rescue by the name of hadoop.


Hadoop is a tool that helps solving the problem of processing large amount of data (Terabytes) by the combination  of number of computers. 

Before I proceed ahead with the objective in mind regarding this blog. We must understand  two principles concepts of Hadoop:

a) HDFS: Hadoop Distributed File System distributes large files across multiple machines in a way that is invisible to the user

b) Map Reduce Concept: This is the crux of the hadoop. It can be broken down into two independent tasks map and reduce. Lets take a small example in a room there are colored balls . One needs to count the balls.Map program would identify the color and attach the postids on the table whereas the reduce program would add the no of postids.

I am analytics enthusiast that got a chance to work on the word count program on the hadoop platform.  The blog would help as a simple guide how to run a simple word count program. 


Word Count Program using Hadoop



1) Start Hortonworks Hadoop Data Platform(HDP) through Oracle Virtual Box

To start  HDP provides an enterprise ready data platform that enables organizations to adopt a Modern data Architecture. It is a single node Hadoop cluster along with Pig, Hive and other applications . VirtualBox is a powerful x86 and AMD64/Intel64 virtualization product for enterprise as well as home use. Once you start the virtual machine the output is shown above



2) Open SSH terminal in virtual box



After this screen we need to go on the address stated to open the shell script.
After logging to this address http://127.0.0.1:4200/ one comes across the sand box is opened by password and user name. User name is root and hadoop
Look at the output below:




  3) Create a directory for java program

  To work around we need to know few commands in unix. 
  mkdir WC Classes

 4) Copy the 3 programs of java

  There are 3 program Sumreduce, Word Mapper and Word Count.Find Below the programs attached.


Sum Reducer Prog:

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;



public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
  
 private IntWritable totalWordCount = new IntWritable();
  
 @Override
 public void reduce(Text key, Iterable<IntWritable> values, Context context_W)
            throws IOException, InterruptedException {
  int wordCount = 0;
  Iterator<IntWritable> it=values.iterator();
  while (it.hasNext()) {
   wordCount += it.next().get();
  }
  totalWordCount.set(wordCount);
  context_W.write(key, totalWordCount);
 }
}


Word Count Prog:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
 public static void main(String[] args) throws Exception {
        if (args.length != 2) {
          System.out.println("usage: [input] [output]");
          System.exit(-1);
        }
   
   
        Job job = Job.getInstance(new Configuration());
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(WordMapper.class); 
        job.setReducerClass(SumReducer.class);  

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setJarByClass(WordCount.class);

        job.submit();

 }
}

Word Mapper Prog:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
 private Text word = new Text();
 private final static IntWritable one = new IntWritable(1);
  
 @Override
 public void map(Object key, Text value, Context context_W) throws IOException, InterruptedException {
  // Convert Text to String
  String line = value.toString();
  // Clean up the string - remove all non alpha characters
  //line = line.replaceAll("[^a-zA-Z\\s]", "").toLowerCase();
  line = line.replaceAll("[^a-zA-Z0-9\\s]", "").toLowerCase();
  // Break line into words for processing
  StringTokenizer wordList = new StringTokenizer(line);
  while (wordList.hasMoreTokens()) {
   word.set(wordList.nextToken());
   context_W.write(word, one);
  }
 }
}

These programs can be copied by typing vi into the shell script and copy the programs by right clicking and selecting the option paste from browser.
                                     
 5) Copy the following shell scripts for compiling and running of the java program





The code is given below:



6) Reflection of hadoop libraries in HDP distribution


The code is as below

hdfs dfs -rm -r /user/hue/wc-out2
hdfs dfs -ls/user/hue
hdfs dfs -ls/user/hue/wc

7) use Hue to upload the data files into an appropriate library and modify the shell scripts to point to correct input and output directories





The screen shots shows the job browser showing the success of the program and the creation of output file wcout and the last step I am attaching the screenshots for output



8)Output







Referances:

Hadoop Images Google


Comments

  1. Thank you for sharing such a useful article. It will be useful to those who are looking for knowledge. Continue to share your knowledge with others through posts like these, and keep posting on
    Data Engineering Services 
    Advanced Data Analytics Solutions
    Data Modernization Services
    AI & ML Service Provider

    ReplyDelete
  2. Usually I do not read post on blogs, but I would like to say that this write-up very forced me to try and do it! Your writing style has been surprised me. Great work admin.Keep update more blog.Visit here for Product Engineering Services | Product Engineering Solutions.

    ReplyDelete
  3. I appreciate you taking the time and effort to share your knowledge. This material proved to be really efficient and beneficial to me. Continue to publish more articles on
    Low Code App Development Company
    Mendix Solutions
    Software Testing Services
    Test Automation Services
    Product Engineering Services

    ReplyDelete
  4. I appreciate you taking the time and effort to share your knowledge. This material proved to be really efficient and beneficial to me. Continue to publish more articles on
    Low Code App Development Company
    Mendix Solutions
    Software Testing Services
    Test Automation Services
    Product Engineering Services

    ReplyDelete

Post a Comment

Popular posts from this blog

Kabaddi Match: Lets meet at the arena!! Aa jao Dam Dikhane!!!

Apache Pig : A tutorial to learn a small exercise on how to run a pig program Big Data Assignment Part 2 for Praxis Business School

Text Analytics Using R - Part A: Extraction of reviews of galaxy s4 product reviews in flipkart