Welcome to the world of Hadoop: A small baby elephant : Big Data Assignment for Praxis Business School Part 1

Big Data has attracted the attention of lots of corporates, individuals, big honchos in the field of analytics. Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques.

A small baby elephant comes to the rescue by the name of hadoop.

Hadoop is a tool that helps solving the problem of processing large amount of data (Terabytes) by the combination of number of computers.

Before I proceed ahead with the objective in mind regarding this blog. We must understand two principles concepts of Hadoop:

a) HDFS: Hadoop Distributed File System distributes large files across multiple machines in a way that is invisible to the user

b) Map Reduce Concept: This is the crux of the hadoop. It can be broken down into two independent tasks map and reduce. Lets take a small example in a room there are colored balls . One needs to count the balls.Map program would identify the color and attach the postids on the table whereas the reduce program would add the no of postids.

I am analytics enthusiast that got a chance to work on the word count program on the hadoop platform. The blog would help as a simple guide how to run a simple word count program.

Word Count Program using Hadoop

1) Start Hortonworks Hadoop Data Platform(HDP) through Oracle Virtual Box

To start HDP provides an enterprise ready data platform that enables organizations to adopt a Modern data Architecture. It is a single node Hadoop cluster along with Pig, Hive and other applications . VirtualBox is a powerful x86 and AMD64/Intel64 virtualization product for enterprise as well as home use. Once you start the virtual machine the output is shown above

2) Open SSH terminal in virtual box

After this screen we need to go on the address stated to open the shell script.

After logging to this address http://127.0.0.1:4200/ one comes across the sand box is opened by password and user name. User name is root and hadoop

Look at the output below:

3) Create a directory for java program

To work around we need to know few commands in unix.

mkdir WC Classes

4) Copy the 3 programs of java

There are 3 program Sumreduce, Word Mapper and Word Count.Find Below the programs attached.

Sum Reducer Prog:

import java.io.IOException;

import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable totalWordCount = new IntWritable();

@Override

public void reduce(Text key, Iterable<IntWritable> values, Context context_W)

throws IOException, InterruptedException {

int wordCount = 0;

Iterator<IntWritable> it=values.iterator();

while (it.hasNext()) {

wordCount += it.next().get();

}

totalWordCount.set(wordCount);

context_W.write(key, totalWordCount);

}

Word Count Prog:

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

public static void main(String[] args) throws Exception {

if (args.length != 2) {

System.out.println("usage: [input] [output]");

System.exit(-1);

}

Job job = Job.getInstance(new Configuration());

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(WordMapper.class);

job.setReducerClass(SumReducer.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.setInputPaths(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setJarByClass(WordCount.class);

job.submit();

}

Word Mapper Prog:

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {

private Text word = new Text();

private final static IntWritable one = new IntWritable(1);

@Override

public void map(Object key, Text value, Context context_W) throws IOException, InterruptedException {

// Convert Text to String

String line = value.toString();

// Clean up the string - remove all non alpha characters

//line = line.replaceAll("[^a-zA-Z\\s]", "").toLowerCase();

line = line.replaceAll("[^a-zA-Z0-9\\s]", "").toLowerCase();

// Break line into words for processing

StringTokenizer wordList = new StringTokenizer(line);

while (wordList.hasMoreTokens()) {

word.set(wordList.nextToken());

context_W.write(word, one);

}

These programs can be copied by typing vi into the shell script and copy the programs by right clicking and selecting the option paste from browser.

5) Copy the following shell scripts for compiling and running of the java program

The code is given below:

6) Reflection of hadoop libraries in HDP distribution

The code is as below

hdfs dfs -rm -r /user/hue/wc-out2

hdfs dfs -ls/user/hue

hdfs dfs -ls/user/hue/wc

7) use Hue to upload the data files into an appropriate library and modify the shell scripts to point to correct input and output directories

The screen shots shows the job browser showing the success of the program and the creation of output file wcout and the last step I am attaching the screenshots for output

8)Output

Referances:

Hadoop Images Google

Comments

Aaron jhonsonDecember 10, 2021 at 12:24 AM
Thank you for sharing such a useful article. It will be useful to those who are looking for knowledge. Continue to share your knowledge with others through posts like these, and keep posting on
Data Engineering Services
Advanced Data Analytics Solutions
Data Modernization Services
AI & ML Service Provider
WilliamJanuary 7, 2022 at 2:42 AM
Usually I do not read post on blogs, but I would like to say that this write-up very forced me to try and do it! Your writing style has been surprised me. Great work admin.Keep update more blog.Visit here for Product Engineering Services | Product Engineering Solutions.
sylviaDecember 23, 2022 at 1:24 AM
I appreciate you taking the time and effort to share your knowledge. This material proved to be really efficient and beneficial to me. Continue to publish more articles on
Low Code App Development Company
Mendix Solutions
Software Testing Services
Test Automation Services
Product Engineering Services
sylviaDecember 26, 2022 at 12:13 AM
I appreciate you taking the time and effort to share your knowledge. This material proved to be really efficient and beneficial to me. Continue to publish more articles on
Low Code App Development Company
Mendix Solutions
Software Testing Services
Test Automation Services
Product Engineering Services

Search This Blog

vida amorosa