Developing Java Map-Reduce on local machine to run on Hadoop Cluster

Big Data & Hadoop
2

Introduction

In this post, I have explained how to develop hadoop jobs in Java and export JAR to run on Hadoop clusters. Most of the articles on internet, talk about installing eclipse-plugin and using maven or ANT to build JAR. To install eclipse-plugin for hadoop, one needs to install eclipse on the same Linux machine having the hadoop installed. But, what would you do if you have installed apache hadoop on VirtaulBox-Linux or Amazon EC2 Linux Server CLI.

In this article, I have explained the easiest way to develop your java program and export JAR directly from eclipse to local file system. Beginners can start developing with basic knowledge of core-JAVA.

Note: Basic knowledge of eclipse platform required.

Steps to set up Eclipse for MapReduce

  1. Install Eclipse on your local window machine.
  2. Download and save the same version of apache hadoop tar ball as you have installed on your Linux box.
  3.  Extract out the tar ball using WinZip or any other compression tool you have.
  4. Open eclipse and create your project.
  5. Now you need to import the hadoop libraries to your eclipse project.
55

Right click your project -> Build Path -> Configure Build Path

6.  Select Add External JARs

Navigate to the saved hadoop-install folder(step-2) and add all Executable Jars from below location.

If the tar ball is version 1.2.1 =>

                  ..\hadoop-1.2.1\lib

                   ..\hadoop-1.2.1

If it is version 2.2.0 =>

              ..\hadoop-2.2.0\share\hadoop\common\lib

            ..\hadoop-2.2.0\share\hadoop\common\

*** For map-reduce

                ..\hadoop-2.2.0\share\hadoop\mapreduce\lib

                  ..\hadoop-2.2.0\share\hadoop\mapreduce

 *** For YARN

   ..\hadoop-2.2.0\share\hadoop\yarn\lib

   ..\hadoop-2.2.0\share\hadoop\yarn

Your project explorer should look like this after adding all the required JARs

66

7.   Start developing map reduce code for your class files and save.

 8.   Now you need to export your program as JAR to execute on hadoop VM.

Right click on project -> export -> Java -> Select Jar files

You would reach to below window   

88

Select first option- > Export generated class files and resource

Browse and select the destination to save JAR on local disk space.

9.   Proceed next with default setting 

99

This step is very important.

You need to mention the Main driver Class name here, which is saved in manifest as path of entry point of the program.

If you don’t mention here, you need to mention the class path of the driver class as command line parameter while executing the job.

 10.    Click finish and your JAR file will be saved at located destination on local disk

11.     Move the JAR file to Linux VM / AWS EC2 using WinSCP.

Now you have got the job on your cluster machine. Execute the job from command line using hadoop jar.

                                          **************************************

Good article with easy to follow steps. Personally I prefer to run Hadoop on Linux but we’re not always in the situation to use these, whether it be for business reasons or simply a preference for Windows, so it’s good to know where to go if I either need to use Eclipse’s Hadoop plugin and/or create Jars in different environments.

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Big Data & Hadoop
2
Cloudera vs AWS vs AZURE vs Google Cloud: How to decide on the right big data platform?

Background Big data concepts evolved to solve a specific problem of processing data of diversified nature, high volume and streaming data. Hadoop came with the first architectural solution to process this nature of data on commodity hardware against the high cost HPC and appliance based systems. During past progressive years, …

Big Data & Hadoop
2
Hadoop Streaming with Perl Script

In this article, I am going to explain how to use Hadoop streaming with Perl scripts. First, let’s understand some theory behind Hadoop streaming. Hadoop has been written in Java. Therefore, the native language to write MapReduce program is Java. But, Hadoop also provide an API to MapReduce that allows …

Big Data & Hadoop
1
Implementing Security in Hadoop Cluster

When we talk about security in Hadoop, we need to explore all the aspect of cluster networking and understand how the Nodes and Client communicate to each other. Let’s list down possible communication in a simple cluster. Master – Slave communication => Namenode – Datanode / Jobtracker – Tasktracker communication …

error

Enjoyed this blog? Please spread the word :)