Hadoop MapReduce v2, Setting up single node cluster

Supported Platforms

Gnu/Linux

Unlike Hadoop v1, Hadoop v2 supports for Windows. For more details, look at apache hadoop wiki page.

Requirements

  • Java
    Hadoop is based on Java Runtime
  • ssh
    to manage remote hadoop daemons

Downloads

Download the newest version of hadoop from apache hadoop mirror site.

The newest stable version of hadoop at the time of this post is 2.4.1.

If you downloaded it, unpack the tarball.

Setting Up Global Configurations

Open file at {$HAOOP_INSTALL}/etc/hadoop/hadoop-env.sh, then edit environment variables as follows:

 

If finished, let’s test as follows:

 

 

Standalone Operation

By default, hadoop runs in a single node, in a single java process.
Let’s test the sample MR job as follows:

 

 

First, make input data to processing. Just copy some configuration files to input/ directory.

Then, run sample hadoop jar that finds line which includes the pattern, ‘dfs[a-z.]+’

Finally, check the result at the output/ directory.

Exploring the sample MR job

As shown in the command of running sample hadoop job, there is no main class, that is no starting point. After inspecting the sample code, I found the following starting point, org.apache.hadoop.examples.ExampleDriver.java:

In according to argument, it runs a appropriate Driver class, in this case Grep.java.
Now let’s look at Grep.java class:

As MR version 1, a driver class extends Configured class and implements Tool interface. The interesting point of this class, is the sortJob and its mapper class, InverseMapper. As you can see from the class description as follows

it just swaps the position of key and value.

And note that Grep driver class used the RegexMapper.java and LongSumReducer.java. At MR version 1, I think, I had to write all the mappers and reducers which carry out the same functions.

RegexMapper splits the input in according to the pattern, and emit the result to the output:

LongSumReducer class does the sum of the input per key:

Pseudo-Distributed Operation

Haddoop can be run on a single node in a pseudo-distributed mode. In pseudo-distributed mode, each hadoop damon runs in a separate Java process.

Configuration

* At version 1, the property name was fs.default.name

* To run in pseudo-distributed mode, you should set the value of dfs.replication to 1

Setup passphraseless for ssh

You should check to ssh to localhost without a passpharese

If you failed to ssh, you should do the following:

Execution – HDFS

To run HDFS,

1. Format the file system:

 

 

2. Start NameNode and “a” DataNode daemon:

* Note that the command directory is differ: formant command “hdfs” is in the bin/ directory, while running daemon command “start-*.sh” is in the sbin/ directory

* You can check daemons as follows:

hdfs_node_process_grep
As you can see, in addition to NameNode and DataNode, Secondary Name Node is also running by default.

* As previous version, there is HDFS web interface available at http://localhost:50070/

hdfs_web_console
It’s beautiful compare to the previous text-based web interface. Moreover, it uses twitter Bootstrap, hence it acts responsively

Running a sample job

Now let’s run above sample job locally on the pseudo-distributed HDFS

1. Make the HDFS directories required to execute MapReduce jobs

 

2. Copy input files for sample job into the HDFS

* You can check the filesystem through a console

 

or a web interface, Utilities > Browse the file system

hdfs_web_console_utilites
3. Run the sample MarReduce job locally(that is not on the YARN)

 

* You can check the result through a console and a web interface

[hr]

 

YARN on a Single Node

Let’s run the smaple MapReduce job on the YARN in as pseudo-distributed mode.

Configuration

* It specifies only the mapreduce framework name, and it implies that there would be many frameworks besides YARN

 

* Right now, I have no idea about the meaning of mapreduce_shuffle. Maybe this page would be helpful
Start ResourceManager and NodeManager daemon

 

* You can check daemons as follows:

mapreduce_node_process_grep
* It would be easy to understand as ResourceManager as a JobTracker, and NodeManager as a TaskTracker from the resource managing point of view.
You can see the web interface for ResourceManager at http://localhost:8088/

jobtracker_web_console

For full-distributed cluster setup, you can find here

References

Hadoop MapReduce Next Generation – Setting up a Single Node Cluster

Leave a Reply

Your email address will not be published. Required fields are marked *