Sunday, January 5, 2014

Big Data: Hadoop 2.x/YARN Multi-Node Cluster Installation

Apache Hadoop 2/YARN/MR2 Multi-node Cluster Installation for Beginners:
In this blog ,I will describe the steps for setting up a distributed, multi-node Hadoop cluster running on Red Hat Linux/CentOS Linux distributions.Now we are comfortable with installation and  execution of MapReduce applications on  Single node in  Pseudo-distributed Mode. [ Click here for the details on single node installation  ].Let us move one step forward  to deploy multi-node cluster .

Whats Big data ? 
Whats Hadoop ?

Hadoop Cluster:
Hadoop Cluster is designed for distributed processing of large data sets across group of commodity machines (low-cost servers). The Data could be unstructured, semi-structured and also could be structured data.It is designed to scale up to thousands of machines, with a  high degree of fault tolerance and software has the intelligence to detect & handle the failures at the application layer. 

Thre are 3 types of machines based on their specific roles  in Hadoop cluster environment

1] Client machines :
   - Loading  the data (input files) into the cluster
   - Submission of jobs (in our case - its a MapReduce Job)
   - Collect the result and view the analytics 

2] Master nodes :

  - The Name Node coordinates the data storage function (HDFS) keeping the Meta data information 

- The ResourceManager negotiates the necessary resources for a container and launches an ApplicationMaster to represent the submitted application.

3] Slave nodes :
Major part of cluster consists of Slave Nodes to perform computation .
The NodeManager manages each node within a YARN cluster. The NodeManager provides per-node services within the cluster - management of a container over its life cycle to monitoring resources and tracking the health of its node.  

Container represents an allocated resource in the cluster. The resource Manager is the sole authority to allocate any container to applications. The allocated container is always on a single node and has unique containerID. It has a specific amount of resource allocated. Typically, an ApplicationMaster receive the container from the ResourceManager during resource negotiation and then talks to the NodeManager to start/stop container. Resource models a set of computer resources. Currently it only models Memeory [may be in future other resources like CPUs will be added ]. 

YARN Architecture [More details available@ source]

                                          Block Diagram Representation

                                             Terminology  and Architecture

MRv2 Architecture [click for source]


 My Cluster setup  has single master node and a slave-node .

1]  Lets have 2 Machines (or   two VMs )  with sufficient resources to run MapReduce application.

2] Both machines are installed with hadoop 2.x as described in link here.
    Keep all configurations and paths same across all nodes in the cluster.

3] Bring down all daemons running  on those machines.


NOTE:  There are  2 nodes: "spb-master"  as a master  &   "spb-slave"  as a slave
                               spb-master's IP address is
                               spb-slave's  IP address  is


Step 1:  First thing here  is to establish a network between master node and slave node.
Assign  IP address to eth0 interface  of node1 and node 2  and include those IP address  and hostname to /etc/hosts file  as shown here.

NODE1 : spb-master

NODE2 : spb-slave


Step 2 : Establish password-less SSH session between master and slave nodes.

Verify password-less session from master to slave node :

Verify password-less session from Slave to master  node:

Now  both Machines are almost ready and communicating without prompting for password .

Step 3: Additional configuration required  at Master node :

HADOOP_PREFIX/etc/hadoop/slaves  file should contain  list of all slave nodes .

[root@spb-master hadoop]# cat slaves
[root@spb-master hadoop]#

NOTE: Here setup configured to have DataNode on master also (dual role).

If you have  many number of slave nodes you could list then like

cat HADOOP_PREFIX/etc/hadoop/slaves
spb-slave 8

Step 4 : Other configurations remain same as copied below across all the  nodes in cluster.

[root@spb-master hadoop]# cat core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



[root@spb-master hadoop]# cat
# Copyright 2011 The Apache Software Foundation
export JAVA_HOME=/usr/java/default
export HADOOP_PREFIX=/opt/hadoop-2.2.0
export HADOOP_HDFS_HOME=/opt/hadoop-2.2.0
export HADOOP_COMMON_HOME=/opt/hadoop-2.2.0
export HADOOP_MAPRED_HOME=/opt/hadoop-2.2.0
export HADOOP_YARN_HOME=/opt/hadoop-2.2.0
export HADOOP_CONF_DIR=/opt/hadoop-2.2.0/etc/hadoop/
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
export HADOOP_LOG_DIR=/opt/hadoop-2.2.0/logs
# The maximum amount of heap to use, in MB. Default is 1000.

[root@spb-master hadoop]# cat hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

[root@spb-master hadoop]#
[root@spb-master hadoop]# cat mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
[root@spb-master hadoop]#
[root@spb-master hadoop]# cat

export JAVA_HOME=/usr/java/default
export HADOOP_PREFIX=/opt/hadoop-2.2.0
export HADOOP_HDFS_HOME=/opt/hadoop-2.2.0
export HADOOP_COMMON_HOME=/opt/hadoop-2.2.0
export HADOOP_MAPRED_HOME=/opt/hadoop-2.2.0
export HADOOP_YARN_HOME=/opt/hadoop-2.2.0
export HADOOP_CONF_DIR=/opt/hadoop-2.2.0/etc/hadoop/
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"


# For setting YARN specific HEAP sizes please use this
# Parameter and set appropriately

[root@spb-master hadoop]# cat yarn-site.xml
<?xml version="1.0"?>


 Step 5 : Format the Namenode .

[hdfs@localhost bin]$ ./hdfs namenode -format
14/01/04 05:34:42 INFO namenode.NameNode: STARTUP_MSG:
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = spb-master/
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.2.0
STARTUP_MSG:   classpath = /opt/hadoop-2.2.0/etc/hadoop/:/opt/hadoop-2.2.0/share/hadoop.
14/01/04 05:34:45 INFO common.Storage: Storage directory /var/data/hadoop/hdfs/nn has been successfully formatted.
14/01/04 05:34:45 INFO namenode.FSImage: Saving image file /var/data/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 using no compression
14/01/04 05:34:45 INFO namenode.FSImage: Image file /var/data/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 of size 196 bytes saved in 0 seconds.
14/01/04 05:34:45 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
14/01/04 05:34:45 INFO util.ExitUtil: Exiting with status 0
14/01/04 05:34:45 INFO namenode.NameNode: SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down NameNode at spb-master/
[hdfs@localhost bin]$
Step 6: Start the Hadoop  and Yarn daemons 

 [hdfs@spb-master sbin]$ ./
This script is Deprecated. Instead use and
Starting namenodes on [spb-master]
spb-master: starting namenode, logging to /opt/hadoop-2.2.0/logs/hadoop-hdfs-namenode-spb-master.out
spb-master: starting datanode, logging to /opt/hadoop-2.2.0/logs/hadoop-hdfs-datanode-spb-master.out
spb-slave: starting datanode, logging to /opt/hadoop-2.2.0/logs/hadoop-hdfs-datanode-spb-slave.out
Starting secondary namenodes [] starting secondarynamenode, logging to /opt/hadoop-2.2.0/logs/hadoop-hdfs-secondarynamenode-spb-master.out
starting yarn daemons
starting resourcemanager, logging to /opt/hadoop-2.2.0/logs/yarn-hdfs-resourcemanager-localhost.localdomain.out
spb-slave: starting nodemanager, logging to /opt/hadoop-2.2.0/logs/yarn-hdfs-nodemanager-spb-slave.out
spb-master: starting nodemanager, logging to /opt/hadoop-2.2.0/logs/yarn-hdfs-nodemanager-spb-master.out
[hdfs@spb-master sbin]$

Pictorial representation of a YARN cluster[Click here for Source].
Step 7 : Daemons started on Master node :

[hdfs@spb-master sbin]$ jps
19296 Jps
18549 NameNode
18952 ResourceManager
18642 DataNode
19050 NodeManager
18818 SecondaryNameNode
[hdfs@spb-master sbin]$
Step 8: Daemons Started on Slave node:

[hdfs@spb-slave nn]$
[hdfs@spb-slave nn]$ jps
25775 DataNode
25891 NodeManager
25992 Jps
[hdfs@spb-slave nn]$
Step 9:  Web Interface  to check the cluster Nodes
Step 10 : Web Interface to view  the Health of Cluster:
Step  11: Check the logfiles  for successful launching of each daemons .

 Step 12 :  Run few POSIX commands on HDFS File system :

 Create a hellofile and hellofile2 in local filesystem and then copy them to HDFS as show below:

[@spb-master bin]# ./hadoop fs -put hellofile /users/hellofile
[@spb-master bin]# ./hadoop fs -put hellofile /users/hellofile2
 Step 13 : Make sure that our input files are available on HDFS to run MapReduce Application
[@spb-master bin]# ./hadoop fs -ls /users
Found 2 items
-rw-r--r--   1 root supergroup     262705 2014-01-06 04:43 /users/hellofile
-rw-r--r--   1 root supergroup      25696 2014-01-06 04:46 /users/hellofile2
[@spb-master bin]#
Step 14 : Run application program "WordCount"  from hadoop-mapreduce-examples-2.2.0.jar

WordCount Example:
WordCount example reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab.Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.

To run the example, the command syntax is
bin/hadoop jar hadoop-*-examples.jar wordcount <in-dir> <out-dir>

input file @ in-dir =/users/hellofile2
output-file @out-dir=/users/Out_hello
 [hdfs@spb-master bin]$ ./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /users/hellofile2 /users/Out_hello
14/01/06 04:54:39 INFO client.RMProxy: Connecting to ResourceManager at spb-master/
14/01/06 04:54:42 INFO input.FileInputFormat: Total input paths to process : 1
14/01/06 04:54:44 INFO mapreduce.JobSubmitter: number of splits:1
14/01/06 04:54:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1389011868970_0001
14/01/06 04:54:47 INFO impl.YarnClientImpl: Submitted application application_1389011868970_0001 to ResourceManager at spb-master/
14/01/06 04:54:47 INFO mapreduce.Job: The url to track the job: http://spb-master:8088/proxy/application_1389011868970_0001/
14/01/06 04:54:47 INFO mapreduce.Job: Running job: job_1389011868970_0001
14/01/06 04:56:12 INFO mapreduce.Job: Job job_1389011868970_0001 running in uber mode : false
14/01/06 04:56:12 INFO mapreduce.Job:  map 0% reduce 0%
14/01/06 05:03:01 INFO mapreduce.Job:  map 100% reduce 0%
14/01/06 05:03:28 INFO mapreduce.Job:  map 100% reduce 100%
14/01/06 05:03:29 INFO mapreduce.Job: Job job_1389011868970_0001 completed successfully
14/01/06 05:03:31 INFO mapreduce.Job: Counters: 43
    File System Counters
        FILE: Number of bytes read=152
        FILE: Number of bytes written=158433
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=25800
        HDFS: Number of bytes written=120
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters
        Launched map tasks=1
        Launched reduce tasks=1
        Rack-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=398307
        Total time spent by all reduces in occupied slots (ms)=12576
    Map-Reduce Framework
        Map input records=689
        Map output records=4817
        Map output bytes=44504
        Map output materialized bytes=152
        Input split bytes=104
        Combine input records=4817
        Combine output records=13
        Reduce input groups=13
        Reduce shuffle bytes=152
        Reduce input records=13
        Reduce output records=13
        Spilled Records=26
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=236
        CPU time spent (ms)=6320
        Physical memory (bytes) snapshot=292528128
        Virtual memory (bytes) snapshot=2350538752
        Total committed heap usage (bytes)=132321280
    Shuffle Errors
    File Input Format Counters
        Bytes Read=25696
    File Output Format Counters
        Bytes Written=120
[hdfs@spb-master bin]$ 

Step 15:Web Interface  to view the running application on cluster.
Status of Running job and Cluster Matrics
Step 16: Verify the status of FINISHED application :

Step 17:Web Interface  to view the Output file

 Step 18:  You could also verify using POSIX commands :

[@spb-master bin]# ./hadoop fs -ls /users
Found 3 items
drwxr-xr-x   - hdfs supergroup          0 2014-01-06 05:03 /users/Out_hello
-rw-r--r--   1 root supergroup     262705 2014-01-06 04:43 /users/hellofile
-rw-r--r--   1 root supergroup      25696 2014-01-06 04:46 /users/hellofile2
[@spb-master bin]# __________________________________________________________________

Step 19:Run another sample application program "pi"  from hadoop-mapreduce-examples-2.2.0.jar and view the status "RUNNING" 

Step 20: List of default PORTs  for each components of Hadoop Ecosystem .
Example :

Similarly you could find  lots of information about WI ports, default port at would clarify your doubts on using the port numbers in Configuration Files.
2)  Hadoop: The Definitive Guide by Tom White:


That’s quite a few points! I hope this quick overview on Hadoop Cluster has been helpful.  :)

Click here : Overview of  apache Hadoop .
Click here:  Big Data Revolution and Vision ........!!!
Click here :  Single-node Cluster setup and Implementation.

Click here : Big Data : Watson - Era of cognitive computing !!!