LINUX & HPC : Advanced Large Scale Computing at a Glance !: Big Data: Hadoop 2.x/YARN Multi-Node Cluster Installation

Apache Hadoop 2/YARN/MR2 Multi-node Cluster Installation for Beginners:
In this blog ,I will describe the steps for setting up a distributed, multi-node Hadoop cluster running on Red Hat Linux/CentOS Linux distributions.Now we are comfortable with installation and execution of MapReduce applications on Single node in Pseudo-distributed Mode. [ Click here for the details on single node installation ].Let us move one step forward to deploy multi-node cluster .

Whats Big data ?
Whats Hadoop ?

Hadoop Cluster:
Hadoop Cluster is designed for distributed processing of large data sets across group of commodity machines (low-cost servers). The Data could be unstructured, semi-structured and also could be structured data.It is designed to scale up to thousands of machines, with a high degree of fault tolerance and software has the intelligence to detect & handle the failures at the application layer.

Thre are 3 types of machines based on their specific roles in Hadoop cluster environment

1] Client machines :
   - Loading the data (input files) into the cluster
   - Submission of jobs (in our case - its a MapReduce Job)
   - Collect the result and view the analytics

2] Master nodes :
- The Name Node coordinates the data storage function (HDFS) keeping the Meta data information
- The ResourceManager negotiates the necessary resources for a container and launches an ApplicationMaster to represent the submitted application.

3] Slave nodes :
Major part of cluster consists of Slave Nodes to perform computation .
The NodeManager manages each node within a YARN cluster. The NodeManager provides per-node services within the cluster - management of a container over its life cycle to monitoring resources and tracking the health of its node.
Container represents an allocated resource in the cluster. The resource Manager is the sole authority to allocate any container to applications. The allocated container is always on a single node and has unique containerID. It has a specific amount of resource allocated. Typically, an ApplicationMaster receive the container from the ResourceManager during resource negotiation and then talks to the NodeManager to start/stop container. Resource models a set of computer resources. Currently it only models Memeory [may be in future other resources like CPUs will be added ].

YARN Architecture [More details available@ source]

Block Diagram Representation

Terminology and Architecture

MRv2 Architecture [click for source]

Prerequisites:

My Cluster setup has single master node and a slave-node .

1] Lets have 2 Machines (or   two VMs ) with sufficient resources to run MapReduce application.

2] Both machines were installed with hadoop 2.x as described in link here.
Keep all configurations and paths same across all nodes in the cluster.

3] Bring down all daemons running on those machines.

    HADOOP_PREFIX/sbin/stop-all.sh

NOTE: There are 2 nodes: "spb-master" as a master & "spb-slave" as a slave
                               spb-master's IP address is   192.X.Y.Z
                               spb-slave's IP address is   192.A.B.C

--------------------------------------------------------------------------------------------------------

Step 1: First thing here is to establish a network between master node and slave node.
Assign IP address to eth0 interface of node1 and node 2 and include those IP address and hostname to /etc/hosts file as shown here.

NODE1 : spb-master

NODE2 : spb-slave
___________________________________________________________________

Step 2 : Establish password-less SSH session between master and slave nodes.

Verify password-less session from master to slave node :

Verify password-less session from Slave to master node:

Now both Machines are almost ready and communicating without prompting for password .
______________________________________________________________________________

Step 3: Additional configuration required at Master node :

HADOOP_PREFIX/etc/hadoop/slaves file should contain list of all slave nodes .

[root@spb-master hadoop]# cat slaves
spb-master
spb-slave
[root@spb-master hadoop]#

NOTE: Here setup configured to have DataNode on master also (dual role).

If you have many number of slave nodes you could list then like

cat HADOOP_PREFIX/etc/hadoop/slaves
spb-slave1
spb-slave2
spb-slave3
spb-slave4
spb-slave5
spb-slave6
spb-slave7
spb-slave 8

___________________________________________________________

Step 4 : Other configurations remain same as copied below across all the nodes in cluster.

[root@spb-master hadoop]# cat core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
        <property>
                <name>fs.default.name</name>
                <value>hdfs://spb-master:9000</value>
        </property>
</configuration>

-------------------------------------------------

[root@spb-master hadoop]# cat hadoop-env.sh

export JAVA_HOME=$BIN/java/default

export HADOOP_PREFIX=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_HDFS_HOME=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_COMMON_HOME=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_MAPRED_HOME=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_YARN_HOME=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_CONF_DIR=$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/

export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"

export HADOOP_LOG_DIR=$PACKAGE_HOME/hadoop-2.2.0/logs

# The maximum amount of heap to use, in MB. Default is 1000.

export HADOOP_HEAPSIZE=500

export HADOOP_NAMENODE_INIT_HEAPSIZE="500"

export HADOOP_JOB_HISTORYSERVER_HEAPSIZE="200"

----------------------------------------------------------------

[root@spb-master hadoop]# cat hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>dfs.replication</name>

</property>

<name>dfs.namenode.name.dir</name>

<value>file:$DATA_DIR/data/hadoop/hdfs/nn</value>

</property>

<name>dfs.datanode.data.dir</name>

<value>file:$DATA_DIR/data/hadoop/hdfs/dn</value>

</property>

<name>dfs.permissions</name>

<value>false</value>

</property>

</configuration>

[root@spb-master hadoop]#

----------------------------------------------------------------------------------
[root@spb-master hadoop]# cat mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>
[root@spb-master hadoop]#
----------------------------------------------------------------------------------------

[root@spb-master hadoop]# cat yarn-env.sh

export JAVA_HOME=$BIN/java/default

export HADOOP_PREFIX=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_HDFS_HOME=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_COMMON_HOME=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_MAPRED_HOME=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_YARN_HOME=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_CONF_DIR=$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/

export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"

JAVA=$JAVA_HOME/bin/java

JAVA_HEAP_MAX=-Xmx500m

# For setting YARN specific HEAP sizes please use this

# Parameter and set appropriately

YARN_HEAPSIZE=500

------------------------------------------------------------------------------------------------

[root@spb-master hadoop]# cat yarn-site.xml
<?xml version="1.0"?>

<configuration>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
                <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
                <name>yarn.resourcemanager.resource-tracker.address</name>
                <value>spb-master:8025</value>
        </property>
        <property>
                <name>yarn.resourcemanager.scheduler.address</name>
                <value>spb-master:8030</value>
        </property>
        <property>
                <name>yarn.resourcemanager.address</name>
                <value>spb-master:8040</value>
        </property>
</configuration>

Step 5 : Format the Namenode .

[hdfs@localhost bin]$ ./hdfs namenode -format
14/01/04 05:34:42 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = spb-master/9.124.45.125
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.2.0
STARTUP_MSG:   classpath = $PACKAGE_DIR/hadoop-2.2.0/etc/hadoop/:$PACKAGE_DIR/hadoop-2.2.0/share/hadoop.
.
.
.
.
14/01/04 05:34:45 INFO common.Storage: Storage directory $DATA_DIR/data/hadoop/hdfs/nn has been successfully formatted.
14/01/04 05:34:45 INFO namenode.FSImage: Saving image file /var/data/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 using no compression
14/01/04 05:34:45 INFO namenode.FSImage: Image file /var/data/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 of size 196 bytes saved in 0 seconds.
14/01/04 05:34:45 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
14/01/04 05:34:45 INFO util.ExitUtil: Exiting with status 0
14/01/04 05:34:45 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at spb-master/9.124.45.125
************************************************************/
[hdfs@localhost bin]$
_____________________________________________________________________
Step 6: Start the Hadoop and Yarn daemons

[hdfs@spb-master sbin]$ ./start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [spb-master]
spb-master: starting namenode, logging to $PACKAGE_DIR/hadoop-2.2.0/logs/hadoop-hdfs-namenode-spb-master.out
spb-master: starting datanode, logging to $PACKAGE_DIR/hadoop-2.2.0/logs/hadoop-hdfs-datanode-spb-master.out
spb-slave: starting datanode, logging to $PACKAGE_DIR/hadoop-2.2.0/logs/hadoop-hdfs-datanode-spb-slave.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to $PACKAGE_DIR/hadoop-2.2.0/logs/hadoop-hdfs-secondarynamenode-spb-master.out
starting yarn daemons
starting resourcemanager, logging to $PACKAGE_DIR/hadoop-2.2.0/logs/yarn-hdfs-resourcemanager-localhost.localdomain.out
spb-slave: starting nodemanager, logging to $PACKAGE_DIR/$PACKAGE_DIR/hadoop-2.2.0/logs/yarn-hdfs-nodemanager-spb-slave.out

spb-master: starting nodemanager, logging to $PACKAGE_DIR/hadoop-2.2.0/logs/yarn-hdfs-nodemanager-spb-master.out
[hdfs@spb-master sbin]$

Pictorial representation of a YARN cluster[Click here for Source].

_____________________________________________________________________
Step 7 : Daemons started on Master node :

[hdfs@spb-master sbin]$ jps
19296 Jps
18549 NameNode
18952 ResourceManager
18642 DataNode
19050 NodeManager
18818 SecondaryNameNode
[hdfs@spb-master sbin]$
________________________________________________________
Step 8: Daemons Started on Slave node:

[hdfs@spb-slave nn]$
[hdfs@spb-slave nn]$ jps
25775 DataNode
25891 NodeManager
25992 Jps
[hdfs@spb-slave nn]$
__________________________________________________________
Step 9: Web Interface to check the cluster Nodes

Step 10 : Web Interface to view the Health of Cluster:

Step 11: Check the logfiles for successful launching of each daemons .

_______________________________________________________________________
Step 12 : Run few POSIX commands on HDFS File system :

Create a hellofile and hellofile2 in local filesystem and then copy them to HDFS as show below:

[@spb-master bin]# ./hadoop fs -put hellofile /users/hellofile
[@spb-master bin]# ./hadoop fs -put hellofile /users/hellofile2
___________________________________________________________________
Step 13 : Make sure that our input files are available on HDFS to run MapReduce Application
[@spb-master bin]# ./hadoop fs -ls /users
Found 2 items
-rw-r--r--   1 root supergroup     262705 2014-01-06 04:43 /users/hellofile
-rw-r--r--   1 root supergroup      25696 2014-01-06 04:46 /users/hellofile2
[@spb-master bin]#
__________________________________________________________________
Step 14 : Run application program "WordCount" from hadoop-mapreduce-examples-2.2.0.jar

WordCount Example:
WordCount example reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab.Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.

To run the example, the command syntax is
bin/hadoop jar hadoop-*-examples.jar wordcount <in-dir> <out-dir>
where
input file @ in-dir =/users/hellofile2
output-file @out-dir=/users/Out_hello
-----------------------------------------------
[hdfs@spb-master bin]$ ./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /users/hellofile2 /users/Out_hello
14/01/06 04:54:39 INFO client.RMProxy: Connecting to ResourceManager at spb-master/192.168.1.3:8040
14/01/06 04:54:42 INFO input.FileInputFormat: Total input paths to process : 1
14/01/06 04:54:44 INFO mapreduce.JobSubmitter: number of splits:1
14/01/06 04:54:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1389011868970_0001
14/01/06 04:54:47 INFO impl.YarnClientImpl: Submitted application application_1389011868970_0001 to ResourceManager at spb-master/192.168.1.3:8040
14/01/06 04:54:47 INFO mapreduce.Job: The url to track the job: http://spb-master:8088/proxy/application_1389011868970_0001/
14/01/06 04:54:47 INFO mapreduce.Job: Running job: job_1389011868970_0001
14/01/06 04:56:12 INFO mapreduce.Job: Job job_1389011868970_0001 running in uber mode : false
14/01/06 04:56:12 INFO mapreduce.Job: map 0% reduce 0%
14/01/06 05:03:01 INFO mapreduce.Job: map 100% reduce 0%
14/01/06 05:03:28 INFO mapreduce.Job: map 100% reduce 100%
14/01/06 05:03:29 INFO mapreduce.Job: Job job_1389011868970_0001 completed successfully
14/01/06 05:03:31 INFO mapreduce.Job: Counters: 43
   File System Counters
       FILE: Number of bytes read=152
       FILE: Number of bytes written=158433
       FILE: Number of read operations=0
       FILE: Number of large read operations=0
       FILE: Number of write operations=0
       HDFS: Number of bytes read=25800
       HDFS: Number of bytes written=120
       HDFS: Number of read operations=6
       HDFS: Number of large read operations=0
       HDFS: Number of write operations=2
   Job Counters
       Launched map tasks=1
       Launched reduce tasks=1
       Rack-local map tasks=1
       Total time spent by all maps in occupied slots (ms)=398307
       Total time spent by all reduces in occupied slots (ms)=12576
   Map-Reduce Framework
       Map input records=689
       Map output records=4817
       Map output bytes=44504
       Map output materialized bytes=152
       Input split bytes=104
       Combine input records=4817
       Combine output records=13
       Reduce input groups=13
       Reduce shuffle bytes=152
       Reduce input records=13
       Reduce output records=13
       Spilled Records=26
       Shuffled Maps =1
       Failed Shuffles=0
       Merged Map outputs=1
       GC time elapsed (ms)=236
       CPU time spent (ms)=6320
       Physical memory (bytes) snapshot=292528128
       Virtual memory (bytes) snapshot=2350538752
       Total committed heap usage (bytes)=132321280
   Shuffle Errors
       BAD_ID=0
       CONNECTION=0
       IO_ERROR=0
       WRONG_LENGTH=0
       WRONG_MAP=0
       WRONG_REDUCE=0
   File Input Format Counters
       Bytes Read=25696
   File Output Format Counters
       Bytes Written=120
[hdfs@spb-master bin]$
___________________________________________________
Step 15:Web Interface to view the running application on cluster.

Status of Running job and Cluster Matrics

Step 16: Verify the status of FINISHED application :

Job-History

Step 17:Web Interface to view the Output file

Step 18: You could also verify using POSIX commands :

[@spb-master bin]# ./hadoop fs -ls /users
Found 3 items
drwxr-xr-x   - hdfs supergroup          0 2014-01-06 05:03 /users/Out_hello
-rw-r--r--   1 root supergroup     262705 2014-01-06 04:43 /users/hellofile
-rw-r--r--   1 root supergroup      25696 2014-01-06 04:46 /users/hellofile2
[@spb-master bin]# __________________________________________________________________

Step 19:Run another sample application program "pi" from hadoop-mapreduce-examples-2.2.0.jar and view the status "RUNNING"

___________________________________________________________________________
Step 20: List of default PORTs for each components of Hadoop Ecosystem .
Example :
http://master:50070/dfshealth.jsp
http://master:8088/cluster

Similarly you could find lots of information about WI ports, default port at http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.5.0/bk_reference/content/reference_chap2.html.This would clarify your doubts on using the port numbers in Configuration Files.
_________________________________________________________________
References:
1) http://hadoop.apache.org/
2) Hadoop: The Definitive Guide by Tom White:

3) http://hortonworks.com/hadoop/
4) http://www.cloudera.com/content/cloudera/en/home.html
5) http://www.meetup.com/lspe-in/pages/7th_Event_-_Hadoop_Hands_on_Session/
____________________________________________________________________

That’s quite a few points! I hope this quick overview on Hadoop Cluster has been helpful. :)

Click here : Overview of apache Hadoop .
Click here: Big Data Revolution and Vision ........!!!
Click here : Single-node Cluster setup and Implementation.
Click here : Big Data : Watson - Era of cognitive computing !!!

LINUX & HPC : Advanced Large Scale Computing at a Glance !

Sunday, January 5, 2014

Big Data: Hadoop 2.x/YARN Multi-Node Cluster Installation

No comments:

Post a Comment

Popular Posts

Translate