Apache Hadoop 2/ YARN/MR2 Installation for Beginners :

Background:
Big Data spans three dimensions: Volume, Velocity and Variety. (IBM defined 4th dimension or property of Big Data i.e Veracity). Apache™ Hadoop® is an open source software project that enables the distributed processing of large data sets (Big Data) across clusters of commodity Machines(Low-cost Servers). It is designed to scale up to thousands of machines, with a high degree of fault tolerance and software has the intelligence to detect & handle the failures at the application layer.

NOTE: More details are available@http://hadoop.apache.org/docs/stable/

The Apache Hadoop component introduced two new terms for Hadoop 1.0 users - MapReduce2 (MR2) and YARN.
Apache Hadoop YARN is the next-generation Hadoop framework designed to take Hadoop beyond MapReduce for data-processing- resulted in better cluster utilization that permit Hadoop to scale to accommodate more and larger jobs.
This blog provides information for users to migrate their Apache Hadoop MapReduce applications from Apache Hadoop 1.x to Apache Hadoop 2.x

https://hadoop.apache.org/docs/current2/hadoop-yarn/hadoop-yarn-site/YARN.html

Steps to Install Hadoop2.0 on CentOS/RHEL6 on single node Cluster setup:

Step1: Install Java from link :http://www.oracle.com/technetwork/java/javase/downloads/index.html
            Set the environment variable $JAVA_HOME properly

NOTE: Java-1.6.0-openjdk OR other Hadoop Java Versions listed in a below link are more preferable.

http://wiki.apache.org/hadoop/HadoopJavaVersions

Step2: Download Apache Hadoop2.2 to folder $PACKAGE_HOME from link : http://hadoop.apache.org/releases.html#Download

Step 3: Add all hadoop and java environment path variables to .bashrc file.

Example :
                 Configure $HOME/.bashrc
                         - HODOOP_HOME
                         - JAVA_PATH
                         - PATH
                         - HADOOP_HDFS_HOME
                         - HADOOP_YARN_HOME
                         - HADOOP_MAPRED_HOME
                         - HADOOP_CONF_DIR
                         - YARN_CLASS_PATH
------------------------------------------------------------------------------------------
Step 4 : Create a separate Group for Hadoop setup
# groupadd hadoop

Step 5: Add 3 user-accounts in Group "hadoop"
# useradd   -g   hadoop   yarn
# useradd   -g   hadoop   hdfs
# useradd -g   hadoop   mapred

NOTE: Its good to run daemons with a related accounts

Step 6: Create Data Directories for namenode,datanode and secondary namenode
              # mkdir -p $CONFIG/data/hadoop/hdfs/nn
# mkdir -p $CONFIG/data/hadoop/hdfs/dn

# mkdir -p $CONFIG/data/hadoop/hdfs/snn

Step 7: Set permission for "hdfs" account

# chown hdfs:hadoop $CONFIG/data/hadoop/hdfs -R

Step 8: Create Log Directories
# mkdir -p $CONFIG/log/hadoop/yarn

# mkdir logs (at installation directory Example $PACKAGE_HOME/hadoop2.2.0/logs)

Step 9: Set ownership to yarn

# chown yarn:hadoop $CONFIG/log/hadoop/yarn - R

Go to Hadoop directory "$PACKAGE_HOME/hadoop2.2.0/ "

# chmod g+w logs
# chown yarn:hadoop . -R

Step 10: Configure below listed XML files at $HADOOP_PREFIX/etc/hadoop

------------------------------------------------------------------------------------------------------------------
i) core-site.xml

---------------------------------------------------------------------------------------------------------------------
ii) hadoop-env.sh

[root@spb-master hadoop]# cat hadoop-env.sh

export JAVA_HOME=$BIN/java/default

export HADOOP_PREFIX=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_HDFS_HOME=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_COMMON_HOME=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_MAPRED_HOME=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_YARN_HOME=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_CONF_DIR=$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/

export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"

export HADOOP_LOG_DIR=$PACKAGE_HOME/hadoop-2.2.0/logs

# The maximum amount of heap to use, in MB. Default is 1000.

export HADOOP_HEAPSIZE=500

export HADOOP_NAMENODE_INIT_HEAPSIZE="500"

export HADOOP_JOB_HISTORYSERVER_HEAPSIZE="200"

------------------------------------------------------------------------------------------------------------------------
iii) hdfs-site.xml

[root@spb-master hadoop]# cat hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>dfs.replication</name>

</property>

<name>dfs.namenode.name.dir</name>

<value>file:$DATA_DIR/data/hadoop/hdfs/nn</value>

</property>

<name>dfs.datanode.data.dir</name>

<value>file:$DATA_DIR/data/hadoop/hdfs/dn</value>

</property>

<name>dfs.permissions</name>

<value>false</value>

</property>

</configuration>

[root@spb-master hadoop]#

------------------------------------------------------------------------------------------------------------------------
iv) mapred-site.xml

---------------------------------------------------------------------------------------------------------------------
v) yarn-env.sh

[root@spb-master hadoop]# cat yarn-env.sh

export JAVA_HOME=$BIN/java/default

export HADOOP_PREFIX=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_HDFS_HOME=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_COMMON_HOME=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_MAPRED_HOME=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_YARN_HOME=$PACKAGE_HOME/hadoop-2.2.0

export HADOOP_CONF_DIR=$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/

export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"

JAVA=$JAVA_HOME/bin/java

JAVA_HEAP_MAX=-Xmx500m

# For setting YARN specific HEAP sizes please use this

# Parameter and set appropriately

YARN_HEAPSIZE=500

------------------------------------------------------------------------------------------------------------------------
vi) yarn-site.xml

---------------------------------------------------------------------------------------------------------------
Step 11: Create a passwordless ssh session for "hdfs" user account :
   # su - hdfs
   hdfs@localhost$    ssh-keygen -t rsa
   hdfs@localhost$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
   hdfs@localhost$ chmod 0600 ~/.ssh/authorized_keys

ssh-copy-id -i /home/user1/.ssh/id_rsa.pub hostname1
ssh-copy-id -i /home/user1/.ssh/id_rsa.pub hostname2
ssh-copy-id -i /home/user1/.ssh/id_rsa.pub hostname3

NOTE: It's important to remember that /home/USER must be 700 or 755 –

[root@ibmgpu01 ~]# chmod 755 /pmpi2/smpici

---------------------------------------------------------------------------------
Step 12:
Now you are allowed to login without prompting for the password :

[hdfs@localhost]$ ssh localhost
Last login: Sun Dec 29 04:31:44 2013 from localhost
[hdfs@localhost ~]$

---------------------------------------------------------------------------------------------------------------

Step 13: Format Hadoop File system :
Format the NameNode directory as the HDFS superuser ( "hdfs" user account)
#su - hdfs
$ cd $PACKAGE_HOME/hadoop2.2/bin
$./hdfs namenode -format

It should show the message : $CONFIG/data/hadoop/hdfs/nn has been successfully formated as shown below:

[hdfs@localhost bin]$ ./hdfs namenode -format
13/12/29 02:36:52 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = localhost.localdomain/127.0.0.x
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.2.0
STARTUP_MSG:   classpath = $PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/common/lib/jetty-6.1.26.jar:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/common/lib/commons-el-1.0.jar:

STARTUP_MSG:   java = 1.7.0_45
************************************************************/
13/12/29 02:36:52 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library //hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
13/12/29 02:36:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Formatting using clusterid: CID-d47a364a-edc6-455f-b3c8-4d2ba54458d5
13/12/29 02:36:54 INFO namenode.HostFileManager: read includes:
HostSet(
)
13/12/29 02:36:54 INFO namenode.HostFileManager: read excludes:
HostSet(
)
13/12/29 02:36:54 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
13/12/29 02:36:54 INFO util.GSet: Computing capacity for map BlocksMap
13/12/29 02:36:54 INFO util.GSet: VM type       = 64-bit
13/12/29 02:36:54 INFO util.GSet: 2.0% max memory = 96.7 MB
13/12/29 02:36:54 INFO util.GSet: capacity      = 2^18 = 262144 entries
13/12/29 02:36:54 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
13/12/29 02:36:54 INFO blockmanagement.BlockManager: defaultReplication         = 1
13/12/29 02:36:54 INFO blockmanagement.BlockManager: maxReplication             = 512
13/12/29 02:36:54 INFO blockmanagement.BlockManager: minReplication             = 1
13/12/29 02:36:54 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
13/12/29 02:36:54 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks = false
13/12/29 02:36:54 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
13/12/29 02:36:54 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
13/12/29 02:36:54 INFO namenode.FSNamesystem: fsOwner             = hdfs (auth:SIMPLE)
13/12/29 02:36:54 INFO namenode.FSNamesystem: supergroup          = supergroup
13/12/29 02:36:54 INFO namenode.FSNamesystem: isPermissionEnabled = true
13/12/29 02:36:54 INFO namenode.FSNamesystem: HA Enabled: false
13/12/29 02:36:54 INFO namenode.FSNamesystem: Append Enabled: true
13/12/29 02:36:54 INFO util.GSet: Computing capacity for map INodeMap
13/12/29 02:36:54 INFO util.GSet: VM type       = 64-bit
13/12/29 02:36:54 INFO util.GSet: 1.0% max memory = 96.7 MB
13/12/29 02:36:54 INFO util.GSet: capacity      = 2^17 = 131072 entries
13/12/29 02:36:54 INFO namenode.NameNode: Caching file names occuring more than 10 times
13/12/29 02:36:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
13/12/29 02:36:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
13/12/29 02:36:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
13/12/29 02:36:54 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
13/12/29 02:36:54 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
13/12/29 02:36:54 INFO util.GSet: Computing capacity for map Namenode Retry Cache
13/12/29 02:36:54 INFO util.GSet: VM type       = 64-bit
13/12/29 02:36:54 INFO util.GSet: 0.029999999329447746% max memory = 96.7 MB
13/12/29 02:36:54 INFO util.GSet: capacity      = 2^12 = 4096 entries
13/12/29 02:36:55 INFO common.Storage: Storage directory $CONFIG/data/hadoop/hdfs/nn has been successfully formatted.
13/12/29 02:36:56 INFO namenode.FSImage: Saving image file $CONFIG/data/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 using no compression
13/12/29 02:36:56 INFO namenode.FSImage: Image file $CONFIG/data/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 of size 196 bytes saved in 0 seconds.
13/12/29 02:36:56 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
13/12/29 02:36:56 INFO util.ExitUtil: Exiting with status 0
13/12/29 02:36:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.x
************************************************************/
[hdfs@localhost bin]$
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 14: Start HDFS service - Namenode Daemon process

$cd ../sbin

[hdfs@localhost bin]$ cd ../sbin/

[hdfs@localhost sbin]$ ./hadoop-daemon.sh start namenode

starting namenode, logging to /$PACKAGE_HOME/hadoop-2.2.0/logs/hadoop-hdfs-namenode localhost.localdomain.out

Step 15: Check the status of namenode daemon

[hdfs@localhost ]$ jps

4537 Jps

4300 NameNode =====> started successfully

[hdfs@localhost sbin]$ ps -ef | grep java

hdfs 4300 1 11 02:38 pts/1 00:00:04 $BIN/java/default/bin/java -Dproc_namenode -Xmx100m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop-hdfs-namenode-localhost.localdomain.log -Dhadoop.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.namenode.NameNode

_______________________________________________________________________________

Step 16 : Start HDFS service - Secondary Namenode Daemon process

[hdfs@localhost sbin]$ ./hadoop-daemon.sh start secondarynamenode

starting secondarynamenode, logging to $PACKAGE_HOME/hadoop-2.2.0/logs/hadoop-hdfs-secondarynamenode-localhost.localdomain.out

[hdfs@localhost sbin]$

Step 17 : Check the status of Secondarynamenode daemon

[hdfs@localhost bin]$ jps

4300 NameNode

4913 SecondaryNameNode ======> started successfully

[hdfs@localhost sbin]$ ps -ef | grep java | grep 4913

hdfs 4913 1 7 02:46 pts/1 00:00:04 $BIN/java/default/bin/java -Dproc_secondarynamenode -Xmx100m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop-hdfs-secondarynamenode-localhost.localdomain.log -Dhadoop.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode

_____________________________________________________________________________________________________________

Step 18: Start HDFS service - DataNode Daemon process

[hdfs@localhost sbin]$ ./hadoop-daemon.sh start datanode

starting datanode, logging to $PACKAGE_HOME/hadoop-2.2.0/logs/hadoop-hdfs-datanode-localhost.localdomain.out

[hdfs@localhost sbin]$

Step 19: Check the status of Datanode daemon

[hdfs@localhost bin]$ jps

4300 NameNode

4913 SecondaryNameNode

4949 Jps

4373 DataNode ======> started successfully

[hdfs@localhost sbin]$ ps -ef | grep java | grep 4373

hdfs 4373 1 34 02:39 pts/1 00:00:06 $BIN/java/default/bin/java -Dproc_datanode -Xmx100m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop-hdfs-datanode-localhost.localdomain.log -Dhadoop.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.datanode.DataNode

___________________________________________________________________

Step 20:Start YARN service - resourcemanager Daemon process

[hdfs@localhost sbin]$ ./yarn-daemon.sh start resourcemanager

starting resourcemanager, logging to $PACKAGE_HOME/hadoop-2.2.0/logs/yarn-hdfs-resourcemanager-localhost.localdomain.out

Step 21 : Check the status of ResourceManager daemon

[hdfs@localhost bin]$ jps

4300 NameNode

4913 SecondaryNameNode

4949 Jps

4373 DataNode

4500 ResourceManager ======> started successfully

[hdfs@localhost sbin]$ ps -ef | grep java | grep 4500

hdfs 4500 1 3 02:41 pts/1 00:00:08 $BIN/java/default/bin/java -Dproc_resourcemanager -Xmx200m -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dyarn.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hdfs-resourcemanager-localhost.localdomain.log -Dyarn.log.file=yarn-hdfs-resourcemanager-localhost.localdomain.log -Dyarn.home.dir= -Dyarn.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -Dyarn.policy.file=hadoop-policy.xml -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dyarn.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hdfs-resourcemanager-localhost.localdomain.log -Dyarn.log.file=yarn-hdfs-resourcemanager-localhost.localdomain.log -Dyarn.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -classpath $PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/:$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/:$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/common/lib/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/common/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/hdfs:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/hdfs/lib/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/hdfs/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/yarn/lib/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/yarn/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/contrib/capacity-scheduler/*.jar:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/yarn/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/yarn/lib/*:$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop//rm-config/log4j.properties org.apache.hadoop.yarn.server.resourcemanager.ResourceManager

_____________________

Step 22:Start YARN service - NodeManager Daemon process

[hdfs@localhost sbin]$ ./yarn-daemon.sh start nodemanager

starting nodemanager, logging to $PACKAGE_HOME/hadoop-2.2.0/logs/yarn-hdfs-nodemanager-localhost.localdomain.out

[hdfs@localhost sbin]$

Step 23 : Check the status of Nodemanager daemon

[hdfs@localhost bin]$ jps

4300 NameNode

4744 NodeManager ======> started successfully

4913 SecondaryNameNode

4949 Jps

4373 DataNode

4500 ResourceManager

[root@localhost bin]#

[hdfs@localhost sbin]$ ps -ef | grep java | grep 4744

hdfs 4744 1 2 02:42 pts/1 00:00:03 $BIN/java/default/bin/java -Dproc_nodemanager -Xmx200m -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dyarn.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hdfs-nodemanager-localhost.localdomain.log -Dyarn.log.file=yarn-hdfs-nodemanager-localhost.localdomain.log -Dyarn.home.dir= -Dyarn.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -Dyarn.policy.file=hadoop-policy.xml -server -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dyarn.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hdfs-nodemanager-localhost.localdomain.log -Dyarn.log.file=yarn-hdfs-nodemanager-localhost.localdomain.log -Dyarn.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -classpath $PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/:$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/:$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/common/lib/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/common/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/hdfs:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/hdfs/lib/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/hdfs/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/yarn/lib/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/yarn/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/contrib/capacity-scheduler/*.jar:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/yarn/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/yarn/lib/*:$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop//nm-config/log4j.properties org.apache.hadoop.yarn.server.nodemanager.NodeManager ________________________________________________________
Step 24: This command gives you information on hdfs system

[hdfs@localhost bin]$ ./hadoop dfsadmin -report

Configured Capacity: 16665448448 (15.52 GB)
Present Capacity: 12396371968 (11.55 GB)
DFS Remaining: 12396347392 (11.54 GB)
DFS Used: 24576 (24 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)
Live datanodes:
Name: 127.0.0.x:50010 (localhost)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 16665448448 (15.52 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 4269076480 (3.98 GB)
DFS Remaining: 12396347392 (11.54 GB)
DFS Used%: 0.00%
DFS Remaining%: 74.38%
Last contact: Sun Dec 29 03:11:02 PST 2013
[hdfs@localhost bin]$
________________________________________________________

Step25: Stop all the services by running " stop-all.sh "

[hdfs@localhost sbin]$ ./stop-all.sh
This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop
[hdfs@localhost sbin]$
________________________________________________________
Step 26: Start all the services by running "start-all.sh "

Added the YARN architecture block diagram to locate the presence of daemons in different components .

[hdfs@localhost sbin]$ ./start-all.sh

check the status of all services :

[hdfs@localhost sbin]$ jps
6161 NameNode
6260 DataNode
6719 NodeManager
6750 Jps
6355 SecondaryNameNode
6429 ResourceManager
[root@localhost bin]#

Job Definition and control Flow between Hadoop/Yarn components:

                                                                    https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=31822268
________________________________________________________
Step 27: Run sample application program "pi" from hadoop-mapreduce-examples-2.2.0.jar

First test with hadoop to run existing hadoop program - launch the program, monitor progress, and get/put files on the HDFS. This program calculates the value of " pi " in parallel i.e 2 maps with 10 samples:

              $ hadoop jar $BIN/lib/hadoop/hadoop-examples.jar    pi    2   10

[hdfs@localhost bin]$ ./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 10
Number of Maps = 2
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Starting Job
13/12/29 04:33:12 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
13/12/29 04:33:13 INFO input.FileInputFormat: Total input paths to process : 2
13/12/29 04:33:13 INFO mapreduce.JobSubmitter: number of splits:2
13/12/29 04:33:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1388320369543_0001
13/12/29 04:33:15 INFO impl.YarnClientImpl: Submitted application application_1388320369543_0001 to ResourceManager at /0.0.0.0:8032
13/12/29 04:33:15 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1388320369543_0001/
13/12/29 04:33:15 INFO mapreduce.Job: Running job: job_1388320369543_0001
13/12/29 04:33:38 INFO mapreduce.Job: Job job_1388320369543_0001 running in uber mode : false
13/12/29 04:33:38 INFO mapreduce.Job: map 0% reduce 0%
13/12/29 04:35:22 INFO mapreduce.Job: map 83% reduce 0%
13/12/29 04:35:23 INFO mapreduce.Job: map 100% reduce 0%
13/12/29 04:36:10 INFO mapreduce.Job: map 100% reduce 100%
13/12/29 04:36:16 INFO mapreduce.Job: Job job_1388320369543_0001 completed successfully
13/12/29 04:36:16 INFO mapreduce.Job: Counters: 43
    File System Counters
        FILE: Number of bytes read=50
        FILE: Number of bytes written=238681
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=528
        HDFS: Number of bytes written=215
        HDFS: Number of read operations=11
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=3
    Job Counters
        Launched map tasks=2
        Launched reduce tasks=1
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=208977
        Total time spent by all reduces in occupied slots (ms)=39840
    Map-Reduce Framework
        Map input records=2
        Map output records=4
        Map output bytes=36
        Map output materialized bytes=56
        Input split bytes=292
        Combine input records=0
        Combine output records=0
        Reduce input groups=2
        Reduce shuffle bytes=56
        Reduce input records=4
        Reduce output records=0
        Spilled Records=8
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=1712
        CPU time spent (ms)=3320
        Physical memory (bytes) snapshot=454049792
        Virtual memory (bytes) snapshot=3515953152
        Total committed heap usage (bytes)=268247040
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=236
    File Output Format Counters
        Bytes Written=97
Job Finished in 184.356 seconds
Estimated value of Pi is 3.80000000000000000000
[hdfs@localhost bin]$
________________________________________________________________
Step 28 : Verify the Running Services Using the Web Interface:

Web Interface for the resource Manager can be viewed by
http://localhost:8088

Shows the running application on single node cluster

Application Overview -Final Status( FINISHED)

__________________________________________________________________

Step 29 :   Create a Directory on HDFS

[hdfs@localhost bin]$ ./hadoop fs -mkdir test1
-------------------------------------------------------------------------
Step 30: Put local file "hellofile" into HDFS (/test1)

[hdfs@localhost bin]$ ./hadoop fs -put hellofile /test1
-------------------------------------------------------------------------

Step 31: Check the input file "hellofile" on HDFS

[hdfs@localhost bin]$ ./hadoop fs -ls /test1
Found 1 items
-rw-r--r--   1 hdfs supergroup        113 2013-12-29 04:56 /test1/hellofile
[hdfs@localhost bin]$
___________________________________________________________
Step 32: Run application program "WordCount" from hadoop-mapreduce-examples-2.2.0.jar

WordCount Example:
WordCount example reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab.Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.

To run the example, the command syntax is
bin/hadoop jar hadoop-*-examples.jar wordcount <in-dir> <out-dir>

All of the files in the input directory (called in-dir in the command line above) are read and the counts of words in the input are written to the output directory (called out-dir above).It is assumed that both inputs and outputs are stored in HDFS.If your input is not already in HDFS, but is rather in a local file system somewhere, you need to copy the data into HDFS as shown in above steps 29-31.
NOTE: Similarly you could think of processing bigger Data Files ( Weather data , Healthcare data, Machine Log data ...etc).

[hdfs@localhost bin]$ ./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /test1/hellofile /test1/output
13/12/29 04:57:51 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
13/12/29 04:57:53 INFO input.FileInputFormat: Total input paths to process : 1
13/12/29 04:57:53 INFO mapreduce.JobSubmitter: number of splits:1
13/12/29 04:57:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1388320369543_0002
13/12/29 04:57:55 INFO impl.YarnClientImpl: Submitted application application_1388320369543_0002 to ResourceManager at /0.0.0.0:8032
13/12/29 04:57:55 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1388320369543_0002/
13/12/29 04:57:55 INFO mapreduce.Job: Running job: job_1388320369543_0002
13/12/29 04:58:06 INFO mapreduce.Job: Job job_1388320369543_0002 running in uber mode : false
13/12/29 04:58:06 INFO mapreduce.Job: map 0% reduce 0%
13/12/29 04:58:17 INFO mapreduce.Job: map 100% reduce 0%
13/12/29 04:58:41 INFO mapreduce.Job: map 100% reduce 100%
13/12/29 04:58:42 INFO mapreduce.Job: Job job_1388320369543_0002 completed successfully
13/12/29 04:58:42 INFO mapreduce.Job: Counters: 43
    File System Counters
        FILE: Number of bytes read=152
        FILE: Number of bytes written=158589
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=215
        HDFS: Number of bytes written=94
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters
        Launched map tasks=1
        Launched reduce tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=9934
        Total time spent by all reduces in occupied slots (ms)=19948
    Map-Reduce Framework
        Map input records=4
        Map output records=21
        Map output bytes=194
        Map output materialized bytes=152
        Input split bytes=102
        Combine input records=21
        Combine output records=13
        Reduce input groups=13
        Reduce shuffle bytes=152
        Reduce input records=13
        Reduce output records=13
        Spilled Records=26
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=148
        CPU time spent (ms)=1520
        Physical memory (bytes) snapshot=298029056
        Virtual memory (bytes) snapshot=2346151936
        Total committed heap usage (bytes)=143855616
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=113
    File Output Format Counters
        Bytes Written=94
[hdfs@localhost bin]$

Verify the Running Services Using the Web Interface:

All Applications

Scheduler-View on Web Interface

_______________________________________________________
Step 33: View the output file of WorrdCount application program :

[hdfs@localhost bin]$ ./hadoop fs -ls /test1/output1
Found 2 items
-rw-r--r-- 1 hdfs supergroup 0 2013-12-29 05:09 /test1/output1/_SUCCESS
-rw-r--r-- 1 hdfs supergroup 120 2013-12-29 05:09 /test1/output1/part-r-00000

[hdfs@localhost bin]$ ./hadoop fs -ls /test1/output1/part-r-00000
Found 1 items
-rw-r--r--   1 hdfs supergroup        120 2013-12-29 05:09 /test1/output1/part-r-00000
[hdfs@localhost bin]$ ./hadoop fs -cat /test1/output1/part-r-00000
Hello    693
Others    231
all    231
and    462
are    231
dear    231
everyone    462
friends    231
here    462
my    231
there    462
to    693
who    231
[hdfs@localhost bin]$

________________________________________________________________
References:
1) http://hadoop.apache.org/
2) Hadoop: The Definitive Guide by Tom White http://it-ebooks.info/book/635/

3) http://hortonworks.com/hadoop/
4) http://www.cloudera.com/content/cloudera/en/home.html
5) http://www.meetup.com/lspe-in/pages/7th_Event_-_Hadoop_Hands_on_Session/

----------------------------------------------------------------------------------------------------------
This is small effort to make familiar with Hadoop YARN setup to run some MapReduce applications and to execute POSIX commands in HDFS environment and also to verify the output for Data analytics.There are many other configurations that you can set for history server/checkpoint/type of Scheduler.. etc which are very much required in production environment (That will be documented separately)

------------------
Click here : Overview of Hadoop .
Click here : Multi-node Cluster setup and Implementation.
Click here: Big Data Revolution and Vision ........!!!
Click here : Big Data : Watson - Era of cognitive computing !!!

End of YARN Single node Installation . :)

LINUX & HPC : Advanced Large Scale Computing at a Glance !

Monday, December 30, 2013

Big Data: Hadoop 2.x (YARN/MRv2) - Single Node Installation

Apache Hadoop 2/ YARN/MR2 Installation for Beginners :

No comments:

Post a Comment

Popular Posts

Translate