LINUX & HPC : Advanced Large Scale Computing at a Glance !: Apache Hadoop 3.x installation on multinode cluster RHEL7 (ppc64le)

Hadoop is an open-source Apache project that allows creation of parallel processing applications on large data sets, distributed across networked nodes. It’s composed of the Hadoop Distributed File System (HDFS™) that handles scalability and redundancy of data across nodes, and Hadoop YARN: a framework for job scheduling that executes data processing tasks on all nodes.

Hadoop allows for the storage and processing of large data sets across clusters of computers. Hadoop was developed by the Apache Software Foundation and has become a popular tool for big data processing and analysis.

Some of the key use cases of Hadoop are:

Data storage: Hadoop Distributed File System (HDFS) is a highly scalable and fault-tolerant distributed file system that is used to store large amounts of data across multiple nodes in a cluster. Hadoop is often used to store and manage large amounts of unstructured and semi-structured data, such as log files, sensor data, and social media data.

Batch processing: Hadoop provides a powerful framework for batch processing of large data sets. This is typically done using the MapReduce programming model, which allows for the parallel processing of data across multiple nodes in a cluster. Hadoop is often used for tasks such as data cleansing, data transformation, and data aggregation.

Data analysis: Hadoop is often used for data analysis and machine learning tasks. Hadoop provides a number of tools for processing and analyzing large data sets, including Apache Pig and Apache Hive. These tools allow users to query and analyze large data sets using SQL-like commands.

Real-time processing: Hadoop can also be used for real-time processing of data, using tools such as Apache Spark and Apache Flink. These tools allow for the processing of data streams in real-time, enabling applications such as fraud detection, real-time recommendations, and IoT data processing.

Apache Hadoop 3.x Benefits

Support multiple standby NameNodes.
Supports multiple NameNodes for multiple namespaces.
Storage overhead reduced from 200% to 50%.
Support GPUs.
Intra-node disk balancing.
Support for Opportunistic Containers and Distributed Scheduling.
Support for Microsoft Azure Data Lake and Aliyun Object Storage System file-system

Architecture of Hadoop Cluster:

Apache hadoop has 2 core components .

1) HDFS - its for storage

2) YARN - its for computation

You could see the HDFS and YARN architecture as shown below:


HDFS ARCHITECTURE

YARN ARCHITECTURE

Before configuring the master and worker nodes, it’s good to understand the different components of a Hadoop cluster. A master node keeps knowledge about the distributed file system, like the inode table on an ext3 filesystem, and schedules resources allocation. node-master will handle this role in this guide, and host two daemons:

• The NameNode: manages the distributed file system and knows where stored data blocks inside the cluster are.

• The ResourceManager: manages the YARN jobs and takes care of scheduling and executing processes on worker nodes.

Worker nodes store the actual data and provide processing power to run the jobs and will host two daemons:

• The DataNode manages the actual data physically stored on the node;

• The NodeManager manages execution of tasks on the node.

Prerequisites for Implementing Hadoop

Operating system – RHEL 7.6
Hadoop – You require Hadoop 3.x package
Passwordless SSH connections between nodes inthe cluster
Firewall settings on machines in the cluster
Machine details :

Master node : hadoopNode1 (Power8 server with K80 GPUs running RHEL7)
Worker nodes: hadoopNode1 & hadoopNode2 (Power8 server with K80 GPUs running RHEL7)

------------------------------------- ------------------------------------------------------------------

Hadoop Installation Steps :

Step 1: Download the Java 8 Package and Save this file in your home directory.

Java is the primary requirement for running Hadoop on any system. They have compiled all the Hadoop jar files using Java 8 run time version. The user now has to install Java 8 to use Hadoop 3.0. And user having JDK 7 has to upgrade it to JDK 8.

If   your machine is IBM power architecture (ppc64le) , then you need to get IBM java package from this link copied below:

Download Link : https://developer.ibm.com/javasdk/downloads/sdk8/

Step 2: Extract the Java TarFile.

Step 3: Download the Hadoop 3.x Package.

Download stable version of Hadoop
wget http://apache.spinellicreations.com/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz

Step 4: Extract the Hadoop tar File.
Extract the files @ /home/users/sachinpb/sachinPB/

tar xzvf hadoop-3.2.0.tar.gz

At the high level "/home/users/sachinpb/sachinPB/hadoop-3.2.0, you will see the following directories:

├── bin
│   ├── container-executor
│   ├── hadoop
│   ├── hadoop.cmd
│   ├── hdfs
│   ├── hdfs.cmd
│   ├── mapred
│   ├── mapred.cmd
│   ├── oom-listener
│   ├── test-container-executor
│   ├── yarn
│   └── yarn.cmd
├── etc
│   └── hadoop
│       ├── core-site.xml
│       ├── hadoop-env.sh
│       ├── hdfs-site.xml
│       ├── log4j.properties
│       ├── mapred-site.xml
│       ├── workers
│       ├── yarn-env.sh
│       └── yarn-site.xml
├── include
├── lib
│   └── native
│       ├── examples
│       ├── libhadoop.a
│       ├── libhadooppipes.a
│       ├── libhadoop.so -> libhadoop.so.1.0.0
│       ├── libhadoop.so.1.0.0
│       ├── libhadooputils.a
│       ├── libnativetask.a
│       ├── libnativetask.so -> libnativetask.so.1.0.0
│       └── libnativetask.so.1.0.0
├── logs
│
├── sbin
│   ├── hadoop-daemon.sh
│   ├── httpfs.sh
│   ├── mr-jobhistory-daemon.sh
│   ├── refresh-namenodes.sh
│   ├── start-all.sh
│   ├── start-balancer.sh
│   ├── start-dfs.sh
│   ├── start-secure-dns.sh
│   ├── start-yarn.sh
│   ├── stop-all.cmd
│   ├── stop-all.sh
│   ├── stop-balancer.sh
│   ├── stop-dfs.sh
│   ├── stop-secure-dns.sh
│   ├── stop-yarn.sh
│   ├── workers.sh
│   ├── yarn-daemon.sh
│
└── share
    ├── doc
    │   └── hadoop
    └── hadoop
        ├── client
        ├── common
        ├── hdfs
        ├── mapreduce
        ├── tools
        └── yarn

Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).

update ~/.bashrc
export HADOOP_HOME=$HOME/sachinPB/hadoop-3.2.0
export HADOOP_CONF_DIR=$HOME/sachinPB/hadoop-3.2.0/etc/hadoop
export HADOOP_MAPRED_HOME=$HOME/sachinPB/hadoop-3.2.0
export HADOOP_COMMON_HOME=$HOME/sachinPB/hadoop-3.2.0
export HADOOP_HDFS_HOME=$HOME/sachinPB/hadoop-3.2.0
export YARN_HOME=$HOME/sachinPB/hadoop-3.2.0
export PATH=$PATH:$HOME/sachinPB/hadoop-3.2.0/bin

#set Java Home

export JAVA_HOME=/opt/ibm/java-ppc64le-80
export PATH=$PATH:/opt/ibm/java-ppc64le-80/bin

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH

source .bashrc

Step 6: Edit the Hadoop Configuration files as per your application requirements.
              HOW TO CONFIGURE AND RUN BIG DATA APPLICATIONS ?

              Configuration files at : $HOME/sachinPB/hadoop-3.2.0/etc/Hadoop

Step 7: Open core-site.xml and edit the property mentioned below inside configuration tag.

SET NAMENODE LOCATION

core-site.xml

-----------------------------------------------------------------------------

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>fs.default.name</name>

<value>hdfs://hadoopNode1:9000</value>

</property>

</configuration>

-----------------------------------------------------------------------------

Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag.

SET PATH FOR HDFS

hdfs-site.xml

-----------------------------------------------------------------------------

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>dfs.replication</name>

</property>

<name>dfs.namenode.name.dir</name>

<value>file:$DATA_DIR/hadoop/hdfs/nn</value>

</property>

<name>dfs.datanode.data.dir</name>

<value>file:$DATA_DIR/hadoop/hdfs/dn</value>

</property>

<name>dfs.permissions</name>

<value>false</value>

</property>

</configuration>

----------------------------------------------------------------------------

Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside the configuration tag.

SET YARN AS JOB SCHEDULER

mapred-site.xml

-------------------------------------------------------------------------

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>mapreduce.framework.name</name>

</property>

<name>yarn.app.mapreduce.am.env</name>

<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>

</property>

<name>mapreduce.map.env</name>

<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>

</property>

<name>mapreduce.reduce.env</name>

<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>

</property>

</configuration>

-----------------------------------------------------------------------------

Step 10: Edit yarn-site.xml and edit the property mentioned below inside configuration tag.

CONFIGURE YARN

yarn-site.xml

----------------------------------------------------------------

<?xml version="1.0"?>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

</configuration>

-----------------------------------------------------------------

Step 11: Edit hadoop-env.sh/yarn-env.sh and add the Java Path as mentioned below.

export JAVA_HOME=$JAVA_PPC64LE_PATH

export HADOOP_HOME=$HOME/sachinPB/hadoop-3.2.0

Step 12 : The file workers is used by startup scripts to start required daemons on all nodes. This a very new change in Hadoop3 ( Please check)

CONFIGURE SLAVES

workers

-------------------------------

hadoopNode1

hadoopNode2

---------------------------------

check Java and hadoop version

[sachinpb@hadoopNode1 hadoop]$ java -version

openjdk version "1.8.0_181"

OpenJDK Runtime Environment (build 1.8.0_181-b13)

OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

Check Hadoop version:

[sachinpb@hadoopNode1 hadoop]$ hadoop version

Hadoop 3.2.0

This command was run using $HOME/sachinPB/hadoop-3.2.0/share/hadoop/common/hadoop-common-3.2.0.jar

[sachinpb@hadoopNode1 hadoop]$

-----------------------------------------------

Step 13: Next, format the NameNode.

HDFS needs to be formatted like any classical file system. On node-master, run the following command: "hdfs namenode -format"

[sachinpb@hadoopNode1]$ hdfs namenode -format -clusterId CID***-XYZ

2019-05-07 03:44:04,380 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG: host = hadoopNode1/$HOST1_IPADDRESS

STARTUP_MSG: args = [-format, -clusterId, CID***-XYZ]

STARTUP_MSG: version = 3.2.0

STARTUP_MSG: classpath = $HOME/sachinpb/sachinPB/hadoop-3.2.0/etc/hadoop:$HOME/sachinpb/sachinPB/hadoop-.

[$DATA_DIR/hadoop/hdfs/nn/current/VERSION, $DATA_DIR/hadoop/hdfs/nn/current/seen_txid, $DATA_DIR/hadoop/hdfs/nn/current/fsimage_0000000000000000000.md5, $DATA_DIR/hadoop/hdfs/nn/current/fsimage_0000000000000000000, $DATA_DIR/hadoop/hdfs/nn/current/edits_0000000000000000001-0000000000000000002, $DATA_DIR/hadoop/hdfs/nn/current/edits_0000000000000000003-0000000000000000004, $DATA_DIR/hadoop/hdfs/nn/current/edits_0000000000000000005-0000000000000000006, $DATA_DIR/hadoop/hdfs/nn/current/edits_0000000000000000007-0000000000000000008, $DATA_DIR/hadoop/hdfs/nn/current/edits_inprogress_0000000000000000009]

2019-05-07 03:44:08,926 INFO common.Storage: Storage directory $DATA_DIR/hadoop/hdfs/nn has been successfully formatted.

2019-05-07 03:44:08,937 INFO namenode.FSImageFormatProtobuf: Saving image file $DATA_DIR/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 using no compression

2019-05-07 03:44:09,063 INFO namenode.FSImageFormatProtobuf: Image file $DATA_DIR/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 of size 401 bytes saved in 0 seconds .

2019-05-07 03:44:09,089 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0

2019-05-07 03:44:09,104 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at hadoopNode1/$HOST1_IPADDRESS

************************************************************/

[sachinpb@hadoopNode1 logs]$

------------------------------------------

Step 14: Once the NameNode is formatted, go to Hadoop-3.0/sbin directory and start all the daemons.

Your Hadoop installation is now configured and ready to run big data application

Step 15: Start hdfs and yarn daemons:

Form directory : $HOME/sachinPB/hadoop-3.2.0/sbin

NOTE: Copy the Hadoop home directory across all the nodes in your cluster (if its not available on shared directory)

[sachinpb@hadoopNode1 sbin]$ ./start-all.sh

WARNING: Attempting to start all Apache Hadoop daemons as sachinpb in 10 seconds..

Starting namenodes on [hadoopNode1]

hadoopNode1: Welcome to hadoopNode1!

hadoopNode1:

hadoopNode1: namenode is running as process 146418.

Starting datanodes

hadoopNode1: Welcome to hadoopNode1!

hadoopNode1:

hadoopNode2: Welcome to hadoopNode2!

hadoopNode2:

hadoopNode1: datanode is running as process 146666.

hadoopNode2: datanode is running as process 112502.

Starting secondary namenodes [hadoopNode1]

hadoopNode1: Welcome to hadoopNode1!

hadoopNode1:

hadoopNode1: secondarynamenode is running as process 147091.

[sachinpb@hadoopNode1 sbin]$

Step 16: All the Hadoop services are up and running. [on other platforms, you could use jps command to see hadoop daemons. IBM JAVA does not provide jps or jstat. So you need to check the Hadoop process by ps command]

[sachinpb@hadoopNode1 sbin]$ ps -ef | grep NameNode

sachinpb 105015 1 0 02:59 ? 00:00:27 $JAVA_PPC64LE_PATH/bin/java -Dproc_namenode -Djava.library.path=$HOME/sachinpb/sachinPB/hadoop-3.2.0/lib -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dyarn.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dyarn.log.file=hadoop-sachinpb-namenode-hadoopNode1.log -Dyarn.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dyarn.root.logger=INFO,console -Dhadoop.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dhadoop.log.file=hadoop-sachinpb-namenode-hadoopNode1.log -Dhadoop.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dhadoop.id.str=sachinpb -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml org.apache.hadoop.hdfs.server.namenode.NameNode

-----

sachinpb 105713 1 0 02:59 ? 00:00:12 $JAVA_PPC64LE_PATH/bin/java -Dproc_secondarynamenode -Djava.library.path=$HOME/sachinpb/sachinPB/hadoop-3.2.0/lib -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dyarn.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dyarn.log.file=hadoop-sachinpb-secondarynamenode-hadoopNode1.log -Dyarn.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dyarn.root.logger=INFO,console -Dhadoop.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dhadoop.log.file=hadoop-sachinpb-secondarynamenode-hadoopNode1.log -Dhadoop.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dhadoop.id.str=sachinpb -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode

[sachinpb@hadoopNode1 sbin]$ ps -ef | grep DataNode

sachinpb 105268 1 0 02:59 ? 00:00:19 $JAVA_PPC64LE_PATH/bin/java -Dproc_datanode -Djava.library.path=$HOME/sachinpb/sachinPB/hadoop-3.2.0/lib -Dhadoop.security.logger=ERROR,RFAS -Dyarn.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dyarn.log.file=hadoop-sachinpb-datanode-hadoopNode1.log -Dyarn.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dyarn.root.logger=INFO,console -Dhadoop.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dhadoop.log.file=hadoop-sachinpb-datanode-hadoopNode1.log -Dhadoop.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dhadoop.id.str=sachinpb -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml org.apache.hadoop.hdfs.server.datanode.DataNode

------

[sachinpb@hadoopNode1 sbin]$ ps -ef | grep ResourceManager

sachinpb 106257 1 1 02:59 pts/3 00:00:50 $JAVA_PPC64LE_PATH/bin/java -Dproc_resourcemanager -Djava.library.path=$HOME/sachinpb/sachinPB/hadoop-3.2.0/lib -Dservice.libdir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/share/hadoop/yarn,$HOME/sachinpb/sachinPB/hadoop-3.2.0/share/hadoop/yarn/lib,$HOME/sachinpb/sachinPB/hadoop-3.2.0/share/hadoop/hdfs,$HOME/sachinpb/sachinPB/hadoop-3.2.0/share/hadoop/hdfs/lib,$HOME/sachinpb/sachinPB/hadoop-3.2.0/share/hadoop/common,$HOME/sachinpb/sachinPB/hadoop-3.2.0/share/hadoop/common/lib -Dyarn.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dyarn.log.file=hadoop-sachinpb-resourcemanager-hadoopNode1.log -Dyarn.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dyarn.root.logger=INFO,console -Dhadoop.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dhadoop.log.file=hadoop-sachinpb-resourcemanager-hadoopNode1.log -Dhadoop.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dhadoop.id.str=sachinpb -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.yarn.server.resourcemanager.ResourceManager

[sachinpb@hadoopNode1 sbin]$ ps -ef | grep NodeManager

sachinpb 106621 1 1 02:59 ? 00:01:08 $JAVA_PPC64LE_PATH/bin/java -Dproc_nodemanager -Djava.library.path=$HOME/sachinpb/sachinPB/hadoop-3.2.0/lib -Dyarn.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dyarn.log.file=hadoop-sachinpb-nodemanager-hadoopNode1.log -Dyarn.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dyarn.root.logger=INFO,console -Dhadoop.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dhadoop.log.file=hadoop-sachinpb-nodemanager-hadoopNode1.log -Dhadoop.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dhadoop.id.str=sachinpb -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.yarn.server.nodemanager.NodeManager

Ckeck similarly the status of hadoop daemons on other worker node [hadoopNode2]:

[sachinpb@hadoopNode2 ~]$ ps -ef | grep hadoop

sachinpb 77718 1 7 21:52 ? 00:00:07 $JAVA_PPC64LE_PATH/bin/java -Dproc_datanode -Djava.library.path=$HOME/sachinpb/sachinPB/hadoop-3.2.0/lib -Dhadoop.security.logger=ERROR,RFAS -Dyarn.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dyarn.log.file=hadoop-sachinpb-datanode-hadoopNode2.log -Dyarn.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dyarn.root.logger=INFO,console -Dhadoop.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dhadoop.log.file=hadoop-sachinpb-datanode-hadoopNode2.log -Dhadoop.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dhadoop.id.str=sachinpb -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml org.apache.hadoop.hdfs.server.datanode.DataNode

sachinpb 78006 1 12 21:52 ? 00:00:11 $JAVA_PPC64LE_PATH/bin/java -Dproc_nodemanager -Djava.library.path=$HOME/sachinpb/sachinPB/hadoop-3.2.0/lib -Dyarn.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dyarn.log.file=hadoop-sachinpb-nodemanager-hadoopNode2.log -Dyarn.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dyarn.root.logger=INFO,console -Dhadoop.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dhadoop.log.file=hadoop-sachinpb-nodemanager-hadoopNode2.log -Dhadoop.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dhadoop.id.str=sachinpb -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.yarn.server.nodemanager.NodeManager

[sachinpb@hadoopNode2 ~]$

Step 17: Now open the Browser and go to localhost:9870/dfshealth.html to check the NameNode interface.

NOTE: In Hadoop 2.x, web UI port is 50070 but in Hadoop3.x, it is moved to 9870. You can access HDFS web UI from localhost:9870

Step 18: run Hadoop application – Example: wordcount mapreduce program

[sachinpb@hadoopNode1 hadoop-3.2.0]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar wordcount /user/sachinpb/helloworld /user/sachinpb/helloworld_out

2019-05-07 04:04:35,044 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

2019-05-07 04:04:36,137 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: $MY_DIR/hadoop-yarn/staging/sachinpb/.staging/job_1557225898252_0003

2019-05-07 04:04:36,374 INFO input.FileInputFormat: Total input files to process : 1

2019-05-07 04:04:36,486 INFO mapreduce.JobSubmitter: number of splits:1

2019-05-07 04:04:36,536 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled

2019-05-07 04:04:36,728 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1557225898252_0003

2019-05-07 04:04:36,729 INFO mapreduce.JobSubmitter: Executing with tokens: []

2019-05-07 04:04:36,939 INFO conf.Configuration: resource-types.xml not found

2019-05-07 04:04:36,939 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.

2019-05-07 04:04:36,997 INFO impl.YarnClientImpl: Submitted application application_1557225898252_0003

2019-05-07 04:04:37,029 INFO mapreduce.Job: The url to track the job: http://hadoopNode1:8088/proxy/application_1557225898252_0003/

2019-05-07 04:04:37,030 INFO mapreduce.Job: Running job: job_1557225898252_0003

2019-05-07 04:04:45,137 INFO mapreduce.Job: Job job_1557225898252_0003 running in uber mode : false

2019-05-07 04:04:45,138 INFO mapreduce.Job: map 0% reduce 0%

2019-05-07 04:04:51,189 INFO mapreduce.Job: map 100% reduce 0%

2019-05-07 04:04:59,223 INFO mapreduce.Job: map 100% reduce 100%

2019-05-07 04:04:59,232 INFO mapreduce.Job: Job job_1557225898252_0003 completed successfully

2019-05-07 04:04:59,348 INFO mapreduce.Job: Counters: 54

File System Counters

FILE: Number of bytes read=41

FILE: Number of bytes written=443547

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=126

HDFS: Number of bytes written=23

HDFS: Number of read operations=8

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

HDFS: Number of bytes read erasure-coded=0

Job Counters

Launched map tasks=1

Launched reduce tasks=1

Data-local map tasks=1

Total time spent by all maps in occupied slots (ms)=3968

Total time spent by all reduces in occupied slots (ms)=4683

Total time spent by all map tasks (ms)=3968

Total time spent by all reduce tasks (ms)=4683

Total vcore-milliseconds taken by all map tasks=3968

Total vcore-milliseconds taken by all reduce tasks=4683

Total megabyte-milliseconds taken by all map tasks=4063232

Total megabyte-milliseconds taken by all reduce tasks=4795392

Map-Reduce Framework

Map input records=1

Map output records=3

Map output bytes=29

Map output materialized bytes=41

Input split bytes=109

Combine input records=3

Combine output records=3

Reduce input groups=3

Reduce shuffle bytes=41

Reduce input records=3

Reduce output records=3

Spilled Records=6

Shuffled Maps =1

Failed Shuffles=0

Merged Map outputs=1

GC time elapsed (ms)=288

CPU time spent (ms)=4030

Physical memory (bytes) snapshot=350552064

Virtual memory (bytes) snapshot=3825860608

Total committed heap usage (bytes)=177668096

Peak Map Physical memory (bytes)=226557952

Peak Map Virtual memory (bytes)=1911750656

Peak Reduce Physical memory (bytes)=123994112

Peak Reduce Virtual memory (bytes)=1914109952

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=17

File Output Format Counters

Bytes Written=23

[sachinpb@hadoopNode1 hadoop-3.2.0]$

------------------------

Step 19: Verify the output file in HDFS:

[sachinpb@hadoopNode1 hadoop-3.2.0]$ hdfs dfs -cat /user/sachinpb/helloworld_out/part-r-00000

---------------------

2019 4

hello 6

world 7

---------------------

Step 20: MONITOR YOUR HDFS CLUSTER

[sachinpb@hadoopNode1]$ hdfs dfsadmin -report

Configured Capacity: 1990698467328 (1.81 TB)

Present Capacity: 1794297528320 (1.63 TB)

DFS Remaining: 1794297511936 (1.63 TB)

DFS Used: 16384 (16 KB)

DFS Used%: 0.00%

Replicated Blocks:

Under replicated blocks: 0

Blocks with corrupt replicas: 0

Missing blocks: 0

Missing blocks (with replication factor 1): 0

Low redundancy blocks with highest priority to recover: 0

Pending deletion blocks: 0

Erasure Coded Block Groups:

Low redundancy block groups: 0

Block groups with corrupt internal blocks: 0

Missing block groups: 0

Low redundancy blocks with highest priority to recover: 0

Pending deletion blocks: 0

-------------------------------------------------

Live datanodes (2):

Name: $HOST1_IPADDRESS:9866 (hadoopNode1)

Hostname: hadoopNode1

Decommission Status : Normal

Configured Capacity: 995349233664 (926.99 GB)

DFS Used: 12288 (12 KB)

Non DFS Used: 118666317824 (110.52 GB)

DFS Remaining: 876682903552 (816.47 GB)

DFS Used%: 0.00%

DFS Remaining%: 88.08%

Configured Cache Capacity: 0 (0 B)

Cache Used: 0 (0 B)

Cache Remaining: 0 (0 B)

Cache Used%: 100.00%

Cache Remaining%: 0.00%

Xceivers: 1

Last contact: Thu May 09 23:23:50 PDT 2019

Last Block Report: Thu May 09 23:11:35 PDT 2019

Num of Blocks: 0

Name: $HOST2_IPADDESS:9866 (hadoopNode2)

Hostname: hadoopNode2

Decommission Status : Normal

Configured Capacity: 995349233664 (926.99 GB)

DFS Used: 4096 (4 KB)

Non DFS Used: 77734621184 (72.40 GB)

DFS Remaining: 917614608384 (854.60 GB)

DFS Used%: 0.00%

DFS Remaining%: 92.19%

Configured Cache Capacity: 0 (0 B)

Cache Used: 0 (0 B)

Cache Remaining: 0 (0 B)

Cache Used%: 100.00%

Cache Remaining%: 0.00%

Xceivers: 1

Last contact: Thu May 09 23:23:49 PDT 2019

Last Block Report: Thu May 09 23:20:58 PDT 2019

Num of Blocks: 0

NOTE : you could see two live data nodes (hadoopNode1 & hadoopNode2) in this cluster with all details about allocated HDFS space and block size...etc .This way you can check the health of the above Hadoop cluster. Also, we tested the wordcount application on this cluster as shown above.

Step 21: How to stop the Hadoop daemons in cluster environmnet:

cd to $HOME/sachinpb/sachinPB/hadoop-3.2.0/sbin

[sachinpb@hadoopNode1 sbin]$ ./stop-all.sh

WARNING: Stopping all Apache Hadoop daemons as sachinpb in 10 seconds.

WARNING: Use CTRL-C to abort.

Stopping namenodes on [hadoopNode1]

hadoopNode1: Welcome to hadoopNode1!

hadoopNode1:

Stopping datanodes

hadoopNode1: Welcome to hadoopNode1!

hadoopNode1:

hadoopNode2: Welcome to hadoopNode2!

hadoopNode2:

Stopping secondary namenodes [hadoopNode1]

hadoopNode1: Welcome to hadoopNode1!

hadoopNode1:

Stopping nodemanagers

Stopping resourcemanagers on []

[sachinpb@hadoopNode1 sbin]$

I hope this blog helped in understanding how to install Hadoop 3.x in a multinode setup i.e cluster and how to perform operation on HDFS files. Overall, Hadoop is a powerful tool for big data processing and analysis, with a wide range of use cases in industries such as finance, healthcare, retail, and telecommunications.

----------------------------------------END-------------------------------------------
Reference:
1) https://hadoop.apache.org/docs/r3.0.3/hadoop-project-dist/hadoop-common/SingleCluster.html
2) https://hadoop.apache.org/docs/r3.0.3/hadoop-project-dist/hadoop-common/ClusterSetup.html
3) https://hadoop.apache.org/docs/r3.0.3/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
4) http://www.sachinpbuzz.com/2014/01/big-data-hadoop-20yarn-multi-node.html

LINUX & HPC : Advanced Large Scale Computing at a Glance !

Sunday, May 12, 2019

Apache Hadoop 3.x installation on multinode cluster RHEL7 (ppc64le)

Apache Hadoop 3.x Benefits

Architecture of Hadoop Cluster:

YARN ARCHITECTURE

Prerequisites for Implementing Hadoop

Hadoop Installation Steps :

No comments:

Post a Comment

Popular Posts

Translate