Monday, September 1, 2014

Integrating GPFS with Apache Hadoop on IBM Big Data Platform

Enterprises are rapidly discovering the power of Apache Hadoop and big data to drive more powerful, data-driven insights and better position their organizations in an increasingly dynamic and competitive economy. Hadoop’s innovation in scalability, cost efficiency, and flexibility affords enterprises an unprecedented ability to handle greater volume, greater velocity, and greater variety of data than traditional data management technologies. For these information driven enterprises, this next step in the evolution of data management is an enterprise data hub (EDH ) which offers a single, unified data management system that combines distributed storage and computation, which can expand indefinitely, store any amount of data, of any type while bringing a wide range of processing, computation, and query engines and third-party applications directly to the data, thus reversing the traditional data flow of moving data to computing environments. The Hadoop Distributed File System (HDFS)—as a key enabler for the data management paradigm shift of bringing the application to the data, thereby avoiding network bottleneck.

General Parallel File System

GPFS is a high performance parallel file system for clusters. GPFS is an IBM product which was first released in 1998. GPFS has been available on IBM's AIX since 1998, on Linux since 2001 and on Microsoft Windows Server since 2008. GPFS is a high performance enterprise class distributed file system. Over the years it has evolved to support a variety of workloads and can scale to thousands of nodes. GPFS was deployed and used in many enterprise customer production environments to support machine critical applications.

Why - GPFS on Hadoop

Hadoop is on its way to becoming the de facto platform for the next-generation of data-based applications.  Some Hadoop users have strict demands around performance, availability and enterprise-grade features, while others aren’t keen of its direct-attached storage (DAS) architecture.
While Hadoop Distributed File System (HDFS) is a component of the Apache Hadoop package. It has several short-comings which can be overcome by replacing HDFS with another file system. One such approach offered by IBM with BigInsights is the IBM General Parallel File System (GPFS).

GPFS was developed  much before Map/Reduce - a distributed computing paradigm of Hadoop framework. GPFS by itself had no Map/Reduce capability as storage nodes are distinct from compute nodes. Mounting GPFS on all Hadoop nodes is not effective as all data is far away i.e no data locality achieved. In 2010, GPFS was extended to work seamlessly with Hadoop as GPFS-Shared Nothing Cluster architecture, which is now available under the name of GPFS File Placement Optimizer (FPO). FPO allows complete control over the data placements for all replicas, if applications so desires .

 Hadoop Installation with GPFS :
 • Eliminates the single point of failure of the NameNode without requiring a costly high availability design
 • Provides tiered storage to place data on the right type of storage (SSD, 15k SAS , 7200 RPM NL/SAS or any)
 • Use native InfiniBand RDMA for better throughput, lower latency and more CPU cycles for your application


IBM General Parallel File System
IBM has been selling its General Parallel File System to high-performance computing customers for years (including within some of the world’s fastest supercomputers), and in 2010 it tuned GPFS for Hadoop. IBM claims the GPFS-SNC (Shared Nothing Cluster) edition is so much faster than Hadoop in part because it runs at the kernel level as opposed to atop the OS like HDFS. IBM BigInsights V2.1 release introduced support for GPFS File Placement Optimizer (FPO). GPFS FPO is a set of features that allow GPFS to support map reduce applications on clusters with no shared disk. 
1) InfoSphere BigInsights now supports IBM General Parallel File System (GPFS), an enterprise file system that is an alternative to HDFS. GPFS provides hierarchical storage management, high performance support for MapReduce and traditional applications, high availability, and other benefits over HDFS. 

2) This new capability provides enterprise-class distributed file system support that is Portable Operating System Interface (POSIX) compliant. GPFS brings established big data distributed file system capabilities to the Hadoop and MapReduce environment. The GPFS distributed metadata feature eliminates any single point of failure in the file system.

3) As you move from the sandbox to integrated rollout, your Hadoop application deals with files that might be on different systems, with different operating systems, and possibly, different applications in addition to MapReduce. With full POSIX compliance support, and the ability to work with more traditional applications requiring read/write capabilities, GPFS provides greater value that can be highly used for all of your file-based analytics requirements.

Tuning GPFS configuration for FPO :
GPFS-FPO clusters have a different architecture than traditional GPFS deployments. Therefore, default values of many of the configuration parameters are not suitable for FPO deployments and should be changed as per the details at Link 

Network shared disk: When a LUN provided by a storage subsystem is configured for use by GPFS, it is referred to as a network shared disk (NSD). In a traditional GPFS installation, a LUN is typically made up of multiple physical disks provided by a RAID device. Once the LUN is defined as an NSD, GPFS allows all cluster nodes to have direct access to the NSDs. Alternatively, a subset of the nodes connected to the disks may provide access to the LUNs as an NSD server. In GPFS-FPO deployments, a physical disk and NSD can have a 1:1 mapping. In this case, each node in the cluster is a NSD server providing access to the disks from the rest of the cluster. FPO introduced a multipart failure group concept to further convey the topology information of the nodes in the cluster that GPFS can exploit when making data placement decisions. Data block placement decisions in FPO environments are affected by the level of replication and the value of the write-affinity depth and write-affinity failure group parameters. 

In a typical FPO cluster, nodes have direct-attached disks. These disks are not shared between nodes as in a traditional GPFS cluster, so if the node is inaccessible, the associated disks are also inaccessible. GPFS provides ways to automatically recover from these and similar common disk failure situations.

In FPO environments, automated recovery from disk failures can be enabled using the restripeOnDiskFailure=yes configuration option.

In GPFS-FPO environments, consider snapshot and placement policy requirements as you plan GPFS file system layout. For example, temporary and transient data that does not need to be replicated can
be placed in a fileset limited to one replica. Similarly, data that cannot be easily re-created may require frequent snapshots and therefore can reside in an independent fileset.

In a GPFS-FPO cluster, you will have a minimum of two storage pools: one for metadata (with standard block allocation) and one for data that is FPO-enabled. By default, GPFS places all the data in a system pool, unless specified. Best practices for FPO require separation of data and metadata disks in different storage pools. This practice allows the use of a smaller block size for metadata and a larger block size for data. In the sample configuration, a different class of storage (SSD) is being used for metadata, so it is best to create a separate pool. When you have more than one pool, you need to define a placement policy.

GPFS-FPO Benefits are listed below :

• Locality awareness so compute jobs can be scheduled on nodes containing the data
• Chunks that allow large and small block sizes to coexist in the same file system to make the
   most of data locality
• Write affinity allows applications to dictate the layout of files on different nodes to maximize
   both write and read bandwidth
• Distributed recovery to minimize the effect of failures on ongoing computation

GPFS Hadoop connector :

All big data applications run seamlessly with GPFS — no application changes are required. GPFS provides seamless integration with Hadoop applications using a Hadoop connector. A GPFS Hadoop connector is shipped withGPFS, and a customized version is shipped with InfoSphere BigInsights. Platform Symphony uses the GPFS connector shipped with the GPFS product.the GPFS Hadoop connector implements Hadoop file system APIs using GPFS APIs to direct file system access requests to GPFS. This redirection of data access from HDFS to GPFS is enabled by changing Hadoop XML configuration files
GPFS- Hadoop Connector  source

Related products to GPFS are:

• IBM DS and DCS disk arrays
• Tivoli® Storage Manager
• Tivoli Provisioning Manager
• HPSS (High Performance Storage System)
• IBM System Storage® SAN Volume Controller


The GPFS FPO license permits the licensed node to perform NSD server functions for
sharing GPFS data with other nodes that have a GPFS FPO or GPFS Server license.
This license cannot be used to share data with nodes that have a GPFS Client license
or non-GPFS nodes.Nodes running the GPFS File Placement Optimizer feature cannot coexist or interoperate with nodes running GPFS V3.4.

NOTE:File Placement Optimizer(FPO) feature, available with GPFS V3.5 for Linux , extends GPFS for a new class of data-intensive big data applications . 

GPFS provided better performance than HDFS on benchmark Systems :

To provide a basis of comparison, tests were run comparing the use of GPFS with Apache Hadoop Distributed File System (HDFS). Benchmark results show GPFS improves application performance over HDFS by 35% on the analytics benchmark (Terasort benchmark), 35% on write tests and 50% on read tests using the Enhanced TestDFSIO benchmark.

Write performance, higher numbers show better results Source Link

Read performance, higher numbers show better results Source Link

NOTE: DFSIO  is Distributed I/O Benchmark and  part of the hadoop distribution and can be found in "hadoop-mapreduce-client-jobclient-*-tests.jar" for MR2.

Example 1 :
hadoop jar hadoop-mapreduce-client-jobclient-*-tests.jar TestDFSIO -write -nrFiles 64 -fileSize 16GB -resFile /tmp/TestDFSIOwrite.txt 
Example 2:
hadoop jar hadoop-mapreduce-client-jobcliet-*-tests.jar TestDFSIO -read -nrFiles 64 -fileSize 16GB -resFile /tmp/TestDFSIOread.txt 


Concept of "Moving the application workload to the data"  means it moves the job to the data as opposed to moving data to the job. If cluster has 100 servers in  all 3 racks,  GPFS FPO knows a copy of the data on 50th server and send the job to that server. This reduces network traffic since GPFS- FPO does not need to move the data, thereby improving performance and efficiency.

IBM General Parallel File System is a proven, enterprise-class file system for your big data applications. The advantages of using GPFS include important enterprise-class functionality such as access control security; proven scalability and performance; built-in file system monitoring; pre-integrated backup and recovery support; pre-integrated information lifecycle management; and file system quotas to restrict abuse, as well as immutability and AppendOnly features to protect the team from accidentally destroying critical data. Through these features, GPFS helps improve the time-to-value of the big data investment, allowing the enterprise to focus on resolving business problems.
1)  RDMA.pdf