LINUX & HPC : Advanced Large Scale Computing at a Glance !: September 2014

Sunday, September 14, 2014

IBM - Blue Gene Q Supercomputer

Blue Gene is an IBM project aimed at designing supercomputers that can reach operating speeds in the PFLOPS (petaFLOPS) range, with low power consumption.The project created three generations of supercomputers, Blue Gene/L, Blue Gene/P, and Blue Gene/Q.

Blue Gene systems have often led the TOP500 and Green500 rankings of the most powerful and most power efficient supercomputers, respectively. Blue Gene systems have also consistently scored top positions in the Graph500 list.

The third supercomputer design in the Blue Gene series, Blue Gene/Q has a peak performance 20 Petaflops, reaching LINPACK benchmarks performance of 17 Petaflops. United States supercomputer sat atop the TOP500 list in June 2012. Named Sequoia, the IBM BlueGene/Q system installed at the Department of Energy’s Lawrence Livermore National Laboratory achieved 16.32 petaflop/s performance running the Linpack benchmark using 1,572,864 cores. Sequoia was the first system to be built using more than one million cores.

Sequoia is primarily water cooled and consists of 96 racks; 98,304 compute nodes; 1.6 million cores; and 1.6 petabytes of memory. Sequoia is roughly 90 times more power efficient than Purple and about eight times more than BG/L relative to the peak speeds of these systems.Sequoia is dedicated to NNSA's Advanced Simulation and Computing (ASC) program.

Blue Gene/Q hardware Overview :

The Blue Gene/Q Compute chip is an 18 core chip. The 64-bit PowerPC A2 processor cores are 4-way simultaneously multithreaded, and run at 1.6 GHz. Each processor core has a SIMD Quad-vector double precision floating point unit (IBM QPX). 16 Processor cores are used for computing, and a 17th core for operating system assist functions such as interrupts, asynchronous I/O, MPI pacing and RAS. The 18th core is used as a redundant spare, used to increase manufacturing yield. The spared-out core is shut down in functional operation.

The processor cores are linked by a crossbar switch to a 32 MB eDRAM L2 cache, operating at half core speed. The L2 cache is multi-versioned, supporting transactional memory and speculative execution, and has hardware support for atomic operations. L2 cache misses are handled by two built-in DDR3 memory controllers running at 1.33 GHz. The chip also integrates logic for chip-to-chip communications in a 5D torus configuration, with 2GB/s chip-to-chip links. The Blue Gene/Q chip is manufactured on IBM's copper SOI process at 45 nm. It delivers a peak performance of 204.8 GFLOPS at 1.6 GHz, drawing about 55 watts. The chip measures 19×19 mm (359.5 mm²) and comprises 1.47 billion transistors. The chip is mounted on a compute card along with 16 GB DDR3 DRAM (i.e., 1 GB for each user processor core).A Q32 compute drawer will have 32 compute cards, each water cooled.

A "midplane" (crate) of 16 compute drawers will have a total of 512 compute nodes, electrically interconnected in a 5D torus configuration (4x4x4x4x2). Beyond the midplane level, all connections are optical. Racks have two midplanes, thus 32 compute drawers, for a total of 1024 compute nodes, 16,384 user cores and 16 TB RAM.Separate I/O drawers, placed at the top of a rack or in a separate rack, are air cooled and contain 8 compute cards and 8 PCIe expansion slots for Infiniband or 10 Gigabit Ethernet networking.

Blue Gene Q Architecture

Compiler Invocation on Blue Gene/Q

To run a code on the compute nodes you must compile and link it using a "cross-compiler" on the front-end (login node). A cross-compiler produces an executable that will run on the compute nodes.

If you instead use a native compiler, the resulting executable will run on the front-end but not on the remote nodes. Also, you can use the native compilers to compile and link only serial code because there are no mpich libraries on the front-end.

Currently you should run your applications on the compute nodes only. The native compilers can be used for compiling and linking the serial portion of your parallel application.

The locations and names of the "unwrapped" cross compilers appear below. For each such there is a description of the corresponding mpich-wrapper cross compiler, which is just a wrapper script that makes the cross compiler a bit easier for the user to invoke.

If there is a Thread-safe version of a compiler, its invocation name will have a _r suffix, for example bgxlc_r is the thread-safe version of the unwrapped IBM C cross compiler bgxlc.

Simple Example: Compile, Link, Run

Step 1 : Compile and link file hello.c which contains a C language MPI program:
mpixlc_r -o helloc hello.c
Step 2 : Submit the job interactively:
runjob --block BLOCKID --exe /home/spb/helloc -p 16 --np 2048 --env-all --cwd /home/spb> job.output 2>&1

Applications
Record-breaking science applications have been run on the BG/Q, the first to cross 10 petaflops of sustained performance. The cosmology simulation framework HACC achieved almost 14 petaflops with a 3.6 trillion particle benchmark run, while the Cardioid code, which models the electrophysiology of the human heart, achieved nearly 12 petaflops with a near real-time simulation, both on Sequoia.

Examples of applications running on Blue Gene

A typical supercomputer consumes large amounts of electrical power, almost all of which is converted into heat due to thermal design power and CPU power dissipation issues. The packing of thousands of processors together inevitably generates significant amounts of heat density that need to be dealt with. In the Blue Gene system, IBM deliberately used low power processors to deal with heat density and hot water cooling to achieve energy efficiency. The energy efficiency of computer systems is generally measured in terms of "FLOPS per Watt". In 2008 IBM's Blue Gene/Q reached 1684 MFLOPS/Watt. In June 2011 the top 2 spots on the Green 500 list were occupied by Blue Gene machines achieving 2097 MFLOPS/W.
___________________________________________________________________________________________________________________________

Reference:

http://www.top500.org/featured/systems/sequoia-lawrence-livermore-national-laboratory/

http://www.bnl.gov/bluegene/content/guide/bgq/compileinvokeq.shtml

http://www.bnl.gov/bluegene/content/guide/bgq/index.php

Open-Source High-Availability with MySQL Database Fabric

MySQL Fabric is an integrated system for managing a collection of MySQL servers and is the framework on which high-availability and sharding is built. MySQL Fabric is open-source and is intended to be extensible, easy to use, and support procedure execution even in the presence of failure, an execution model we call resilient execution.

MySQL (My Sequel) is one of the world's most widely used open-source relational database management system (RDBMS) . MySQL is a relational database management system (RDBMS), and ships with no GUI tools to administer MySQL databases or manage data contained within the databases.The official set of MySQL front-end tools, MySQL Workbench is actively developed by Oracle, and is freely available for use.

Though MySQL began as a low-end alternative to more powerful proprietary databases, it has gradually evolved to support higher-scale needs as well. There are however limits to how far performance can scale on a single server ('scaling up'), so on larger scales, multi-server. MySQL ('scaling out') deployments are required to provide improved performance and reliability. A typical high-end configuration can include a powerful master database which handles data write operations and is replicated to multiple slaves that handle all read operations. The master server synchronizes continually with its slaves so in the event of failure a slave can be promoted to become the new master, minimizing downtime. Further improvements in performance can be achieved by caching the results from database queries in memory using memcached, or breaking down a database into smaller chunks called shards which can be spread across a number of distributed server clusters.

Ensuring high availability requires a certain amount of redundancy in the system. For database systems, the redundancy traditionally takes the form of having a primary server acting as a master, and using replication to keep secondaries available to take over in case the primary fails. This means that the "server" that the application connects to is in reality a collection of servers, not a single server. In a similar manner, if the application is using a sharded database, it is in reality working with a collection of servers, not a single server. In this case, a collection of servers is usually referred to as a farm. MySQL Fabric - an integrated framework for managing farms of MySQL servers with support for both high-availability and sharding.

MySQL Fabric is an extensible framework for managing farms of MySQL Servers. Two features have been implemented - High Availability (HA) and scaling out using data sharding. These features can be used in isolation or in combination.

Introduction to Fabric :

To take advantage of Fabric, an application requires an augmented version of a MySQL connector which accesses Fabric using the XML-RPC protocol. MySQL Connectors are used by the application code to access the database(s), converting instructions from a specific programming language to the MySQL wire protocol, which is used to communicate with the MySQL Server processes. A ‘Fabric-aware’ connector stores a cache of the routing information that it has received from the mysqlfabric process and then uses that information to send transactions or queries to the correct MySQL Server. Currently the three supported Fabric-aware MySQL connectors are for PHP, Python and Java (and in turn the Doctrine and Hibernate Object-Relational Mapping frameworks).

Fabric manages sets of MySQL servers that have Global Transaction Identifiers (GTIDs) enabled to check and maintain consistency among servers. Sets of servers are called high-availability groups. Information about all of the servers and groups is managed by a separate MySQL instance, which cannot be a member of the Fabric high-availability groups. This server instance is called the backing store.

Fabric organizes servers in high-availability groups for managing different shards. For example, if standard asynchronous replication is in use, Fabric may be configured to automatically monitor the status of servers in a group. If the current master in a group fails, it elects a new one if a server in the group can become a master.

Besides the high-availability operations such as failover and switchover, Fabric also permits shard operations such as shard creation and removal.

Fabric is written in Python and includes a special library that implements all of the functionality provided. To interact with Fabric, a special utility named mysqlfabric provides a set of commands you can use to create and manage groups, define and manipulate sharding, and much more.

Both features are implemented in two layers:

The mysqlfabric process which processes any management requests. When using the HA feature, this process can also be made responsible for monitoring the master server and initiating failover to promote a slave to be the new master should it fail.

MySQL Fabric-aware connectors store a cache of the routing information that it has fetched from MySQL Fabric and then uses that information to send transactions or queries to the correct MySQL Server.

MySQL Fabric provides high availability and database sharding for MySQL Servers

High Availability

HA Groups are formed from pools of two or more MySQL Servers; at any point in time, one of those servers is the Primary (MySQL Replication master) and the others are Secondaries (MySQL Replication slaves). The role of a HA Group is to ensure that access to the data held within that group is always available.

While MySQL Replication allows the data to be made safe by duplicating it, for a HA solution two extra components are needed and MySQL Fabric provides these:
1) Failure detection and promotion - MySQL Fabric monitors the Primary within the HA group and should that server fail then it selects one of the Secondaries and promotes it to be the Primary

2) Routing of database requests - The routing of writes to the Primary and load balancing reads across the slaves is transparent to the application, even when the topology changes during failover.

Adding MySQL Servers to Create a HA Farm: At this point, MySQL Fabric is up and running but it has no MySQL Servers to manage. This figure shows the what the configuration will look like once MySQL Servers have been added to create a HA server farm.

Three MySQL Servers will make up the managed HA group – each running on a different machine

Sharding - Scaling out

When nearing the capacity or write performance limit of a single MySQL Server (or HA group), MySQL Fabric can be used to scale-out the database servers by partitioning the data across multiple MySQL Server "groups". Note that a group could contain a single MySQL Server or it could be a HA group.

The administrator defines how data should be sharded between these servers; indicating which table columns should be used as shard keys and whether HASH or RANGE mappings should be used to map from those keys to the correct shard as described below:

1) HASH: A hash function is run on the shard key to generate the shard number. If values held in the column used as the sharding key don’t tend to have too many repeated values then this should result in an even partitioning of rows across the shards.

2) RANGE: The administrator defines an explicit mapping between ranges of values for the sharding key and shards. This gives maximum control to the user of how data is partitioned and which rows should be co-located.

When the application needs to access the sharded database, it sets a property for the connection that specifies the sharding key – the Fabric-aware connector will then apply the correct range or hash mapping and route the transaction to the correct shard.

If further shards/groups are needed then MySQL Fabric can split an existing shard into two and then update the state-store and the caches of routing data held by the connectors. Similarly, a shard can be moved from one HA group to another.

The steps that follow evolve that configuration into one containing two shards as shown in the following figure.

Another HA group (group_id-2) is created, from three newly created MySQL Servers then one of the servers is promoted to be the Primary. At this point, the new HA group exists but is missing the application schema and data. Before allocating a shard to the group, a reset master needs to be executed on the Primary for the group (this is required because changes have already been made on that server – if nothing else, to grant permissions for one or more users to connect remotely). The mysqlfabric group lookup_server command is used to first check which of the three servers is currently the Primary. The next step is to split the existing shard, specifying the shard id and the name of the HA group where the new shard will be stored. Python code adds some new rows to the subscribers table and the tables property for the connection is set and the key to the value of the sub_no column for that table – this is enough information for the Fabric-aware connector to choose the correct shard/HA group and then the fact that the mode property is set to fabric. MODE_READWRITE further tells the connector that the transaction should be sent to the Primary within that HA group.The mysql client can then be used to confirm that the new data has also been partitioned between the two shards/HA groups.

Current Limitations

1) Sharding is not completely transparent to the application. While the application need not be aware of which server stores a set of rows and it doesn’t need to be concerned when that data is moved, it does need to provide the sharding key when accessing the database.

2) All transactions and queries need to be limited in scope to the rows held in a single shard, together with the global (non-sharded) tables. For example, Joins involving multiple shards are not supported.Because the connectors perform the routing function, the extra latency involved in proxy-based solutions is avoided but it does mean that Fabric-aware connectors are required – at the time of writing these exist for PHP, Python and Java

3) The MySQL Fabric process itself is not fault-tolerant and must be restarted in the event of it failing. Note that this does not represent a single-point-of-failure for the server farm (HA and/or sharding) as the connectors are able to continue routing operations using their local caches while the MySQL Fabric process is unavailable.

_____________________________________________________________________________

Reference:

1) http://dev.mysql.com/doc/mysql-utilities/1.4/en/fabric-intro.html

2) http://en.wikipedia.org/wiki/MySQL

3) http://mysqlmusings.blogspot.in/2014/05/mysql-fabric-musings-release-1.4.3.html

4) http://mysqlmusings.blogspot.in/2013/09/brief-introduction-to-mysql-fabric.html

5) http://www.mysql.com/products/enterprise/fabric.html

6) http://www.paranet.com/blog/bid/133845/Difference-between-Synchronous-Asynchronous-Replication-Table

7) http://www.clusterdb.com/mysql-fabric/mysql-fabric-adding-high-availability-and-scaling-to-mysql

8) http://www.evidian.com/products/high-availability-software-for-application-clustering/shared-nothing-cluster-vs-shared-disk-cluster/

Monday, September 8, 2014

Apache Mesos - Open Source Datacenter Computing

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. Mesos is a open-source software originally developed at the University of California at Berkeley. It sits between the application layer and the operating system and makes it easier to deploy and manage applications in large-scale clustered environments more efficiently. It can run many applications on a dynamically shared pool of nodes. Prominent users of Mesos include Twitter, Airbnb, MediaCrossing, Xogito and Categorize.

Mesos leverages features of the modern kernel – “cgroups” in Linux, “zones” in Solaris – to provide isolation for CPU, memory, I/O, file system, rack locality, etc. The big idea is to make a large collection of heterogeneous resources. Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. It is a thin resource sharing layer that enables fine-grained sharing across diverse cluster computing frameworks, by giving frameworks a common interface for accessing cluster resources.The idea is to deploy multiple distributed systems to a shared pool of nodes in order to increase resource utilization. A lot of modern workloads and frameworks can run on Mesos, including Hadoop, Memecached, Ruby on Rails, Storm, JBoss Data Grid,MPI, Spark and Node.js, as well as various Web servers, databases and application servers.

Mesos - Node Abstraction source

In a similar way that a PC OS manages access to the resources on a desktop computer, Mesos ensures applications have access to the resources they need in a cluster. Instead of setting up numerous server clusters for different parts of an application, Mesos allows you to share a pool of servers that can all run different parts of your application without them interfering with each other and with the ability to dynamically allocate resources across the cluster as needed. That means , it could easily switch resources away from framework1 [ big-data analysis ] and allocate them to framework2 [web server ] , if there is a heavy network. It also reduces a lot of the manual steps in deploying applications and can shift workloads around automatically to provide fault tolerance and keep utilization rates high.

Resource sharing across the cluster increases throughput and utilization

Mesos ==> " Data Center Kernel "

Mesos - One large pool of resources

Mesos is essentially data center kernel - which means it’s the software that actually isolates the running workloads from each other . It still needs additional tooling to let engineers get their workloads running on the system and to manage when those jobs actually run. Otherwise, some workloads might consume all the resources, or important workloads might get bumped by less-important workloads that happen to require more resources.Hence Mesos needs more than just a kernel - Chronos scheduler, a cron replacement for automatically starting and stopping services (and handling failures) that runs on top of Mesos. The other part of the Mesos is Marathon that provides API for starting, stopping and scaling services (and Chronos could be one of those services).

source

Architecture:

Mesos consists of a master process that manages slave daemons running on each cluster node, and frameworks that run tasks on these slaves. The master implements fine-grained sharing across frameworks using resource offers. Each resource offer is a list of free resources on multiple slaves. The master decides how many resources to offer to each framework according to an organizational policy, such as fair sharing or priority. To support a diverse set of inter-framework allocation policies, Mesos lets organizations define their own policies via a pluggable allocation module.

Mesos Architecture with two running Frameworks (Hadoop and MPI)

Each framework running on Mesos consists of two components: a scheduler that registers with the master to be offered resources, and an executor process that is launched on slave nodes to run the framework’s tasks. While the master determines how many resources to offer to each framework, the frameworks’ schedulers select which of the offered resources to use. When a framework accepts offered resources, it passes Mesos a description of the tasks it wants to launch on them.

Resource Offer

Figure shows an example of how a framework gets scheduled to run tasks. In step (1), slave 1 reports to the master that it has 4 CPUs and 4 GB of memory free. The master then invokes the allocation module, which tells it that framework 1 should be offered all available resources. In step (2), the master sends a resource offer describing these resources to framework 1. In step (3), the framework’s scheduler replies to the master with information about two tasks to run on the slave, using 2 CPUs; 1 GB RAM for the first task, and 1 CPUs; 2 GB RAM for the second task. Finally, in step (4), the master sends the tasks to the slave, which allocates appropriate resources to the framework’s executor, which in turn launches the two tasks (depicted with dotted borders). Because 1 CPU and 1 GB of RAM are still free, the allocation module may now offer them to framework 2. In addition, this resource offer process repeats when tasks finish and new resources become free.

While the thin interface provided by Mesos allows it to scale and allows the frameworks to evolve independently. A framework will reject the offers that do not satisfy its constraints and accept the ones that do. In particular, we have found that a simple policy called delay scheduling, in which frameworks wait for a limited time to acquire nodes storing the input data, yields nearly optimal data locality.

Mesos Scheduler APIs :
API cosists of two primitives: calls and events which are low-vel,unreliable and one way message passing.

"calls" are basically messages sent to mesos.

Life cycle management (start, failover, stop) - Register, Reregister, Unregister
Resource Allocation -Request, Decline,Revive
TaskManagement-Launch, Kill,Acknowledgemnet,Reconcile

"events" are messages that framework received.

Life cycle management - Registered, Reregistered
Resource allocation - Offers, Rescind
Task Management -Update

Scheduler communication with Mesos

Scheduler sends a REGISTER call to Mesos Master .
Mesos master responds with acknowledgement that you got REGISTERED .
Offer will be made to the scheduler with specific requests (optional).
Master allocates some resources for scheduler shows up as a OFFER
Scheduler can use offered resources to run tasks, once it decides what tasks it might want to run.
Then Master launches task on specified slaves. It might receive OFFER for mor resource allocation. Later , it may get update i.e state of task ( asynchronous in nature).

Task/Executor isolation:
To get more control over task management, executors are used. The executor would decide how it actually wants to run the task. Executor can run one or more tasks (like threads of distributed systems). Advantage here is that you could assign multiple tasks to the executor. Interesting point to note here is isolation (i.e allocation of containers for Executor tree with multiple tasks and for individual task) as shown below. The executor's resources change overtime and dynamically adjust the resources per container. In summary , Mesos gives an elasticity on per node basis or across the cluster for containers to grow or shrink dynamically. Applications like Spark would show great performance for the same reason.

Features of Mesos :

Fault-tolerant replicated master using ZooKeeper

Scalability to 10,000s of nodes

Isolation between tasks with Linux Containers

Multi-resource scheduling (memory and CPU aware)

Java, Python and C++ APIs for developing new parallel applications

Web UI for viewing cluster state

Running at container level increases the performance

Software projects built on Mesos :

Long Running Services:
1)Aurora is a service scheduler that runs on top of Mesos, enabling you to run long-running services that take advantage of Mesos' scalability, fault-tolerance, and resource isolation.

2)Marathon is a private PaaS built on Mesos. It automatically handles hardware or software failures and ensures that an app is “always on”.

3)Singularity is a scheduler (HTTP API and web interface) for running Mesos tasks: long running processes, one-off tasks, and scheduled jobs.

4)SSSP is a simple web application that provides a white-label “Megaupload” for storing and sharing files in S3.

Big Data Processing :
1)Cray Chapel is a productive parallel programming language. The Chapel Mesos scheduler lets you run Chapel programs on Mesos.

2)Dpark is a Python clone of Spark, a MapReduce-like framework written in Python, running on Mesos.

3)Exelixi is a distributed framework for running genetic algorithms at scale.

4)Hadoop : Running Hadoop on Mesos distributes MapReduce jobs efficiently across an entire cluster.

5)Hama is a distributed computing framework based on Bulk Synchronous Parallel computing techniques for massive scientific computations e.g., matrix, graph and network algorithms.

6)MPI is a message-passing system designed to function on a wide variety of parallel computers.

7)Spark is a fast and general-purpose cluster computing system which makes parallel jobs easy to write.

8)Storm is a distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.

Batch Scheduling:
1)Chronos is a distributed job scheduler that supports complex job topologies. It can be used as a more fault-tolerant replacement for Cron.

2)Jenkins is a continuous integration server. The mesos-jenkins plugin allows it to dynamically launch workers on a Mesos cluster depending on the workload.

3)JobServer is a distributed job scheduler and processor which allows developers to build custom batch processing Tasklets using point and click web UI.

4)Torque is a distributed resource manager providing control over batch jobs and distributed compute nodes.

Data Storage:
1)Cassandra is a highly available distributed database. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.

2)ElasticSearch is a distributed search engine. Mesos makes it easy to run and scale.

3)Hypertable is a high performance, scalable, distributed storage and processing system for structured and unstructured data.

-----------------------------------------------------------------------------------------------------------------------------------------------

Conclusion:

Trends such as cloud computing and big data are moving organizations away from consolidation and into situations where they might have multiple distributed systems dedicated to specific tasks. With the help of Docker executor for Mesos, Mesos can run and manage Docker containers in conjunction with Chronos and Marathon frameworks. Docker containers provide a consistent, compact and flexible means of packaging application builds. Delivering applications with Docker on Mesos promises a truly elastic, efficient and consistent platform for delivering a range of applications on premises or in the cloud.

______________________________________________________________________________________________
References :
1) http://mesos.apache.org/documentation/latest/
2) http://static.usenix.org/event/nsdi11/tech/full_papers/Hindman_new.pdf
3) http://howtojboss.com/2013/09/04/ampd-for-hadoop-alternatives/
4) http://typesafe.com/blog/play-framework-grid-deployment-with-mesos
5) https://mesosphere.io/2013/09/26/docker-on-mesos/

Monday, September 1, 2014

Integrating GPFS with Apache Hadoop on IBM Big Data Platform

Enterprises are rapidly discovering the power of Apache Hadoop and big data to drive more powerful, data-driven insights and better position their organizations in an increasingly dynamic and competitive economy. Hadoop’s innovation in scalability, cost efficiency, and flexibility affords enterprises an unprecedented ability to handle greater volume, greater velocity, and greater variety of data than traditional data management technologies. For these information driven enterprises, this next step in the evolution of data management is an enterprise data hub (EDH ) which offers a single, unified data management system that combines distributed storage and computation, which can expand indefinitely, store any amount of data, of any type while bringing a wide range of processing, computation, and query engines and third-party applications directly to the data, thus reversing the traditional data flow of moving data to computing environments. The Hadoop Distributed File System (HDFS)—as a key enabler for the data management paradigm shift of bringing the application to the data, thereby avoiding network bottleneck.

General Parallel File System

GPFS is a high performance parallel file system for clusters. GPFS is an IBM product which was first released in 1998. GPFS has been available on IBM's AIX since 1998, on Linux since 2001 and on Microsoft Windows Server since 2008. GPFS is a high performance enterprise class distributed file system. Over the years it has evolved to support a variety of workloads and can scale to thousands of nodes. GPFS was deployed and used in many enterprise customer production environments to support machine critical applications.

Why - GPFS on Hadoop

Hadoop is on its way to becoming the de facto platform for the next-generation of data-based applications. Some Hadoop users have strict demands around performance, availability and enterprise-grade features, while others aren’t keen of its direct-attached storage (DAS) architecture.

While Hadoop Distributed File System (HDFS) is a component of the Apache Hadoop package. It has several short-comings which can be overcome by replacing HDFS with another file system. One such approach offered by IBM with BigInsights is the IBM General Parallel File System (GPFS).

GPFS was developed much before Map/Reduce - a distributed computing paradigm of Hadoop framework. GPFS by itself had no Map/Reduce capability as storage nodes are distinct from compute nodes. Mounting GPFS on all Hadoop nodes is not effective as all data is far away i.e no data locality achieved. In 2010, GPFS was extended to work seamlessly with Hadoop as GPFS-Shared Nothing Cluster architecture, which is now available under the name of GPFS File Placement Optimizer (FPO). FPO allows complete control over the data placements for all replicas, if applications so desires .

Hadoop Installation with GPFS :

• Eliminates the single point of failure of the NameNode without requiring a costly high availability design
• Provides tiered storage to place data on the right type of storage (SSD, 15k SAS , 7200 RPM NL/SAS or any)
• Use native InfiniBand RDMA for better throughput, lower latency and more CPU cycles for your application

source

IBM General Parallel File System

IBM has been selling its General Parallel File System to high-performance computing customers for years (including within some of the world’s fastest supercomputers), and in 2010 it tuned GPFS for Hadoop. IBM claims the GPFS-SNC (Shared Nothing Cluster) edition is so much faster than Hadoop in part because it runs at the kernel level as opposed to atop the OS like HDFS. IBM BigInsights V2.1 release introduced support for GPFS File Placement Optimizer (FPO). GPFS FPO is a set of features that allow GPFS to support map reduce applications on clusters with no shared disk.

1) InfoSphere BigInsights now supports IBM General Parallel File System (GPFS), an enterprise file system that is an alternative to HDFS. GPFS provides hierarchical storage management, high performance support for MapReduce and traditional applications, high availability, and other benefits over HDFS.

2) This new capability provides enterprise-class distributed file system support that is Portable Operating System Interface (POSIX) compliant. GPFS brings established big data distributed file system capabilities to the Hadoop and MapReduce environment. The GPFS distributed metadata feature eliminates any single point of failure in the file system.

3) As you move from the sandbox to integrated rollout, your Hadoop application deals with files that might be on different systems, with different operating systems, and possibly, different applications in addition to MapReduce. With full POSIX compliance support, and the ability to work with more traditional applications requiring read/write capabilities, GPFS provides greater value that can be highly used for all of your file-based analytics requirements.

Tuning GPFS configuration for FPO :
GPFS-FPO clusters have a different architecture than traditional GPFS deployments. Therefore, default values of many of the configuration parameters are not suitable for FPO deployments and should be changed as per the details at Link

Network shared disk: When a LUN provided by a storage subsystem is configured for use by GPFS, it is referred to as a network shared disk (NSD). In a traditional GPFS installation, a LUN is typically made up of multiple physical disks provided by a RAID device. Once the LUN is defined as an NSD, GPFS allows all cluster nodes to have direct access to the NSDs. Alternatively, a subset of the nodes connected to the disks may provide access to the LUNs as an NSD server. In GPFS-FPO deployments, a physical disk and NSD can have a 1:1 mapping. In this case, each node in the cluster is a NSD server providing access to the disks from the rest of the cluster. FPO introduced a multipart failure group concept to further convey the topology information of the nodes in the cluster that GPFS can exploit when making data placement decisions. Data block placement decisions in FPO environments are affected by the level of replication and the value of the write-affinity depth and write-affinity failure group parameters.

In a typical FPO cluster, nodes have direct-attached disks. These disks are not shared between nodes as in a traditional GPFS cluster, so if the node is inaccessible, the associated disks are also inaccessible. GPFS provides ways to automatically recover from these and similar common disk failure situations.

In FPO environments, automated recovery from disk failures can be enabled using the restripeOnDiskFailure=yes configuration option.

In GPFS-FPO environments, consider snapshot and placement policy requirements as you plan GPFS file system layout. For example, temporary and transient data that does not need to be replicated can

be placed in a fileset limited to one replica. Similarly, data that cannot be easily re-created may require frequent snapshots and therefore can reside in an independent fileset.

In a GPFS-FPO cluster, you will have a minimum of two storage pools: one for metadata (with standard block allocation) and one for data that is FPO-enabled. By default, GPFS places all the data in a system pool, unless specified. Best practices for FPO require separation of data and metadata disks in different storage pools. This practice allows the use of a smaller block size for metadata and a larger block size for data. In the sample configuration, a different class of storage (SSD) is being used for metadata, so it is best to create a separate pool. When you have more than one pool, you need to define a placement policy.

GPFS-FPO Benefits are listed below :

• Locality awareness so compute jobs can be scheduled on nodes containing the data
• Chunks that allow large and small block sizes to coexist in the same file system to make the
most of data locality
• Write affinity allows applications to dictate the layout of files on different nodes to maximize
both write and read bandwidth
• Distributed recovery to minimize the effect of failures on ongoing computation

GPFS Hadoop connector :

All big data applications run seamlessly with GPFS — no application changes are required. GPFS provides seamless integration with Hadoop applications using a Hadoop connector. A GPFS Hadoop connector is shipped withGPFS, and a customized version is shipped with InfoSphere BigInsights. Platform Symphony uses the GPFS connector shipped with the GPFS product.the GPFS Hadoop connector implements Hadoop file system APIs using GPFS APIs to direct file system access requests to GPFS. This redirection of data access from HDFS to GPFS is enabled by changing Hadoop XML configuration files

GPFS- Hadoop Connector source

Related products to GPFS are:

• IBM DS and DCS disk arrays
• Tivoli® Storage Manager
• Tivoli Provisioning Manager
• HPSS (High Performance Storage System)
• IBM System Storage® SAN Volume Controller

Compatibility

The GPFS FPO license permits the licensed node to perform NSD server functions for
sharing GPFS data with other nodes that have a GPFS FPO or GPFS Server license.
This license cannot be used to share data with nodes that have a GPFS Client license
or non-GPFS nodes.Nodes running the GPFS File Placement Optimizer feature cannot coexist or interoperate with nodes running GPFS V3.4.

NOTE:File Placement Optimizer(FPO) feature, available with GPFS V3.5 for Linux , extends GPFS for a new class of data-intensive big data applications .

GPFS provided better performance than HDFS on benchmark Systems :

To provide a basis of comparison, tests were run comparing the use of GPFS with Apache Hadoop Distributed File System (HDFS). Benchmark results show GPFS improves application performance over HDFS by 35% on the analytics benchmark (Terasort benchmark), 35% on write tests and 50% on read tests using the Enhanced TestDFSIO benchmark.

Write performance, higher numbers show better results Source Link

Read performance, higher numbers show better results Source Link

NOTE: DFSIO is Distributed I/O Benchmark and part of the hadoop distribution and can be found in "hadoop-mapreduce-client-jobclient-*-tests.jar" for MR2.

Example 1 :
hadoop jar hadoop-mapreduce-client-jobclient-*-tests.jar TestDFSIO -write -nrFiles 64 -fileSize 16GB -resFile /tmp/TestDFSIOwrite.txt
Example 2:
hadoop jar hadoop-mapreduce-client-jobcliet-*-tests.jar TestDFSIO -read -nrFiles 64 -fileSize 16GB -resFile /tmp/TestDFSIOread.txt
.............................................................................................................................

Concept of "Moving the application workload to the data" means it moves the job to the data as opposed to moving data to the job. If cluster has 100 servers in all 3 racks, GPFS FPO knows a copy of the data on 50th server and send the job to that server. This reduces network traffic since GPFS- FPO does not need to move the data, thereby improving performance and efficiency.

IBM General Parallel File System is a proven, enterprise-class file system for your big data applications. The advantages of using GPFS include important enterprise-class functionality such as access control security; proven scalability and performance; built-in file system monitoring; pre-integrated backup and recovery support; pre-integrated information lifecycle management; and file system quotas to restrict abuse, as well as immutability and AppendOnly features to protect the team from accidentally destroying critical data. Through these features, GPFS helps improve the time-to-value of the big data investment, allowing the enterprise to focus on resolving business problems.

_______________________________________________________________________________________________
References:
1) http://www.mellanox.com/related-docs/whitepapers/WP-Big-Insights-with-Mellanox-Infiniband- RDMA.pdf
2) http://public.dhe.ibm.com/common/ssi/ecm/en/dcw03045usen/DCW03045USEN.PDF
3) http://www.nist.gov/itl/ssd/is/upload/NIST-BD-Platforms-01-Pednault-BigData-NIST.pdf
4) http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5./com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_fposettings.htm
5) https://support.pivotal.io/hc/en-us/articles/200864057-Running-DFSIO-mapreduce-benchmark-test
6) https://mrt.presalesadvisor.com/PowerStoragePDF/520aixgpfsdeployingbigdata.pdf
7) http://www.cloudera.com
_______________________________________________________________________________________________

LINUX & HPC : Advanced Large Scale Computing at a Glance !