Monday, September 8, 2014

Apache Mesos - Open Source Datacenter Computing

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. Mesos is a open-source software originally developed at the University of California at Berkeley. It sits between the application layer and the operating system and makes it easier to deploy and manage applications in large-scale clustered environments more efficiently. It can run many applications on a dynamically shared pool of nodes. Prominent users of Mesos include Twitter, Airbnb, MediaCrossing, Xogito and Categorize.

Mesos leverages features of the modern kernel – “cgroups” in Linux, “zones” in Solaris – to provide isolation for CPU, memory, I/O, file system, rack locality, etc. The big idea is to make a large collection of heterogeneous resources. Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. It is a thin resource sharing layer that enables fine-grained sharing across diverse cluster computing frameworks, by giving frameworks a common interface for accessing cluster resources.The idea is to deploy multiple distributed systems to a shared pool of nodes in order to increase resource utilization. A lot of modern workloads and frameworks can run on Mesos, including Hadoop, Memecached, Ruby on Rails, Storm, JBoss Data Grid,MPI, Spark and Node.js, as well as various Web servers, databases and application servers.

Mesos - Node Abstraction source

 In a similar way that a PC OS manages access to the resources on a desktop computer, Mesos ensures applications have access to the resources they need in a cluster. Instead of setting up numerous server clusters for different parts of an application, Mesos allows you to share a pool of servers that can all run different parts of your application without them interfering with each other and with the ability to dynamically allocate resources across the cluster as needed.  That means , it could easily switch resources away from framework1 [ big-data analysis ] and allocate them to  framework2 [web server ] , if  there is a heavy network. It also reduces a lot of the manual steps in deploying applications and can shift workloads around automatically to provide fault tolerance and keep utilization rates high.

Resource sharing across the cluster increases throughput and utilization

                         Mesos  ==>  " Data Center Kernel "
Mesos - One large pool of resources

Mesos is essentially data center  kernel - which means it’s the software that actually isolates the running workloads from each other . It still needs additional tooling to let engineers get their workloads running on the system and to manage when those jobs actually run. Otherwise, some workloads might consume all the resources, or important workloads might get bumped by less-important workloads that happen to require more resources.Hence  Mesos needs more than just a kernel - Chronos scheduler, a cron replacement for automatically starting and stopping services (and handling failures) that runs on top of Mesos. The other part of the Mesos  is Marathon that  provides API for starting, stopping and scaling services (and Chronos could be one of those services). 



Mesos consists of a master process that manages slave daemons running on each cluster node, and frameworks that run tasks on these slaves. The master implements fine-grained sharing across frameworks using resource offers. Each resource offer is a list of free resources on multiple slaves. The master decides how many resources to offer to each framework according to an organizational policy, such as fair sharing or priority. To support a diverse set of inter-framework allocation policies, Mesos lets organizations define their own policies via a pluggable allocation module.

Mesos Architecture with two running Frameworks (Hadoop and MPI)

 Each framework running on Mesos consists of two components: a scheduler that registers with the master to be offered resources, and an executor process that is launched on slave nodes to run the framework’s tasks. While the master determines how many resources to offer to each framework, the frameworks’ schedulers select which of the offered resources to use. When a framework accepts offered resources, it passes Mesos a description of the tasks it wants to launch on them.

Resource Offer
Figure shows an example of how a framework gets scheduled to run tasks. In step (1), slave 1 reports to the master that it has 4 CPUs and 4 GB of memory free. The master then invokes the allocation module, which tells it that framework 1 should be offered all available resources. In step (2), the master sends a resource offer describing these resources to framework 1. In step (3), the framework’s scheduler replies to the master with information about two tasks to run on the slave, using 2 CPUs; 1 GB RAM for the first task, and 1 CPUs; 2 GB RAM for the second task. Finally, in step (4), the master sends the tasks to the slave, which allocates appropriate resources to the framework’s executor, which in turn launches the two tasks (depicted with dotted borders). Because 1 CPU and 1 GB of RAM are still free, the allocation module may now offer them to framework 2. In addition, this resource offer process repeats when tasks finish and new resources become free.

While the thin interface provided by Mesos allows it to scale and allows the frameworks to evolve independently. A framework will reject the offers that do not satisfy its constraints and accept the ones that do. In particular, we have found that a simple policy called delay scheduling, in which frameworks wait for a limited time to acquire nodes storing the input data, yields nearly optimal data locality.

Mesos Scheduler APIs :
API cosists of two primitives: calls and events  which are low-vel,unreliable and one way message passing.

"calls" are basically messages sent to mesos.
  •  Life cycle management (start, failover, stop) - Register, Reregister, Unregister
  •  Resource Allocation -Request, Decline,Revive 
  • TaskManagement-Launch, Kill,Acknowledgemnet,Reconcile 

"events" are messages that framework received.
  • Life cycle management - Registered, Reregistered
  • Resource allocation - Offers, Rescind
  • Task Management -Update 

Scheduler communication with Mesos

  • Scheduler sends a REGISTER call to Mesos Master .
  • Mesos master responds with acknowledgement that you got REGISTERED .
  • Offer will be made to  the scheduler with specific requests (optional).
  • Master allocates some resources for scheduler shows up as a OFFER
  • Scheduler can use offered resources to run tasks, once it decides what tasks it might want to run.
  • Then Master launches task on specified slaves. It might receive OFFER for mor resource allocation. Later , it may get update i.e state of task ( asynchronous in nature).

Task/Executor isolation:
To get more control over task management, executors are used. The executor would decide how it actually wants to run the task. Executor can run one or more tasks (like threads of distributed systems). Advantage here is that you could assign multiple tasks to the executor. Interesting point to note here is isolation (i.e allocation of containers for Executor tree with multiple tasks and for individual task) as shown below. The executor's resources change overtime and dynamically adjust the resources per container. In summary , Mesos gives an elasticity on per node basis or across the cluster for containers to grow or shrink dynamically. Applications like Spark would show great performance for the same reason.

Features of Mesos :
  •     Fault-tolerant replicated master using ZooKeeper
  •     Scalability to 10,000s of nodes
  •     Isolation between tasks with Linux Containers
  •     Multi-resource scheduling (memory and CPU aware)
  •     Java, Python and C++ APIs for developing new parallel applications
  •     Web UI for viewing cluster state

Running at container level increases the performance 

Software projects built on Mesos :
Long Running Services:
1)Aurora is a service scheduler that runs on top of Mesos, enabling you to run long-running   services that take advantage of Mesos' scalability, fault-tolerance, and resource isolation.
2)Marathon is a private PaaS built on Mesos. It automatically handles hardware or software failures and ensures that an app is “always on”. 
3)Singularity is a scheduler (HTTP API and web interface) for running Mesos tasks: long running processes, one-off tasks, and scheduled jobs.
4)SSSP is a simple web application that provides a white-label “Megaupload” for storing and sharing files in S3.
Big Data Processing :
1)Cray Chapel is a productive parallel programming language. The Chapel Mesos scheduler lets you run Chapel programs on Mesos.
2)Dpark is a Python clone of Spark, a MapReduce-like framework written in Python, running on Mesos.
3)Exelixi is a distributed framework for running genetic algorithms at scale.
4)Hadoop : Running Hadoop on Mesos distributes MapReduce jobs efficiently across an entire cluster.
5)Hama is a distributed computing framework based on Bulk Synchronous Parallel computing techniques for massive scientific computations e.g., matrix, graph and network algorithms.
6)MPI is a message-passing system designed to function on a wide variety of parallel computers.
7)Spark is a fast and general-purpose cluster computing system which makes parallel jobs easy to write.
8)Storm is a distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
Batch Scheduling:
1)Chronos is a distributed job scheduler that supports complex job topologies. It can be used as a more fault-tolerant replacement for Cron.
2)Jenkins is a continuous integration server. The mesos-jenkins plugin allows it to dynamically launch workers on a Mesos cluster depending on the workload.
3)JobServer is a distributed job scheduler and processor which allows developers to build custom batch processing Tasklets using point and click web UI.
4)Torque is a distributed resource manager providing control over batch jobs and distributed compute nodes.
Data Storage:
1)Cassandra is a highly available distributed database. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
2)ElasticSearch is a distributed search engine. Mesos makes it easy to run and scale.
3)Hypertable is a high performance, scalable, distributed storage and processing system for structured and unstructured data.

  Trends such as cloud computing and big data are moving organizations away from consolidation and into situations where they might have multiple distributed systems dedicated to specific tasks.  With the help of Docker executor for Mesos, Mesos can run and manage Docker containers in conjunction with Chronos and Marathon frameworks. Docker containers provide a consistent, compact and flexible means of packaging application builds. Delivering applications with Docker on Mesos promises a truly elastic, efficient and consistent platform for delivering a range of applications on premises or in the cloud.
References :