Sunday, January 12, 2014

Big Data Revolution and Vision ........!!!

Big Data is THE biggest buzzwords  around at the moment and  definitely big data will change the world. Big Data refers to data sets that are too large to be processed and analyzed by traditional IT technologies.

The Big Data Universe is changing right before our eyes and  beginning to explode.Big data absolutely has the potential to change the way governments, organizations, and academic institutions conduct business and make discoveries, and its likely to change how everyone lives their day-to-day lives.In the next five years, we’ll generate more data as humankind than we generated in the previous 5,000 years ...!!! Records and data exist in electronic digital form generated by mobile communications to surveillance cameras to emails to web sites to transaction receipts; it can combine daily news, social media feeds and videos.
What is big data?
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. 

Gartner defines Big Data as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. According to IBM, 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is Big Data.

In other words, Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,search, sharing, transfer, analysis and visualization. The trend to larger data sets is due to the additional information (VALUE) derivable from analysis of a single large set of related data allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.

What does Hadoop solve?

  • Organizations are discovering that important predictions can be made by sorting through and analyzing Big Data.
  • However, since 80% of this data is "unstructured", it must be formatted (or structured) in a way that makes it suitable for data mining and subsequent analysis.
  • Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful for analytics purposes.
In 2004, Google published a paper on a process called MapReduce that used such an architecture. MapReduce framework provides a parallel processing model and associated implementation to process huge amount of data. With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step). The framework was incredibly successful, so others wanted to replicate the algorithm. Therefore, an implementation of MapReduce framework was adopted by an Apache open source project named Hadoop. Click here to download :MapReduce: Simpli ed Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat.

Big data spans four dimensions -The 4 Vs that characterize big data:

  • Volume – the vast amounts of data generated every second -Example: terabytes, Records, Transactions,Tables and files 
  • Velocity – the speed at which new data is generated and moves around (credit card fraud detection is a good example where millions of transactions are checked for unusual patterns in almost real time) -Example: Batch , Near time,Real time and Streams
  • Variety – the increasingly different types of data (from financial data to social media feeds, from photos to sensor data, from video capture to voice recordings)-Example :  structured, unstructured, semi structured and all 3 types.
  • Veracity – the messiness of the data (just think of Twitter posts with hash tags, abbreviations, typos and colloquial speech) 
Source link:
How the Big Data Explosion Is Changing the World ?
 Big data is the term increasingly used to describe the process of applying serious computing power – the latest in machine learning and artificial intelligence – to seriously massive and often highly complex sets of information. Big data can be comparing utility costs with meteorological data to spot trends and inefficiencies. Big data can be comparing ambulance GPS information with hospital records on patient outcomes to determine the correlation between response time and survival and can also be the tiny device you wear to track your movement, calories and sleep to track your own personal health and fitness. Our daily lives generate an enormous collection of data.Whether you’re surfing the Web, shopping at the store, driving your smart car around town, boarding an airplane, visiting a doctor, attending class at university, each day you are generating a variety of data.The benefit of the data depends on where and to whom you’re talking to - a lot of the ultimate potential is in the ability to discover potential connections, and to predict potential outcomes in a way that wasn’t really possible before.With more data than ever available in digital form, progressively inexpensive data storage, and more advanced computers at the ready to help process and analyze it all.Companies believe that big data has the power to drive practical insights that just weren’t possible before. It’s about managing all that data and providing tools that enable everyone to answers questions– questions they might not have even known they had. IBM CEO Ginni Rometty says big data and predictive decisions will reshape organizations, and computers that learn, like Watson, will be tech's next big wave. Its a vision of future .A hospital uses rapid gene sequencing to stop an outbreak of antibiotic resistant bacteria, saving lives. A railroad company gets an alert from a train’s sensor that a preventative fix is needed, saving the cost and time of removing the train from the tracks later. A university notices a student’s activity level has started to drop to a level consistent with dropouts, and reaches out to assist.

Classic UseCases and its implementation in real-time scenarios : 
----------------------------------------------------------------------------
1) Retailers can exploit the data to track sales and consumer behavior, in store and online; 

2) Health professionals and epidemiologists trying to predict the spread of disease combine data from  health services, border agencies and a variety of other sources.

3) The London Olympics will analyze big data to establish traffic patterns, policing needs and potential terrorist threats. 

4) The finance sector seeks to exploit one of the most valuable mother lodes of data through powerful tools that can make sense of patterns in news, trading activities and other more esoteric sources.

5) India’s Unique identification project [Aadhaar project], spearheaded by NandanNilekani, will collect and process billions of data, to provide identification for each resident across the country and would be used primarily as the basis for efficient delivery of welfare services. It would also act as a tool for effective monitoring of various programs and schemes of the Government.

6) From developing strategies for cricket teams to anylyze the bowling patterns , pitch behavior, detecting Match Fixing issues ...etc

7) Predicting a crime -Chicago Designing Predictive Software Platform to Identify Crime Patterns. Beyond the public safety uses, the platform could also help officials make better decisions for city services like restaurant inspections, snow plowing or garbage delivery.........etc !!!

Data scientists are building specialized systems that can read through billions of bits of data, analyze them via self-learning algorithms and package the insights for immediate use.
------------------------------------------

In the next few years millions of big data-related IT jobs will be created worldwide and  there is a major shortage of the “analytical and managerial talent necessary to make the most of big data.The United States alone faces a shortage of more than 140,000 workers with big data skills as well as up to 1.5 million managers and analysts needed to analyze and make decisions based on big data findings.
 ---------------------------------------------------------------------
Click here - Overview of apache Hadoop 
Click here - Watson - Era of cognitive Computing 

Big Data: Overview of apache Hadoop

Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. It is licensed under the Apache License 2.0.

 Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project. No one knows that better than Doug Cutting, chief architect of Cloudera and one of the curious story behind Hadoop. When he was creating the open source software that supports the processing of large data sets, Cutting knew the project would need a good name. Cutting's son, then 2, was just beginning to talk and called his beloved stuffed yellow elephant "Hadoop" (with the stress on the first syllable). Fortunately, he had one up his sleeve—thanks to his son. The son (who's now 12) frustrated with this. He's always saying 'Why don't you say my name, and why don't I get royalties? I deserve to be famous for this :)
Video link



The Apache Hadoop framework is composed of the following modules :
1] Hadoop Common - contains libraries and utilities needed by other Hadoop modules

2] Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on the    commodity machines, providing very high aggregate bandwidth across the cluster.
 

3] Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.
 

4] Hadoop MapReduce - a programming model for large scale data processing.

All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework. Apache Hadoop's MapReduce and HDFS components originally derived respectively from Google's MapReduce and Google File System (GFS) papers.

Beyond HDFS, YARN and MapReduce, the entire Apache Hadoop “platform” is now commonly considered to consist of a number of related projects as well – Apache Pig, Apache Hive, Apache HBase, and others
For the end-users, though MapReduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program. Apache Pig, Apache Hive among other related projects expose higher level user interfaces like Pig latin and a SQL variant respectively. The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts.

HDFS & MapReduce :
There are two primary components at the core of Apache Hadoop 1.x : the Hadoop Distributed File System (HDFS) and the MapReduce parallel processing framework. These open source projects, inspired by technologies created inside Google.
 Hadoop distributed file system :

 
The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication. Clients use Remote procedure call (RPC) to communicate between each other.

HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX-compliant, because the requirements for a POSIX file-system differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX-compliant file-system is increased performance for data throughput and support for non-POSIX operations such as Append.
HDFS added the high-availability capabilities for release 2.x  allowing the main metadata server (the NameNode) to be failed over manually to a backup in the event of failure-  automatic fail-over.

The HDFS file system includes a so-called secondary namenode, which misleads some people into thinking that when the primary namenode goes offline, the secondary namenode takes over. In fact, the secondary namenode regularly connects with the primary namenode and builds snapshots of the primary namenode's directory information, which the system then saves to local or remote directories. These checkpointed images can be used to restart a failed primary namenode without having to replay the entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. Because the namenode is the single point for storage and management of metadata, it can become a bottleneck for supporting a huge number of files, especially a large number of small files. HDFS Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple name-spaces served by separate namenodes.

An advantage of using HDFS is data awareness between the job tracker and task tracker. The job tracker schedules map or reduce jobs to task trackers with an awareness of the data location. For example: if node A contains data (x,y,z) and node B contains data (a,b,c), the job tracker schedules node B to perform map or reduce tasks on (a,b,c) and node A would be scheduled to perform map or reduce tasks on (x,y,z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer. When Hadoop is used with other file systems this advantage is not always available. This can have a significant impact on job-completion times, which has been demonstrated when running data-intensive jobs.HDFS was designed for mostly immutable files and may not be suitable for systems requiring concurrent write-operations.


Another limitation of HDFS is that it cannot be mounted directly by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient. A Filesystem in Userspace (FUSE) virtual file system has been developed to address this problem, at least for Linux and some other Unix systems.

File access can be achieved through the native Java API, the Thrift API to generate a client in the language of the users' choosing (C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the command-line interface, or browsed through the HDFS-UI webapp over HTTP. 


JobTracker and TaskTracker: the MapReduce engine:

Above the file systems comes the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser.
Hadoop 1.x MapReduce System is composed of the JobTracker, which is the master, and the per-node slaves- TaskTrackers
If the JobTracker failed on Hadoop 0.20 or earlier, all ongoing work was lost. Hadoop version 0.21 added some checkpointing to this process; the JobTracker records what it is up to in the file system. When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off.

Known limitations of this approach in Hadoop 1.x are:
 

The allocation of work to TaskTrackers is very simple. Every TaskTracker has a number of available slots (such as "4 slots"). Every active map or reduce task takes up one slot. The Job Tracker allocates work to the tracker nearest to the data with an available slot. There is no consideration of the current system load of the allocated machine, and hence its actual availability.If one TaskTracker is very slow, it can delay the entire MapReduce job - especially towards the end of a job, where everything can end up waiting for the slowest task. With speculative execution enabled, however, a single task can be executed on multiple slave nodes.

 Apache Hadoop NextGen MapReduce (YARN): 
 MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN

Apache™ Hadoop® YARN is a sub-project of Hadoop at the Apache Software Foundation introduced in Hadoop 2.0 that separates the resource management and processing components. YARN was born of a need to enable a broader array of interaction patterns for data stored in HDFS beyond MapReduce. The YARN-based architecture of Hadoop 2.0 provides a more general processing platform that is not constrained to MapReduce. 
Architectural view of YARN
The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.

The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.

The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

Overview of Hadoop1.0 and Hadopp2.0
As part of Hadoop 2.0, YARN takes the resource management capabilities that were in MapReduce and packages them so they can be used by new engines.  This also streamlines MapReduce to do what it does best, process data.  With YARN, you can now run multiple applications in Hadoop, all sharing a common resource management.  Many organizations are already building applications on YARN in order to bring them IN to Hadoop.
A next-generation framework for Hadoop data processing
As part of Hadoop 2.0, YARN takes the resource management capabilities that were in MapReduce and packages them so they can be used by new engines.  This also streamlines MapReduce to do what it does best, process data.  With YARN, you can now run multiple applications in Hadoop, all sharing a common resource management.  Many organizations are already building applications on YARN in order to bring them IN to Hadoop.When enterprise data is made available in HDFS, it is important to have multiple ways to process that data.  With Hadoop 2.0 and YARN organizations can use Hadoop for streaming, interactive and a world of other Hadoop based applications.

What YARN Does

YARN enhances the power of a Hadoop compute cluster in the following ways:

  • Scalability The processing power in data centers continues to grow quickly. Because YARN ResourceManager focuses exclusively on scheduling, it can manage those larger clusters much more easily.
  • Compatibility with MapReduce Existing MapReduce applications and users can run on top of YARN without disruption to their existing processes.
  • Improved cluster utilization. The ResourceManager is a pure scheduler that optimizes cluster utilization according to criteria such as capacity guarantees, fairness, and SLAs. Also, unlike before, there are no named map and reduce slots, which helps to better utilize cluster resources.
  • Support for workloads other than MapReduceAdditional programming models such as graph processing and iterative modeling are now possible for data processing. These added models allow enterprises to realize near real-time processing and increased ROI on their Hadoop investments.
  • AgilityWith MapReduce becoming a user-land library, it can evolve independently of the underlying resource manager layer and in a much more agile manner.

How YARN Works

The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker/TaskTracker into separate entities:

  • a global ResourceManager
  • a per-application ApplicationMaster.
  • a per-node slave NodeManager and
  • a per-application Container running on a NodeManager
The ResourceManager and the NodeManager form the new, and generic, system for managing applications in a distributed manner. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is a framework-specific entity and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the component tasks.\ The ResourceManager has a scheduler, which is responsible for allocating resources to the various running applications, according to constraints such as queue capacities, user-limits etc. The scheduler performs its scheduling function based on the resource requirements of the applications. The NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager. Each ApplicationMaster has the responsibility of negotiating appropriate resource containers from the scheduler, tracking their status, and monitoring their progress. From the system perspective, the ApplicationMaster runs as a normal container
------------------------------------------------------------
References:
1)  http://hadoop.apache.org/

2)  http://hortonworks.com
3)  http://www.cloudera.com

--------------------------------------------------------------------------------
 Click here: Single-node Hadoop Cluster setup and Implementation
 Click here: Multi-node Hadoop Cluster setup and Implementation
 Click here: Watson - Era of cognitive computing !!!
 

Big Data : Watson - Era of cognitive computing !!!

         
                         “Can computers replace Human Beings?”

Innovations in Exascale computing could imitate the human brain . But won't replace !. 
It is believed that artificial intelligence would take long time to function like human brains .But not sure how long we need to wait for this revolution.No doubt ..Computers have brought a revolution in human life. Nowadays computers are busy taking most of the human activities , can think  and has problem-solving capabilities. These factors make us believe that computers are likely to replace human beings  in future.

IBM announced a major new initiative aimed at accelerating progress in the era of cognitive computing.Big Blue is using the human brain as a template for breakthrough designs. Assume  yourself as a supercomputer that's cooled and powered by electronic blood,natural healing system,thousands of  I/O activities :) .

In the era of real-time Big Data, the “old number crunching” computers are not sufficient anymore and new computers are required that can interact with us, the way we want to interact with each other. That can visualize data the way we humans interact with the world.Although the computers to date are capable of handling vast amounts of data, they still do that with separated memory and processing and doing all the steps in a sequential order. IBM is developing a new type of computers, cognitive computers that can be trained with artificial intelligence and machine-learning algorithms to become more like humans and deal with data the way humans do.Cognitive computing will bring a level of fluidity and appropriateness to the way we will interact with computers. The idea of the new cognitive computer that IBM is developing is to facilitate human cognition beyond the current barriers because of the ever-increasing volumes of data.

His name is Watson. He's bad with puns. Great at math. And,  he won the game show "Jeopardy!" against real, live, breathing, thinking humans (Brad Rutter and Ken Jennings, two of Jeopardy's champions) .The top prize for the Watson showdown is $1 million, with $300,000 for second place and $200,000 for third. Jennings and Rutter planed to donate half their winnings to charity. Watson won $1 million and all of its winnings donated to charity...Link where the humans were destroyed by Watson @ final round of man vs. machine.Those game shows were reminiscent of IBM's "Deep Blue," a chess-playing computer that, in 1996 and 1997, was pitted against world champion Gary Kasparov.Kasparov beat the first version of Deep Blue in 1996, but was defeated by a revamped program in 1997 -- with Blue scoring two wins and three draws in a best-of-five contest. Deep Blue relied heavily on mathematical calculations, while Watson has to interpret human language, a far more difficult task.

IBM’s computer system Watson vanquished human contests on the TV quiz show Jeopardy!. Its combination of machine-learning strategies and an ability to process natural language—or ordinary speech—allowed it to defeat human contestants .Watson's software is powered by an IBM Power7 server (supercomputer with 2,880 IBM Power750 cores, or computing brains, and 15 terabytes of memory) and, according to developers, is optimized to process complex questions and render answers quickly. The question now: can it defeat the complexities of the real world? IBM is confident and it will also combine Watson with other “cognitive computing” technologies and invest a further $1 billion into a business it says will define the future of how companies use data.



IBM Watson Group, to be headquartered in New York City’s Silicon Alley. The organization is unique within IBM– integrating research, software, systems design, services and industry expertise.
This will revolutionize everything from cancer care to call centers. Among IBM’s biggest plans for Watson has been creating a system that can read medical records and recommend treatments, particularly for cancer patients.Watson is still a medical student or about to complete the internship :). Watson is going to work with doctors, helping oncologists treat patients.

Only 20 percent of the knowledge physicians use to make diagnosis and treatment decisions today is evidence based. The result? One in five diagnoses are incorrect or incomplete and nearly 1.5 million medication errors are made in the US every year. Given the growing complexity of medical decision making, how can health care providers address these problems?.  The information medical professionals need to support improved decision making is available. Medical journals publish new treatments and discoveries every day. Patient histories give clues. Vast amounts of electronic medical record data provide deep wells of knowledge. Some would argue that in this information is the insight needed to avoid every improper diagnosis or erroneous treatment. In fact, the amount of medical information available is doubling every five years and much of this data is unstructured - often in natural language. And physicians simply don't have time to read every journal that can help them keep up to date with the latest advances - 81 percent report that they spend five hours per month or less reading journals. Computers should be able to help, but the limitations of current systems have prevented real advances. Natural language is complex. It is often implicit: the exact meaning is not completely and exactly stated. In human language, meaning is highly dependent on what has been said before, the topic itself, and how it is being discussed: factually, figuratively or fictionally - or a combination.


What Watson can do—given the right data—is pull up relevant literature and also consistently recommend the same course of treatment that’s suggested in the written medical guidelines that doctors consult. But following guidelines is also something that less sophisticated software can do. Watson can easily duplicate a guideline recommendation.  Machine that docs can turn to as an adviser and colleague.That system will be able to make recommendations for treating several cancers based on manually organized inputs—structured data—and will also interpret text notes for two cancers, lung and breast, with reasonable accuracy. This is the right time to move forward with a bigger investment.


How Watson can address healthcare challenges


Watson uses natural language capabilities, hypothesis generation, and evidence-based learning to support medical professionals as they make decisions. For example, a physician can use Watson to assist in diagnosing and treating patients. First the physician might pose a query to the system, describing symptoms and other related factors. Watson begins by parsing the input to identify the key pieces of information. The system supports medical terminology by design, extending Watson's natural language processing capabilities.  


Watson then mines the patient data to find relevant facts about family history, current medications and other existing conditions. It combines this information with current findings from tests and instruments and then examines all available data sources to form hypotheses and test them. Watson can incorporate treatment guidelines, electronic medical record data, doctor's and nurse's notes, research, clinical studies, journal articles, and patient information into the data available for analysis. 
Watson will then provide a list of potential diagnoses along with a score that indicates the level of confidence for each hypothesis.

The ability to take context into account during the hypothesis generation and scoring phases of the processing pipeline allows Watson to address these complex problems, helping the doctor — and patient — make more informed and accurate decisions. 


Preparing Watson for Moon Shots:

The University of Texas MD Anderson Cancer Center in Houston ranks as one of the world's most respected centers focused on cancer patient care, research, education and prevention.MD Anderson’s Moon Shots Program is an unprecedented and highly concentrated assault against cancer.

IBM’s Watson technology is expected to play a key role within APOLLO, a technology driven “adaptive learning environment” that MD Anderson is developing as part of its Moon Shots program. APOLLO enables iterative and continued learning between clinical care and research by creating an environment that streamlines and standardizes the longitudinal collection, ingestion and integration of patient’s medical and clinical history, laboratory data as well as research data into MD Anderson’s centralized patient data warehouse. Once aggregated, this complex data is linked and made available for deep analyses by advanced analytics to extract novel insights that can lead to improved effectiveness of care and better patient outcomes. 

One of the richest sources of valuable clinical insight trapped within this patient data is the unstructured medical and research notes, and test results, for each cancer patient Watson’s cognitive capability has been shown to be powerful tool to extract valuable insight from such complex data and MD Anderson's Oncology Expert Advisor capability can generate a more comprehensive profile of each cancer patient. This will help physicians better understand the patient’s data in the evaluation of a patient’s condition.

By identifying and weighing data-driven connections between the attributes in a patient’s profile and the knowledge corpus of published medical literature, guidelines in Watson, MD Anderson’s Oncology Expert Advisor can provide evidence-based treatment and management options that are personalized to that patient, to aid the physician’s treatment and care decisions. These options can include not only standard approved therapies, but also appropriate investigational protocols. 

”One unique aspect of the MD Anderson Oncology Expert Advisor is that it will not solely rely on established cancer care pathways to recommend appropriate treatment options,” explained Lynda Chin, M.D., professor and chair of Genomic Medicine and scientific director of the Institute for Applied Cancer Science at MD Anderson. “The system was built with the understanding that what we know today will not be enough for many patients. Therefore, our cancer patients will be automatically matched to appropriate clinical trials by the Oncology Expert Advisor. Based on evidence as well as experiences, our physicians can offer our patients a better chance to battle their cancers by participating in clinical trials on novel therapies.”

The MD Anderson Oncology Expert Advisor is expected to help physicians improve the future care of cancer patients by enabling comparison of patients based on a new range of data-driven attributes, previously unavailable for analysis. For example, MD Anderson’s clinical care and research teams can compare groups of patients to identify those patients who responded differently to therapies and discover attributes that may account for their differences. This analysis will then inform the generation of testable hypotheses to help researchers and clinicians to advance cancer care continually. Click here for more Information


Finally , Why did they name it Watson?
It's the name of the founder of IBM, Thomas J. Watson

Those possibilities that Watson's breakthrough computing capabilities hold for building a smarter planet and helping people in their business tasks and personal lives - stunned everyone !!!


Click here for Hadoop 2.x single node Installation 
Click here for Hadoop 2.x Multi-node Cluster Installation
Click here : Overview of  apache Hadoop .
Click here:  Big Data Revolution and Vision ........!!!

Sunday, January 5, 2014

Big Data: Hadoop 2.x/YARN Multi-Node Cluster Installation

Apache Hadoop 2/YARN/MR2 Multi-node Cluster Installation for Beginners:
In this blog ,I will describe the steps for setting up a distributed, multi-node Hadoop cluster running on Red Hat Linux/CentOS Linux distributions.Now we are comfortable with installation and  execution of MapReduce applications on  Single node in  Pseudo-distributed Mode. [ Click here for the details on single node installation  ].Let us move one step forward  to deploy multi-node cluster .

Whats Big data ? 
Whats Hadoop ?

Hadoop Cluster:
Hadoop Cluster is designed for distributed processing of large data sets across group of commodity machines (low-cost servers). The Data could be unstructured, semi-structured and also could be structured data.It is designed to scale up to thousands of machines, with a  high degree of fault tolerance and software has the intelligence to detect & handle the failures at the application layer. 

Thre are 3 types of machines based on their specific roles  in Hadoop cluster environment

1] Client machines :
   - Loading  the data (input files) into the cluster
   - Submission of jobs (in our case - its a MapReduce Job)
   - Collect the result and view the analytics 


2] Master nodes :

  - The Name Node coordinates the data storage function (HDFS) keeping the Meta data information 

- The ResourceManager negotiates the necessary resources for a container and launches an ApplicationMaster to represent the submitted application.

3] Slave nodes :
Major part of cluster consists of Slave Nodes to perform computation .
The NodeManager manages each node within a YARN cluster. The NodeManager provides per-node services within the cluster - management of a container over its life cycle to monitoring resources and tracking the health of its node.  

Container represents an allocated resource in the cluster. The resource Manager is the sole authority to allocate any container to applications. The allocated container is always on a single node and has unique containerID. It has a specific amount of resource allocated. Typically, an ApplicationMaster receive the container from the ResourceManager during resource negotiation and then talks to the NodeManager to start/stop container. Resource models a set of computer resources. Currently it only models Memeory [may be in future other resources like CPUs will be added ]. 

YARN Architecture [More details available@ source]

                                          Block Diagram Representation





                                             Terminology  and Architecture


MRv2 Architecture [click for source]

Prerequisites:

 My Cluster setup  has single master node and a slave-node .

1]  Lets have 2 Machines (or   two VMs )  with sufficient resources to run MapReduce application.

2] Both machines were installed with hadoop 2.x as described in link here.
    Keep all configurations and paths same across all nodes in the cluster.

3] Bring down all daemons running  on those machines.

    HADOOP_PREFIX/sbin/stop-all.sh

NOTE:  There are  2 nodes: "spb-master"  as a master  &   "spb-slave"  as a slave
                               spb-master's IP address is   192.X.Y.Z
                               spb-slave's  IP address  is   192.A.B.C

--------------------------------------------------------------------------------------------------------

Step 1:  First thing here  is to establish a network between master node and slave node.
Assign  IP address to eth0 interface  of node1 and node 2  and include those IP address  and hostname to /etc/hosts file  as shown here.

NODE1 : spb-master

NODE2 : spb-slave
 ___________________________________________________________________

Step 2 : Establish password-less SSH session between master and slave nodes.

Verify password-less session from master to slave node :

Verify password-less session from Slave to master  node:

Now  both Machines are almost ready and communicating without prompting for password .
______________________________________________________________________________


Step 3: Additional configuration required  at Master node :

HADOOP_PREFIX/etc/hadoop/slaves  file should contain  list of all slave nodes .

[root@spb-master hadoop]# cat slaves
spb-master
spb-slave
[root@spb-master hadoop]#

NOTE: Here setup configured to have DataNode on master also (dual role).

If you have  many number of slave nodes you could list then like

cat HADOOP_PREFIX/etc/hadoop/slaves
spb-slave1
spb-slave2
spb-slave3
spb-slave4
spb-slave5
spb-slave6
spb-slave7
spb-slave 8
___________________________________________________________

Step 4 : Other configurations remain same as copied below across all the  nodes in cluster.

[root@spb-master hadoop]# cat core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
        <property>
                <name>fs.default.name</name>
                <value>hdfs://spb-master:9000</value>
        </property>
</configuration>

-------------------------------------------------

[root@spb-master hadoop]# cat hadoop-env.sh
# Copyright 2011 The Apache Software Foundation
export JAVA_HOME=$BIN/java/default
export HADOOP_PREFIX=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_HDFS_HOME=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_COMMON_HOME=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_MAPRED_HOME=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_YARN_HOME=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_CONF_DIR=$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
export HADOOP_LOG_DIR=$PACKAGE_HOME/hadoop-2.2.0/logs
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=500
export HADOOP_NAMENODE_INIT_HEAPSIZE="500"
export HADOOP_JOB_HISTORYSERVER_HEAPSIZE="200"

----------------------------------------------------------------
[root@spb-master hadoop]# cat hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
 <property>
   <name>dfs.replication</name>
   <value>1</value>
 </property>
 <property>
   <name>dfs.namenode.name.dir</name>
   <value>file:$DATA_DIR/data/hadoop/hdfs/nn</value>
 </property>
 <property>
   <name>dfs.datanode.data.dir</name>
   <value>file:$DATA_DIR/data/hadoop/hdfs/dn</value>
 </property>
 <property>
   <name>dfs.permissions</name>
   <value>false</value>
 </property>
</configuration>
[root@spb-master hadoop]#
----------------------------------------------------------------------------------
[root@spb-master hadoop]# cat mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>
[root@spb-master hadoop]#
----------------------------------------------------------------------------------------
[root@spb-master hadoop]# cat yarn-env.sh

export JAVA_HOME=$BIN/java/default
export HADOOP_PREFIX=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_HDFS_HOME=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_COMMON_HOME=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_MAPRED_HOME=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_YARN_HOME=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_CONF_DIR=$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"

JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx500m

# For setting YARN specific HEAP sizes please use this
# Parameter and set appropriately
 YARN_HEAPSIZE=500
------------------------------------------------------------------------------------------------

[root@spb-master hadoop]# cat yarn-site.xml
<?xml version="1.0"?>

<configuration>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
                <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
                <name>yarn.resourcemanager.resource-tracker.address</name>
                <value>spb-master:8025</value>
        </property>
        <property>
                <name>yarn.resourcemanager.scheduler.address</name>
                <value>spb-master:8030</value>
        </property>
        <property>
                <name>yarn.resourcemanager.address</name>
                <value>spb-master:8040</value>
        </property>
</configuration>