Thursday, November 2, 2017

Spectrum LSF multi-cluster Models and Configurations

IBM® Spectrum LSF (formerly IBM® Platform™ LSF®) is a complete workload management solution for demanding HPC environments. Featuring intelligent, policy-driven scheduling and easy to use interfaces for job and workflow management, it helps organizations to improve competitiveness by accelerating research and design while controlling costs through superior resource utilization.

Without a scheduler, an HPC Cluster would just be a bunch of servers with different jobs interfering with each other. When you have a large clusters and multiple users, each user doesn’t know which compute nodes and CPU cores to use, nor how much resources are available on each node. To solve this, cluster batch control systems are used to manage jobs on the system using HPC Schedulers. They are essential for sequentially queuing jobs, assigning priorities, distributing, parallelizing, suspending, killing or otherwise controlling jobs cluster-wide. Spectrum LSF is a  powerful workload management platform, job scheduler, for distributed high performance computing.

Computational multi-clusters are an important emerging class of supercomputing architectures. As multi-cluster systems become more prevalent, techniques for efficiently exploiting these resources become increasingly significant. A critical aspect of exploiting these resources is the challenge of scheduling. In order to maximize job throughput, multi-cluster schedulers must simultaneously leverage the collective computational resources of each of its participating clusters. By doing so, jobs that would otherwise wait for nodes to become available on a single cluster can potentially run earlier by aggregating disjoint resources throughout the multi-cluster. This procedure can result in dramatic reductions in queue waiting times.

Organizations  might have multiple LSF clusters manged by different business units. In this scenario it is good to share the resources across the cluster to reap the benefits of global load sharing.
  • Ease of administration 
  • Different geographic locations 
  • Scalability
There are two Spectrum LSF  Multi-cluster Models :

Job forwarding Model:

In this model, the cluster that is starving for resources sends jobs over to the cluster that has resources to spare. To work together, two clusters must set up compatible send-jobs and receive-jobs queues.
With this model, scheduling of MultiCluster jobs is a process with two scheduling phases: the submission cluster selects a suitable remote receive-jobs queue, and forwards the job to it; then the execution cluster selects a suitable host and dispatches the job to it. This method automatically favors local hosts; a MultiCluster send-jobs queue always attempts to find a suitable local host before considering a receive-jobs queue in another cluster.

You could refer another blog for configuring your cluster to Job forwarding Mode. Click here.

Resource leasing model

In this model, the cluster that is starving for resources takes resources away from the cluster that has resources to spare. To work together, the provider cluster must “export” resources to the consumer, and the consumer cluster must configure a queue to use those resources. In this model, each cluster schedules work on a single system image, which includes both borrowed hosts and local hosts.

Two clusters agree that one cluster will borrow resources from the other, taking control of the resources. Both clusters must change their configuration to make this possible, and the arrangement, called a “lease”, does not expire, although it might change due to changes in the cluster configuration.
With this model, scheduling of jobs is always done by a single cluster. When a queue is configured to run jobs on borrowed hosts, LSF schedules jobs as if the borrowed hosts actually belonged to the cluster.

  1. Setup:
    • A resource provider cluster “exports” hosts, and specifies the clusters that will use the resources on these hosts.
    • A resource consumer cluster configures a queue with a host list that includes the borrowed hosts.
  2. To establish a lease:
    1. Configure two clusters properly (the provider cluster must export the resources, and the consumer cluster must have a queue that requests remote resources).
    2. Start up the clusters.
    3. In the consumer cluster, submit jobs to the queue that requests remote resource.
    4. At this point, a lease is established that gives the consumer cluster control of 
             the remote resources.
    • If the provider did not export the resources requested by the consumer, there is no lease. The provider continues to use its own resources as usual, and the consumer cannot use any resources from the provider.
    • If the consumer did not request the resources exported to it, there is no lease. However, when entire hosts are exported the provider cannot use resources that it has exported, so neither cluster can use the resources; they will be wasted.
  3. Changes to the lease:
    • The lease does not expire. To modify or cancel the lease, you should change the export policy in the provider cluster.
    • If you export a group of workstations allowing LSF to automatically select the hosts for you, these hosts do not change until the lease is modified. However, if the original lease could not include the requested number of hosts, LSF can automatically update the lease to add hosts that become available later on.
    • If the configuration changes and some resources are no longer exported, jobs from the consumer cluster that have already started to run using those resources will be killed and requeued automatically.
    If LSF selects the hosts to export, and the new export policy allows some of the same hosts to be exported again, then LSF tries to re-export the hosts that already have jobs from the consumer cluster running on them (in this case, the jobs continue running without interruption). If LSF has to kill some jobs from the consumer cluster to remove some hosts from the lease, it selects the hosts according to job run time, so it kills the most recently started jobs.

Selection of Model: 

Consider your own goals and priorities when choosing the best resource-sharing model for your site.

  • The job forwarding model can make resources available to jobs from multiple clusters, this flexibility allows maximum throughput when each cluster’s resource usage fluctuates. The resource leasing model can allow one cluster exclusive control of a dedicated resource, this can be more efficient when there is a steady amount of work.
  • The lease model is the most transparent to users and supports the same scheduling features as a single cluster.
  • The job forwarding model has a single point of administration, while the lease model shares administration between provider and consumer clusters.

In this blog, you could follow  both Lease & Job forward Mode configurations for Spectrum LSF cluster .

[sachin@host1 ~]$ lsid
IBM Spectrum LSF Standard
My cluster name is cluster1_p8
My master name is host1
[sachin@host1 ~]$

lsclusters : displays configuration information about LSF clusters

bhosts : Displays hosts and their static and dynamic resources in cluster


Configuration Files:


Begin Cluster
ClusterName  Servers
cluster1_p8     (host1)
cluster2_p9     (host6)
cluster3_x86    (host11)
End Cluster

Begin HostExport
PER_HOST     =    host1      # export host list
SLOTS        = 20                   # for each host, export 1 job slots
DISTRIBUTION = ([ cluster2_p9 , 1] [cluster3_x86, 1]) # share distribution for remo
MEM          = 100                 # export 100M mem of each host [optional parameter]
SWP          = 100                 # export 100M swp of each host [optional parameter]
End HostExport
In this example, resources are leased to 2 clusters in an even 1:1 ratio. Each cluster gets 1/2 of the resources. NOTE: This configuration required only for Lease Mode.

Begin Queue
QUEUE_NAME     = send_queue
SNDJOBS_TO     = receive_queue@cluster3_x86
HOSTS          = none
PRIORITY       = 30
NICE           = 20
End Queue

Begin Queue
QUEUE_NAME = leaseq
HOSTS = all allremote
End Queue

Begin Queue
QUEUE_NAME   = cluster1_p8
PRIORITY     = 30
HOSTS        = host1 host2 host3 host4 host5        # hosts on which jobs in this queue can run
DESCRIPTION  = For submission of jobs to P9 machines
End Queue

Begin Queue
QUEUE_NAME   = cluster2_p9
PRIORITY     = 30
HOSTS        = host6 host7 host8 host9 host10       # hosts on which jobs in this queue can run
DESCRIPTION  = For submission of jobs to P9 machines
End Queue

Begin Queue
QUEUE_NAME   = cluster3_x86
PRIORITY     = 30
HOSTS        = host11 host12 host13 host14 host15       # hosts on which jobs in this queue can run
DESCRIPTION  = For submission of jobs to P8 machines
End Queue

In case of job forwarding model you need to have following configuration on Remote cluster

Begin Queue
QUEUE_NAME      = receive_queue
RCVJOBS_FROM    = send_queue@cluster1_p8
HOSTS           =   host11 host12 host13 host14 host15
PRIORITY        = 55
NICE            = 10
DESCRIPTION     = Multicluster Queue
End Queue


Check  Job Forwarding Information  and Resource Lease Information by issuing bclusters command :

Submit  LSF job  - forwarding mechanism

Submit  LSF job  - Resource Leasing mechanism

In this article I wanted to illustrate how someone could get started creating their own LSF multi-cluster setup to run their application that needs more computational resource.



Sunday, July 30, 2017

Getting Started with MongoDB

The NoSQL database movement came about to address the shortcomings of relational databases and the demands of modern software development.  new data is unstructured and semi-structured, so developers also need a database that is capable of efficiently storing it. Unfortunately, the rigidly defined, schema-based approach used by relational databases makes it impossible to quickly incorporate new types of data, and is a poor fit for unstructured and semi-structured data. NoSQL provides a data model that maps better to these needs.

MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling across a configurable set of systems that function as storage nodes.
  • database holds a set of collections
  • collection holds a set of documents
  • document is a set of fields
  • field is a key-value pair
  • key is a name (string)
  • value is a - basic type like string, integer, float, timestamp, binary, etc.,
  • a document, or an array of value
MongoDB Architecture
 MongoDB stores all data in documents, which are JSON-style data structures composed of field-and-value pairs. MongoDB stores documents on disk in the BSON serialization format. BSON is a binary representation of JSON documents, though it contains more data types than JSON. These documents can be simple documents as above and can also be complex documents such as below:

    id: x,
    name: y,
    other: z,
    multipleArray: [
        {lab1: "A",  lab2: "B", lab3:"C"},
        {lab1: "AB", lab2: "BB", lab3:"CB"},
        {lab1: "AC", lab2: "BC", lab3:"CC"}

Document Database

A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents.

The advantages of using documents are:
  • Documents (i.e. objects) correspond to native data types in many programming languages.
  • Embedded documents and arrays reduce need for expensive joins.
  • Dynamic schema supports fluent polymorphism.
Most user-accessible data structures in MongoDB are documents, including:
-> All database records.
-> Query selectors, which define what records to select for read, update, and delete operations.
-> Update definitions, which define what fields to modify during an update.
-> Index specifications, which define what fields to index.
-> Data output by MongoDB for reporting and configuration, such as the output of the server-status and the replica set configuration document.

Joins and Other Aggregation Enhancements in MongoDB 3.2 on-wards

How to create database and collections with basic examples to query  ?

spb@spb-VirtualBox:~$ mongo
MongoDB shell version: 3.2.12
connecting to: test
Server has startup warnings:
> show dbs
finance  0.000GB
local    0.000GB
mydb     0.000GB
MongoDB didn’t provide any command to create “database“. Actually, you don’t need to create it manually, because, MangoDB will create it on the fly, during the first time you save the value into the defined collection (or table in SQL), and database.

> use hospital
switched to db hospital
WriteResult({ "nInserted" : 1 })
>{name:"Ramesh",age:"55",gender:"M",disease:"blood pressure",city:"bengaluru"})
WriteResult({ "nInserted" : 1 })
WriteResult({ "nInserted" : 1 })
WriteResult({ "nInserted" : 1 })
WriteResult({ "nInserted" : 1 })
WriteResult({ "nInserted" : 1 })
WriteResult({ "nInserted" : 1 })
WriteResult({ "nInserted" : 1 })
WriteResult({ "nInserted" : 1 })
WriteResult({ "nInserted" : 1 })
WriteResult({ "nInserted" : 1 })
WriteResult({ "nInserted" : 1 })
WriteResult({ "nInserted" : 1 })

> db.patient.find()
{ "_id" : ObjectId("597d83f6a9d2632baed3c076"), "name" : "John", "age" : "29", "gender" : "M", "disease" : "fever", "city" : "chennai" }
{ "_id" : ObjectId("597d8457a9d2632baed3c077"), "name" : "Ramesh", "age" : "55", "gender" : "M", "disease" : "blood pressure", "city" : "bengaluru" }
{ "_id" : ObjectId("597d8488a9d2632baed3c078"), "name" : "Harish", "age" : "35", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d84baa9d2632baed3c079"), "name" : "Namitha", "age" : "25", "gender" : "F", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d84efa9d2632baed3c07a"), "name" : "Asha", "age" : "15", "gender" : "F", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d851aa9d2632baed3c07b"), "name" : "Ravi", "age" : "23", "gender" : "M", "disease" : "diabetic", "city" : "chennai" }
{ "_id" : ObjectId("597d8544a9d2632baed3c07c"), "name" : "Lokesh", "age" : "37", "gender" : "M", "disease" : "fever", "city" : "mumbai" }
{ "_id" : ObjectId("597d855ca9d2632baed3c07d"), "name" : "Sangeetha", "age" : "37", "gender" : "F", "disease" : "fever", "city" : "mumbai" }
{ "_id" : ObjectId("597d8571a9d2632baed3c07e"), "name" : "Apoorva", "age" : "27", "gender" : "F", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d858ba9d2632baed3c07f"), "name" : "Jijo", "age" : "30", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d859da9d2632baed3c080"), "name" : "Mallik", "age" : "38", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d85afa9d2632baed3c081"), "name" : "Parashuram", "age" : "32", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d85c7a9d2632baed3c082"), "name" : "Rakesh", "age" : "35", "gender" : "M", "disease" : "cold", "city" : "bengaluru" }
 > show dbs
finance   0.000GB
hospital  0.000GB
local     0.000GB
mydb      0.000GB

To query the document on the basis of some condition, you can use following operations.

1) query to get records  where  desease=fever
 > db.patient.find({"disease":"fever"})
{ "_id" : ObjectId("597d83f6a9d2632baed3c076"), "name" : "John", "age" : "29", "gender" : "M", "disease" : "fever", "city" : "chennai" }
{ "_id" : ObjectId("597d8488a9d2632baed3c078"), "name" : "Harish", "age" : "35", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d84baa9d2632baed3c079"), "name" : "Namitha", "age" : "25", "gender" : "F", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d84efa9d2632baed3c07a"), "name" : "Asha", "age" : "15", "gender" : "F", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d8544a9d2632baed3c07c"), "name" : "Lokesh", "age" : "37", "gender" : "M", "disease" : "fever", "city" : "mumbai" }
{ "_id" : ObjectId("597d855ca9d2632baed3c07d"), "name" : "Sangeetha", "age" : "37", "gender" : "F", "disease" : "fever", "city" : "mumbai" }
{ "_id" : ObjectId("597d8571a9d2632baed3c07e"), "name" : "Apoorva", "age" : "27", "gender" : "F", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d858ba9d2632baed3c07f"), "name" : "Jijo", "age" : "30", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d859da9d2632baed3c080"), "name" : "Mallik", "age" : "38", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d85afa9d2632baed3c081"), "name" : "Parashuram", "age" : "32", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
---------------------------------------------- ---------------------
2) To display the results in a formatted way with pretty() method to get records  where  desease=fever 

> db.patient.find({"disease":"fever"}).pretty()
    "_id" : ObjectId("597d83f6a9d2632baed3c076"),
    "name" : "John",
    "age" : "29",
    "gender" : "M",
    "disease" : "fever",
    "city" : "chennai"
    "_id" : ObjectId("597d8488a9d2632baed3c078"),
    "name" : "Harish",
    "age" : "35",
    "gender" : "M",
    "disease" : "fever",
    "city" : "bengaluru"
    "_id" : ObjectId("597d84baa9d2632baed3c079"),
    "name" : "Namitha",
    "age" : "25",
    "gender" : "F",
    "disease" : "fever",
    "city" : "bengaluru"
    "_id" : ObjectId("597d84efa9d2632baed3c07a"),
    "name" : "Asha",
    "age" : "15",
    "gender" : "F",
    "disease" : "fever",
    "city" : "bengaluru"

    "_id" : ObjectId("597d8544a9d2632baed3c07c"),
    "name" : "Lokesh",
    "age" : "37",
    "gender" : "M",
    "disease" : "fever",
    "city" : "mumbai"
    "_id" : ObjectId("597d855ca9d2632baed3c07d"),
    "name" : "Sangeetha",
    "age" : "37",
    "gender" : "F",
    "disease" : "fever",
    "city" : "mumbai"
    "_id" : ObjectId("597d8571a9d2632baed3c07e"),
    "name" : "Apoorva",
    "age" : "27",
    "gender" : "F",
    "disease" : "fever",
    "city" : "bengaluru"
    "_id" : ObjectId("597d858ba9d2632baed3c07f"),
    "name" : "Jijo",
    "age" : "30",
    "gender" : "M",
    "disease" : "fever",
    "city" : "bengaluru"
    "_id" : ObjectId("597d859da9d2632baed3c080"),
    "name" : "Mallik",
    "age" : "38",
    "gender" : "M",
    "disease" : "fever",
    "city" : "bengaluru"
    "_id" : ObjectId("597d85afa9d2632baed3c081"),
    "name" : "Parashuram",
    "age" : "32",
    "gender" : "M",
    "disease" : "fever",
    "city" : "bengaluru"
3) query to get records  where  age=25
> db.patient.find({"age":"25"})
{ "_id" : ObjectId("
597d84baa9d2632baed3c079"), "name" : "Namitha", "age" : "25", "gender" : "F", "disease" : "fever", "city" : "bengaluru" }
4) query to get records  where  age greater than 25

> db.patient.find({"age":{$gt:"25"}})
{ "_id" : ObjectId("597d83f6a9d2632baed3c076"), "name" : "John", "age" : "29", "gender" : "M", "disease" : "fever", "city" : "chennai" }
{ "_id" : ObjectId("597d8457a9d2632baed3c077"), "name" : "Ramesh", "age" : "55", "gender" : "M", "disease" : "blood pressure", "city" : "bengaluru" }
{ "_id" : ObjectId("597d8488a9d2632baed3c078"), "name" : "Harish", "age" : "35", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d8544a9d2632baed3c07c"), "name" : "Lokesh", "age" : "37", "gender" : "M", "disease" : "fever", "city" : "mumbai" }
{ "_id" : ObjectId("597d855ca9d2632baed3c07d"), "name" : "Sangeetha", "age" : "37", "gender" : "F", "disease" : "fever", "city" : "mumbai" }
{ "_id" : ObjectId("597d8571a9d2632baed3c07e"), "name" : "Apoorva", "age" : "27", "gender" : "F", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d858ba9d2632baed3c07f"), "name" : "Jijo", "age" : "30", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d859da9d2632baed3c080"), "name" : "Mallik", "age" : "38", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d85afa9d2632baed3c081"), "name" : "Parashuram", "age" : "32", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d85c7a9d2632baed3c082"), "name" : "Rakesh", "age" : "35", "gender" : "M", "disease" : "cold", "city" : "bengaluru" }
5) query to get records  where  age less than 25
> db.patient.find({"age":{$lt:"25"}})
{ "_id" : ObjectId("597d84efa9d2632baed3c07a"), "name" : "Asha", "age" : "15", "gender" : "F", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d851aa9d2632baed3c07b"), "name" : "Ravi", "age" : "23", "gender" : "M", "disease" : "diabetic", "city" : "chennai" }
6) query to get records  where  age less than or equal to   25
 > db.patient.find({"age":{$lte:"25"}})
{ "_id" : ObjectId("597d84baa9d2632baed3c079"), "name" : "Namitha", "age" : "25", "gender" : "F", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d84efa9d2632baed3c07a"), "name" : "Asha", "age" : "15", "gender" : "F", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d851aa9d2632baed3c07b"), "name" : "Ravi", "age" : "23", "gender" : "M", "disease" : "diabetic", "city" : "chennai" }
7) query to get records  where  age greater than or equal to   25
> db.patient.find({"age":{$gte:"25"}})
{ "_id" : ObjectId("597d83f6a9d2632baed3c076"), "name" : "John", "age" : "29", "gender" : "M", "disease" : "fever", "city" : "chennai" }
{ "_id" : ObjectId("597d8457a9d2632baed3c077"), "name" : "Ramesh", "age" : "55", "gender" : "M", "disease" : "blood pressure", "city" : "bengaluru" }
{ "_id" : ObjectId("597d8488a9d2632baed3c078"), "name" : "Harish", "age" : "35", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d84baa9d2632baed3c079"), "name" : "Namitha", "age" : "25", "gender" : "F", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d8544a9d2632baed3c07c"), "name" : "Lokesh", "age" : "37", "gender" : "M", "disease" : "fever", "city" : "mumbai" }
{ "_id" : ObjectId("597d855ca9d2632baed3c07d"), "name" : "Sangeetha", "age" : "37", "gender" : "F", "disease" : "fever", "city" : "mumbai" }
{ "_id" : ObjectId("597d8571a9d2632baed3c07e"), "name" : "Apoorva", "age" : "27", "gender" : "F", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d858ba9d2632baed3c07f"), "name" : "Jijo", "age" : "30", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d859da9d2632baed3c080"), "name" : "Mallik", "age" : "38", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d85afa9d2632baed3c081"), "name" : "Parashuram", "age" : "32", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d85c7a9d2632baed3c082"), "name" : "Rakesh", "age" : "35", "gender" : "M", "disease" : "cold", "city" : "bengaluru" }
8) query to get records  where  age NOT equal to   25
> db.patient.find({"age":{$ne:"25"}})
{ "_id" : ObjectId("597d83f6a9d2632baed3c076"), "name" : "John", "age" : "29", "gender" : "M", "disease" : "fever", "city" : "chennai" }
{ "_id" : ObjectId("597d8457a9d2632baed3c077"), "name" : "Ramesh", "age" : "55", "gender" : "M", "disease" : "blood pressure", "city" : "bengaluru" }
{ "_id" : ObjectId("597d8488a9d2632baed3c078"), "name" : "Harish", "age" : "35", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d84efa9d2632baed3c07a"), "name" : "Asha", "age" : "15", "gender" : "F", "disease" : "fever", "city" : "bengaluru" }

{ "_id" : ObjectId("597d851aa9d2632baed3c07b"), "name" : "Ravi", "age" : "23", "gender" : "M", "disease" : "diabetic", "city" : "chennai" }
{ "_id" : ObjectId("597d8544a9d2632baed3c07c"), "name" : "Lokesh", "age" : "37", "gender" : "M", "disease" : "fever", "city" : "mumbai" }
{ "_id" : ObjectId("597d855ca9d2632baed3c07d"), "name" : "Sangeetha", "age" : "37", "gender" : "F", "disease" : "fever", "city" : "mumbai" }
{ "_id" : ObjectId("597d8571a9d2632baed3c07e"), "name" : "Apoorva", "age" : "27", "gender" : "F", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d858ba9d2632baed3c07f"), "name" : "Jijo", "age" : "30", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d859da9d2632baed3c080"), "name" : "Mallik", "age" : "38", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d85afa9d2632baed3c081"), "name" : "Parashuram", "age" : "32", "gender" : "M", "disease" : "fever", "city" : "bengaluru" }
{ "_id" : ObjectId("597d85c7a9d2632baed3c082"), "name" : "Rakesh", "age" : "35", "gender" : "M", "disease" : "cold", "city" : "bengaluru" }
CRUD (Create Read Update Delete) operation we have following commands in the MongoDB 


 That’s all for  basic introduction  on MongoDB

Thursday, May 11, 2017

Casbah - Scala toolkit for MongoDB

 Casbah is a Scala toolkit for MongoDB  and it  integrates a layer on top of the official mongo-java-driver for better integration with Scala.

The recommended way to get started is with a dependency management system. 

 libraryDependencies += "org.mongodb" %% "casbah" % "3.1.1"

Casbah is MongoDB project and will continue to improve the interaction of Scala + MongoDB.

Add import:
import com.mongodb.casbah.Imports._


You could get the source from :

Then you could modify your
mongoConnector/ScalaCasbahConnections$ cat build.sbt
organization := "com.alvinalexander"

name := "ScalatraCasbahMongo"

version := "0.1.0-SNAPSHOT"

scalaVersion := "2.11.8"

libraryDependencies += "org.mongodb" %% "casbah" % "3.1.1"

libraryDependencies += "com.mongodb.casbah" % "casbah-gridfs_2.8.1" % "2.1.5-1"

libraryDependencies += "org.slf4j" % "slf4j-log4j12" % "1.7.24"

resolvers += "Sonatype OSS Snapshots" at ""

mongoConnector/ScalaCasbahConnections$ sbt run
[info] Loading project definition from /home/spb/mongoConnector/ScalaCasbahConnections/project
[info] Set current project to ScalatraCasbahMongo (in build file:/home/spb/mongoConnector/ScalaCasbahConnections/)
[info] Compiling 1 Scala source to /home/spb/mongoConnector/ScalaCasbahConnections/target/scala-2.11/classes...
[warn] there was one deprecation warning; re-run with -deprecation for details
[warn] one warning found
[info] Running casbahtests.MainDriver
debug: a
log4j:WARN No appenders could be found for logger (com.mongodb.casbah.commons.conversions.scala.RegisterConversionHelpers$).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See for more info.
debug: b
debug: c
debug: d
debug: e
debug: f
debug: g
debug: h
debug: i
debug: j
debug: k
debug: l
debug: m
debug: n
debug: o
debug: p
debug: q
debug: r
debug: s
debug: t
debug: u
debug: v
debug: w
debug: x
debug: y
debug: z
sleeping at the end
  sleeping: 1
  sleeping: 2
  sleeping: 3
  sleeping: 4
  sleeping: 5
  sleeping: 6
  sleeping: 7
  sleeping: 8
  sleeping: 9
  sleeping: 10
  sleeping: 11
  sleeping: 12
  sleeping: 13
  sleeping: 14
  sleeping: 15
  sleeping: 16
  sleeping: 17
  sleeping: 18
  sleeping: 19
  sleeping: 20
  sleeping: 21
  sleeping: 22
  sleeping: 23
  sleeping: 24
  sleeping: 25
  sleeping: 26
  sleeping: 27
  sleeping: 28
  sleeping: 29
  sleeping: 30
game over
[success] Total time: 62 s, completed 13 Mar, 2017 5:37:31 PM
spb@spb-VirtualBox:~/mongoConnector/ScalaCasbahConnections$ sbt package
[info] Loading project definition from /home/spb/mongoConnector/ScalaCasbahConnections/project
[info] Set current project to ScalatraCasbahMongo (in build file:/home/spb/mongoConnector/ScalaCasbahConnections/)
[info] Packaging /home/spb/mongoConnector/ScalaCasbahConnections/target/scala-2.11/scalatracasbahmongo_2.11-0.1.0-SNAPSHOT.jar ...
[info] Done packaging.
[success] Total time: 1 s, completed 13 Mar, 2017 5:54:42 PM

spb@spb-VirtualBox:~/Scala_project$ mongo
MongoDB shell version: 3.2.12
connecting to: test
Server has startup warnings:
> show dbs
local  0.000GB
mydb   0.000GB
> show dbs
finance  0.000GB
local    0.000GB
mydb     0.000GB
> show collections
> use finance
switched to db finance
> show collections
> db.stocks.find()
{ "_id" : ObjectId("58cd184edffa1f1829bfbc94"), "name" : "a", "symbol" : "a" }
{ "_id" : ObjectId("58cd184fdffa1f1829bfbc95"), "name" : "b", "symbol" : "b" }
{ "_id" : ObjectId("58cd1850dffa1f1829bfbc96"), "name" : "c", "symbol" : "c" }
{ "_id" : ObjectId("58cd1851dffa1f1829bfbc97"), "name" : "d", "symbol" : "d" }
{ "_id" : ObjectId("58cd1852dffa1f1829bfbc98"), "name" : "e", "symbol" : "e" }
{ "_id" : ObjectId("58cd1853dffa1f1829bfbc99"), "name" : "f", "symbol" : "f" }
{ "_id" : ObjectId("58cd1854dffa1f1829bfbc9a"), "name" : "g", "symbol" : "g" }
{ "_id" : ObjectId("58cd1855dffa1f1829bfbc9b"), "name" : "h", "symbol" : "h" }
{ "_id" : ObjectId("58cd1856dffa1f1829bfbc9c"), "name" : "i", "symbol" : "i" }
{ "_id" : ObjectId("58cd1857dffa1f1829bfbc9d"), "name" : "j", "symbol" : "j" }
{ "_id" : ObjectId("58cd1858dffa1f1829bfbc9e"), "name" : "k", "symbol" : "k" }
{ "_id" : ObjectId("58cd1859dffa1f1829bfbc9f"), "name" : "l", "symbol" : "l" }
{ "_id" : ObjectId("58cd185adffa1f1829bfbca0"), "name" : "m", "symbol" : "m" }
{ "_id" : ObjectId("58cd185bdffa1f1829bfbca1"), "name" : "n", "symbol" : "n" }
{ "_id" : ObjectId("58cd185cdffa1f1829bfbca2"), "name" : "o", "symbol" : "o" }
{ "_id" : ObjectId("58cd185ddffa1f1829bfbca3"), "name" : "p", "symbol" : "p" }
{ "_id" : ObjectId("58cd185edffa1f1829bfbca4"), "name" : "q", "symbol" : "q" }
{ "_id" : ObjectId("58cd185fdffa1f1829bfbca5"), "name" : "r", "symbol" : "r" }
{ "_id" : ObjectId("58cd1860dffa1f1829bfbca6"), "name" : "s", "symbol" : "s" }
{ "_id" : ObjectId("58cd1861dffa1f1829bfbca7"), "name" : "t", "symbol" : "t" }
Type "it" for more


There are two ways of getting the data from MongoDB to Apache Spark.
Method 1: Using Casbah (Layer on MongDB Java Driver)
val uriRemote = MongoClientURI("mongodb://RemoteURL:27017/")
val mongoClientRemote =  MongoClient(uriRemote)
val dbRemote = mongoClientRemote("dbName")
val collectionRemote = dbRemote("collectionName")
val ipMongo = collectionRemote.find
val ipRDD = sc.makeRDD(ipMongo.toList)

Method 2: Spark Worker at our use
Better version of code: Using Spark worker and multiple core to use to get the data in short time.

val config = new Configuration()
config.set("mongo.input.uri", "mongodb://RemoteURL:27017/dbName.collectionName")
val keyClassName = classOf[Object]
val valueClassName = classOf[BSONObject]
val inputFormatClassName = classOf[com.mongodb.hadoop.MongoInputFormat]
val ipRDD = sc.newAPIHadoopRDD(config,inputFormatClassName,keyClassName,valueClassName)


Wednesday, April 19, 2017

Text Mining(TM) with an example of WordCloud on RStudio

It is estimated that major part of useable business information is unstructured, often in the form of text data. Text mining provides a collection of methods that help us to derive actionable insights from these data. 

The main package to perform text mining tasks in R is tm .The structure for managing documents in tm is  Corpus, representing a collection of text documents. Or "A corpus is a large body of natural language text used for accumulating statistics on natural language text. The plural is corpora. A lexicon is a collection of information about the words of a language about the lexical categories to which they belong. A lexicon is usually structured as a collection of lexical entries like same word used for verb, Noun and adjectives.

Once we have a corpus we typically want to modify the documents in it, e.g., stemming, stopword removal…etc.  In tm, all this functionality is subsumed into the concept of a transformation. Transformations are done via the tm_map() function which applies (maps) a function to all elements of the corpus. Basically, all transformations work on single text documents and tm_map() just applies them to all documents in a corpus.

Eliminating Extra Whitespace
> sample <- tm_map(sample, stripWhitespace)

Convert to Lower Case
> sample <- tm_map(sample, content_transformer(tolower))

Remove Stopwords
> sample <- tm_map(sample, removeWords, stopwords("english"))

Stemming is done by:
> sample <- tm_map(sample, stemDocument)
Wordcloud _example_1: 

Step 1 : Install package "tm"

Step 2:  Install package "RColorBrewer"

Step 3 : Install package wordCloud 

Step 4 :  Load Libraries 

Step 5 : Execute the  R script :
my_data_file = readLines("/home/spb/data/input.txt")

myCorpus = Corpus(VectorSource(my_data_file))

myCorpus = tm_map(myCorpus, tolower)
myCorpus = tm_map(myCorpus, removePunctuation)
myCorpus = tm_map(myCorpus, removeNumbers)
myCorpus = tm_map(myCorpus, removeWords, stopwords("english"))

myTDM = TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))

m = as.matrix(myTDM)

v = sort(rowSums(m), decreasing = TRUE)

wordcloud(names(v), v, min.freq = 50) 
 Step 6 :  wordcloud visualization :

Wordcloud _example_2:
wordcloud(names(v), v, min.freq = 50, colors=brewer.pal(7, "Dark2"), random.order = TRUE) 

Wordcloud _example_3: 
wordcloud(names(v), v, min.freq = 50, colors=brewer.pal(7, "Dark2"), random.order = FALSE)