Sunday, March 5, 2017

Sample Spark Application -wordcount with Simple Build Tool (SBT)

Developing and Running a Spark WordCount Application written in Scala :

Apache Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source.

WordCount example reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a space (" ").

Simple Build Tool (SBT) is an open source build tool for Scala and Java projects, similar to Java's Maven or Ant. Its main features are: native support for compiling Scala code and integrating with many Scala frameworks. sbt is the de facto build tool in the Scala community.

This tutorial describes how to write, compile and run a simple Spark word count application in scala language supported by Spark. 

Pr-requisites: You could download and install listed packages:
1) sbt from: http://www.scala-sbt.org/    or  git clone https://github.com/sbt/sbt.git
2) scala from https://www.scala-lang.org/download/all.html
3) Apache spark from  http://spark.apache.org/downloads.html
 ---------------------------------------------------------------------------------
Step 1:  create the folder scala_project :
               mkdir Scala_wc_project
Step 2:   cd  scala_wc_project

Step 3 :  mkdir -p project src/{main,test}/{scala,java,resources}

spb@spb-VirtualBox:~/Scala_wc_project$ ls
project  src
Step 4 : create wordcount.scala program at src/main/scala

spb@spb-VirtualBox:~/Scala_wc_project/src/main/scala$ vi wordcount.scala
-----------------------------------
 import org.apache.spark._
 import org.apache.spark.SparkContext._


 object WordCount {
    def main(args: Array[String]) {
      val inputFile = args(0)
      val outputFile = args(1)
      val conf = new SparkConf().setAppName("
wordCount")
      // Create a Scala Spark Context.
      val sc = new SparkContext(conf)
      // Load our input data.
      val input =  sc.textFile(inputFile)
      // Split up into words.
      val words = input.flatMap(line => line.split(" "))
      // Transform into word and count.
      val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
      // Save the word count back out to a text file, causing evaluation.
      counts.saveAsTextFile(
outputFile)
    }
}
-----------------------------------------------------------------------------------
where



 
Step 5 : create  build.sbt at root of project directory
spb@spb-VirtualBox:~/Scala_wc_
project$ cat build.sbt
name := "SPB_Word Count"

version := "1.0"

scalaVersion := "2.11.8"

sbtVersion := "0.13.13"

libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "2.1.0"
-------------------------------------------------------------------------------------
 Step 6 :  Execute sbt commands : clean, update,compile, package
This will create the  jar file (package) at target directory

spb@spb-VirtualBox:~/Scala_wc_project$ sbt clean
[info] Set current project to SPB_Word Count (in build file:/home/spb/Scala_wc_project/)
[success] Total time: 1 s, completed 5 Mar, 2017 7:29:04 PM
spb@spb-VirtualBox:~/Scala_wc_project$ sbt update
[info] Set current project to SPB_Word Count (in build file:/home/spb/Scala_wc_project/)
[info] Updating {file:/home/spb/Scala_wc_project/}scala_wc_project...
[info] Resolving jline#jline;2.12.1 ...
[info] Done updating.
[success] Total time: 18 s, completed 5 Mar, 2017 7:29:33 PM
spb@spb-VirtualBox:~/Scala_wc_project$ sbt compile
[info] Set current project to SPB_Word Count (in build file:/home/spb/Scala_wc_project/)
[info] Compiling 1 Scala source to /home/spb/Scala_wc_project/target/scala-2.11/classes...
[success] Total time: 10 s, completed 5 Mar, 2017 7:29:54 PM
spb@spb-VirtualBox:~/Scala_wc_project$
spb@spb-VirtualBox:~/Scala_wc_project$ sbt package
[info] Set current project to SPB_Word Count (in build file:/home/spb/Scala_wc_project/)
[info] Packaging /home/spb/Scala_wc_project/target/scala-2.11/spb_word-count_2.11-1.0.jar ...
[info] Done packaging.
[success] Total time: 2 s, completed 5 Mar, 2017 7:30:47 PM
spb@spb-VirtualBox:~/Scala_wc_project$

spb@spb-VirtualBox:~/Scala_wc_project/target/scala-2.11$ ls
classes  spb_word-count_2.11-1.0.jar
spb@spb-VirtualBox:~/Scala_wc_project/target/scala-2.11$
-------------------
Step 7: Project directory structure :

Step 8 :  Run  "spb_word-count_2.11-1.0.jar" file  on spark framework  as shown here :

 spark-submit --class WordCount --master local spb_word-count_2.11-1.0.jar /home/spb/data/input.txt /home/spb/data/output8

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/03/05 19:36:32 INFO SparkContext: Running Spark version 2.1.0
17/03/05 19:36:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/05 19:36:34 WARN Utils: Your hostname, spb-VirtualBox resolves to a loopback address: 127.0.1.1; using 192.168.1.123 instead (on interface enp0s3)
17/03/05 19:36:34 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/03/05 19:36:34 INFO SecurityManager: Changing view acls to: spb
17/03/05 19:36:34 INFO SecurityManager: Changing modify acls to: spb
17/03/05 19:36:34 INFO SecurityManager: Changing view acls groups to:
17/03/05 19:36:34 INFO SecurityManager: Changing modify acls groups to:
17/03/05 19:36:34 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spb); groups with view permissions: Set(); users  with modify permissions: Set(spb); groups with modify permissions: Set()
17/03/05 19:36:34 INFO Utils: Successfully started service 'sparkDriver' on port 39733.
17/03/05 19:36:34 INFO SparkEnv: Registering MapOutputTracker
17/03/05 19:36:34 INFO SparkEnv: Registering BlockManagerMaster
17/03/05 19:36:34 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.
DefaultTopologyMapper for getting topology information
17/03/05 19:36:34 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/03/05 19:36:35 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-8b32665a-9d41-
4c8d-bee4-42b6d5891c15
17/03/05 19:36:35 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/03/05 19:36:35 INFO SparkEnv: Registering OutputCommitCoordinator
17/03/05 19:36:35 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/03/05 19:36:35 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.123:4040
17/03/05 19:36:36 INFO SparkContext: Added JAR file:/home/spb/Scala_wc_
project/target/scala-2.11/spb_word-count_2.11-1.0.jar at spark://192.168.1.123:39733/jars/spb_word-count_2.11-1.0.jar with timestamp 1488722796080
17/03/05 19:36:36 INFO Executor: Starting executor ID driver on host localhost
17/03/05 19:36:36 INFO Utils: Successfully started service 'org.apache.spark.network.
netty.NettyBlockTransferService' on port 37957.
17/03/05 19:36:36 INFO NettyBlockTransferService: Server created on 192.168.1.123:37957
17/03/05 19:36:36 INFO BlockManager: Using org.apache.spark.storage.
RandomBlockReplicationPolicy for block replication policy
17/03/05 19:36:36 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.1.123, 37957, None)
17/03/05 19:36:36 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.123:37957 with 366.3 MB RAM, BlockManagerId(driver, 192.168.1.123, 37957, None)
17/03/05 19:36:36 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.123, 37957, None)
17/03/05 19:36:36 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.123, 37957, None)
17/03/05 19:36:37 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 236.5 KB, free 366.1 MB)
17/03/05 19:36:38 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.9 KB, free 366.0 MB)
17/03/05 19:36:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.123:37957 (size: 22.9 KB, free: 366.3 MB)
17/03/05 19:36:38 INFO SparkContext: Created broadcast 0 from textFile at wordcount.scala:12
17/03/05 19:36:38 INFO FileInputFormat: Total input paths to process : 1
17/03/05 19:36:38 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/03/05 19:36:38 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/03/05 19:36:38 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/03/05 19:36:38 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/03/05 19:36:38 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/03/05 19:36:38 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
17/03/05 19:36:38 INFO SparkContext: Starting job: saveAsTextFile at wordcount.scala:18
17/03/05 19:36:39 INFO DAGScheduler: Registering RDD 3 (map at wordcount.scala:16)
17/03/05 19:36:39 INFO DAGScheduler: Got job 0 (saveAsTextFile at wordcount.scala:18) with 2 output partitions
17/03/05 19:36:39 INFO DAGScheduler: Final stage: ResultStage 1 (saveAsTextFile at wordcount.scala:18)
17/03/05 19:36:39 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
17/03/05 19:36:39 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
17/03/05 19:36:39 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at wordcount.scala:16), which has no missing parents
17/03/05 19:36:39 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.7 KB, free 366.0 MB)
17/03/05 19:36:39 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.7 KB, free 366.0 MB)
17/03/05 19:36:39 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.123:37957 (size: 2.7 KB, free: 366.3 MB)
17/03/05 19:36:39 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:996
17/03/05 19:36:39 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at wordcount.scala:16)
17/03/05 19:36:39 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
17/03/05 19:36:39 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 6043 bytes)
17/03/05 19:36:39 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 6043 bytes)
17/03/05 19:36:39 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/03/05 19:36:39 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
17/03/05 19:36:39 INFO Executor: Fetching spark://192.168.1.123:39733/jars/spb_word-count_2.11-1.0.jar with timestamp 1488722796080
17/03/05 19:36:40 INFO TransportClientFactory: Successfully created connection to /192.168.1.123:39733 after 110 ms (0 ms spent in bootstraps)
17/03/05 19:36:40 INFO Utils: Fetching spark://192.168.1.123:39733/jars/spb_word-count_2.11-1.0.jar to /tmp/spark-58cba0fb-b20f-46f4-
bb17-3201f9dec45b/userFiles-69c444db-7e90-48cb-805b-12225146686f/fetchFileTemp6212103198181174966.tmp
17/03/05 19:36:40 INFO Executor: Adding file:/tmp/spark-58cba0fb-b20f-
46f4-bb17-3201f9dec45b/userFiles-69c444db-7e90-48cb-805b-12225146686f/spb_word-count_2.11-1.0.jar to class loader
17/03/05 19:36:40 INFO HadoopRDD: Input split: file:/home/spb/data/input.txt:
0+39
17/03/05 19:36:40 INFO HadoopRDD: Input split: file:/home/spb/data/input.txt:
39+40
17/03/05 19:36:41 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1819 bytes result sent to driver
17/03/05 19:36:41 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1819 bytes result sent to driver
17/03/05 19:36:41 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1221 ms on localhost (executor driver) (1/2)
17/03/05 19:36:41 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1201 ms on localhost (executor driver) (2/2)
17/03/05 19:36:41 INFO DAGScheduler: ShuffleMapStage 0 (map at wordcount.scala:16) finished in 1.355 s
17/03/05 19:36:41 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/03/05 19:36:41 INFO DAGScheduler: looking for newly runnable stages
17/03/05 19:36:41 INFO DAGScheduler: running: Set()
17/03/05 19:36:41 INFO DAGScheduler: waiting: Set(ResultStage 1)
17/03/05 19:36:41 INFO DAGScheduler: failed: Set()
17/03/05 19:36:41 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at wordcount.scala:18), which has no missing parents
17/03/05 19:36:41 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 72.8 KB, free 366.0 MB)
17/03/05 19:36:41 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 26.4 KB, free 365.9 MB)
17/03/05 19:36:41 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.123:37957 (size: 26.4 KB, free: 366.2 MB)
17/03/05 19:36:41 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:996
17/03/05 19:36:41 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at wordcount.scala:18)
17/03/05 19:36:41 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
17/03/05 19:36:41 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, executor driver, partition 0, ANY, 5827 bytes)
17/03/05 19:36:41 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, executor driver, partition 1, ANY, 5827 bytes)
17/03/05 19:36:41 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
17/03/05 19:36:41 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
17/03/05 19:36:41 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/03/05 19:36:41 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
17/03/05 19:36:41 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 22 ms
17/03/05 19:36:41 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 38 ms
17/03/05 19:36:41 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
17/03/05 19:36:41 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
17/03/05 19:36:41 INFO FileOutputCommitter: Saved output of task 'attempt_20170305193638_0001_
m_000000_2' to file:/home/spb/data/output8/_temporary/0/task_20170305193638_0001_m_000000
17/03/05 19:36:41 INFO SparkHadoopMapRedUtil: attempt_20170305193638_0001_m_
000000_2: Committed
17/03/05 19:36:41 INFO FileOutputCommitter: Saved output of task 'attempt_20170305193638_0001_
m_000001_3' to file:/home/spb/data/output8/_temporary/0/task_20170305193638_0001_m_000001
17/03/05 19:36:41 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1977 bytes result sent to driver
17/03/05 19:36:41 INFO SparkHadoopMapRedUtil: attempt_20170305193638_0001_m_
000001_3: Committed
17/03/05 19:36:41 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1890 bytes result sent to driver
17/03/05 19:36:41 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 519 ms on localhost (executor driver) (1/2)
17/03/05 19:36:41 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 519 ms on localhost (executor driver) (2/2)
17/03/05 19:36:41 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at wordcount.scala:18) finished in 0.532 s
17/03/05 19:36:41 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/03/05 19:36:41 INFO DAGScheduler: Job 0 finished: saveAsTextFile at wordcount.scala:18, took 2.968627 s
17/03/05 19:36:42 INFO SparkContext: Invoking stop() from shutdown hook
17/03/05 19:36:42 INFO SparkUI: Stopped Spark web UI at http://192.168.1.123:4040
17/03/05 19:36:42 INFO MapOutputTrackerMasterEndpoint
: MapOutputTrackerMasterEndpoint stopped!
17/03/05 19:36:42 INFO MemoryStore: MemoryStore cleared
17/03/05 19:36:42 INFO BlockManager: BlockManager stopped
17/03/05 19:36:42 INFO BlockManagerMaster: BlockManagerMaster stopped
17/03/05 19:36:42 INFO OutputCommitCoordinator$
OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/03/05 19:36:42 INFO SparkContext: Successfully stopped SparkContext
17/03/05 19:36:42 INFO ShutdownHookManager: Shutdown hook called
17/03/05 19:36:42 INFO ShutdownHookManager: Deleting directory /tmp/spark-58cba0fb-b20f-46f4-
bb17-3201f9dec45b
spb@spb-VirtualBox:~/Scala_wc_
project/target/scala-2.11$
NOTE:
             --class: The entry point for your wordcount application
             --master: The master URL for the cluster/local (e.g. spark://23.195.26.187:7077)
---------------------------------------------
 Step 9:  Verify  the wordcount output file  as mentioned in previous step.

These examples give a quick overview of the Spark API. Spark is built on the concept of distributed datasets, which contain arbitrary Java or Python objects. You create a dataset from external data, then apply parallel operations to it. 

                                                                                  -----------------THE END -------------------

How to build/package Scala Projects using SBT and run as Apache Spark applications

Simple Build Tool (SBT) is an open source build tool for Scala and Java projects, similar to Java's Maven or Ant. Its main features are: native support for compiling Scala code and integrating with many Scala frameworks. sbt is the de facto build tool in the Scala community.

Getting started with a simple SBT project :

Pr-requisites: You could download and install listed packages:
1) sbt from: http://www.scala-sbt.org/    or  git clone https://github.com/sbt/sbt.git
2) scala from https://www.scala-lang.org/download/all.html
3) Apache spark from  http://spark.apache.org/downloads.html

   Typical sbt project will have following directory structure.

step 1:  create the folder scala_project :
             mkdir Scala_project
Step 2: cd  scala_project

Step 3 : mkdir -p project src/{main,test}/{scala,java,resources}

Step 4 : show the directory structure


Step 5 : create build.sbt file  at the root of the folder  i.e  Scala_project/
Step 6:  Create project/build.properties file
             echo "sbt.version=0.13.13" >project/build.properties

Step 7: Create a scala program "hello.scala"   at /src/main/scala directory


Step 8: You can directly compile this file using scalac  without sbt tool
             scalac hello.scala

Step 9:   Go to your projects root directory  i.e /home/spb/Scala_project  and run following commands

1) sbt clean - delete all previously generated files in target directory
-------------------------------------------------------------------------------------------
spb@spb-VirtualBox:~/Scala_project$ pwd
/home/spb/Scala_project
spb@spb-VirtualBox:~/Scala_
project$ ls
build.sbt  project  src  target
spb@spb-VirtualBox:~/Scala_project$ sbt clean
[info] Loading project definition from /home/spb/Scala_project/project
[info] Set current project to SPB_Hello world (in build file:/home/spb/Scala_project/)
[success] Total time: 0 s, completed 5 Mar, 2017 1:26:47 PM
spb@spb-VirtualBox:~/Scala_project$
----------------------------------------------------------------------------------------------
2) sbt update -Updates external dependencies.

 spb@spb-VirtualBox:~/Scala_project$ sbt update
[info] Loading project definition from /home/spb/Scala_project/
project
[info] Set current project to SPB_Hello world (in build file:/home/spb/Scala_project/)
[info] Updating {file:/home/spb/Scala_project/}scala_project...
[info] Resolving jline#jline;2.12.1 ...
[info] Done updating.
[success] Total time: 1 s, completed 5 Mar, 2017 1:26:59 PM
spb@spb-VirtualBox:~/Scala_project$

 -----------------------------------------------------------------------------------------------
3) sbt compile  -Compiles source code files that are in src/main/scala, src/main/java and root dir of project

----------------------------------------------------------------------------------------
spb@spb-VirtualBox:~/Scala_
project$ sbt compile
[info] Loading project definition from /home/spb/Scala_project/project
[info] Set current project to SPB_Hello world (in build file:/home/spb/Scala_project/)
[info] Compiling 1 Scala source to /home/spb/Scala_project/target/scala-2.11/classes...
[success] Total time: 5 s, completed 5 Mar, 2017 1:27:26 PM
spb@spb-VirtualBox:~/Scala_project$ 
--------------------------------------------------------------------------------------

4) sbt run -Compiles your code, and runs the main class from your project, in the same JVM as SBT. If your project has multiple main methods (or objects that extend App), you’ll be prompted to select one to run.
----------------------------------------------------------------------------------------
 spb@spb-VirtualBox:~/Scala_
project$ sbt run
[info] Loading project definition from /home/spb/Scala_project/project
[info] Set current project to SPB_Hello world (in build file:/home/spb/Scala_project/)
[info] Running HelloWorld
Hello world and welcome to scala
[success] Total time: 1 s, completed 5 Mar, 2017 1:27:55 PM
spb@spb-VirtualBox:~/Scala_project$ 
--------------------------------------------------------------------------------------
5) sbt package -Creates a JAR file (or WAR file for web projects) containing the files in src/main/scala, src/main/java, and resources in src/main/resources.
----------------------------------------------------------------------------------------
spb@spb-VirtualBox:~/Scala_project$ sbt package
[info] Loading project definition from /home/spb/Scala_project/project
[info] Set current project to SPB_Hello world (in build file:/home/spb/Scala_project/)
[info] Packaging /home/spb/Scala_project/target/scala-2.11/spb_hello-world_2.11-1.0.jar ...
[info] Done packaging.
[success] Total time: 0 s, completed 5 Mar, 2017 1:27:41 PM
spb@spb-VirtualBox:~/Scala_project$
---------------------------------------------------------------------------------------
Check your  target directory  for jar file created .


  
Now you can submit a job to Apache spark framework  to run application as shown here:
 ---------------------------------------------------------------------------------
spb@spb-VirtualBox:~/Scala_
project/target/scala-2.11$ spark-submit  --class  HelloWorld spb_hello-world_2.11-1.0.jar
Hello world and welcome to scalaspb@spb-VirtualBox:~/Scala_project/target/scala-2.11$
----------------------------------------------------------------------------------- 

Additionally , you could also make use of sbt console that Compiles the source code files in the project, puts them on the classpath, and starts the Scala interpreter (REPL). SBT has some interesting features that come in handy during development, such as starting a Scala REPL with project classes and dependencies on the classpath, continuous compilation and testing with triggered execution, and much more.
                                                                                ---THE END---

Tuesday, January 24, 2017

Spectrum LSF : accounting Logs for resource usage statistics





 The bacct command uses the current lsb.acct file for its output. Displays a summary of accounting statistics for all finished jobs (with a DONE or EXIT status) submitted by the user who invoked the command, on all hosts, projects, and queues in the LSF system. bacct displays statistics for all jobs logged in the current Platform LSF accounting log file: LSB_SHAREDIR/cluster_name/logdir/lsb.acct.

The lsb.acct file is the batch job log file of LSF. The master batch daemon
mbatchd generates a record for each job completion or failure. The record is
appended to the job log file lsb.acct

The file is located at  LSB_HOMESHAREDIR/cluster_name/logdir

[root@c61f4s20 conf]# bacct
/lsf_home/work/CI_cluster1/logdir/lsb.acct: No such file or directory
[root@c61f4s20 conf]#

NOTE: Create a file "lsb.acct"  at LSB_HOMESHAREDIR/cluster_name/logdir
and restart LSF daemons to take your modifications.
----------------------------------------
[lsfadmin@localhost logdir]$ pwd
/lsf_home/work/CI_cluster1/logdir
[lsfadmin@localhost logdir]$ ls -alsrt lsb.acct
44 -rw-r--r-- 1 lsfadmin lsfadmin 41182 Jan 25 01:32 lsb.acct
[lsfadmin@localhost logdir]$
----------------------------------------


[lsfadmin@localhost logdir]$ head -2 lsb.acct
"JOB_FINISH" "10.1" 1484136455 106 1002 33554438 1 1484136355 0 0 1484136355 "lsfadmin" "normal" "" "" "" "c61f4s20" "/lsf_home/conf" "" "" "" "1484136355.106" 1 "c712f6n07" 1 "c712f6n07" 64 250.0 "" "sleep 100" 0.505916 0.029839 9088 0 -1 0 0 931 0 0 0 256 -1 0 0 0 18 7 -1 "" "default" 0 1 "" "" 0 12288 352256 "" "" "" "" 0 "" 0 "" -1 "/lsfadmin" "" "" "" -1 "" "" 5136 "" 1484136355 "" "" 2 1032 "0" 1033 "0" 0 -1 0 12288 "select[type == any] order[r15s:pg] " "" -1 "" -1 0 "" 0 0 "" 100 "/lsf_home/conf" 0 "" 0.000000 0.00 0.00 0.00 0.00 1 "c712f6n07" -1 0 0 0
"JOB_FINISH" "10.1" 1484136493 107 1002 33554438 1 1484136393 0 0 1484136393 "lsfadmin" "normal" "" "" "" "c61f4s20" "/lsf_home/conf" "" "" "" "1484136393.107" 1 "localhost" 1 "localhost" 64 250.0 "" "sleep 100" 0.516205 0.018902 9088 0 -1 0 0 932 0 0 0 256 -1 0 0 0 18 7 -1 "" "default" 0 1 "" "" 0 12288 352256 "" "" "" "" 0 "" 0 "" -1 "/lsfadmin" "" "" "" -1 "" "" 5136 "" 1484136393 "" "" 2 1032 "0" 1033 "0" 0 -1 0 12288 "select[type == any] order[r15s:pg] " "" -1 "" -1 0 "" 0 0 "" 100 "/lsf_home/conf" 0 "" 0.000000 0.00 0.00 0.00 0.00 1 "localhost" -1 0 0 0
[lsfadmin@localhost logdir]$

-------------------------------
1) Submit a  job:

[lsfadmin@localhost ~]$ bsub
bsub> sleep 400
bsub> Job <232> is submitted to default queue <normal>.

2) Check status

[lsfadmin@localhost ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
232     lsfadmi RUN   normal     localhost   c712f6n07   sleep 400  Jan 25 01:25

3) Verify the statistics  on jobs submitted by the USER  called lsfadmin  by using  bacct command
[lsfadmin@localhost ~]$ bacct

Accounting information about jobs that are:
  - submitted by users lsfadmin,
  - accounted on all projects.
  - completed normally or exited
  - executed on all hosts.
  - submitted to all queues.
  - accounted on all service classes.
------------------------------------------------------------------------------

SUMMARY:      ( time unit: second )
 Total number of done jobs:      69      Total number of exited jobs:    26
 Total CPU time consumed:     146.0      Average CPU time consumed:     1.5
 Maximum CPU time of a job:     5.0      Minimum CPU time of a job:     0.0
 Total wait time in queues: 1181375.0
 Average wait time in queue:12435.5
 Maximum wait time in queue:681365.0      Minimum wait time in queue:    0.0
 Average turnaround time:     12738 (seconds/job)
 Maximum turnaround time:    681365      Minimum turnaround time:         2
 Average hog factor of a job:  0.00 ( cpu time / turnaround time )
 Maximum hog factor of a job:  0.02      Minimum hog factor of a job:  0.00
 Average expansion factor of a job:  12435.08 ( turnaround time / run time )
 Maximum expansion factor of a job:  681365.00
 Minimum expansion factor of a job:  1.00
 Total Run time consumed:     28773      Average Run time consumed:     302
 Maximum Run time of a job:    1000      Minimum Run time of a job:       0
 Total throughput:             0.29 (jobs/hour)  during  330.25 hours
 Beginning time:       Jan 11 07:07      Ending time:          Jan 25 01:22

[lsfadmin@localhost ~]$

NOTE: custom made Statistics not reported by bacct but of interest to individual system administrators  can be generated by directly using awk or perl to process the lsb.acct file.

Reference:
1) http://www.ibm.com/support/knowledgecenter/SSJSMF_9.1.2/lsf/workbookhelp/wb_usecase_lsf.html
2) https://www.ibm.com/support/knowledgecenter/SSETD4_9.1.3/lsf_command_ref/bacct.1.html
3) https://www.ibm.com/support/knowledgecenter/SSETD4_9.1.2/lsf_admin/job_exit_info_view_logged.html
4) http://www-01.ibm.com/support/docview.wss?uid=isg3T1013490

Spectrum LSF 10.1 Installation and Job Submission


Load Sharing Facility (or simply LSF) is a workload management platform, job scheduler, for distributed HPC environments. It can be used to execute batch jobs on networked Unix and Windows systems on many different architectures. LSF was based on the Utopia research project at the University of Toronto.

IBM Platform Computing is now renamed to IBM Spectrum Computing to complement IBM’s Spectrum Storage family of software-defined offerings. The IBM Platform LSF product is now IBM Spectrum LSF.
IBM Spectrum LSF 10.1 is available as the following offering packages: 
1) IBM Spectrum LSF Community Edition 10.1, 
2) IBM Spectrum LSF Suite for Workgroups 10.1, and 
3) IBM Spectrum LSF Suite for HPC 10.1.

LSF provides a resource management framework that takes your job requirements, finds the best resources to run the job, and monitors its progress. Jobs always run according to host load and site policies.


LSF daemons and processes

Multiple LSF processes run on each host in the cluster. The type and number of processes that are running depends on whether the host is a master host or a compute host.


LSF hosts run various daemon processes, depending on their role in the cluster.


DaemonRole
mbatchdJob requests and dispatch
mbschdJob scheduling
sbatchdJob execution
resJob execution
limHost information
pimJob process information
elimDynamic load indexes


Installation Steps :

Step 1 : Create installation directory: 
[root@localhost LSF_installation_files]# pwd
/root/LSF_installation_files
[root@localhost LSF_installation_files]#

Step 2 : Untar the package 
----------------
[root@localhost LSF_installation_files]# ls
lsfce10.1-x86_64  lsfce10.1-x86_64.tar.gz
[root@localhost LSF_installation_files]#


[root@localhost LSF_installation_files]# gunzip -c lsfce10.1-x86_64.tar.gz | tar xvf -
lsfce10.1-x86_64/
lsfce10.1-x86_64/pmpi/
lsfce10.1-x86_64/pmpi/platform_mpi-09.01.02.00u.x64.bin
lsfce10.1-x86_64/lsf/
lsfce10.1-x86_64/lsf/lsf10.1_lsfinstall_linux_x86_64.tar.Z
lsfce10.1-x86_64/lsf/lsf10.1_linux2.6-glibc2.3-x86_64.tar.Z
lsfce10.1-x86_64/pac/
lsfce10.1-x86_64/pac/pac10.1_basic_linux-x64.tar.Z
[root@localhost LSF_installation_files]#

/root/LSF_installation_files/lsfce10.1-x86_64/lsf/
[root@localhost lsf]# ls
lsf10.1_linux2.6-glibc2.3-x86_64.tar.Z  lsf10.1_lsfinstall_linux_x86_64.tar.Z
[root@localhost lsf]# zcat lsf10.1_lsfinstall_linux_x86_64.tar.Z | tar xvf -

----------------------

[root@localhost lsf10.1_lsfinstall]# pwd
/root/LSF_installation_files/lsfce10.1-x86_64/lsf/lsf10.1_lsfinstall
[root@localhost lsf10.1_lsfinstall]# ls
conf_tmpl  install.config  lap         lsf_unix_install.pdf  patchlib   README      rpm      slave.config
hostsetup  instlib         lsfinstall  patchinstall          pversions  rhostsetup  scripts
[root@localhost lsf10.1_lsfinstall]#

---------------------------
=========================
2. Use lsfinstall
========================
The installation program for IBM Spectrum LSF Version 10.1 is lsfinstall.
Use the lsfinstall script to install a new LSF Version 10.1 cluster.

------------------------
2.1 Steps
------------------------
1. Edit lsf10.1_lsfinstall/install.config to specify the options
   for your cluster. Uncomment the options you want and replace the
   example values with your own settings.
2. Run lsf10.1_lsfinstall/lsfinstall -f install.config
3. Read the following files generated by lsfinstall:
   o  lsf10.1_lsfinstall/lsf_getting_started.html to find out how
      to set up your LSF hosts, start LSF, and test your new LSF cluster
   o  lsf10.1_lsfinstall/lsf_quick_admin.html to learn more about
      your new LSF cluster

--------------------------------------------------------------------------
Start install script :


[root@localhost lsf10.1_lsfinstall]# ./lsfinstall -f install.config


Cron scheduler - cron.daily, cron.weekly, cron.monthly

Cron is a daemon that can be used to schedule the execution of recurring tasks according to a combination of the time, day of the month, month, day of the week, and week.

To check the RPM package for cron: 


[root@ ~]# rpm -qa | grep cron
cronie-anacron-1.4.11-14.el7_2.1.ppc64le
crontabs-1.11-6.20121102git.el7.noarch
cronie-1.4.11-14.el7_2.1.ppc64le

[root@~]#
-------------------------------------------------

To check the status of cron daemon:

[root@~]# /sbin/service crond status
Redirecting to /bin/systemctl status  crond.service
● crond.service - Command Scheduler
   Loaded: loaded (/usr/lib/systemd/system/crond.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2016-11-30 15:05:45 EST; 1 months 24 days ago
 Main PID: 4633 (crond)
   CGroup: /system.slice/crond.service
           └─4633 /usr/sbin/crond -n

Nov 30 15:05:45 hostname systemd[1]: Started Command Scheduler.
Nov 30 15:05:45 hostname systemd[1]: Starting Command Scheduler...
Nov 30 15:05:45 hostname crond[4633]: (CRON) INFO (RANDOM_DELAY will be scaled with factor 71% if used.)
Nov 30 15:05:45 hostname  crond[4633]: (CRON) INFO (running with inotify support)
[root@ ~]#

------------------------------------------------------
The main configuration file for cron, /etc/crontab, contains the following lines:

[root@~]# cat /etc/crontab
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root

# For details see man 4 crontabs

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name  command to be executed


[root@~]#
-------------------------------------------------------------
minute   hour   day   month   dayofweek   command
Each line in the /etc/crontab file represents a task and has the format:


  • minute — any integer from 0 to 59
  • hour — any integer from 0 to 23
  • day — any integer from 1 to 31 (must be a valid day if a month is specified)
  • month — any integer from 1 to 12 (or the short name of the month such as jan or feb)
  • dayofweek — any integer from 0 to 7, where 0 or 7 represents Sunday (or the short name of the week such as sun or mon)
  • command — the command to execute (the command can either be a command such as ls /proc >> /tmp/proc or the command to execute a custom script)
As shown in the /etc/crontab file, it uses the run-parts script to execute the scripts in the /etc/cron.hourly/etc/cron.daily/etc/cron.weekly, and /etc/cron.monthly directories on an hourly, daily, weekly, or monthly basis respectively. The files in these directories should be shell scripts.

# record the memory usage of the system every monday 
# at 3:30AM in the file /tmp/meminfo
30 3 * * mon cat /proc/meminfo >> /tmp/meminfo
# run custom script the first day of every month at 4:10AM
10 4 1 * * /root/scripts/backup.sh
If a cron task needs to be executed on a schedule other than hourly, daily, weekly, or monthly, it can be added to the /etc/cron.d directory.
------------------------------------

How to test your cron jobs instantly :


[root@ jenkins]# crontab -l

* * * * * /etc/cron.weekly/test.sh &>>/tmp/cron_debug_log.log

[root@cjenkins]# 

 tail -f /tmp/cron_debug_log.log

will run your command for every minute and logged 

---------------------------------
[root@ ]# run-parts /etc/cron.weekly -v
/etc/cron.weekly/test.sh:
[root@ ]#
------------------------------------------

RHEL / CentOS  find out cron timings for /etc/cron.{daily,weekly,monthly}/
-----------------------------------------------------------------------------------------------
 cat /etc/anacrontab
# /etc/anacrontab: configuration file for anacron

# See anacron(8) and anacrontab(5) for details.

SHELL=/bin/sh
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
# the maximal random delay added to the base delay of the jobs
RANDOM_DELAY=45
# the jobs will be started during the following hours only
START_HOURS_RANGE=3-22

#period in days   delay in minutes   job-identifier   command
1       5       cron.daily              nice run-parts /etc/cron.daily
7       25      cron.weekly             nice run-parts /etc/cron.weekly
@monthly 45     cron.monthly            nice run-parts /etc/cron.monthly
------------------------------------------------------------------------------------------------

Cron also offers some special strings, which can be used in place of the five time-and-date fields:

Reference:

1) https://www.cyberciti.biz/faq/linux-when-does-cron-daily-weekly-monthly-run/
2) https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/3/html/System_Administration_Guide/ch-autotasks.html
3) https://help.ubuntu.com/community/CronHowto
4)https://www.digitalocean.com/community/tutorials/how-to-schedule-routine-tasks-with-cron-and-anacron-on-a-vps