Getting Started with Spark: local / standalone

Recently one of our clients wanted to demo the ML algorithm that we’ve developed on their laptop (w/ no internet connection); so running things on a spark cluster was not an option.  Note that spark “standalone” mode refers to actually running a cluster on a machine; while “self-contained” is a light version (the one we will be using).

Luckily getting spark running locally takes just a couple of mins.  Here are the brief steps.

1. Download Spark: link (pre-built for hadoop 2.6; so you don’t have to compile it (saves several mins)); and unzip it in a dir of your choice (e.g. “/opt/”

2.  Create sbt project (IntelliJ is great for that); or you can just do it manually.  Include spark as dependency (in build.sbt):

name := "spark-test"

version := "1.0"

scalaVersion := "2.11.6"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.1"

3.  Create a Sample app:

/* SimpleApp.scala; based on: */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "FULL-PATH-TO-SOME-FILE" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application").setMaster("local") // to run locally: .setMaster("local")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))

4. Execute:

sbt run

You will see something like:

15/05/31 12:17:57 INFO BlockManager: Found block rdd_1_1 locally
15/05/31 12:17:57 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 16 ms on localhost (1/2)
15/05/31 12:17:57 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1830 bytes result sent to driver
15/05/31 12:17:57 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 10 ms on localhost (2/2)
15/05/31 12:17:57 INFO DAGScheduler: Stage 1 (count at SimpleApp.scala:15) finished in 0.020 s
15/05/31 12:17:57 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
15/05/31 12:17:57 INFO DAGScheduler: Job 1 finished: count at SimpleApp.scala:15, took 0.034734 s
Lines with a: 13, Lines with b: 4



You may see some exceptions; but those are simply due to sudden shutdown of the service after results are obtained:

 at java.lang.Object.wait(Native Method)
 at java.lang.ref.ReferenceQueue.remove(
 at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:146)
 at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:144)
 at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:144)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
 at org.apache.spark.ContextCleaner$$anon$
15/05/31 16:46:34 ERROR Utils: Uncaught exception in thread SparkListenerBus
 at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(
 at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(
 at java.util.concurrent.Semaphore.acquire(
 at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:62)
 at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
 at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
 at org.apache.spark.util.AsynchronousListenerBus$$anon$




keywords: apache spark local machine no cluster tutorial embeded

About Neil Rubens

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *