Spark 101 for Scala Users

A quick hands-on intro into Spark for Scala users.

I’ll format this into a more detailed presentation later (so feel free to check back and bug me if I’m not getting around to it) but here are some immediate things you may be interested in if you saw my Austin Scala Enthusiasts Meetup presentation…

Here’s a link to the PDF of the slides I talked to.

Running Zeppelin via a Docker container

docker run --name zeppelin -p 8080:8080 -p 4040:4040 -v $HOME/spark/data:/data -v $HOME/spark/logs:/logs -v $HOME/spark/notebook:/notebook -e ZEPPELIN_NOTEBOOK_DIR='/notebook' -e ZEPPELIN_LOG_DIR='/logs' -e ZEPPELIN_INT_JAVA_OPTS="-Dspark.driver.memory=4G" -e ZEPPELIN_INTP_MEM="-Xmx4g" -d apache/zeppelin:0.9.0 /zeppelin/bin/zeppelin.sh

Running Spark via a Docker container

docker run --name spark -v $HOME/spark/data:/data -p 4040:4040 -it mesosphere/spark bin/spark-shell

For a basic Spark SBT project

build.sbt:

import Dependencies._

ThisBuild / scalaVersion     := "2.12.11"
ThisBuild / version          := "0.1.0-SNAPSHOT"
ThisBuild / organization     := "com.example"
ThisBuild / organizationName := "Meetup Spark Example"
ThisBuild / scalacOptions ++= Seq("-language:higherKinds")

lazy val root = (project in file("."))
  .settings(
    name := "SparkCatScratch",
    libraryDependencies ++= Seq( scalaTest % Test, sparkCore, sparkSQL, catsCore, catsFree, catsMTL)
  )

initialCommands in console :=
  s"""
    |import cats._, cats.data._, cats.implicits._, org.apache.spark.sql.SparkSession
    |val spark = SparkSession.builder().master("local").getOrCreate
    |""".stripMargin

cleanupCommands in console := "spark.close"

project/Dependencies.scala:

import sbt._

object Dependencies {

  val sparkVersion = "2.4.5"
  val catsVersion = "2.0.0"

  lazy val scalaTest = "org.scalatest" %% "scalatest" % "3.0.8"
  lazy val sparkCore = "org.apache.spark" %% "spark-core" % sparkVersion
  lazy val sparkSQL = "org.apache.spark" %% "spark-sql" % sparkVersion
  lazy val catsCore = "org.typelevel" %% "cats-core" % catsVersion
  lazy val catsFree = "org.typelevel" %% "cats-free" % catsVersion
  lazy val catsMTL = "org.typelevel" %% "cats-mtl-core" % "0.7.0"
}

Starting Spark in the SBT console:

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().master(?local").getOrCreate
val sc = spark.SparkContext

Funny Bug of the Day (Java)

This took a little while to figure out!

Date startDate = new Date();
Date endDate = new Date(startDate.getTime() + (24 * 3600000 * 42));

This was expected to result with startDate being right now (Feb 20, 2013 5:17:10 PM) and the end date being six weeks later (Apr 3, 2013 6:17:10 PM), but instead end date was being computed to be earlier than the start date… Feb 13, 2013 in fact! Continue reading “Funny Bug of the Day (Java)”

What’s special about SwitchYard? It’s SCA.

Service Component Architecture (SCA) is the marriage of these two things. It essentially recognizes that some of your building blocks are going to be local, others are going to be external, all of them need to be as reusable as possible, and their internal/external exposure may change over time. In other words, it’s a SOA orchestration of services that is a fluid hybrid between the two worlds. Your code will periodically use a @Service tag to declare a service dependency. And some XML configuration file will help map out the ESB-specifics of (a) exposing necessary services or service compositions to the outside and (b) getting access to outside service dependencies that you’ll need.

Over the past several years I’ve become wildly enthusiastic about certain “trends” in software development that promote clean, agile, distributable code. The primary buzzwords I’m referring to are “SOA” (Service Oriented Architecture) and “Dependency Injection”. Heck, I’ll even throw “SOAP” into the mix, even though that’s a more specific protocol.

If I talk to other developers about these things, they look at me funny or they show disinterest or suspicion. I think some people think I’ve gone crazy over some trendy buzzwords (i.e. pointy-haired-manager syndrome) and am just trying to evangelize some lame trend that will come and go. Or there’s the classic “SOAP? Oh yeah, I tried to do something with that and got so confused and just hated it. *shiver*”

(By the way, SOAP may have some hard-to-parse standards, and the WSDL can be daunting and confusing, but it’s really simple: you are taking your XML request and wrapping it in a simple envelope that is often nothing more than a single parent <soap> tag! There are some provisions for allowing messaging meta-data to be added in a header section, but all that stuff is optional. Anyway, if you are one of those people who is afraid of SOAP, take a second look; it’s not all that bad.)

Dependency Injection and Service Oriented Architecture are interesting topics in that they really should be easier to explain. I’ve read so many books and articles, and authors have written a lot of insightful material, but I haven’t run into anyone who has been able to expose the inherent simplicity behind these things. Well, I might my hand at the task someday, but not in this blog posting. My purpose is a bit more specific… to share an insight about Service Component Architecture. Continue reading “What’s special about SwitchYard? It’s SCA.”