Spark as a catalyst to learn Scala

There is no doubts that Apache Spark project has a lot of momentum in the Big Data world right now. Short of curing cancer, Spark appears to be able to solve all the data problems people have. Map/Reduce batch workflow – check, real-time streaming – check, working with Graphs – check, Machine Learning – check. So I embarked on a journey to learn as much as possible about Spark.
The first thing you see when you start looking into Spark is that it supports three languages: Scala, Java and Python and Spark itself written in Scala. While I do have limited knowledge of Python, as a java developer, I started exploring Spark with Java. Java and Scala got first class support in Spark, while Python support is becoming better in each release.
Spark documentation provides easy an way to write code snippets in all three languages and compare styles. So I started with Java 7.
Here is the drawback of using Java 7 with Spark that turned me off. Java 7 does not have anonymous functions as a first class citizen, so one should write anonymous classes to pass to Spark, which created a lot of boilerplate code. Spark ideas build on top of functional programming concepts, and java does not lend itself as a functional programming language. So I was envious of Python programmers who can express ideas in one line of code where I have to write 10 of them.
Expressing simple filter idea in Scala:

val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()

The same idea in Java 7, three times more code:
long numAs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("a"); }
    }).count();

long numBs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("b"); }
    }).count();

So the next logical step would be to use Java 8. The Spark team developed support for Java 8 and with lambdas code was not so bad at all. Check Cloudera blog post Making Apache Spark Easier to Use in Java with Java 8
Java 8, not bad at all:

long numAs = logData.filter(s -> s.contains("a").count();
long numBs = logData.filter(s -> s.contains("b").count();

The problem for me was what I am going to do with this spark code? I want to run it on YARN on my Hadoop cluster. None of the three major Hadoop distributions are certified to java 8 yet. So I am limited only to run Spark on a dedicated Spark cluster which is not what the majority of people want to do. I want to utilize YARN or MESOS and existing investments and integrations in Hadoop to be able to run Spark on top of my YARN cluster. Plus I still feel java 8 has too much baggage to be a true functional language with powerful syntax. It’s become close, and for those interested, I recommend reading Functional Programming in Java: Harnessing the Power Of Java 8 Lambda Expressions by Venkat Subramaniam.

 

Another problem is that Spark provides shell for both, Scala and Python, but not for Java. This limited my ability to quickly prototype ideas. So what options are left: Python and Scala. Spark itself is written in Scala, so I have a feeling that Scala always will be the language to get anything done in Spark. Second, I still love statically typed languages, and want my compiler to help me with coding and discover bugs related to types. So I start looking into Scala. I want to write my Spark job with ease and elegance.
I should admit, I tried to learn Scala long time ago, but was not persistent enough and dropped it somewhere in the middle. Scala has the reputation of a big, hard to learn language. This time my reasoning was among the following lines: to be productive in the limited space of Spark tasks, I probably don’t need to master all of Scala, maybe 30-40% will be sufficient to be productive and then learn and master the rest.
So I start reading Scala books and experimenting with Scala and I love it. I should note that I have good knowledge of Groovy, another functional programing language for JVM, and some constructs in Scala are similar to Groovy, which made my learning easier.
Extensive object-oriented features and functional ways of thinking blew my mind away. Thinking in terms of functions as first class citizens, immutable data, reusable collections machinery, recursive functions instead of loop is indeed a lot of fun. Scala syntax, which looks weird initially started making a lot of sense. So I was seduced by Scala. Suddenly I was able to write and understand Spark jobs without all the Java 7 boilerplate code.

I think it is worth to invest time in Scala even for the sole goal of writing Spark jobs. But Scala is much bigger than Spark, and suddenly a whole new world of functional programming will be open for you and you will be doing things differently, with more elegance and much less bugs.
Here some resources I found invaluable in my Scala journey:

  1. Scala for the Impatient
  2. Programming Scala: Scalability = Functional Programming + Objects
  3. Functional Programming in Scala
  4. Coursera class – Functional Programming Principles in Scala
  5. Learning Spark: Lightning-Fast Big Data Analysis

And for your entertainment, check out this video from JavaZone, premier java developers conference in Europe, where a java guy met a girl named Scala. I want you to have the same feelings, that you met your new love, Scala.

Submit a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>