“Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives” is a new book by Vijay Agneeswaran on the topic of Big Data.
Author provides foundation why Hadoop, especially Map-Reduce computational model is not suited well for a number of cases.
Author divided those cases into 3 broad categories:
- Real-Time Analytics
- Analytics Involving iterative Machine Learning
- Processing Large Graph
Chapter 2 introduce us to Berkeley Data Analytics Stack (BDAS). The most famous part of this technology stack is Spark, but attention paid to Mesos and SQL on Spark (Shark). You learn what was the motivations for creating Spark and architecture of the whole stack.
Next two chapter I found most interesting and somewhat practical: It’s about how to realize Machine Learning algorithms in Spark and in real-time, with Storm. After some very brief introduction into Machine learning, author present following topic with some code implementations:
- Logistic Regression Algorithm in Spark .
- Support Vector Machine (SVM)
- Predictive Modeling Markup Language support in Spark
This section describes how to realize some machine learning algorithms in real-time. Book provides brief explanation of Storm concept and programming construct. It also have a chapter on Storm design pattern. This books will not tech you Storm or Trident, there are plenty of other books that address this task. Two Storm introductory chapters are here to set the stage for further Storm discussion and give you some high level overview about Storm.
Then, as in case with Spark, author demonstrated code for implementing following task:
- Implementing logistic regression as a Storm bolt. Author uses Mahout implementation here.
- Implementing Support Vector Machine Algorithm in Storm
- Naive Bayes PMML Support in Storm
After discussing those implementations, author get more practical and show how to write two real-world applications:
- A classification system for manufacturing logs
- Internet page classification
The first use case is from an electronic manufacturing company. The different devices that are on the shop floor perform tests on the input data and send out the logs in the form of unstructured text that record the run of the test and the output. The log basically captures the parameters as well as their values for each run of the test and the output—the intention is to understand whether the test has been passed or whether it was a failure. The log file sample is given next so that the reader can understand what has to be processed and analyzed. The trick here is ho have ML algorithm to learn pattern of failures and act upon this learning.
There are some code samples, but as opposed to majority of technical books aimed at practitioners, I was not able to find github repository with all code samples. They are scattered in the book and some are in appendix A, titled “The code Sketches”, but not all of them. Some of the code samples are in Java and some are in C++, which makes it hard for java developers to follow.
Overall this book provide good overview of latest technology stack for Big Data, how in principle realize machine learning algorithms over Spark, Storm and Graph processing frameworks. Think of this book as a guided tour on this topics. When you need to dig more and build production quality apps using this technologies, you have to look elsewhere for details and more practical context.