Twitter Storm is an open source real-time computation engine developed by a company called BackType that was acquired by Twitter in 2011 partially because Twitter uses Storm internally. Nathan Marz is the main contributor to the project.
Storm can be used with any programming language for applications such as real-time analytics, continuous computation, distributed remote procedure calls and integration. Storm is designed to work with existing queuing and database technologies.
I spent some time researching and developing an application with Storm and wanted to share my overall positive experience.
There are a lot of business cases where you may want to do real-time processing on a large scale such as real-time data analytics, ETL processing, and machine learning applications to name a few.
Storm allows you to create extremely scalable, fault-tolerant architecture hiding most of the complexity of managing queues, nodes, serialization, etc.
From a developers standpoint, the main concepts are simple. Lets say we have an event processing application. Events can come either from the message queue or from other sources.
The first element that is reading the stream of data is called spout. It is just a reader that is listening to the queue and emitting messages for processing down the line.
Spout also can perform other important functions: it transform incoming messages into storm’s tuples which is a serializable data structure. Spout also keeps track of messages, and can replay failing messages.
Spout then sends messages to the next element, called bolt. Bolt can do further transformations – for example: some calculations, saving to database, etc.
In map-reduce work, we can have spout emitting messages and multiple bolts doing map-reduce job. Relationship between bolts can represent a tree.
For example, imagine spout read from constant stream of text data. Spout converts raw text data into lines of text and emits each line for further processing to the next bolt.
This bolt is responsible for breaking down line into words. Then it emits each word to the next bolt which in turns keeps track of how many times each word appeared in the text.
It may pass the data to the next bolt which may just write data to some permanent data storage.
All this workflow is organized into topology, where you specify your workflow and how many processes and threads you want to allocate for each task. This is where scalability
comes in. For example, if you need more computer power to allocate to a particular bolt, you can do so by defining the parallelism parameter when you create your topology.
Storm will allocate additional threads or processes for the task. The best part, developers don’t have to know the details of how it’s done.
Storm provides interfaces and some abstract classes to make programming quite easy.
I highly recommend investigating this framework further, and using the following resources:
In “Storm: Distributed and Fault-tolerant Real-time Computation” Nathan Marz introduces Storm, his new tool for real-time data stream processing. This January 2011 QCon talk gives you a founder’s level overview of Storm, along with its main use cases.
Very good documentation – cheers to developers
Active message board
Storm-started project on github
Strorm-contributions – different storm components, spout and bolts for you to use
Presentation on SlideShare
Book by Nathan Marz about Big Data. Not storm specific, but quite on the topic
Little book titled “Getting started with Storm” – a little bit outdated, but newer less good starting point.
Blog post “A storm is coming” – very good overview of Storm
For those who is on amazon cloud, it easy to deploy with storm-deploy
In the next blog post I will show how to program a simple topology and then we will investigate it from deployer perspective.