As I wrote a few posts ago, I’m playing a fast game of catch-up in the big data technology arena. My technical work didn’t intersect with big data until recently. And, the landscape of tools is quite sprawling – Sqoop, Hive, Pig, HDFS, HBase, Cassandra, Flume, Impala, Spark, etc. Somewhat comically, Zookeeper is one tool used to keep all these “animals” under control. But, while rummaging around in the zoo, we stumbled upon what is one of the most compelling technological advances to hit mainstream open source software in several years – Apache Spark. It works with big data, runs in large clusters, has query tools (SparkSQL), machine learning (MLlib), stream processing (Spark Streaming) and is considered part of a big data tool set. But, it enables new capabilities beyond just bigger/faster/more distributed.
Spark allows the rapid integration of and interactive analysis of big data that was previously confined to longer running batch jobs on top of the traditional (how many years are required to create tradition?) big data technology stack. Now, without derailing their train of thought, the data scientist, marketer, or business analyst can iterate through a series of related questions, quickly receiving answers, and changing their approach as necessary to understand their data as it starts to make sense to them. Spark provides access to lots of data, and the processing of that data, now – not in 12-24 hour batch job cycles.
Programming is required – Java, Python, or Scala (which all the cool kids are doing). This is great for development teams. But it’s only a matter of time before “value added” providers will wrap this and make big fast analytics available through a more declarative interface and in applications that may not even be obvious from the user interface. Big data platform vendors such as Cloudera and Hortonworks are bundling Spark with their platforms. Other companies such as Zoomdata are using it to speed up their analytics. It seems that everyone I talk to in the industry is using Spark. It’s solving real problems – big problems – and that’s having a snowball effect that we don’t see very often. As more and more companies adopt Spark, they are contributing developers. The growing developer community creates more features. More features attracts more companies. And, the cycle continues.
I was impressed the first time I sat down to build an iOS application in a way that I hadn’t experienced before – the possibilities were enormous. And, we’ve seen that play out in the mobile software world. Similarly, Spark expands the realm of possibility for big data. This is going to be exciting.