Spending on big data infrastructure can only be justified if it brings value to an organization. A component of that value proposition is big data analytics. Simply storing the data is not enough. We also want to decipher the messages contained within the data. Which customers are likely to buy what products? How long will a particular product survive in the market? Where should I open my next retail store? Which insurance claim is likely to be fraudulent? Which DNA mutation is responsible for a particular disease? Which sensor signals are indicators of imminent equipment failure?

What makes an analytic tool successful nowadays is the size of the community behind it. In this regard, open-source has become a must in the modern big data analytics ecosystem.

The recent Spark Summit 2014 conference illustrates this point. All sessions were streamed live, free of charge. From a traditional business point of view, this may appear counterintuitive. Wouldn’t free live streaming negatively affect attendance? Actually the opposite was true. Spark Summit 2014 was sold out. The reason? By being open, it got big.

Hadoop is a prime example of a successful open-source endeavor in the big data area. Hadoop has two main components: HDFS (Hadoop Distributed File System), and MapReduce, its programming framework. While HDFS has been an astounding success, reception of MapReduce has been somewhat mixed. Because MapReduce was conceived as a batch processing framework, and as more and more users attempted to use it in real-time applications, MapReduce’s inherent latency became an issue.

Enter Spark, a second-generation big data programming framework. On Apache Spark’s website, its header says it all: Lightning-fast cluster computing.

Spark was designed to be as close to real time as possible. It offers advanced execution graphs with in-memory pipelining to speed up end-to-end application performance. It’s essentially a MapReduce on steroids, and users have much better control over when to use disk operations and when to cache intermediate results in memory. In fact, given this new kid on the block, it is hard to see how anyone would continue to use MapReduce. According to Cloudera Co-founder and Chief Strategy Officer Mike Olson, Spark is “light years better” than MapReduce, and he thinks that “it will succeed MapReduce in most instances.”

Just how fast is Spark compared to MapReduce? Spark boasts being 100x faster in memory, and 10x faster on disk. That should provide some perspective on why many developers are joining the Spark camp. As for installation and usage, Spark is easy to install and comes with an interactive shell that instantly allows novice users to explore its features. Best of all, Spark supports three programming languages: Scala, Java and Python. The inclusion of Python opens access to a wealth of algorithmic libraries such as NumPySciPySciKit, etc.

The secret of Spark is that it is a distributed computational model based on what’s called Resilient Distributed Datasets (RDDs). These datasets keep the intermediate computational results and can be stored on disks across the cluster, or in the RAM cache memory of each worker node. The usage of lazy evaluation and pipelining means that many processes can be executed concurrently, without the need of saving intermediate results for the next step. In case of worker node failure, only the affected RDDs will be rebuilt. Spark has an advanced execution graph scheduler with built-in lazy evaluation to take maximum advantage of cluster parallelism and RAM memory caching, and Spark can automatically adjust its memory caching. The more RAM available, the faster the execution speed.

Notable Spark Utilities

Shark: Like animal names for your software? Spark’s got them. Under Spark, the SQL-like data warehouse infrastructure known as Hive becomes Shark, or “Spark + Hive = Shark,” if you really need an equation. Although Pig Latin is still spoken in the Spark world as a programming language for direct file query, the port of Pig over to Spark is known as the Spork, standing for “Spark + Pork.” As for whether a land animal (Impala) or a sea animal (Shark) is faster, opinions differ. Impala is a dedicated, open-source, non-MapReduce data query engine first introduced by Cloudera. Depending on the additional optimizations like usage of columnar file format, sophistication of schema, complexity of query, etc., both Impala and Shark have claimed to be faster than the other. Impala, for being a dedicated framework to serving data queries, certainly has potential to continue to be the leader in this field. But it will be interesting to follow this land-sea competition.

MLlib: It’s not a typo. The first L is capitalized, while the second is lowercase. MLlib stands for Machine Learning Library. It’s a collection of algorithms for machine learning, the current reincarnation of “artificial intelligence.” This is a package of functions that helps data scientists develop computer models to accomplish tasks such as product recommendation, credit card fraud detection, business operation optimization, etc. MLlib has a respectable collection of algorithms and more are coming. For instance, Random Forest is being incorporated into the library at the time this post was written.

Cassandra: Apache’s famous NoSQL database, is now being ported to Spark by joined effort of two companies: DataStax and Data Bricks.

Genomics: ADAM (Data Alignment/Map) platform is being developed by UC Berkeley’s Amplab as DNA sequencing engine by using an efficient columnar storage format known as Parquet. Expect to see a production-quality release in late 2014.

Spark R: functionalities of Spark are now accessible from R.

Development of Spark is intense, and this Apache project has long surpassed Hadoop’s HDFS and MapReduce to become one of the most active Apache projects. It’s an exciting time to explore this technology and make it part of your big data environment.

Additional Resources

Related Services and Solutions