How Apache Spark Integrates with Apache Hadoop

Published on March 21, 2017

The idea that “big data is the new oil” has been around for more than a decade, but business leaders are just discovering how true that is in unexpected ways. Unrefined data is a sticky mess and can be a liability. However, those who know how to process data with the right tools, such as Apache Hadoop and Apache Spark, have found that processed data is an invaluable driving force for outperforming the market.

Perhaps the biggest difference is that there is an infinite supply of data. To put a cap on the volume and velocity when a company “strikes data,” Apache Hadoop has been directing the flow of large data sets for more than a decade. Apache Spark joined the Apache Hadoop ecosystem in 2014, with an emphasis on real-time analysis of live streams and machine learning. Apache Spark and Apache Hadoop perform different but complementary functions, and both are critical in a world that runs on data.

Here’s an overview of what each one is, what it does with data, and how they work together to turn crude data into enterprise “rocket fuel.”

What Is Apache Spark?

Apache defines Spark as “a fast and general engine for large-scale data processing.” To be a bit more specific, it is an open source cluster processing engine that can handle many different data types and data sources at the same time. Companies like Netflix, eBay, and Yahoo have deployed Apache Spark to process petabytes of data quickly.

Comparison to Apache Hadoop

The first statistic many people hear about Apache Spark is that it can process data 100 times faster than Apache Hadoop. However, this is a bit of a misnomer. Apache Hadoop represents a much larger ecosystem of multiple Apache projects. What was actually being compared was Apache Spark and Hadoop MapReduce, which is a distributed batch processing framework designed for very large jobs to complete reliably – more on this below. In 2016, Apache Spark won the CloudSort Benchmark by sorting out 100 terabytes of data for $144 worth of cloud resources.

Apache Hadoop (i.e., MapReduce, Apache Hive, etc.) also processes data, but there are many areas where the two diverge. For example, Apache Spark doesn’t have a storage engine for on-premise deployments. It must rest on top of a read-write storage platform like the Hadoop Distributed File System (HDFS). The core Hadoop projects, in addition to HDFS, consists of:

  • Hadoop Common — a collection of utilities, libraries, and modules
  • Hadoop YARN (Yet Another Resource Negotiator) — a node manager agent for balancing resources and workloads on clusters
  • Hadoop MapReduce — a parallel processing framework for running static batch processes. Before YARN came along, MapReduce was the core engine for scheduling, monitoring, and restarting tasks. In Hadoop MapReduce, multiple MapReduce jobs are strung together to create a data pipeline. MapReduce code reads data from the disk in between each stage of that pipeline. After it finishes, the data is written back to the disk. Just looking at that you can see how inefficient it is because there are reads to start each stage. Spark is based on the same MapReduce mode, but it saves time by keeping data in memory. Spark takes up less of your processing resources.

Cloudera offers plenty of good examples of how to access various Hadoop ecosystem components from Spark, such as HBase data to Spark or Hive to Spark. Another place to look for inspiration is this list of Spark integrations with MapR, including batch applications, ETL data pipelines, and advanced analytics.

A survey by the IDC in 2013 found that 32 percent of enterprises had already adopted Apache Hadoop and another 31 percent had plans to deploy Hadoop within the coming year. You can view a detailed list of current companies utilizing Apache Hadoop here.

3 Ways Apache Spark Works With Apache Hadoop

There are three main approaches to an Apache Spark integration with Apache Hadoop project:

  1. Independence — The two can run separate jobs based on business priorities, with Apache Spark pulling data from the HDFS. Due to its simplicity, this is a very common setup.
  2. Speed — If users already have Hadoop YARN running, Spark can be used instead of MapReduce to provide faster read/write from HDFS. This is particularly true for apps with machine learning requirements and similar AI projects.
  3. Accessibility — Imagine outputting data from a Stream job and visualizing it in Arcadia Data. Take a look at how that would work with data from a Connected Car. This allows end-users to see real-time updates of information and have the ability to drill to detail in a single view.

Apache Spark vs. Apache Hadoop

Although the consensus is that these two work well together as part of the same ecosystem, some companies do run one without the other. As mentioned above, one of the key differences is that Apache Spark requires a storage engine. That means for those who aren’t going to use HDFS, other options would be Amazon’s S3, Apache’s NoSQL database Cassandra, or MapR-FS.

Another way to look at it is that Apache Hadoop is made for handling massive data collections that are distributed across thousands of nodes. There’s no need for a business to invest in additional hardware for compute or storage, because the data is indexed to speed up processing and analytics. Apache Hadoop is often used for predictive modeling for its ability to combine inputs from so many different variables and data sources.

Spark is all about increasing the speed of analysis in real time, not storage. Common use cases for Spark would be interactive marketing campaigns, managing online recommendations as the user shops, a tighter watch on cybersecurity threats, and log monitoring to catch exceptions.

What’s Next for Apache Hadoop and Spark

The close integration of these technologies has suggested to some analysts that Apache Spark will replace MapReduce as Apache Hadoop’s main processor at some point. Eli Collins, Chief Technologist at Cloudera, said, “We see Spark as the future of Hadoop. Spark today is an integrated part of the platform but how do you go from making it … one that’s great for specific use cases to being the default engine, not just for MapReduce core workloads but also for partner products.”

As Spark is still open source, there are developers all over the world working on that every day. Cloudera alone has contributed more than 43,000 lines of code to Spark over the years.

Learn more about how Apache Spark works with Arcadia.


Arcadia Instant, Release 4.2.1
Copyright © 2018, Arcadia Data Inc. All rights reserved.
Category: Beginner, How To