Chapter 1:
A Brief Overview of the Big Data Ecosystem (Hadoop, Spark, and Beyond)

In with the New — and the Old, Too

The Hadoop ecosystem doesn’t behave like a rogue set of technologies. It has embraced both SQL, the global standard for communicating with and querying traditional relational database management systems (RDBMS), as well as online analytical processing (OLAP), commonly used in multidimensional data analysis. In doing so, Hadoop attracted thousands of IT professionals trained in these traditional query languages.

As a result, the demand for these approaches spawned their own sub-ecosystem of open source projects and startups, as well as support by industry-leading software vendors such as Oracle and Microsoft. At last count, there are over 20 different SQL-on-Hadoop offerings while OLAP-on-Hadoop is quickly catching up.

Four of the more well-known SQL-on-Hadoop offerings include:

Apache Hive: Apache Hive is considered by the big data community as the first native SQL-on-Hadoop engine. It is mostly used in conjunction with traditional BI tools for batch data preparation as well as ETL (extract, transform, load) processes. Hive is supported by every major Hadoop distribution and has a very active open source community sponsored by the Apache Software Foundation (ASF).

Apache Impala: Originally conceived by Cloudera and then donated to the Apache Software Foundation community, Apache Impala is a SQL-on-Hadoop engine that runs directly on top of a Hadoop installation. While Hive conducts its operations in batches, Impala works in real time and shines in multi-user interactive BI and analytics operations.

Apache Drill: Apache Drill is recognized as the first distributed SQL query engine that incorporated a schema-free JSON (JavaScript Object Notation) object model. Drill defines its schema dynamically (“schema on read”) as opposed to others which require a predefined one. It also operates with its own execution engine which includes in-memory processing for fast, interactive ad-hoc querying.

Spark SQL: Spark SQL is a key component of Apache Spark. Spark SQL introduced a data abstraction called DataFrames which offers support for both structured as well as semi-structured data. It provides a domain-specific language (DSL) to manipulate these DataFrames in Scala, Java, or Python.

By providing support for these traditional approaches, Hadoop truly offers both scalability and flexibility as well as the potential to empower business analysts and other non-IT users to reap the benefits commonly associated with big data analytics.

Make Way for Spark

The big data ecosystem is certainly not limited to Hadoop. As of 2016, the most popular open source project globally is Apache Spark. Spark has captured the attention and imagination of data scientists, and increasingly, business analysts. It is an ultra-high speed general processing engine that is compatible with Hadoop and can access data sources including the Hadoop Distributed File System (HDFS) and Apache HBase. Many industry analysts believe that long term, Spark may be a leading candidate if not the leader in providing the most user-friendly, fastest processing solution for advanced big data analytics.

According to the official Spark project page, more than 1000 organizations use Spark in production environments most notably Amazon, eBay, and Pinterest. Many of these organizations run Spark on massive clusters covering thousands of nodes, to perform both ETL as well as data analyses on multi-petabyte data stores with ease.

While traditional MapReduce (and YARN) incorporates a disk-based data processing approach, Spark uses an in-memory application framework. This is one key reason why Spark advocates are able to claim it is 100 times faster than MapReduce.

Other reasons for its wide-scale adoption include:

Strong language support: Spark has various APIs for three of the most popular programming languages used by open source and commercial software developers (Java, Python, Scala), as well as for common data management languages such as SAS, SQL, and R, each of which are heavily used by data scientists and business analysts.

Multiple deployment options: Spark can be deployed on-premises with a storage engine such as HDFS, as well as via cloud (public, private, hybrid). Spark provides an interactive shell and can be operated in batch mode, so it can be used in almost any setting.

Advanced data operations: Spark not only offers the “map” and “reduce” functions that are inherent in the MapReduce framework, it also added support for operations commonly available with databases such as “filter,” “join,” and “group by.” With such operations, a word count function can be written in Spark in only 4 lines of code versus 100 in MapReduce.

Graph data support: Spark natively provides the capability to handle graph data (i.e., sets of nodes/vertices/points connected by edges) via its GraphX service. When used in conjunction with data stored as rows and columns, this offers the benefit of quickly analyzing relationships between entities.

Whether the Hadoop ecosystem can continue to evolve to enable greater adoption and use by line-of-business managers and other non-IT staff remains to be seen, but the traction so far is very promising. For now, Spark’s greatest appeal is to data scientists for whom the framework can offer considerable productivity gains in the development of big data analytics solutions.