Chapter 1: A Brief Overview of the Big Data Ecosystem (Hadoop, Spark, and Beyond)

As mentioned in the introduction, big data offers the greatest opportunity for organizations of all sizes to truly distinguish themselves and forge real competitive advantage. For example, the relative success of one company’s marketing program versus that of a competitor may boil down to which of the two does the best job of leveraging big data analytics on sentiment data scraped from social media feeds on Facebook or Twitter. That is very different from analyzing data warehouse data related to marketing campaigns conducted in the past. In short, big data can be used to glimpse the future of what consumers are likely to do, not just what they have already done.

The key, as will be shown later in this book, is to deploy the analytics tools that enable the enterprise to explore and exploit big data.

The Big Data Ecosystem Starts with Apache Hadoop

According to Alexa Internet, a leading commercial web traffic and analytics company, as of March 2017, three of the most commonly visited websites in the United States are Amazon, Facebook, and LinkedIn. Despite their different missions, one thing that each of these organizations have in common besides their phenomenal success is that they operate and maintain some of the biggest Hadoop clusters in the world. Since publicly launching in 2006, this Java-based framework has extended far beyond its open source / search engine roots to become the premier platform for aggregating, storing, and processing extremely large data sets in distributed environments using commodity hardware.

Ten years later, Hadoop adoption and expansion has continued at a dramatic pace. In fact, leading experts believe that Hadoop will grow at a compound annual growth rate of 59% through 2020. This more than doubling every two years mirrors IDC’s predictions of growth in overall data volume. In fact, Forrester Research predicted that eventually all enterprises will adopt and deploy Hadoop somewhere in their organization. A client survey by Gartner in September 2016 shows that 73% of organizations either have or plan to invest in big data, and that number increases to 86% for large enterprises.

Hadoop’s core strength is capturing and storing multiple data types (unstructured, semi-structured, and structured) in almost limitless amounts, while offering a comprehensive framework for big data analytics.

Apache Hadoop Has Emerged as the De Facto Standard for Data Lakes


Hadoop is an ultra-scalable platform that was designed to exploit the collective power of hundreds if not thousands of clustered computing nodes. Like other successful open source projects such as Linux, Hadoop has an active community and has attracted the attention of open source vendors as well as legacy players offering their monetization strategies.

The source of Hadoop’s capability to store and process enormous data sets is a robust programming model that controls the commodity hardware-based computing nodes. Hadoop runs multiple data nodes and distributes workloads across a typical deployment using the MapReduce engine as one compute option. Hadoop and its ecosystem are highly fault-tolerant because of the built-in redundancy capabilities to prevent data loss should a node fail. Many other processing engines such as Apache Spark have increased in popularity for in-memory performance considerations, and can be resource managed using Apache YARN, adding tremendous flexibility to the big data ecosystem.

Get the PDF version for easy access to read offline or print.