July 10, 2018 - Dale Kim | Big Data Ecosystem

What’s the Difference between Hadoop and a Data Lake

I recently participated in a webinar hosted by DBTA titled, Unlocking the Power of the Data Lake, where one of the audience members asked, “will data lakes be replacing Hadoop in the future”? I think the three speakers sufficiently answered the question on the webinar, but considering that many others might have similar questions, I thought it was worth calling out here as well.

That question assumes that Hadoop and data lakes fall into the same category of technologies. It actually is not an “apples to apples” comparison, but perhaps more like an “apples to apple pie” comparison. To put it simply, Hadoop is a technology that can be used to build data lakes. A data lake is an architecture, while Hadoop is a component of that architecture. In other words, Hadoop is the platform for data lakes. So the relationship is complementary, not competitive. For the foreseeable future, as data lakes continue to grow in popularity, so will Hadoop.Difference Between Hadoop and Data Lakes

Perhaps another way to view the question is, “will data lakes ditch Hadoop in favor of other technologies”? That might eventually happen, but I don’t think that is the key trend in the coming years. Instead, data lakes will incorporate other technologies into a comprehensive architectural stack with Hadoop at the core. That evolution is actually happening right now, which again is more complementary than competitive. For example, in addition to Hadoop, your data lake can include cloud object stores like Amazon S3 or Microsoft Azure Data Lake Store (ADLS) for economical storage of large files. Or you might add Apache Kafka to manage real-time data. Or you can add a NoSQL database for transaction-oriented workloads in your data lake. And adding modern data warehouses like Apache Kudu makes sense for other types of large-scale analytic workloads. And last but not least, you should include search index stores that are best suited for textual data and other unstructured formats.

If this is beginning to sound very complicated, don’t worry. The paradox is that a data lake, despite its many potential disparate components, is a much better solution to handling many of today’s data requirements than traditional technologies like relational databases and data warehouses. With traditional technologies, you have to design a schema just to load the data. And as you add more data sources and formats, you have to do more modeling and remodeling. We all know this to be very time consuming, and we don’t have to bring that overhead to a data lake. Modern data platforms like Hadoop let you load your data immediately and then apply transformations as you explore and discover how you and your users want to use that data.

I don’t mean to oversimplify the effort involved in deploying a successful data lake, so I should note it’s not as simple as loading data and running with it. In fact, I believe a lot of early data lake failures were due to the expectation that data lakes would make data management trivial. And while data lakes certainly do alleviate much of the pain of data warehouses, you don’t get that benefit for free. The key is to plan ahead and have clear objectives in mind so you don’t repeat the mistakes of the past. Over the years, challenges with data lakes were commonly blamed on security, data governance, and limited expertise. But one topic that hasn’t gotten enough attention, and is now becoming more top-of-mind, is how to let non-technical business users take advantage of the data lake. (We’re doing data lake research  with The Eckerson Group to examine how organizations are doing with regard to opening up data lakes to business users, so feel free to fill out the survey to see how you compare.)

One of the commonly sought-after goals of data lakes is to help promote self-service from the perspective of the business analyst and business user. Unfortunately, this has not been realized universally because the focus with respect to data lakes has been on the data platform, and the data platform alone cannot drive self-service. Data lakes must incorporate other technologies that were designed for data lakes so that IT teams are freed from time-consuming tasks that can otherwise be automated, or at least handled entirely by business analysts.

This is where the traditional BI tools have fallen short when it comes to BI on Hadoop and data lakes. Most traditional BI tools treat the data lake like any other data store. And if you treat a data lake the same way you interface with a data warehouse, then you inherit all of the baggage of the data warehouse and gain very few of the advantages of the data lake. A class of technologies has emerged to solve the BI/Hadoop disconnect via a “middleware” approach, to assist in either query acceleration or query federation (or both), but those also fall short. That’s because you’ve now added more unnecessary complexity to the stack as well as more user interfaces. This means that very sophisticated users need to run the system, which takes away from the goal of achieving self-service BI for business users.

We, and our customers, believe that the right approach is having a complete visual analytics and BI stack that’s entirely designed for big data. This includes integrations with Hadoop, Spark, Kafka, cloud object stores, NoSQL, and search. It’s not enough to just have a generic connection to each of these sources, as a native approach ensures a more seamless experience when building semantic layers, creating dashboards, collaborating, optimizing for performance, and deploying applications to production.

There’s a lot to discuss about this “data-native approach” to BI on Hadoop and data lakes, so contact us and let us know if we can help. Or if you first want to get a feel for our visualization capabilities, download our free Arcadia Instant for browser-based analytics/BI on your desktop.

Related Posts