April 11, 2018 - Dale Kim | Big Data Ecosystem

Can Your Data Lake Be Saved?

If you are a believer in Betteridge’s Law, then you’re in for a surprise (though perhaps I should’ve titled this blog as, “Are Data Lakes Doomed to Fail?” so as to not contradict the law). Data lakes can, in fact, be saved. If you are reading this post because you are intrigued by the title, then I’ll boldly guess that one of the following is true:

  1. You’ve been wanting to build a data lake because you know the potential value it provides, but you’re also worried about the bad rap it’s been getting.
  2. You already have a data lake, but you feel like it has underperformed against your initial expectations.
  3. You have a successful data lake, and are wondering why others struggle with theirs.
  4. You don’t know what a data lake is, but you hoped that Betteridge’s Law foreshadowed an impending doom to some unknown entity.

I can’t help you if you fall into category 4, but those of you in categories 1, 2, or 3 might get some useful information based on recent work Arcadia Data did with Early Adopter Research. Their CEO and lead analyst, Dan Woods, embarked on a mission to help save the data lake, in which he documents all of the main reasons data lakes have failed, and shares some of the strategies for making data lakes successful. A small sampling of outputs from Dan’s research includes a white paper and a recent webinar (which also included our friends at MapR). I’d like to share a few snippets of information from that webinar.

A Data-native Approach Is the Key to Saving the Data Lake

One of the main points that Dan makes is that you should consider a data-native approach to your data lake. In a data-native approach, you analyze the data where it is right inside the data lake; no extracts are required. This approach empowers business analysts who want to describe their data in a meaningful way (via semantic modeling) so that they, and other analysts, can more easily glean interesting insights from the data. In a data-native use of the data lake, you operationalize the results, meaning that you not only get the answer, but you are able to explore the answer. You can view data across all endpoints and users to see what’s being used.

Saving Data Lake

Why Have Data Lakes Failed?

No, according to Betteridge’s Law (which clearly doesn’t work with a non-polar question). I should note that an alternate answer was covered in the webinar, in which a few examples were pointed out. Bad planning, bad expectations, and bad use of tools, all contribute in some way to failures such as data swamps, endless proofs-of-concept, and inaccessible data. It’s important to recognize that data lakes have failed not because the theory behind them is flawed, but rather, they are not the slam-dunk data solution that they were hyped to be.

Can data lakes be saved after they have failed? Fortunately, the answer is yes. There are four main methods for transforming your failed data lake into one that can successfully support a wide variety of use cases. These methods, as well as key challenges that data lake users may face, are covered in detail in the webinar.

It All Starts with the Right Architecture

You need to have the right architecture that can deliver the data into the analytics using all of the capabilities that you associate with a data lake. The architecture should be able to support putting the data into data business processes and making those distilled packages of data available for use by microservices and streams that support other applications.

In the webinar, Jack Norris of MapR discussed the data fabric architecture that MapR offers. This architecture gives you a repository on which you can put operationalizable analytics. The underlying data fabric provides the scale and reliability to support a broad set of applications.

The Right Architecture Includes Analytics

The key to any successful data lake is the ability to drive operational processes and transactional processes on the same infrastructure where the data already sits. Arcadia Data enables this by providing the data analytics software that runs within modern data platforms for the scale, performance, and security needed for real-time insights. Users can easily build visual analytics and dashboards directly within their big data environments. Arcadia Data Smart Acceleration enables business users to easily get fast responses for production applications, without forcing IT teams to go through laborious data/performance modeling exercises. The technology makes use of machine learning and algorithms to measure and recommend different ways to create query accelerators (known as “analytical views”).

Data lakes started as a place for big data discovery and exploration for highly-technical users. Over the years, data lakes have evolved to support a wide variety of business use cases, assisted by machine learning and artificial intelligence, visual analytics, and BI tools. However, many attempts at building data lakes have failed due to process and technology mismatches. Hopefully you have a sense that data lakes can in fact be a valuable tool for your organization, so please check out our webinar to get more details on how you should handle your data lake moving forward.

Related Posts