April 4, 2018 - Steve Wooledge | Big Data Ecosystem

My 4 Key Takeaways on Data Lakes from the Gartner Data and Analytics Summit 2018

As in any hot technology market, people become either enamored or confused by new terms and acronyms: AI/ML, GDPR, IoT, big data, data hubs, and of course, “data lakes.” A few weeks ago at the Gartner Data and Analytics Summit in Grapevine, TX, I attended 11 sessions, three analyst inquiries, and talked with dozens of customers and prospects. In this blog post, I’d like to share with you my key takeaways on data lakes.

  • 1. The “data lake” is a standard design pattern in today’s organizations for dealing with big data.

    I attended three different data lake sessions by Gartner. A number of vendors have shifted or gone “all in” on the data lake concept to describe the design pattern of collecting data “all day for all time” and refining it into useful information. Ted Friedman at Gartner had a nice slide in his session titled “Data Hubs, Lakes, and Warehouses: Choosing the Core of Your Digital Platform” which shows the most common types of questions and data addressed by data lakes and data warehouses.

Gartner Event Presentation, Data Hubs, Lakes, and Warehouses: Choosing the Core of Your Digital Platform, Ted Friedman, Gartner Data & Analytics Summit, 5 -8 March 2018 / Grapevine, TX

  • While the scope of the data lake today seems narrow given the hype, the sense I got was that there was muted excitement around data lakes, despite the predictions from Gartner such as: 
  • In fact, Nick’s presentation titled, “From Pointless to Profitable: Using Data Lakes for Sustainable Analytics Innovation,” which was my personal favorite title of the conference, I believe underscores a growing sentiment I heard throughout the conference and something we hear from our customers and partners. I heard time and again that data lakes can be very profitable when designed and implemented well. From what I’ve seen from our customers with mature data lake environments, these organizations are dramatically reducing their costs from legacy systems while unlocking new insights from data with low “business value density” more quickly than was possible before. In fact, one of our customers presented recently at a session where he said they had identified one billion dollars in expected value from new use cases on their data lake environment that had been identified with the business.
  • 2. There are no silver bullets—data lakes must be governed like any other data platform.

    In Andrew White’s session, “Adopt a Data Hub Strategy: Stop Blindly Integrating Data and Start With Governing It,” was also a favorite of mine. I am reminded that concepts like data governance, master data management (MDM), and the people and processes that support this are required for successful programs.

    Whether you’re using Apache Hadoop and creating a data lake or trying to bring in streaming event data from connected devices in the field using Apache Kafka, it’s clear to me that there are no silver bullets. No magic technology can solve the business requirements around data management and governed analytics.

    A company that takes governed data lakes a step further in terms of a specific framework for deployment is Zaloni. I was impressed by Zaloni’s (“The Data Lake Company”) presentation, which went into much more depth and included specific client examples of how data lakes are deployed and governed. They have focused their entire company on building governed data lakes and consult with customers on deploying data lakes in “zones.” These zones represent separate areas within the data lake for different workloads, from landing data in its native state to cleansing and transformation in the “raw zone,” to sandboxes, which are refined and trusted zones for different use cases and workloads such as providing a large number of users with standardized information and a single version of the truth (sounds a lot like a data warehouse).

    So, which is it? Are data lakes only places for highly-trained specialists (as Gartner points out) like the data scientist, or can they be used for “trusted information” accessed by hundreds of people (as indicated in the talk from Zaloni)? That leads us to point #3:

    3. Data lakes are quickly evolving in definition AND capabilities.

    This makes sense—the whole premise of Apache Hadoop, which arguably ushered in the big data era, was to bring the processing to the data. And engineers have! The open nature of the Apache Software Foundation and modern data platforms enable a host of data processing engines (e.g., Apache Hive, Apache Impala, Apache Spark) to run in place on the nodes, where the data resides. This allows developers, data scientists, and data engineers to run data transformations/cleansing (i.e., ELT) and analytics processing (e.g., SQL workloads, procedural functions for machine learning and AI) natively within the cluster. Next-gen commercial software companies such as Trifacta, StreamSets, Waterline Data, and Arcadia Data take advantage of this open processing framework and run the processing as close to the data as possible without moving it.

I am seeing organizations supporting more casual analytic users on data lakes, which is also evident from the customers and partners such as Zaloni whom I spoke with at the Gartner Summit. Also, Gartner referred to “enterprise data lakes” which promise that “all of the disparate data sources in the enterprise will make its way into a single piece of infrastructure and serve all analytical needs in the enterprise.” There are also two problems Gartner noted for enterprise data lakes one of which was, “Challenging to meet diverse needs for varying performance, governance, and security”. To resolve this, Gartner offered recommendations of “Avoid ‘big bang’ scenarios”, “Focus on creating business-unit specific data lakes”, and “optimize for relevant data types, data volumes and users”. I agree in particular with starting with a specific business area, which is exactly how I’ve seen successful projects conducted over the past 20 years in the analytics market, regardless of the technology.

key takeaways on data lakes

Gartner Event Presentation, From Pointless to Profitable: Using Data Lakes for Sustainable Analytics Innovation, Nick Heudecker, Gartner Data & Analytics Summit, 5 -8 March 2018 / Grapevine, TX

Gartner also had the following recommendations:

Gartner Event Presentation, From Pointless to Profitable: Using Data Lakes for Sustainable Analytics Innovation, Nick Heudecker, Gartner Data & Analytics Summit, 5 -8 March 2018 / Grapevine, TX

You should not assume your existing BI skills and tools transfer to the data lake. I think that data warehouse-based BI tools aren’t designed for data lakes and require a new architecture, which leads us to observation #4:

  • 4. Organizations are choosing a new analytic/BI standard for their data lake.

    Of course, people want to make the most of existing investments, but BI and analytics tools designed primarily for data warehouses and relational databases can’t take advantage of the scale and processing agility inherent in modern data platforms like the data lake. Sure, data scientists may not always embrace BI and analytics tools, but what if you want to enable citizen data scientists who are tired of coding and want a UI? What about the business analyst who sits in the department/line of business? Can you enable hundreds of them to get self-service insights from the data lake?

    It doesn’t even have to be a heavy data discovery use case. Consider a CISO or security analyst who wants to get visibility across all the endpoints, networks, and users in his/her enterprise systems. An analytic application which uses machine learning to detect anomalies and alerts an analyst to suspicious activity should also allow that analyst to drill down to the details and inspect the underlying machine and network traffic in question, without needing to wait for data to be moved into a separate system—or, even worse—for them to go from one system to another trying to trace and analyze the connections. Why not give them a self-service security data lake analytic application so they don’t have to do “swivel chair analytics”? Information security is the domain of a customer I had dinner with last week, who has a 3000-node data lake and simply couldn’t meet the business needs of the security analyst team without a “native BI” approach.

    Modern BI and analytics designed for the data lake need to be “native” to big data to support a variety of users and use cases without moving data, without flattening complex data, and without requiring heavy data modeling in advance of the analytics/discovery phase. Without it, the data lake is destined to function as a staging/ELT/data scientist system of nominal value.

In this blog post, I’ve discussed the four key takeaways on data lakes that I learned from attending the Gartner Data and Analytics Summit. Here’s a question for you: what makes a native analytics and BI platform different from traditional BI tools? Lots. Check out the webinar, “Data Lakes Are Worth Saving” to learn more.

Related Posts