Of all the new terms big data has introduced to the language, few continue to cause as much confusion as “data lake.” What is it, and how does it fit into an organization’s data warehousing and analytics strategy?
The term “data lake” is actually a playful variation on data warehouse, a concept that goes back to the 1970s, but the metaphor works. When we think of a warehouse, we think of a large building filled with goods organized according to some sort of structured classification system. A data warehouse is the same idea applied to data. It’s a structured data store used to hold information for analytical purposes.
Most data warehouses are essentially large relational databases that contain extracts of production data, including historical information. This enables analysts to look at changes over time and to experiment with different views and slices of data. Whereas a production database processes transactions, a data warehouse enables exploration.
A data lake is the same concept, but the data is less structured. Data lakes grew out of the revolution in big data enabled by Hadoop, which is based on search engine technology. Unlike relational engines, which require highly structured data, Hadoop enables organizations to store all kinds of data in an unstructured or semi-structured format. The cost of that storage is a tiny fraction of the cost of keeping it in a relational database. This has enabled organizations to hold onto much larger quantities of data than they could before.
Data lakes can be very large and can contain a much greater variety of data than warehouses. Like their namesake bodies of water, they’re inherently unstructured. This is both their greatest appeal and their biggest drawback.
Like a relational database, a data warehouse is built upon labeled rows and columns. This structure – or schema – is put in place before the data is loaded. Changing it after the fact is laborious and expensive. The advantage of using structured data is that operations run more quickly because there is no ambiguity about what data is being processed. The disadvantage is in flexibility and the need to standardize data through an often labor-intensive cleansing and transformation process before loading it into the warehouse.
In a data lake, the structure is typically provided by one of two ways: a metadata store or in self-describing data formats. For the former option, in a metadata store such as the Hive Metastore used in Hadoop, schemas for unstructured data are stored in a database that are read by SQL-based query tools to get the information that helps make sense of the unstructured data. Note the Hive Metastore can be used for data residing outside Apache Hive, so it’s not just for Hive. For the latter option, data formats such as Apache Parquet and Apache ORC are standalone file formats that include the schema in the file itself (self-describing) so that query tools can see the structure as they read the files.
The main point here is you can apply the schema after the data is loaded into the data lake. When using a metadata store, you create a schema and point it to the loaded file. For Parquet and ORC, you read a raw file and transform it into either format and add the schema in the process. This reversal of the load/modeling process compared to data warehouses means you can quickly test out different schemas on the data in a data lake and not have to worry about reloading the data if you need to modify the schema.
Many organizations use data lakes as a holding area for information that isn’t immediately put to use but may come in handy later. An example of this is archiving for regulatory data-retention requirements. This is one of their greatest virtues. The flexibility and low storage costs of data lakes enable organizations to keep a much larger volume and variety of potentially useful data than they could in traditional relational stores.
Data lakes have gotten some negative publicity from the disappointing results some early users have experienced, but those are usually a function of poor planning. The poor planning also entails using the wrong tools, as you can quickly hit scale and performance issues if you try using traditional BI tools on a data lake environment. Without a disciplined and well-thought-out approach to deployment, data lakes can become nothing more than a messy file cabinet.
One of the most common mistakes people make with data lakes is to treat them as they would a warehouse. Dan Woods, CEO of Evolved Media and Chief Analyst of Early Adopter Research, distinguishes between “small data” and “big data” questions:
A small data question would be asking the total revenue for your Northeast Region. A big data question would be asking for detailed customer journey information on high-value customers who have contacted customer service over the past three years and who have churned.
The first question is easily answered by a data warehouse because the data is well-defined. The second is a better candidate for a data lake because the variables aren’t as clear and may include unstructured forms like email messages and audio clips. In fact, such a query would be nearly impossible to answer with a conventional data warehouse.
Data lakes don’t make data warehouses obsolete, but they’ve changed the way many organizations use them. Data warehouses will likely always be the best option for fast, interactive queries on recent data. And since larger data warehouses can get expensive, strategies around data warehouse optimization (DWO) will be important. DWO is about using your data warehouse for the most important analytical tasks and placing less critical tasks on other technologies like your data lake. Resource intensive extract/load/transform (ELT) tasks can be offloaded from data warehouses onto data lakes, thus freeing data warehouse resources for the interactive analytical queries. DWO can also entail moving older or less frequently used data to a data lake, where BI-style analytics can be done on a large-scale basis and more cost effectively.
Analysts also may now start with a data lake to test assumptions on massive volumes of data, then extract and load the most useful data into a warehouse for decision-making. Like a real lake, data lakes can be messy under the surface, but they can delight and enlighten when the treasures lurking there are revealed.