In the recent TDWI webinar, Data Management for Big Data, Hadoop, and Data Lakes, TDWI Senior Research Director for Data Management, Philip Russom, joined representatives from Arcadia Data, HVR, Panoply, and Paxata to discuss trends in data lakes, Hadoop, and big data. They described the impact of these modern trends on traditional data ecosystems and added real-world examples of what successful organizations are doing today.
Philip kicked off the webinar by discussing the main trends around Hadoop and data lakes. Data itself is evolving with respect to a greater diversification of data sources and structure, with machine data coming in from sensors, handheld devices, delivery trucks, etc. Because of this diversity, businesses are revising their practices and adopting new tools and platforms. They are doing a lot more data management “on-the-fly” as they access data for the first time. This is markedly different from what they’ve done in past, which involved the time-intensive task of data preparation for data warehouse storage. In addition, the business use of data is changing, and organizations are diversifying into more analytics, real-time business monitoring, and business transformation.
A 2016 TDWI Hadoop survey found that 20% of data warehouse programs surveyed had Hadoop clusters in production. Note that this number only relates to data warehousing, not the rest of the enterprise. Hadoop is being used primarily to extend data warehouses and provide both storage and analytic computing power for analytics on the advanced side. The survey also revealed an increase in the use of data lakes, data archiving, marketing data, and a wide variety of operational data on Hadoop. Organizations used Hadoop either on-premises behind the firewall in their enterprise, in the cloud, or it could be a combination of both.
A data lake is not a platform you would buy from a vendor or get through open source. Rather, the data lake is a method for organizing large volumes of data. You can’t just quickly create a database; you have to decide page sizes, interfaces, volumes, etc. The data lake helps you create structures for combining and collecting data in a very large way. A data lake can be built on top of Hadoop or on a relational database, or both. Any of these combinations can involve the cloud. We do see a lot of marketing departments in corporations setting up a data lake as well, as multi-channel marketing is also an early adopter of the data lake.
The real promise with embracing big data, analytics, and data lakes is that when you have new data that’s new to your organization, there’s always the prospect of gaining new insights. For example, if your company developed a new smartphone app that helps your customers manage their account, there’s also a lot of behavioral data there that can tell you a lot about your customers. When you have new data sources, you broaden the range of visibility into your operations, your customer, your partners, etc.
Many of you have older applications for fraud, risk calculations, or customer segmentation. A lot of those types of analytics benefit from a much larger data sample. Bringing in big data helps you extend those samples for greater granular accuracy. Many organizations want to have self-service access to the new data and do self-service analytics and visualization with it. As you think about how you’re going to embrace this new data, don’t forget that you most likely will have business people who want self-service access, so you’ll need to give them tools that are really attuned for self-service.
Finally, you may be dealing with a lot of streaming, real-time data sources. This is helpful for those of you who want to do real-time business monitoring. You may want to have management dashboards, and those will have metrics that some managers need continuously refreshed throughout the business day. A data lake on Hadoop is one way to capture that data so that you can do more frequent refreshes of these sorts of management-driven data delivery products.
Trends Affecting Data Lake Management
Let’s first review just a few of the many trends that affect how you pursue and implement a data lake:
- Data is becoming more dynamic in terms of how we define it. This is causing organizations to struggle with these static data pipelines that they’ve had in place for a long time. In the past, it could take six months and cost upwards of $1 million to change the schema of a report. Your data lake needs to be designed to use numerous distinct schemas, as well as “schemaless” data.
- New formats like JSON allow self-description to be present in the data. These popular new formats are designed for machine readability at a very large scale. You need to be able to serve up insights from that data. This can apply to a lot of different applications across industries such as product analytics from an IoT perspective, streaming analytic data from set-top boxes in an advertising context, trade surveillance in financial services, etc.
- Organizations want access to the data in real time. With BI applications, we need to be able to read data in real time and allow people to explore it. Real-time analytics are a big driver for competitive advantage, and businesses need technologies that can enable immediate insights on a wide variety of data types.
- Technologies have evolved so that customers can now see how they can get value out of their data. We no longer have technological limitations that would limit the volume of data that needs to be collected, or that would artificially limit the real-time aspect of the data. We now have technologies that can capture and manage the data at scale and at volume, and we can start combining that to drive business value.
- The only constant is change. As companies are building out their data lakes, they are not only adding new sources and data types, but they’re changing technologies; they’re shifting from one deployment model to the other. The option of moving from on-prem to cloud-based storage is just one example of a changing deployment model, and this also requires the ability to support multi-vendor architectures. Businesses need to be open to continually exploring new innovations in the market.
- Organizations are turning to platforms that reduce administrative overhead. Managing a complex data infrastructure to handle big data requirements can be a challenge. Many organizations have migrated to a cloud or hybrid infrastructure as a start. Going further to reduce administrative complexity, more and more companies are turning to self-managed, optimized platforms to more easily maintain and scale their infrastructure.
- Broader groups of end users are adopting self-service tools so that they can turn raw data into useful information. There is a push towards democratizing information across not just the data scientist, developer, and engineering community, but towards the average business user inside of an organization. Unfortunately, earlier generations of self-service tools can’t be used to connect to large data lakes. Businesses need to explore the latest innovations in self-service BI to get the performance, scale, and granularity you seek.
As you might expect, Arcadia Data customers are right in the middle of these business trends. They see the growing demands around data volume/variety/velocity, real-time, and agility from both an end user and an IT standpoint. The flagship Arcadia Data product, Arcadia Enterprise, addresses many of the trends that data-driven organizations see today. Arcadia Enterprise was architected to handle big data natively. This means you don’t have to create extracts or move data, and thus you don’t lose time and data granularity. The notion of data-native BI tools are about letting you run analytical processing directly on the data within the platform itself. This gives users incredibly fast access to the data in its native format as well as to document stores like Solr or real-time sources like Kafka.
To give an example of this in a real scenario, we can look at Arcadia Data customers in cyber-security. They get real-time instant response alerts when anomalous behavior is flagged, and their security analysts can drill down to the detail across all of the endpoints, networks, and users. They can see who’s connected to that incident, triage it, and take remedial action as quickly as possible. Another example is connected cars, where fleet managers want to track drivers and be alerted if there’s an incident that requires investigation. They can also drill down into a historical analysis to identify aggression patterns of those drivers over time, or identify weather patterns, road surface conditions, etc.
To get more information about the TDWI webinar, you can watch the full webinar here. Learn what else the webinar participants have to say about trends in big data, Hadoop, and data lakes, as well as real-world examples of what successful organizations are doing today.