Aired: March 28, 2018
Data Lakes Are Worth Saving
CEO & Chief Analyst of Early Adopter Research
VP of Marketing
Senior Vice President of Data and Applications
Data lakes started as a place for big data discovery and exploration for highly-technical users. But leading organizations have evolved data lakes to support a wide variety of business use cases, assisted by machine learning and artificial intelligence, visual analytics, BI tools, and other methods.
The data native concept helps explain how big data should be used, particularly for data lakes. But so many attempts at building data lakes have failed due to process and technology mismatches.
Join us for this webinar with industry experts from Evolved Media, Arcadia Data, and MapR to learn:
- Why and how data lakes have failed over the years
- What key analytic and data governance capabilities you should aspire to have in your data lake
- Key challenges to overcome
- Principles to consider for your data lake strategy
More about our presenters:
CEO of Evolved Media & Chief Analyst of Early Adopter Research
Dan creates ideas about technology products, based on a broad technical understanding. By writing as an analyst in Forbes and working with Evolved Media’s clients, he sees the magic in technology and why it matters to IT buyers.
VP Marketing of Arcadia Data
Steve is responsible for the overall go-to- market strategy and marketing for Arcadia Data. He is a 15-year veteran of enterprise software in both large public companies and early-stage start-ups and has a passion for bringing innovative technology to market.
Senior Vice President, Data and Applications, MapR
Jack drives understanding and adoption of new applications enabled by data convergence. With over 20 years of enterprise software marketing experience, he has demonstrated success in many areas from defining new markets for small companies to increasing sales of new products for large public companies.
Recorded Voice: Broadcast is now starting. All attendees are in listen only mode.
Steve Wooledge: Hey everyone, welcome to the webinar today. We're going to wait just one minute to get started, let everybody login, and we'll be on our way.
I'm showing one past the top of the hour. My name's Steve [Wool-edge 00:00:46], I'll be one of the presenters today, but I wanted to welcome you to our webinar: Data Lakes are Worth Saving. I think there's been a lot of hype in the market around data lakes and our goal is to really educate you on what we're seeing as analysts and fenders and industry that have been working in the big data space since 2008 and before that.
Just some quick logistics things and I'll pass it over to our moderator and host, Dan Woods, from Evolved Media/Early Adopter. We are going to be taking questions throughout this. We'll have some time at the end for answering those questions, so if you have questions enter them into the questions panel and we'll address those at the end. We are recording today and will make this available. You can see the landing page right there if you want to grab it now and all the slides will be available and distributed to all the attendees. This gives you a view of a couple of screens, if you've got any questions you can click on that button and away we go.
With that, I'll turn things over to Dan Woods.
Dan Woods: Hi, welcome everybody. We're going to have some fun today talking about data lakes, how they work, how they haven't worked, what to do about it. We hope that some of the knowledge we've gathered will be beneficial to you and we also have a group here that is pretty well acquainted with each other, and I think knows something about the topic.
My name is Dan Woods and I am the CEO and chief analyst of Early Adopter research. I try to focus on helping people in the enterprise space make sense of the kind of a mess that generally is created when you have to use technology to run a business. I try to focus on coming up with ideas that create sustainable advantage by using IT. I call these ideas research missions and you can find out more about it at earlyadopter.com. One of the things I'm very interested in and I have a research mission about is the idea of saving the data lake because the signals that are in big data are too important to go unused. The first wave of technology implementation has gotten us part of the way they are, but we need to go all the way there.
We have two other presenters who are very interested in this topic as well. Steve Wooledge, who just spoke, is the VP of marketing for Arcadia Data and he's responsible for telling the story about their technology, which offers some interesting capabilities that we'll go over, about how to actually make use of big data. Then, Jack Norris is VP of data and applications at MapR and he is interested in what makes people successful when they use big data technology in various use cases. He's constantly worried about how things are happening in the real world. I've had the pleasure of working with both Steve and Jack at many of the companies they've worked with and we always have a good time talking together. Now, let's get started.
The first thing we want to talk about is the idea of a fundamental concept that we call the data native approach. One way of thinking about the data native approach is the idea of how can you have data in a data lake that feels like it's right here not like it somewhere else? One of the things that has to happen in a data lake is that you need to analyze the data where it is right inside the data lake. So often, data lakes have a limited their usefulness in terms of the ability to explore and provide the power of big data because they have always been extracting chunks of it and then delivering it through other technologies that don't allow you to get to the granularity that you would otherwise get to.
Now, of course, is that a useless technique? Of course, it's not useless. It can be very useful to distill the signals, but once you've distilled the signals you've lost that signal, you've lost that granularity. That becomes quite a problem when you're trying to exploit the full power of the big data that you have. The second part of a data native approach is the idea of empowering business users and that means allowing people who have the need to understand how to package that data up into useful chunks. If you can play with the data, if you can look at it, if you can mess with it, if you can take that capability and expand it across a large number of people than they're going to find interesting things.
Once they find those interesting things then making them useful means creating a coherent package whether it's a certain customer record, or a record about all the information we know about a product, or some collection of information that then becomes a useful chunk. That is ideally done by the people using the data not by people who have to hand it off and have it be [inter-mediated 00:05:58].
Now, the next part is really interesting in this context, the idea that in a data native use of the data lake what you're doing is you're trying to operationalize the results now there's tremendous infrastructure out there for operationalizing data lakes in all sorts of ways using a batch paradigm. I'm not sure if any of you guys are familiar with the Netflix Genie project, but that's a way of taking a data lake and allowing you to describe a request and then you pass it off to the genie and the genie grants your wish by giving you the data back. It's a complete batch paradigm. It doesn't allow you to do what you can do, and what makes spreadsheets and other tools so popular, and that is play with the data and get interactive use of it. Operationalizing those results means giving that ability to somebody to not only get the answer, but then explore the answer a bit.
Then, finally, the last part of the data native approach is very crucial to being able to ask big data questions. We're going to cover this in a couple of slides, but one of the problems that we've seen with data lakes is that people are asking small data questions instead of big data questions. When you ask a big data question, when you get a big data answer what you're doing is you're getting all of the granularity of the data informing you about what's going on. It's not just a precise answer to a precise question, and we'll talk a little bit about that.
Before we go a little further what we want to do is we want to find out from you guys where you're at in your data lake journey, so we have a poll that we'd like to run right now. If you can go to your interface and select one of these answers what we're trying to do here is just get a sense of where everybody's act in their data lake journey. Are you at the gathering knowledge stage, where you're thinking about either Hadoop or other ways to put a large collection of data to work for your business? Are you developing strategy for doing this by looking at tools and architecture? Are you piloting? Many of the data lakes we've seen have started piloting and never ended piloting. It's not because they haven't been able to create useful things, but it's that they haven't been able to create a useful fully realized data Lake. Or are you deployed and have some use cases working in some sense? Or are you fully operational? Please, let's take a little bit to fill these questions out.
What we're trying to do is understand where people are at in this journey. As we go forward, what we're going to do is talk about, based on your answers, we'll focus our discussion on some of the problems that may be related to where you're at. I think that's enough time. Can we move on and see the results of the poll? Where we're at is that many people are at the beginning of their journey. Their gathering knowledge or developing strategy. Some people are piloting. As we suspected, from our research, some are deployed, and a small amount are fully operational and have that broad use of the data lake across the business. This, we believe, is a very accurate representation of what we've seen in the market. We think that some of the information that we will be able to present to you in the rest of this should help you move from pilot to deployment to fully operational and give you some ideas about tactical things you can do.
Let's ask the question, why haven't we been able to get to the places that we wanted to get to with data lakes? In other words, why have data lakes failed? Now, if you look at the ideas, the failure modes if you will that have afflicted data lakes we have identified about six different failure modes that we found to be relatively common. Let's go and advance the first one. I'd be happy to advance myself if you can give me that power back. There we go.
The first one is something called a data swamp. Now, the idea there is very simple. Most data lakes started out with tremendous ambition to be not only a repository for large data sets, but a repository that was pretty wide that included all data sets. In that sense, people had the ambition for one data repository to rule them all and that's perfectly a great ambition and it's a tremendous amount of power to have in one environment. That power is diluted if you don't have the ability to find information that you know is in there, you don't have the ability to curate and create packages of data that are then useful, and so you end up with, what has been called, a data swamp. Meaning, a large amount of data, but it's very hard to actually use it.
The next problem is one that, I think, our poll showed the idea of pilot purgatory. That means that you do something oftentimes because of the tools people have chosen, because of the way they've decided to implement data lakes, there is a highly [inter-mediated 00:11:47] in the process. What happens is you don't have the end user being able to access the data directly. Instead you have multilayered handoffs and you do create something of use. There's plenty of uses of the early stage of data lakes where you created a pipeline to create a certain data set that became very, very useful. The problem is that then getting that into production, allowing the users to run it when they need it, all of that stuff didn't happen, and you end up with a system in which you don't have the ability to operationalize it the same way that you would normally want to.
Another idea, and this happens without people really thinking of it, is that they just start using the data lake as if it were a data warehouse. They put data in it, they use tools that are not exposing the granularity of the big data, they use a highly [inter-mediated 00:12:42] process, and so you end up with many of the familiar problems that you've had with data warehouses, but without actually exploiting the advantages of the data lake.
The next problem is related to the pilot purgatory and that is once you get everything running there's a complexity, especially when you get to the machine learning deployments, of making a data lake work and that is that you need to hand it off. In a sense, there's an analytics ops sort of process that's very similar to dev ops and the problem is that it's really hard to have the ops part happen without having the dev part involved. Many data lakes haven't been able to become part of the same infrastructure in a business because they haven't had the ability to hand off to an ops team the data lake.
Now, the problem when you hand off certain types of things to the ops team is that you then need to keep an eye on them to make sure that they're still working. Machine learning models, various types of statistics, they need to be monitored to make sure that they're not wearing out, or the training is not going bad, or the new users are not reacting and have changed their behavior. The ops part is a little bit more complicated than just throwing things into production.
Again, the lack of analyst ownership is another lens on what we've been talking about and that is the lack of ability for people to get direct access and play with the data. As I said before, you want the data to be right here for the analyst. You don't want the data to be over there. Then, as we've mentioned earlier, small data questions, this is the idea of asking very precise questions and getting precise answers and not actually trying to get the largest scope of answers. Let's explore that a little bit more as this slide moves forward.
I love that transition. It feels like all of the problems of a data lake are attacking you. I'm getting a black screen, is the regular screen there? There we go. Big data versus small data questions, there we go. The next thing I'm going to talk about is what does it mean to ask a big data question as opposed to a small data question? The idea is that a big data question is a question that provides a rich and broad context as opposed to a precise answer to a precise question as we said. If you're designing your data lake you're going to be putting into it data that provides you a high-resolution model of your business. It provides you with all of the things that were initially claimed for the data lake: the volume and variety. It provides you up-to-date data, velocity, et cetera. There's other V's that people have added, but the idea is that you have this gigapixel view of your business not a megapixel view of the business.
So, if you have a gigapixel view you should probably be asking gigapixel questions, and so what do we mean? What's the difference between a small data question and a big data question? Let's just take sales for an example. What are the sales trends over the last two years? Very good, useful question, but it's a precise question to a precise answer. It's not a big question.
Who's about to churn and what we know about their journey, and how much more can we find out? This is a much more interesting question. It's also a question that you could actually put in front of a customer service representative, and if you put that in front of a customer service representative, and you give that customer service representative the power to examine the detailed picture you have of the customer's behavior you will then now allow that customer service representative to exercise their intelligence based on more insight into what the customer's been doing. This is the really magical point at which you get where the big data's not just providing more granular answers for machine learning or something. It's providing details so that people can use their brains to do what their brains do really well, which is identify patterns, to have instincts, to make leaps of insight.
Let's look at security, for example. Another reasonable question would be, it's a small data question, what are the login and logout behaviors of contractors who oftentimes can be a security risk? The big data question would be how can we correlate the actions of contractors, particular contractors across multiple sources of behavior, across multiple accounts? This is a much richer answer. For the same reason, if you have a security ops person that's able to look at something suspicious and then very quickly see the whole picture they're going to be able to do a much better job and fix the problems much faster.
Now, let's go through and talk about some of the positive vision that we have for a data lake. Now, one of the problems that we've had with data lakes up 'til now is that the vision of the data Lake has been limited to one repository to rule them all and then generally batch modes of accessing and analyzing that repository. Of course, there's nothing wrong with that. There's been tremendous amounts of signals gathered, but the problem is that the team that you have that can do that is not able to grow. You have a limited size of that team using tools that they can handle that require high skills.
What you need is to create a place where analysts can actually access the data directly the same way they do with the spreadsheets, the same way they do with data discovery tools. You want to have as many people with understanding of the business problem be able to get access to all that data. When they can get access all that data and then use the tools that were designed to extract the signals from big data that's when you start getting the rewards that you should be dating getting from the data lake instead of having a bottle-necked process that's all about batch jobs.
Now, the next thing is that you do want to be able to use data sets that you find in whatever environment they're useful. It's not wrong to want to extract data that turns out to have high months of signal and then pass it on to the existing data infrastructure, the existing usage data discovery tools and everything. You want that to actually be quite easy. You want to be able to move the data to business apps when you need to so that those business apps can be informed by the signal. It's important that this, again, not to be a black art, but these integrations be product highest and smooth and, of course, move into operational status without having to be maintained by the original person that created them.
Of course, being able to have data that's fresh, that you can bang away at with the right tools, but also supporting batch is clear all three of those paradigms are needed. Then finally, being able to handle all types of data. This is one of the ways big data even in the early area, has shined. You've gotten access to so many data sources that has so much signal and that's really the frustrating part about it. Most people who collect data and put it in data lakes, they know when they're putting the data in there that it's got tremendous amounts of signal and it's frustrating to think of all that signal being in there and not being able to be unlocked because it's not going to be because the team, you cannot expand the team that's moving forward.
Now, the next thing we want to talk about is the challenges of victory. Let's say that you create a data lake that allows you to get a larger team to have access. What are the challenges you're going to be facing? One of the challenges, and this is difficult no matter how you do it, no matter where you do it, and that is data discovery and data organization. That is, how are you going to have people understand the new data when it arrives? How are you going to have people understand how to package that data up into maybe a set of canonical objects and on top of that a set of objects that are purpose built? Then, when somebody comes to that data, how are you going to have them be able to find it? This is easy to think about, oh we have a data catalog, oh we have a variety of mechanisms for packaging data, but to really have that working so that people feel that they have a data catalog that really works for them that is always a challenge and it's really interesting.
Now, the other part is the interactive use to drive business action. We talked about how having the analysts have interacted use is really important, but when you operationalize it it's not like the data is all just distilled at that point. In the use cases that Steve is going to present you're going to see things where people who are on the front lines of the business, people who would normally use distilled dashboards are actually using interfaces that allow them to get at the granularity of the big data to answer specific questions. It's as if they have this spotlight that they can immediately shine on just what data they need to answer that question and get that high-resolution view of just what they're interested in. It's not just interactive use to allow the analysts to play around and have the data be right there for them. The data and the ability to use the big data has to be right there for the people on the front lines.
Now, the next part is another aspect of analytic ops. These systems are complex, they do use powerful technology, but if that becomes a problem where you need the equivalent of an advanced DVA type person to do anything you want to do that becomes a bottleneck and so it's really important that in the systems you choose they have as much auto administration as possible. That the configuration and the operations is not a black heart. Otherwise, you're going to be keeping yourself in pilot purgatory, as we said. You're going to be keeping the amount of things you can operationalize very limited.
Finally, how can you use the resulting insights in real time and in support of applications? This is related to 2, where you want to provide the ability to use data to drive business action. Here, what you're talking about is how do you incorporate into the business processes the insights? When is it the right time to have that spotlight shone on the data? That's really an important capability.
Then, to make all this work you need an architecture that is able to support both delivery of the data into analytics using all the capabilities that you'd associated with a data lake that we showed here. Then also, the ability of moving that data and operationalizing it because a data lake is part of what you need to do, which is put data into the data business processes, but a data fabric is another aspect of this, which is how do you operationalize the data and move it into your ... It's almost like your OLTP layer. How do you make that data, those distilled packages of data available for applications for use by micro-services, for streams that support other applications? How do you replicate it globally?
The data fabric architecture that MapR has been working on is one way that you can have a repository on which you put [operational-izable 00:25:20] analytics, which is sort of like [O-LAP 00:25:25]. You can also put OLTP operations on top of it as well using a data fabric. I'm going to ask Jack to go over what MapR has done in the realm of the data fabric?
Jack Norris: Thank you Dan. When you look at a data fabric it incorporates the variety of data, the different operations that Dan talked about. The focus here is a fabric that includes not only the data at rest that you associate with a data lake, but also the data in motion because flows are important to make sure that you don't have a swamp that is there unused and [un-updated 00:26:17]. A fabric is incorporating this data in motion, it's data at rest. It's not limited to a single rack or building, so it can stretch across locations. It's required to remove the batch historical constraints of analytics. It's not just the ability to report, and explain, and describe what happened, but a data fabric also serves the ability to impact the business as it is happening. As the customer is engaging you're actually optimizing revenue, as threats are occurring you're minimizing risk. As the credit card's being swiped your determining is it fraudulent or not. As the business is operating your optimizing the cost of development or delivery of product and service. That's the function of a data fabric.
How do we effectively accomplish this? It's really summarized in two words, and that's architecture matters. On the next slide we've got a representation that the underlying data fabric provides that scale and that reliability to support a broad set of applications. That next layer up has the different operations that are possible. file, database functions, integrated streams. What you get is a general-purpose processing layer that can operate on top or integrated into this fabric. Not only is that more efficient. Not only is that easier to manage. Not only does that lower cost, which are all important, but perhaps more importantly that eliminates some of the delays and latency associated with those different operations. That's what's really required to inject the analytics into the operations. If you aren't able to eliminate those delays then by the time you have those analytics ready the business moment has passed. That's really the importance of that underlying architecture.
On top of that, we put and make sure that there's a series of industry standard ABIs. That means the interaction with the data fabric goes through traditional NFS POSIX, or ODBC, or open JSON interface. That gives customers a lot of flexibility in terms of the types of tools they use. I guess, the important thing to put to point out here is that not all tools are equal. If you take tools like Arcadia Data that have pushed down the data processing directly into the fabric then you don't run into scale issues by tools trying to suck out data and get overwhelmed by the volume. That's what really what gives you the ability to put the spotlight on things, as Dan was pointing out.
The summarization here is to effectively operationalize the data, to do the traditional type of analytics that you think about or start out in a data lake, and then moving to that operational step that's part of that architecture slide that Dan talked through, to do that it really requires a data fabric. When it comes to a data fabric architecture matters, so you can't have a data fabric that's built on a batch write once data system like you see in traditional Hadoop. Or it can't really serve as the mission-critical system of record if it's built on a tool that scales with eventual consistency. That's [crosstalk 00:30:35] summary.
Dan Woods: There's two points I'd like to bring and make stronger here in this. One is that the idea of the data fabric is that it's an underlying layer on top of which your Hadoop or big data infrastructure can sit. Then, once it sits on the data fabric it then allows you to run whatever analytics you want the way we've been talking to running a data lake, but then the insights for that can then be brought into your application and operational infrastructure, so you don't have to create a separate mechanism for allowing a mobile app to use some insights that you may have gotten through your big data. Or to feed a stream of data to some other event processor that is now trying to recognize important events. All of that can happen in an integrated fashion using the same data that you are actually using when the analysts are banging away at their jobs, that's the first thing.
The second thing is, I think, that the fabric has incorporated a variety of different operational mechanisms, so it's not just that you can provide ABIs, but you also can provide access through no SQL databases or through streams where you can use the stream as a database the way Kafka allows you to it. You can get all these different views on this data that you have, the views can be on distilled data, or they can be on the raw data. The data fabric, I think, provides a new kind of mechanism for completing that operational loop not just about operationalizing a batch view of the data, but operationalizing application and production use in other use patterns.
Jack Norris: It's not just the analytic apps because you can take the traditional applications that are accessing file-based data and run those side-by-side. That's what really opens up the possibilities. You can [container-ize 00:32:45] all your applications and it makes it much easier to take that operational step because they're there running on the same fabric.
Dan Woods: Now, let's move to the next part of this where we have Steve take us through some examples about how this works in practice using Arcadia Data. Steve, can you show us what you have going in terms of the demonstration of the product?
Steve Wooledge: Absolutely, and before I get into the product part of it I'd like to just set a little more context. I think, the data fabric idea is fantastic because the data lake, I think, gets a lot of negative press right now because people have invested heavily in the technology and even Gartner has put out some reports about the pressure that analytics leaders are feeling because they've got a lot of unprocessed data in these data lakes. Of course, there's no silver bullet here. You have to have data governance practices, and security, and all the things in place, but I think what's been missing all this hype around data science, machinery, and et cetera is just simple business access to the information that's in there. That's part of what I'd like to talk about is what's changed in the technology with products like MapR that opens up the processing capabilities and what you can do with that information?
As Jack said, it's not just about the analytics and trying to make a data warehouse on the data lake. That would just be reinventing the wheel, but how do you drive operational processes and transactional processes on that same infrastructure where the data already sits? I think that's the opportunity. The challenges have been, again this is from Gartner, talking about determining the value and what are the skills and capabilities that we have within the organization that handle some of these newer technologies?
A lot of what, we as an industry, have been working on is making it easier to access to technology and the data through user interfaces and things like that, which I'll show in a second. If you think about it, historically Jack and I actually used to work together at a company that was talking about big data before companies had even started Hadoop distributions and things like that. It was always about big data changing the volume, velocity, and variety of that data. As Jack talked about, not just batch, but also interactive real time on that system.
Platforms like MapR came out that can handle documents, and streams, and tables within that common infrastructure and allow us to load and go. Bring the data in and then do the transformation, do the discovery, and then figure out how it gets operationalized in the business. If you look at what's been happening in the BI and analytics market there's a lot of vendors, but they keep reinventing the wheel with the scale up single node types of systems. There hasn't been a new approach to business intelligence and really unlocking the power of data fabrics that have the scale out distributed architectures.
The question I have in this poll, and Sam you can spin it up here, but there's different ways to give end users access to data in your data lake. Whether you're piloting, thinking about it, or in production it'd be great to know are you using development tools and it's focused for data scientists using things like Spark or MapReduce? Are you providing direct SQL access through open source tools like Apache Drill, Hive, Impala, and others? Of course, people have invested in traditional BI tools, that's one approach. Then, machine learning and AI tools. Data science workspaces, and benchmarks, tool sets, and things like that. Then, Hadoop-native distributed BI platforms is another approach.
Hopefully, everybody's had time to respond to that. If we could please open up the results when we get a chance here. There we go. Number one, is traditional BI tools, which makes sense. People have invested in those things, the SQL access of open source is second at about 30%. Then, machine learning/AI is definitely a hot topic. More and more people are doing that. Development tools, and then Hadoop-native distributed BI platforms or data-native analytics platforms.
That's what Arcadia is, is that last class and we'll talk about talk more about what that means. It doesn't surprise me that traditional BI is the way people are starting, but my question for you would be is your BI tool going to be really able to stand up to the challenges of big data analytics? If you're John Snow, if you're a Game of Thrones fan, you're Superman and of course you can. Most of us mere mortals need a little more power than what you're going to get in a desktop BI tool or something that doesn't scale with the data.
What we're seeing is that large organizations are choosing to BI standards. One for their data warehouse because the relational technology and the hardware requirements at the time made sense for the architecture that was developed where you have a separate tier for the BI server that runs outside the data warehouse. With a data lake it opens up a whole new paradigm and you need a different standard. We have a large financial services organization that brought in Arcadia as the BI standard and it was the first to new BI standard they have brought in since Tableau 5 to 10 years previously. There is no BI tool on the market that can really scale, and provide the concurrency, and deal with the complexity of the data in the data lake without taking a new architectural approach.
I agree with Jack, architecture does matter. If you think about the history, I used to work at [inaudible 00:38:18] and I worked at another BI tool back in the day. If you were to try and take a BI server and run it on the relational database system it just isn't going to happen particularly if you look at something like Teradata where they've highly optimized the software and the hardware to work together there's no room for other process processing engines in that mix. They're fully utilizing the hardware as much as they can to squeeze every ounce of productivity out of it, which was required when there wasn't low-cost commodity hardware that was out there.
What happened was BI server technology became a separate tier outside of that environment and from an analytics perspective you've got to optimize physical models to improve performance both in the data warehouse as well as the BI server, you've got a semantic layer that provides that business view to the table names and columns that people might not understand. You're securing that data, you're loading that data, and you're doing it twice. You're doing it once in the data warehouse, you're taking extracts, and then you're doing it in the BI server in a cube or something like that. If you've got big data requirements where you want to be able to connect natively to complex semi-structured data you need a parallel environment to scale and run with that and you need access to real-time, streaming information. We live in an on-demand world with Twitter, et cetera, why should the enterprise be any different? That really can't be supported in that traditional architecture and that's not to bash it, it's just that was the technology that was available at the time.
If you look at a data lake it's very different because you have an open storage paradigm and an open processing paradigm where there's a distributed file system like MapR's or cloud object storage. You've given the open source community the ability to build different processing engines and run it directly on those data nodes. Arcadia Data took advantage of that approach. We took a BI server technology and pushed it down to every individual data node in the cluster, which means that all of that physical optimization, semantic layer, securing the data only has to happen once and you get parallel real-time capabilities because you're not moving the data, you're not extracting data out to a separate tier, so you get a lot of performance and value from that from the end-user, but also from an IT administration perspective, as Jack said, it's just simplifying everything to keep it all in one place.
One of our joint customers with MapR is a large telecommunications company. Interestingly enough they provide a webinar platform. For their large enterprise clients they have a customer service team that can help them troubleshoot issues that you might be seeing in terms of performance of the network, looking at usage of the people so that they can get licensing straight, and things like that. They had requirements to be able to do a lot of ad hoc query reporting and analysis. They were struggling to do that for a large number of concurrent users. They had over 300 customer service reps and the requirement was to be able to support 30 concurrent users against their data lake and do really complex queries. It was a combination of MapR with Arcadia Data that was able to support the practicing requirements that they had.
This is a quick snapshot of the benchmark and the small orange bars represent Arcadia Data running within MapR and bringing result sets back in a very reasonable time frame for 30 concurrent users with a complex workload. We scrunched the scale down because there were queries using latest CBI tools just patching SQL to different open-source sequences that couldn't get the concurrency. It's not to bash any projects that are out there, but it's to say that you need a way to do intelligent caching of data and speed up the processing to really go that last mile for the end user, so they can get value from the data lake by having more than just your three data scientists accessing that information. What a waste.
That's really what we focused on and this is just another view of how people do that. If you've got a data warehouse BI tool and you're connecting to data notes via JBDC and SQL you're just patching SQL, you don't have any sense of how the data's partitioned and stored and how to optimize performance. You've got to bring data out into the BI server, which we talked about.
There's other vendors that have come out with more of a middleware approach where they're going to create a cube inside the cluster. To be gracious, many of them actually create separate servers outside the cluster and it's either patching SQL back and forth or you're copying data into multiple places and you lose the latency, the performance, and all the semantic information about filters, and aggregates, and how to intelligently model that data directly on the data nodes, so that you get performance for high concurrent usage. The data native is all about that, so we really simplify that, we do the performance scaling with the data nodes. Everything's pushed down not just the processing, but also the semantic model itself.
This is highly optimized for performance and there's no data movement, and it's a single security model. You can actually just give people a web browser, which I'll show in a second, they can access all the data, and you can turn off their ability to pull down data to the local desktop. There's a large healthcare organization, that was their primary use case was just from a data governance perspective if you're copying data in all these different places and trying to secure it multiple times, and tracking who's got what it's a nightmare to manage. The more we can simplify that data stack and how data's been moving the better. We call this lossless or high-definition analytics where and end user can go against all the granular data without waiting for an extract or waiting for that to get moved from one place to another.
One of the key innovations within Arcadia a technology called Smart Acceleration. [O-LAP 00:43:59] cubes have been around forever. The challenges you've got a build it all in advance trying to determine what people are going to ask of the data. We said, "Why don't we flip that on its head and enable business users to access the data lake cluster, perform as many ad hoc queries as they like on all the granular data," the arrow on the left. We actually use machine learning and algorithms to measure and recommend different ways to create these, what we call, analytical views, which are aggregates, caching, and there's a tiering of this that we do. Stored back on disk, in this case MapR, there's also use of VIN memory.
The next time that query comes in there's essentially a [inaudible 00:44:42] space optimization decision done on, "Hey, can I answer this query faster by going to the analytical view? Or do I need to go against all the granular data?" It's a dynamic way to build [O-LAP 00:44:51] cubes, if you will, based on actual usage in which files and tables that the end user's actually going at rather than trying to build it all in advance. It's just a much faster way to improve performance over time.
We don't have a lot of time here, but I think the build big challenge, I would say, is with legacy BI tools, which the majority of you plan to use, is you wind up treating the data lake just like a data warehouse. You've got to trade the semantic model, you're extracting data to the BI server, you're securing it twice, we've talked about some of the steps, but that analytical discovery process happens after you've done a lot of the performance modeling. How did you know what you needed to model? How did you know what questions you needed to ask? What a lot of our large companies would tell us is they would spend days and weeks going back and forth with business users trying to get them the right extracts into the servers so they could answer the questions they were trying to have, so you have these multiple iterations before you actually get something you want to move into production. That becomes slow, repetitive, and takes a long time to push that out.
We had one customer that literally said it cost them $1 million and 12 months of time to put a new dimension into their model, their physical model. That's why, I think, the promise of open source and data lakes has been interesting. If you have an analytics approach that's native you can do that visual discovery and semantic modeling in an iterative fashion directly against all the raw data and optimize performance incrementally based on need in a much faster time to value. You move that analytic process from stage 6 here, let's say, all the way back to stage 2 and that represents weeks or months of time, truly. It's not just giving, again, business users access to a data lake, but how do we speed up their iteration and discovery process?
We have a lot of customers that we work with that have many different use cases for their data lake. I won't go through all these, but cyber security, Internet of things, analytics, I'll give a demo of a connected car application, and just a lot of ways that you can leverage all the data in one system not just for analytics, but also operationalizing that into real time types of situations.
Let's take a look at what that looks like. Hopefully, my demo system hasn't timed out too much. Actually, let me go here. This is Arcadia Data, just the core product if you will. On the left-hand side we've created some different demo applications and I'm going to select this one, the IOT demo, we launched this app and it strings together all the different visuals that were created for this application. This scenario is a fleet manager for, let's say, a [inaudible 00:47:38] operator that needs to send out cars out into the field to repair stations and that sort of a thing.
A fleet manager might want to have a real-time view of what's happening out there. This is fictitious data, so it's not going to represent a true business problem. What we're looking at our different hazardous events or, in this case, illegal lane departures, a collision that has happened with one of our vehicles in the fleet, or some hazard out on the road. We can see in real time if we had a feed from MapR Streams, let's say, coming in what's happening. You've got a geographic map, which you can zoom into, to look for hotspots within San Francisco and where different events are occurring. Then, you've got the individual transactions, if you will, for each vehicle identification number down here at the bottom. Those are updating as things are happening.
As an end-user, you're simply clicking on a button, you pull up that VIN ID and we can see all history for that vehicle over all time. We're doing some other things up here on the right like calculating an aggression score based on accelerometers looking at how fast this vehicle has been accelerated or braking over time, and looking for some correlations, if we go to the analysis tab then for a vehicle. You can think about things like is their aggression score correlated with accidents? Of course, that makes sense. Then, looking at things like maintenance, so oil changes and how frequently do you need to change oil based on the aggression scores for one of these vehicles and things like that. Just an interesting view, very visual, very interactive the ability to drill down and look at information. All completely through a web browser, no browser download, and no data being moved to a separate BI server. Nothing to install on your desktop, that sort of thing.
The other thing I want to show really quickly is how do we build this kind of stuff? This is another view of Arcadia and I'm just going to go ahead and show you how to connect the data. We've got a bunch of different connections in here. Arcadia can not only connect to things like MapR, but other systems as well. We're going to go ahead and connect to a small sample TV data set that we've got here and we create a new dashboard.
Again, I'm a marketing guy, I'm connecting to data and building stuff. We'll go ahead and let this data load up in here into the canvas. We've got a bunch of dimensions around this TV viewing data. I'm just going to edit it to simplify things a little bit. I just want to look at all the viewers over all time and the number of different shows and programs that have been looked at. Refresh that, simplifies it down. We come with a lot of [inaudible 00:50:23] visual types and sometimes an end user might not know exactly what's the best visual type to use. We've actually used AI to help recommend visuals, so I just click this thing that says 'Explore Visuals,' and it actually looks at the dimensions I chose and recommends some different ways to visualize that. I can just see it laid out here without having to iterate manually. I'm going to grab this interesting looking calendar heat map. This is looking at the number of viewers for different days of the week, and there's hotspots we could drill in to and that sort of thing. That's a nice view, I'll save that away.
Then, the next thing would be to look a little bit more granular at some of the programs and channels that people might be watching. Let me open up another component here. I will edit that and try that again. Here we go. I'm going to look at all the channels, and programs, and records. I'm going to simplify this down to only look at the top 50. We'll refresh that so that simplifies it down again. Not sure which visual type, so I'll use my recommender here. There's vertical bars and scattered plots, et cetera. I'm just going to take this horizontal bar chart and now we can start to look at which channels, which programs. That's interesting. Save and close that. If I want to add a filter, and make this a collaborative thing, get my webinar thing out of the way, click 'Filter.' I'll add a channel filter, I'll add a program filter, save that. Go ahead and view it.
Now, we've got the heat map of all viewers over time and if I wanted to select a particular channel, let's say I'm looking to sell advertising, in this case, on the BET network. I want to know which shows and which days people are viewing stuff, so let's filter it down. I can look at some of the hotspots and if I wanted, let's say, filter on that particular day I can see which shows are the most popular. This is Take Mama's House and Menace II Society on that particular day. Just a very interactive easy way to do some analysis and remove filters and things like that. Anyway, I just wanted to give you that flyby of what's happening. We took a little bit longer, but I wanted to backup the theory with how it actually works in practice. A pretty simple product.
With that, we'll wrap things up, handle any questions. We've got a whitepaper that Dan Woods has authored with us that gets into more of the details and recommendations for people. We've got joint demos we've built with MapR to look at things like data warehouse optimization. If you want to get started with our Arcadia Instant today, there's a nice simple tutorial and a way to download fat.
Dan, any questions that we've had come in?
Dan Woods: Sure. People can use the interface to get some seed questions. There's no problem. You should be able to type your questions and send them to us. The idea that we'd like to talk about is just what do you want to mow to know more about?
One question that just came and was, "How do companies make the change from traditional to data-driven?" That's really interesting in terms of the idea of data playing a new role. I think that, in a sense, there's a cultural change that has to take place where whenever something happens people start asking the question what does the data say? Now, the problem is that in so much of the history of the enterprise you couldn't ask that question because the data wasn't available. There was no data. That's one of the things, the differences between the big Internet web scale companies is that there's always state available for whatever they're doing because anything they do generates huge amounts of data. I think, the ultimate answer is the more data's available the more it is easier to be a data driven culture if you have convenient access to it all.
What patterns, Jack and Steve, have you seen people make in terms of the transition to a data-driven culture?
Jack Norris: I think, what Steve just showed is pretty eye-opening because yes, we've been data-driven in the past, but it was actually we organized analytics based on the questions we knew we wanted to answer. It's a self-fulfilling prophecy. Yes, we had data to understand questions that we had, but it was based on those architectures. What Steve just went through is you have this much greater ability to pursue a train of thought and understand the data because of your ability to drill in ways that you hadn't anticipated before, you're following the path of the data.
Dan Woods: Another question that just came in is about how is AI and ML leveraged in big data analytics, and how can BI tools leverage this? Steve, you just showed an example of using recommendations powered by machine learning to show you what kind of visualizations might be appropriate. It seems like what you're really shooting for in this is to have as much as possible a guided process in which a large number of choices is reduced to an appropriate number of choices using the machine learning or AI. How would you characterize the role of machine learning and AI in supporting an analyst's use of big data?
Steve Wooledge: You're right. I think there's two ways where AI/machine learning apply. One is it's similar to a car where you have power brakes or power steering. AI/machine learning can add that extra boost of productivity, which you saw with the Arcadia Instant Visuals feature. The other is to, from an analyst perspective, provide the output of these algorithms into the analysis.
A simple example, that could be the aggression score that we showed in that IoT demo where something's being calculated and maybe you're going to start to predict part failures and things like that based on those aggression scores and learn over time. We're not an AI/machine learning workbench, but we provide the framework to pull that data, again, because it's all running on that central data fabric. It's super fast and easy to bring it in all on that interface, so that's the second way that we would provide insight around the outputs of that machine learning.
Dan Woods: For our last question, Steve or Jack, could you provide examples of how people use the real granular high resolution data they get from a big data repository like a data lake, like use in the frontline operations of a business process?
Jack Norris: I think one way to think about it is that we've moved away from an 80/20 mass customization where a lot of the analytics were to explain what happened on average. What was the high-level pie chart of the results to we're really optimizing at the long tail and figuring out how to personalize things for a particular individual and understanding the risk profile based on an individual. To do that requires just a massive amount of data, and this ability to process it and do detailed segmentation, et cetera, so that the results that are highly targeted. That gives huge advantages to the organizations that are able to do that transformation, to better improve the customer engagement and better improve the operational results. I think that Arcadia provides the ability to then visualize that both int the front end in terms of how you're going to explore the data to figure out where you should operationalize and then in the results that Steve pointed out having the data drive some of the content, so you're able to zero in and spotlight on the things of interest.
Dan Woods: We're at the top of our. This has been illuminating to me about the way that Arcadia works and the different patterns we have. I hope people found it useful. There'll be a recording of this and the slides available. Is there any other administrative things to say about the webinar?
Steve Wooledge: I think we're all set.
Dan Woods: Thanks everybody.
Steve Wooledge: Thank you, have a great day.
Jack Norris: Thank you.