Let your business analysts choose the right track
to insights from their data lake
SVP & Research Director of Data and Analytics
CTO & Co-Founder
WEBINAR AIRED: DECEMBER 7, 2017
Give Your Business Analysts Better Choices for BI on Data Lakes
As organizations continue to use more data sources to gain an edge, the data lake concept has gained momentum as a shared enterprise resource for supporting insights across multiple lines of business. A critical, yet often underemphasized step, is to provide secure, self-service BI and visual analytics for end users to get the insights they need.
Sounds easy to address, but is it really? There are four different ways to provide analytics on the data lake. What are the pros and cons of each? Can you leverage existing investments, such as your existing BI tools? What are the best deployment strategies for supporting 100s or 1000s of concurrent users across many business units?
Join this complimentary webinar with industry experts from Ventana Research and Arcadia Data who will cover:
- Survey data from Ventana showing latest trends in data lakes and big data analytics
- The pros and cons of traditional BI tools vs. “data native” modern BI tools
- A live analytics/visualization walk through
- 5 key requirements for your data lake
Ladies and gentlemen, hello, and welcome back once again to Hot Technologies of 2018. That's exactly right. My name is Eric Kavanagh. I will be your moderator for today's event, "Delivering on the Promise: Business Value from Data Lakes". That's what everybody wants with the data lake, I'll tell you right now.
We have a great lineup for you today. Yours truly at the top there. My good buddy, Wayne Eckerson of Eckerson Group, has dialed in today, as well as my other good friend, Steve Wooledge of Arcadia Data. We're going to dive through a couple presentations and hopefully a demo as well, but I do want to share some other quick thoughts with you.
This webcast is part of a whole program and it includes an assessment, which is designed to frankly help users like yourselves understand where you are in the data lake journey, as they say in the business, to really help you figure out what your next steps should be based upon feedback from yourself and from your peers. Many of you should have seen a survey or assessment that popped up when you registered, and this is what you would see when you got there. How much value does your data lake provide to business users?
What we've done here, and this is something that Wayne came up with and his team, and it's called Rate My Data. It's a very cool technology. It's a platform, again, for helping companies understand how they can basically size up to other organizations and determine what are the next best steps forward. Obviously any kind of investment as serious as a data lake requires a lot of thought, a lot of time, a lot of resource, and of course some money thrown in there as well, and you want to make sure you're making the right decisions.
Wayne came up with this concept and his team of developing Rate My Data, which is an actual application. It's a web-based platform for assessment. What happens is, you take the assessment. It only takes about five minutes, and then you get a personalized report. The report can take a number of different directions in terms of what you get and you can see. This is one of the things that you'll get after you take the report. This was a medium score, this is a high score, and you can see how you compare to other companies. The idea is it's going to help you figure out should you move this way, should you move that way, which direction should you take for your organization to optimize the value of what you can get?
I'd also like to push a poll, if I could very quickly. Let me see here. I'm on the road myself today, so I'm going to open this poll. You can see the question is "How do you, or plan to give, users access to your data lake?" You have five different options there: A, B, C, D, and E. I'm going to give this a couple of minutes here. Okay, folks are starting to answer. A is development tools. B is direct SQL access. C is traditional BI tools. D is Hadoop native. E is other, and for other, if you would just go ahead and chat us something if you have something else going on.
This once again is all part of our desire to understand what's going on out there in the marketplace. We are researchers and analysts, Wayne and myself at least, and of course companies like Arcadia are always very curious to understand what's actually happening out there in the real world. What are you folks doing? What are you seeing? We always want to understand your thoughts and your perspective on these things because, like I say, this is a very challenging space. You want to make sure you do things right, and so that's why we have this whole platform for you. I'll give this one more second.
Maybe, Wayne Eckerson, I'll just throw a quick question over to you if you want to talk just for a second about Rate My Data or about this particular one. Took you guys some time working with Arcadia to put all this together. Any thoughts on what you hope to learn from the survey as you read what people write in their personal assessments?
Right. Data lakes traditionally have been the domain of data scientists, so we want to explore how many regular Joes are actually using the data lake and getting value from it. We crafted, it's about 20 questions, but it takes four minutes to complete because of the way we designed the questions with a Likert scale. It doesn't change for each question. It spans, I think, it was six categories that we looked at. Each category had about two questions and then we threw in some filter questions.
The cool thing about this report, we've had over 100 people take the assessment so far, is you can go in and filter it. There's a filter button you can see upper right-hand corner. That's where you can further benchmark yourself against a more targeted mix group based on company size, region, and industry among other things. I think it's a pretty useful tool. Gives you a quick snapshot of where you stand in terms of the data lake usage for a regular Joe.
Okay, good stuff. Let me go ahead and close this poll. We got a good number of responses. I can tell you the results. Thirty-one percent say traditional BI tools, 12% direct SQL access, 8% Hadoop native, 4% development tools, 4% other. That includes the entire breakout of folks, so well, no answer is 42% so far, but thank you very much for taking that, folks. Well done.
Now, let me hop back over here. I'm going to push this next slide forward. I'm actually going to give the keys of the castle to Mr. Eckerson. Wayne, you can share your screen or use the slides in there, whichever you prefer. With that, it looks like you're going to share your screen. Take it away.
Give me a chance to set this up here. I'm on a Mac. It takes a little while. You're seeing one screen or two, Eric?
Just one, looks good.
Okay. To introduce this topic of the data lake and business value, I think I decided to start from the beginning, which was comparing traditional data warehouses to data lake. In many ways, data lakes are a response to the efficiencies of a traditional data warehouse. Just a quick background, this is a traditional data warehouse architecture. The benefit of it when was launched in the mid-90s was one place to go for all your data, instead of multiple-source systems. This was just one place.
Designed for queries not transactions. Designed in a way that simplifies user access and speeds and performance. Provides an enterprise view, so instead of silos of data now you get a single consistent view across all of your functional areas. Delivering a single version of truth with common metrics and standards. What we found was very ideal for supporting your core reports and dashboards.
That was the promise of the data warehouse. It still is a relevant promise today, but we hit a lot of speed bumps along the way, in which data lakes in many ways are an answer to the data warehouses because of schema on write. It takes a long time to model and load the data, so it takes a long time to build. It's hard to change. Takes an army of people to maintain this thing, so it could be costly. The infrastructure, built on relational databases, tends to be costly as well. It scales well, but not into the petabytes for sure, and not really designed for multi-structured data.
What we have found is that the data warehouses are really good for answering known questions, things that traditionally the IT department went out and gathered requirements for and developed reports and dashboards. Had plenty of those in the warehouse, with some drill-down to do, some root cause analysis around those metrics, but it's really less good for answering new questions with new types of data.
In many ways, the problem with the data warehouse were that we were asking it to more than it was really designed for. Around 2010, Hadoop, which most data lakes are built, hit the scene hard. In many circles, people advocated wiping the data warehouse away entirely and replacing it with Hadoop based data lakes because they solved a lot of the problems of data warehousing. Infinitely scalable on a low-cost scale out distributed architecture. Supports any kind of data thanks to schema on read. Basically, you're just dumping data into a file system. You don't have to model it first.
Giving users, especially those data hunger, power users, and data scientists instant access to data instead of having to ... Hold on one second. Instead of having to wait for the IT department to model the data.
Also, built on open source. The cost for software licenses is an order of magnitude less than traditional data warehouse. Probably the thing I like most is that once you put it in there, you never have to move it because you can bring different compute engines, like relational, like sorting engines, like graphing engines into a Hadoop cluster. You never move the data out of the cluster into something else.
That was the promise of data lakes that we found. There are some liabilities. One, it's still relative new technology. It's evolving quite fast. Very fast, as a matter of fact, but it's still working out issues. A lot of the software's built on open source from Apache Foundation. A lot of different projects starting, stopping, overlapping. Hard to keep track of, which is why we need these Hadoop distributions from companies like Cloudera and MapR and Hortonworks.
We found that they're actually fairly complex to manage, especially the infrastructure and the hardware, which ends up costing a lot more money than people think. The skills, hiring the skills to do it is not cheap either. From a workload processing perspective, it can do real fast full table scans, but it's less good for complex multi-table joins and high volume user concurrency.
Today, and this is a moving target, we're finding that data lakes are really good for data scientists and power users who want instant access to raw data. It's the ultimate data dump. It's also good for offloading ETL workloads and archiving large volumes of detailed data so that you don't have to upgrade the data warehouse, which can be quite expensive.
What you're seeing is that the data lake, somewhat a response to the deficiencies of the warehouse has its own deficiencies that, in fact, the data warehouse is [inaudible 00:11:30] to address. If you look at the underlying technologies behind the data warehouse and the data lake, specifically data warehouse relational database and distributed file systems called Hadoop, in 2010, the attributes for each were almost polar opposite. You can go down the list here and see that on every single characteristic, they're completely different. One's interactive. The other's batch. One offers SQL based queries, the other's Java based. One's schema on read, schema on write. Just goes on and on and on, all the way down that list.
These two technologies are playing both friendly and competitor in the ecosystem, and as a result the capabilities are starting to converge. Relational database is starting to take on a lot of capabilities of Hadoop, and vice versa, and both of them are going to the cloud.
I've been trying to help companies try to figure out what the dividing line is between these two world, and it is a little bit difficult, and it is a moving target. What we're seeing is that the data warehouse, running on a relational database, is great for supporting business people, regular Joes, as I was saying before, supporting specific types of workloads that require complex, multi-table joins and large volumes of concurrent users, but it's really good for supporting the existing reports and dashboards and doing analysis on those things.
Whereas Hadoop and the data lakes, they're really good for data scientists and power users who want instant access to the raw data or slighted scrubbed clean data. It's really good for big table spans, large batch jobs, and ETL offload, data offload, and data science sandboxes. When you put these two together, you realize, hey, you know what? Why should we have one versus the other? What we really want is both.
Then the question become, how do you unify these into a coherent ecosystem or architecture? What I figured out is that there's options. There may be more, but this is what I've come up with so far. One is that you create distinct worlds. Physically distinct environments, so you have a data lake running on Hadoop, and then sitting next to it an integrated an inside data warehouse. You're feeding data back and forth. The data warehouse runs on a relational database, and the data lake runs on Hadoop.
That's how it exists in most companies that do have both today, but there are other options as well. For instance, I've seen companies who try to rebuild the data warehouse in a data lake, and they can be fairly successful. I would call these more data marts than data warehouse, but that's where you take some of the SQL on Hadoop technology like Cloudera's Imapala, build out tables in a dimensional schema of sorts, and then run queries right against those tables in the data lake.
A third option would be to use a BI tool to recreate dimensional view of data in the lake. In this case, the virtual view, the analytics tool, sits outside of the data lake and queries data inside the lake. They pull back data into its own cache, where it can optimize that data to ensure consistent fast performance.
Then the other option, the last option, is where the analytics tool actually sits inside the data lake and resides natively on Hadoop and queries the data from there. I believe Steve Wooledge will talk about that approach since that's the one Arcadia takes.
Where are we today? Let's say, two years ago the state of the art was you would put most of your data in Hadoop or S3, as the cloud starting to emerge as the platform of choice for many companies. You're going to do a lot of your ETL work in Spark, and your data scientists may also use Spark libraries for doing machine learning.
Then once you've refined the data sufficiently through those zones on the left, one, two, three, four, you can push that data into a relational database to support your big data warehouse. The benefits here are basically you get the scalability, support for multi-structured data and schema on read that a data lake supports, whether in the cloud or not, but you still have the disadvantage of copying and people getting data into a data warehouse, which anytime you duplicate large volumes of data and are making a transition between orthogonal technologies, you can run into problems and expense.
What we're seeing today is a little bit different environment, where companies are using big data analytics tools like Arcadia to actually query the data that's been transformed in the lake, whether it's Hadoop or in the cloud. Usually that transformation is happening in Spark using Python oftentimes or commercial tools as well, and sometimes reaching way down into the landing area as well.
That's where we seem to be going in this big data world. The benefits here are same as the other, scalability, multi-structured data supports, schema on read, but you don't copy or duplicate data. You just keep in one place in the lake, and give dimensional access to non-data scientists via a native big data analytics tool.
I guess the cons to this approach is that it's new, and that there is no relational database, and that might freak some people out. It's something to consider as some of the bleeding edge and leading edge companies are going in this direction. Maybe more than that, Steve can convince us of that.
This is a dirtier picture of that architecture with a little bit more detail, but basically saying the same thing. There's a lot of different pipelines that come out of that zone two data hub, supporting different types of users and applications. This environment we're seeing can be built only by traditional data engineers familiar with SQL and relational, but also big data engineers who prefer open source libraries and tools.
Eric mentioned this assessment that we're running right now. It only takes four minutes of your time. It might give you some interesting insights on how you've progressed with your data lake. With that, I'm going to turn this back over to Eric.
All rightee. I'm going to turn it over to Steve Wooledge. Folks, feel free to ask questions. I'll post the slides in just a second. With that, Steve Wooledge. Take it away.
Great, thanks. I'm going to share my screen as well. You see that okay, Eric?
Yes, I can.
Cool. All right. Hey, everyone. My name's Steve. I work for Arcadia Data. Happy to be here. I've worked in the industry along with Eric and Wayne for, gosh, it's probably 15, 18 years now. I've worked at relational database companies. I've worked business intelligence companies like Business Objects. I've worked at Hadoop vendors like MapR, and now I work for Arcadia Data. It's fun to see how the industry's evolving, how customers are using different technologies in different ways, and what I'll talk about is that fourth option that Wayne pointed out of getting value from business users or business users getting value from data lakes using what we call native BI and analytics.
The quick snapshot back in 2008 when I was at a small database startup company, everybody talks about big data. It was all about moving from structured data, if you will, to multi-structured data, things that didn't fit as neatly into rows and columns, things like JSON and IP traffic data off of sensors and those types of things. Batch workloads within Hadoop, as Wayne talked about, have moved to more interactive and real-time as well of course, big data in terms of volume, but just a lot of complexity there.
We're way past that. The platforms have evolved, and relational databases have evolved, but I think the need for agility on this data where people don't want to have to structure it all in advance, as Wayne talked about. They want to be able to create things as they lay without doing a lot of structuring in some cases. You want to be able to query search indices, events, documents, like document databases, things like that, and you don't want them to necessarily transform data and have it modeled perfectly into an environment for analysis. You might want to do the transformation in place of the data warehouse, or in the data lake, or discover the data before you transform it into something that's more normalized for reporting.
There's been a lot of, I'd say, agility changes because of the nature of hardware and the costs coming down, but our observation at Arcadia has been that there really hasn't been a lot of innovation around the BI technology, if you will from where we've been. SQL is still the language of choice of business users, but BI tools don't necessarily handle the scale or the complexity of big data on the platforms that are out there.
That's really what we set out to do. If you're a "Game of Thrones" fan, the question becomes can you stand up to the big data analytic requirements and the types of data that's out there? For people who don't know, this is Jon Snow, who decided to charge an entire army on his own. Just kind of fun.
Really, we found that the company with a mission of connecting business users to big data, as Wayne talks about data lakes today, tend to be the realm of data scientists, developer tools, those types of things, where you want to go after the raw data. You don't necessarily want it structured. You don't want to lose any signals in the noise, so to speak, but there's a lot of value in that data lake as well that business users can get access to.
I'd say data lakes today often get treated as an development environment to find and discover information, but if the data's already there, and you've found some insights, why not share a little with a lot of people from where the data sits? You don't necessarily need to move it into special purpose system that handles concurrency and SLAs and dynamic workload management and that kind of stuff.
That's what we do. We've been around since 2012. We've gotten some awards from Gartner and Forrester in different technology areas like what Forrester would call Hadoop native BI, which is a different category from traditional BI. We've had a lot of big customers with data lakes, where they're creating a standard for the data lake for their BI tool, which is different from their data warehouse. This is not replacing data warehousing. This is new use cases, new data, new applications that companies like Citibank or Procter & Gamble are deploying using data lakes and Arcadia Data as the front end to their business users.
What I'd like to do is talk through the reason why people are choosing two BI standards, the benefits of that, and show you an actual product demo, and we can get into questions and answers from there.
Again, the premise is that there is whole host of BI tools that have been around for decades. I used to work for one, and they're optimized and work extremely well on relational technology. Not necessarily optimized to work on the openness and the scale that's available within non-relational data lakes. That's not to say you can't build a data lake conceptually on a relational database, but I'm going to talk about more of the Hadoop based and cloud based object store types data lakes that are out there.
If you think about it, why people are choosing two BI standards for their enterprise is because the traditional relational database was highly optimized to take advantage of the hardware that was available at the time. These are closed environments, and I don't mean closed in a negative way, but I worked for Teradata, and the amount of engineering, the performance you can squeeze out of relational database is amazing. The work they do to integrate that with the hardware is fantastic, but you can't take a processing engine and run it on the same hardware where the database nodes are running because it's just not designed to handle that kind of workload.
If you were to take a BI server and say let's run it on the data warehouse, can't really do that. BI servers growing up over time are a tiered model. You've got data that sits on the server or on the desktop. These are scale-up environments for the most part. You can cluster these, but they're not distributed systems, and what winds up happening is you've got to load data once into the warehouse. You do some transformation. You're going to have to load it into the BI server. You're going to have to secure it in multiple points.
You've got a semantic layer, which maps back to the schema that's been defined in the database. Then you typically will optimize the physical model maybe twice. Once in the data warehouse, if you want to do the optimization there. You can also optimize that performance in the BI server. That's a choice that people make from an architecture perspective, but oftentimes you're doing it in both places.
It becomes a little bit of extra work, and there's value in that, but you don't have a native connection in many cases to things like semi-structured data. If you take JSON files as an example, you're going to have to flatten that to put it into a table. The BI tools are required to be in more of a relational format before they can execute queries against it. These are not parallel environments as I talked about.
The idea with Arcadia was to take the openness of systems like Apache Hadoop, which allow you to have multiple processing engines running on the nodes where the data sits. The whole idea is bring the processing to the data. Don't take the data to the processing, especially when you're talking about petabytes of data.
We took advantage of that, and we built a BI server essentially that runs on the nodes in the data lake, fully parallel distributed system. There's benefits from performance and things like that that we'll get into.
The backend stuff is also hugely valuable. We inherit the security that's already in place. We do the physical modeling in place. We give you a business semantic layer to access the data to look and define it with business terms directly in place. We have understanding of where the data's located from a distribution perspective, the hashing, et cetera, so that we can create query plans which are highly optimized for distributed environment. You only do it once, and you get native connectivity to those data types. We can handle complex types like JSON natively, and it's a fully parallel environment.
You would say, I don't necessarily want to have all my data in a data lake. For sure, every company, it's mind-boggling. I was at the Gartner show last week, and they showed this graph of people and the number of systems they have. There's hundreds of databases in big organizations, so yeah, you can connect other systems into a system like Arcadia.
One of the things we've been innovating with is the Apache Kafka project and our partner, Confluent, where they've created a SQL interface to real-time streaming data called KSQL. We've integrated with that. You can have real-time streaming data coming into your dashboard, which would trigger an alert.
Then you can drill down into detail within the data lake or within your data warehouse environment, or your MongoDB environment, Solr, other types of systems where you store data. It's not just for the data lake, but that's where you get a lot of the performance and the benefits for people that want to discover information, and then also productionize it within one system.
If you contrast that with what's out there. This is another way of saying some of the same things, but data warehouse BI architecture is really a scale-up environment, again, optimized for the technology at the time, but that requires data movement, multiple points of security management, et cetera. There are vendors out there that have come up with is middleware application of this as a Band-aid approach to allow traditional data warehouse BI tools to connect to another data store within the cluster, which they put on an edge node or a series of edge nodes.
That works okay, but you've still got multiple points of integration and security. Really, you don't have the semantic knowledge about the data that's down on the data nodes. You're still pulling data out. You loss information on where the filters and aggregates, where should those be applied, and you're essentially passing SQL back and forth between the BI tool and that middleware box that's interpreting things and pulling data back from the data nodes.
Those cubes are typically built on a nightly batch run, and you've got to build the cubes in advance based on what you think people will run a query. You lose a little bit of the freestyle nature of being able to query ad hoc against the full thing versus data native, or native BI, which pushes down not only the processing, but also the semantic knowledge.
What we can do then is build dynamic caches of data based on the actual usage of people that are issuing queries. It doesn't have to be a build it all in advance based on what we think people will query, but let's learn over time and build ways to accelerate performance based on the actual usage of the cluster because we have that semantic knowledge and everything else from the queries that are coming into the system.
We like to call this lostless. Like high fidelity, high definition television or audio, you want your analytics to be high definition as well. If you lose the granularity because you're aggregating and pulling data out to handle the low scale of a BI server, you're not going to have that full fidelity access.
Then the performance is something that really stands out as well. This is a benchmark from one of our customers, who is trying to give business analysts. These are actually customer service reps for a telecommunications company that might run a webinar platform similar to the one we're on to get really high performance ad hoc queries for 30 concurrent users. They can troubleshoot things like where are the bottlenecks in the webinar platform. What are some different questions we need to provide back to our customers?
The point of this is not to compare us with a SQL and Hadoop engine, but we actually leverage SQL on Hadoop connectors to data, but we are putting in a proper BI server, if you will, within the data lake which gives you much better concurrent performance to be able to return results that's in a reasonable amount of time for people. That's the kind of performance we see, and again, the way we do it is through some innovative technology. We call it smart acceleration.
There's some patent pending technology around this. Again, in terms of agility, we want end users to be able to access to the data lake cluster, get granular access to all the data, ask any question they want, and then we have these analytical views that are recommended by the system based on machine learning. We're looking at what tables are being accessed, which queries are being run on a frequent basis, and we recommend back to the admin person, you might want to rearrange and recreate some aggregate tables that we store back [HDFS 00:31:38] or S3. Deploy those out.
The next time that query come in, we can make a cost based optimization decision on where to route that query for better performance. You get 100X better performance than just scanning the entire data lake and trying to bring back result sets.
That's a big difference, and again, it's incremental. It's dynamic. It's based on actual usage. You don't have to build the entire cube in advance, which is a huge advantage from an admin perspective.
Really, the whole premise of the data lake was to provide more data agility. Again, whether it's done on relational or Hadoop, it doesn't really matter. The point is you've got to be able to bring in data and iterate on it quickly. If you take a data lake, and you just treat it like another database, where you want to, the BI server is forcing you to take data out of a data lake, secure it there, do the performance modeling in the BI server before you can actually start doing data discovery in a coherent way.
Then, by the way, we forgot to put a dimension into that cube that someone wanted to look at. Now you've got to go back to IT, ask them to add the dimension before you can go back to the second iteration or the nth iteration of analysis that you want to do.
I've lived this in my previous lives, and that takes a lot of time and costs to manage an environment like that. You lose the business agility, which is the whole promise of Hadoop and data lakes in the first place.
We've changed all that, and we allow you to analyze data as it lies, if you will, in its original form. Yes, you do semantic modeling on it, so you can put business terms against it, and you can interpret JSON and look at what is the schema that's embedded in the metadata, but go ahead and analyze. Do the discovery before you have to do productionization or optimization of that data structure in a way that allows you to productionize it.
You know what? A lot of business analysts might want to just find some insight, and then go take it and do something with it. They're not necessarily going to deploy this out to 100 concurrent users. That performance modeling step is optional. It gives you a lot of flexibility, faster time to value, and we just move that entire analytic and visual discovery step from step six all the way up to step three. That's delivering on a promise of agility with a data lake.
In summary, that's what we do. We provide business user access to all the data. Complex schemas, whatever, on a native architecture that gives you that tight governance and integrated security on the data lake and allows you to deploy to hundreds and thousands of users in a high concurrent type of a workload. With that, I will switch it over and I'll give you a demo, what we're talking about.
Pull up my browser here. I've got Arcadia Data running here in a web browser environment. Everything is HTML5 browser based. There's no browser plugins. There's now desktop download. Everything you see is just delivered via the web. The data all sits back in the data lake, which is huge from a governance and a compliance perspective. You don't have to worry about people downloading data to the desktop and auditing that. It's all browser based.
What I'm going to do is show you a very simple demo that give you a sense of the tool. Then I'll show you a more robust application around cybersecurity, which is a big use case that we have with some of our clients, some of which I can't mention. Some are including US agencies like the Department of Agriculture, believe it or not.
All I want to do in this case is I want to show you how to connect the data and build a simple dashboard. In this case, I'm going to click on data. It pulls it up on my connections. You can see things like Solr and Kudu and Kafka, relational technologies as well on the left. I've got a very simple dataset that was created around TV viewership data. I call it Eckerson TV. That's your handle for your next TV show, Wayne.
All it's going to do is pull up a palette here or a dashboard. It's going to bring in the tabular data in a way that doesn't necessarily do a lot for me as an analyst. Let me go ahead and look at this a different way. I'm going to edit this, and I want to look at all viewers over time. I'll bring in a date string as my dimension for measures. I'll bring in a record count.
Let me just refresh this. This is filtering down, looking at the data. Okay, so four different dates over time. I see the number of total viewers at any point in time across a lot of different TV channels. As an advertiser, I might want to know what shows are people watching? What time of day are they watching? Those types of things.
Let's visualize this in a different way. What we've done is we've embedded machine learning not only into the backend for performance optimizations, but also into the frontend to assist people with the right ways to visualize data. I just clicked this button that says explore visuals, and this is actually showing me different visualization types using my data, and I can compare what's the most useful to me. Do I want to do a standard bar chart, a scatter plot, bubble thing, or maybe this calendar heat map would be interesting since we're talking about time?
Here, I'm looking at the total numbers of viewers, and we've got hot spots on things like Sunday when maybe sports are happening or your favorite gospel show. It could be something on Thursday, but I'm not really sure what that is. That's useful, so I'll save that away to my dashboard, and I'll close this. I've got that visual.
Now, I want to look at something a little bit different, which would be to break down things like, let me just zoom out here. The channels and different things that I want to look at. I've just got to zoom out because I can't reach my edit button. There it is. Now, I'm going to look at channel and program as my dimensions. For measures, we'll look at record count again.
Refresh this visual. Now it's going to break it down a little bit more by what are the top channels, and which programs are most popular. Again, I want to visualize that so that it speaks to me a little bit better. This again will recommend some different visual types. You've got your standard bar charts and scatter plots. We've got some things like network drafts down here, which are really interesting and dynamic, but not something I want to necessarily use for television viewing.
I'll just do a traditional horizontal bar chart. You know what? I forgot to put the filter. That's going to take a while to pull back, but the final result then would be this bar chart on the right, which shows the different channels and what shows are popular, and I've added a filter which allows an end user to select something like the BET Network. If I wanted to talk to an advertiser about what are the best days to advertise on BET and which shows, now you can see what those shows are pretty quickly.
You can hover on a hot spot and filter that by, okay, this day, Wednesday. We had a bunch of people. What shows are they actually watching? It was the BET Hip Hop Awards. It was "Death At a Funeral," those types of things. Just a very simple visual to show you what you can do, connecting the data in a very visual way.
The cool part, that's a very simple use case, but big enterprises want to build applications that really help them do things like stop cybersecurity attacks. We've built an application with one of our partners, Cloudera, around the Apache Spot project. This is an open source project, which brings together a community response to the best way to visualize threats from a network end user perspective as well as endpoints in the network.
There's machine learning algorithms that are included as part of this project, and Arcadia's part in this is to contribute visualization types that can help people spot issues, hence the name Spot, in a visual way to not only detect attacks, but to do greenfield threat hunting and things like that. I'm just going to put this in fullscreen mode so it shows a little bit better.
The idea here is you could have something like an executive summary view. This is a dashboard that's created. It's using machine learning to bubble up high-potential threats from an end user perspective or endpoints. There's some ways that you can feedback to that model and learn over time. This gives you that bird's eye view of what's happening across your entire enterprise.
If you look into the network, if I'm a security analyst, I want to look at net flow data over time within my environment, and again machine learning is being used in the bottom left to bubble suspicious activity. As a security analyst, I know a lot about the systems that are there. I might look at this top threat and say this is a demo environment, so I think this is a pretty low score in terms of a threat, so I'm going to click that.
Some of theses other ones are maybe a little bit higher, but that feeds back then to the model and can learn over time and improve the accuracy of what the machine is doing to detect potential threats. You can do things like pick the time slider here, and it's going to change the network graph. Over here we're looking at the flow of data between endpoints, beginning to end. In this environment, I probably selected too little data.
This is a demo application, but then you can look at the thickness of the line to understand what are the strong connections between systems and if I've identified a specific endpoint, what are the other endpoints that it's connected to. You can drill down into ultimately all the detail that's there, so specific IP address. We can click into that, and it takes me into an exploration window, where now there's some workflows that have been defined, let's say, for the security analyst where they can collect all this data in one place. Click on the user name, proxy actions, et cetera.
I'm not a security analyst by nature, but essentially they can do some analysis here, and then if you want to share that with other people, it's as simple as going up here. You can email that view to somebody, or get the URL, copy and paste that in a case management system. When someone logs in with the right authentication, they can see all this data in context of where the analysis left off in their exploration.
That's the kind of thing that we want to do on a very large scale for organizations around cybersecurity, IoT systems, just general marketing application and things like that. Hopefully, that gives you a flavor of what we do. Just to wrap things up from my perspective, if you want to learn more about Arcadia, here's some links we can leave up. I'll turn it back to Eric to take any questions that we've got.
Great, and we do have questions. We have a bunch of good questions. Let me just dive right in. You were showing a pretty cool demo there, by the way. I love that Spot. I love the machine learning surfacing. What to look at, that's one of the best, it seems to me, use cases for machine learning is to help separate wheat from chaff and point you in the right direction. Lots of questions here. One is, is the TV viewer data that was show, is that structure or unstructured? What kind of data was that?
Yeah, I think that was just an open data set. I didn't get the data set myself. I believe that was structured data. We have examples of taking JSON and visualizing it on the fly without flattening it, but that was not the case with the TV viewership data.
Okay. Let's see one of the users is saying they, according to their own policies at their company, they have access denied to external file sharing or storage. Do you guys have any ways around that? What would your recommendations be there?
I didn't quite get that, so they are not allowed to do file sharing? Was that the question?
Right, external file sharing or storage.
I'm not sure exactly what they're asking, but I guess the point I would make is that, again, all the data stays in the data lake. You can have the option to allow somebody to download that data to Excel or whatever they want to do with it, but in some cases we have a very large healthcare organization that did not want. Their challenge was they had traditional BI tools where people were downloading data to their desktops, and they had to try and keep track of all that from a data governance perspective.
That was one of the reasons they wanted native BI approach where the data could sit. People could still do their query reporting and analysis in one environment, but they could be restricted from pulling that data down to separate systems somewhere.
Okay, good. Here's another. A lot of really good detailed questions, folks, and if we don't get to yours in this event, we will forward them on to our presenters today. Here's a question from an attendee asking what kinds of measures are there in the architecture for data sensitivity like masking data of sensitive information, et cetera? Can you speak to that a bit?
Yeah, from a high level, we will inherit any existing security protocols that are in the underlying data platform, so meaning if you've got Apache Sentry or Ranger or some security model within the cloud environment, we inherit those role-based access controls. Now, last I checked maybe Cloudera and Hortonworks have developed some more things around masking and such, but a lot of times we'll use a third-party system. I'm forgetting some of the names now, but we've partnered with a lot of those third parties around security space that would do the data masking and things like that.
We don't provide the full granularity of all those different security protocols within our system, but that's why you have a lot of these third-party providers. Anything that's in a project like Sentry or Ranger, we can leverage.
Here's a really good question, and Wayne, feel free to chime in on this. I'll throw it over to Steve first, and Wayne, if you want to chime in, what concept in the architecture replaces the cube data mart, as an [Essbase 00:45:44]. I think that's just the massive parallel nature of the technology, right, Steve?
That's a very informed question. I was careful, or I try to be careful not to say the word cube because I think a cube has a very specific notion in people's heads, like Essbase, where again you're building that cube in advance. You can build multiple cubes, and it becomes an IT and overhead burden at some point.
We've tried to minimize that burden. As I was talking about, we call them analytical views, but they're really much more than a view. There is actually a notion of dimensionality and physical data structure and modeling both on disk, on the file system, as well as some things we do in memory. You could call it a dynamic cube if you want, but we don't force you to build it all in advance. We build it incrementally over time and we recommend dimensionally to add to speed up query performance over time.
Again, we call it an analytical view, and it's part of our smart acceleration process, but yeah, you can call it a cube if you want. It doesn't have some of the legacy baggage of what people think about. I'm not trying to bash Essbase for sure or anything like that, but it was designed for a different purpose.
Right. Wayne, do you want to comment on that real quick?
Yeah. We're definitely moving away from the world of physical cubes, whether in Essbase or product like that or even out in the cloud. It seems that with all the horsepower that we have in memory processing, we can build these dimensional views on the fly or maintain them in a dynamic cache like Steve was saying. There's a lot of vendors who are doing this kind thing, each with their own twist on it.
With Arcadia, they're building that right as a cube, so it doesn't get moved for a lot of these vendors. Pulling the data out in their own scale-out in memory cache. I like what Arcadia's doing because data doesn't go anywhere. It just stays in Hadoop, not moving outside of the cluster. Yeah, there's lot of ways to skin the cat, but the days of a physical cube seem to be pretty much over.
Yeah, we've got a bunch more good questions here, folks. Thanks for sending these in. One attendee here is asking about metadata. Can you talk about metadata management and what kind of functionality you have there? Obviously, there's some open source projects that have tried to address that, and I know that I've mused with other analysts that at least in the early days of the Hadoop ecosystem it felt like we were all making some of the same old mistakes again by not really focusing on metadata. Steve, can you talk about how metadata is handled with Arcadia?
Sure. Metadata is handled just like you would expect within a BI tool. We have the notion of a semantic layer, as in one example where the business person in finance can name tables and columns within the data lake based on the business terms that they're familiar with. There could be a different term within a different department, let's say, but it will map back to the same data.
We can also leverage any metadata that's been defined. Things that have been set in the Hive metastore and other systems that collect that. We partner with companies like Trifacta and StreamSets to leverage any ingest transformation types of things that they do and data catalogs and things like that with Waterline. There's a robust metadata environment around these things, which we all know is required to have a governed environment.
Early on Hadoop didn't have as much developed in that area, but I think there's a robust ecosystem all around metadata management and data governance within these data lakes now. We take advantage of that, as you would expect the BI tool to. You can, again, define some of that within our tool and do some lightweight transform works and naming of metadata and things, but we, again, rely on third parties that specialize in those things just like you would within a relational environment.
Okay, good, and folks, we will stop right at the top of the hour. Yours truly has a hard stop. I'll try to get to as many more of these questions as I can. Here's a really good one that I think lets you guys shine a bit. The question is around how you get data into Arcadia. Of course, you guys are right inside the cluster there, so the specific question is something like how does a user define their own data lake? In other words, input to Arcadia data? That's kind of the problem that you solved out of the box by embedding right inside the cluster, right?
Yeah, exactly. There's no sense of importing and moving data. We're just creating. It's funny. We actually have a little internal discussion around the naming. If I go to this data tab here, I think I'm still sharing.
Yeah, you are.
What we call data sets are actually what I'd call semantic layer. It's a view of data that's already in the cluster, so we just defined what's in that data set through metadata when you pull it up. I'm not as up to speed on all these different connections and things like that, but yeah, there's no data movement. It's just creating a view essentially on top of the data that's in the environment and defining these different semantic layers, which you can, again, name them different terminology and measures and things like that.
Okay, good. Yeah, Wayne, I'll just throw it over to you real quick. That's a very clever move by Arcadia, it seems to me, to embed right in there. The general movement in the industry is away from movement, is away from moving data. I remember way back when, to date myself here, Foster Hinshaw telling me about the whole concept of putting the processing where the data lives. That's the direction we seem to be going. Obviously, there's going to be a very long tail to the old way of doing things, but I think that's pretty clever, Wayne. What do you think?
Yeah, I've come out with a little manifesto about the 10 characteristics of a modern data architecture, and that's one of them. Don't move the data, but we're not there yet. Most of, I would say, 95% of the environments aren't there because most people still pushing data out into a BI cache like some of Arcadia's competitors or into a relational data warehouse.
We are still moving data around, so what I like about Arcadia is that it does hold fast to that characteristic of a modern data architecture. The people are definitely using the processing power of scale-out in-memory architectures to reduce the need for a lot of backend modeling preprocessing of the data in a cube or in a database and spending more of their modeling time on the frontend and semantic layer like you just saw with Arcadia, where they're basically creating views of potentially complex data sets in the backend and simplifying them for end users.
Then relying on the processing power of the platform to pull all that data together in real time. Caching, aggregating where absolutely needed for performance and consistency's sake, but that also gets minimized compared to what we used to do. Everything was pretty aggregated, and there was no access to detail.
That's right. Yeah, it's a straw in the wind. Several other questions here. Can it access a Parquet file? I'm pretty sure the answer there is yes, Steve, right?
Yeah, that's our preferred structure, I'd say, for the data is in the Parquet format.
I figured that. Let's see. Lots of other questions, folks. I'm going to try to get to as many as possible. Here's a question that came in a while ago. Real time data, real time streaming data, does that call for any custom design, mechanisms in the data lake? How do you deal with streaming data?
Yeah, streaming data today, our integration is a couple of different ways. People will talk about Sparks streaming as one mechanism. In that case, we wait for that streaming data to land from Spark into the system, and then we visualize it from there, so it's not really real time, but it's subsecond once it lands, and we can visualize it.
Another really innovative thing is Kafka, or I should say Confluent, released the KSQL interface to Kafka streams or Kafka topics, I should say. That's now generally available, and we were one of the early people. In fact, we're the only BI tool right now that can visualize on top of KSQL. For us, it's just the connections to that. We've got a demo up on our website we can send out later to a video showing that in action, but that's something that's in our latest products release that people can download and explore and try today with Arcadia Instant and connecting it to Kafka and KSQL for real time streaming within the dashboard, and then being able to take action. If we've got time, I can actually pull up a quick little ...
Yeah, let's do it.
Demo what that looks. This was [inaudible 00:55:38]. We've got this IoT demo. I haven't tried running this one in a while, but let me pull it up. This is another one we built, again, with our partner, Cloudera. In this case, we want to have, this is an environment where you're looking at, let's say, a fleet manager who is managing a fleet of cars and they want to measure what's happening out in the field with these cars.
You've got an event stream. This is more of the real time information about where a car is located, are there different incidents that are happening? In real time, you're getting that information in here. You can see things updating. In this case, it's just writing data into the file system. Either it's Kudu or [inaudible 00:56:18], Solr index, which are also used in more real time types of applications. It's not true streaming. You're not reading it in-memory, but it's pretty fast.
Then you can drill to detail from here. Click into one of these specific VIN numbers of a car that just got into a hazardous situation or something like that, and go into a detailed view of what's happening. Then you want to look at correlation analysis for that VIN number and different things that happening. Again, I haven't used this demo in a while, so some of the information is not there. The concept is you have a real time dashboard that can be updated, and then drill into detail because you've got all the information in one place.
I love it. I love this stuff. Folks, we're watching the future here. It's that great quote by William Gibson, "The future is here already. It's just not evenly distributed." At least not yet. Like I said earlier, there is going to be a very long tail to the old way of doing things. You heard Wayne say 95% of environments are still dealing with largely batched processes and other ways of getting the job done, but this the future. This is direction we're going.
Big thanks to Wayne for his time today, and of course to Steve Wooledge of Arcadia Data. You will get that assessment popup when we close out this WebEx, so by all means, folks, please do take just three to four minutes. Go through that puppy. Let us know what you think. You can always email yours truly, [email protected]
Hope to hear from you tomorrow on DM Radio. We have some big news as well on that front. We're now coast to coast with DM Radio, from Jacksonville to Atlanta and Chicago, all the way out to Los Angeles. Hope to hear you on the show sometime. You can always tweet to me with the hashtag of DM Radio, and with that, we're going to bid you farewell, folks.
We do archive all these webcast for later listening and viewing. Feel free to come back, share with your colleagues, et cetera. Otherwise, we'll talk to you soon folks. Take care. Bye-bye.