Find deeper insights
within all your data.
Research Director | 451 Research
CTO & Co-Founder | Arcadia Data
WEBINAR, AIRED: October 25, 2017
Accelerating Data Lakes and Streams with Real-time Analytics
As organizations modernize their data and analytics platforms, the data lake concept has gained momentum as a shared enterprise resource for supporting insights across multiple lines of business. The perception is that data lakes are vast, slow-moving bodies of data, but innovations like Apache Kafka for streaming-first architectures put real-time data flows at the forefront. Combining real-time alerts and fast-moving data with rich historical analysis lets you respond quickly to changing business conditions with powerful data lake analytics to make smarter decisions.
Join this complimentary webinar with industry experts from 451 Research and Arcadia Data who will discuss:
- Business requirements for combining real-time streaming and ad hoc visual analytics.
- Innovations in real-time analytics using tools like Confluent’s KSQL.
- Machine-assisted visualization to guide business analysts to faster insights.
- Elevating user concurrency and analytic performance on data lakes.
- Applications in cybersecurity, regulatory compliance, and predictive maintenance on manufacturing equipment all benefit from streaming visualizations.
Announcement: The broadcast is now starting. All attendees are in listen only mode.
Steve: Welcome everyone, this is Steve Wooledge. With Arcadia Data. I'll be your host for today. We're gonna get started in just a minute, [00:00:30] we're waitin' for everybody to get on, so we can have the full experience. Just give us a minute and we'll get started. Thank you.
All right. [00:01:00] Think we're ready to go. Again, this is Steve Wooledge, I'm happy to be your host for today. Thank you for joining. I'm also delighted to have our special guest, 451 research, on today's conference. I think there's a lot of exciting things happening in the world of big data, if you will. Data lakes and real time streaming analytics are common topics that we hear a lot from analysts like Matt Aslett, and a lot from our customers as well.
Our goal for today is to break down and talk about what's happening in the market, [00:01:30] but also what are some of the ways to approach this, in the cuts from the deployments that we've seen. Just in terms of administration stuff, we will do live Q & A. Sorry, not live, but we'll do Q & A at the end. There is a chat window, or a question tab within your webinar browser here, where you can add in those questions as you think of them throughout the session. I'll moderate them at the end. We'll have some time for Q & A at the end.
I'd like to introduce our speakers today, [00:02:00] we've got Matt Aslett, as I mentioned, who's the research director for the data platforms and analytics channel, at 451 research. He has overall responsibility for the data platforms and analytics research coverage. Which includes operational and analytical databases, the duke grid stream processing, search base data platforms, and a whole bunch of other stuff. Including machine learning, advanced analytics. I've personally worked with Matt for, gosh it's prolly been 10 years, done a number webinars and live presentations. He always brings [00:02:30] a great perspective. Has spoken at a lot of different conferences, including the duke parole, the duke summit, some of those.
He's also been ranked by analytics week as one of the top 200 sought leaders in the field of big data and analytics. Thank you Matt for joining today, and he will start us off.
Then Shant Hovsepian is Co-Founder and CTO. He is responsible for the long term innovation and technical direction for Arcadia Data. He's an expert in high performance data processing, query [00:03:00] optimization in distributing systems. He previously worked at Teradata, through the ASTER data acquisition, and was an early member of the engineering team, working on various features across the stack. Prior to that he worked at Google, where he worked on optimizing adwords database. I don't know Shant, should I talk about your lemonade stand you started when you were four years old? Some of those things.
Very entertaining speakers, both of them. I'm glad to have them today. As I mentioned, we start out [00:03:30] with a perspective from Matt Aslett from 451, covering some of the research he covers. Then we'll turn it over to Shant to talk about some of the things he's seen with customers, and then open it up for Q & A at the end. If there's enough time, we'll have a live demonstration as well. That I'll pass over to Matt Aslett.
Sorry, actually there was a poll first, to set a little context for today, just want to see where people in the audience are in their big data deployments. There should be a live poll popping up here on your screens, [00:04:00] and just check that out. The top of the poll list is, early in your journey, thinking about scale-out platforms like Hadoop of the Cloud. Developing strategy, defining architecture, piloting, and then deployment. We'll give it about a half a minute here, for all those responses to come in. We should be able to share that back live with everyone, so that'll set some of the context for the presenters today.
[00:04:30] Sam I just clicked back on the screen, so I don't know if this will show or not. Are able to push it live at this point?
Sam: Yeah, give it a couple more seconds for the last few to come in, then I'll push it live.
Steve: Pretty good.
Sam: All right, looks like most everyone voted. Over all looks like about 21% [00:05:00] are still gathering knowledge, 38% are developing their strategy still, 26% of people have actually deployed there big data, and 15% are still piloting.
Steve: Okay. There I see it now. The majority of folks are in that developing strategy phase. Although 26% deployed, that's higher than industry average, I would say. Some of the research I've seen are closer to 15% deployed. [00:05:30] I think we're still pretty early in the journey. In terms of replatforming and moving some of these new architectures. Now, I will officially pass it over to Mass Aslett. Matt, if you're with us, you might be on mute.
Matt Aslett: I am, what a good start. Thanks Steve. Thank you everyone for joining us, interesting to see that poll, as you say. It's all higher than average, or what we would expect necessarily, [00:06:00] in terms of companies that are in production. Perhaps that speaks a little bit to the topic here, in terms of streaming and organizations taking to that next level.
So, I'm Matt Aslett, I'm the research director of data platform analytics, 451 research. And jumping in the slides, just to give you a little bit of an introduction to 451, before we get into the slides. If anybody has not come across the organization for us, we are an IT research and advisory company. We founded in [00:06:30] 2000. A few hundred employees, including about 120 [inaudible 00:06:35] of the 2000 clients.
Which includes technology and service providers, court advisory, finance professionals, and IT decision makers. The last thing I'll sort of draw your attention to on this slide, is the next number, perhaps most important ... There's 70,000+ IT professionals, business users, and consumers in our research community. This is something that we've been developing over the last few years. An increasing and important part of [00:07:00] 451 research. These are people who are not necessarily direct customers of 451 research, many of them work for companies that are ... But they're individuals, people, practitioners out there in the field working with technology, using technology as part of their daily working lives.
Of course their home life is increasingly overlapping and they help shape our view on the world. By taking part in surveys and interviews. As I say, an increasingly important part of the organization. Instead [00:07:30] of today, you know talking about, or drawing your attention really, I suppose ... To this quote from George Dyson the computer historian. He talked about big data being what happened when the cost of keeping information became less than the cost of throwing it away. The reason to start with this, this is a key quote that I first came across a few years ago, and just keep coming back to. It really highlights the fundamental point of big data, I think for us. From our perspective is that it's now more economically [00:08:00] feasible to store unprocessed data, that was previously ignored due to the cost and functional limitations of traditional data management technologies.
That really it seems to be the starting point for a lot of organizations on their journeys towards big data and advanced analytics, and becoming more data driven. One of the other key technologies we've seen and enabling or organizations to try and do this, is this idea of the data lake. [00:08:30] The data lake, as a term is credited to James Dixon, the founder and CTO of Pentaho. He first used it in 2010 in a blog post, and he was talking about basically an environment for storing a large amount of raw data, and comparing that to a large body water, which is fed from various sources.
The point being, is that from a [00:09:00] data perspective, it should be an environment that could be accessed by multiple users for multiple purposes. In comparison to a datamart, which Dixon sort of talked about in being the equivalent of bottled water. It's being packaged up for a very specific purpose. This idea we've seen isn't very attractive. A lot of companies set out on this goal of work. We need to create a data [00:09:30] lake. Unfortunately, the idea kind of, the concept didn't really come with instructions, in terms of what technologies were involved, or how to go out and create one. A lot of companies sort of set off on this journey in the hope of proving value somewhere down the line. Without any really idea about what actually they would do with it, beyond getting all the data into one place.
Without [00:10:00] real clear identification of the requirements and use cases and the underlying technologies. As I sort of come to point to that, back in 2014, where you could see this is already happening, we sort of argued that perhaps an alternative analogy, and one that was better thought about, in terms of how you go about doing this, was the idea of a data treatment plant. Again, you have water comes into this environment, there's industrial [00:10:30] processes, they're required to make it available for various uses. We saw this is essentially was organizations are trying to do with their data lake.
Of course, the data treatment plant, or the water treatment plant, isn't as perhaps an attractive analogy as a beautiful data lake. So, that was when we were gonna win, and that wasn't really the point. I think we've been somewhat vindicated in what we've seen of the more recent years, is an increasing focus [00:11:00] on those industrial scale processes ... How you actually go about making data acceptable for multiple desired end uses, and multiple methods of accessing or processing data.
There's a greater focus on this data integration pipeline, from data ingestion, through preparation, delivery, out to discovering, visualization. In particular reference to self service access. Which inducts another key trend [00:11:30] that we see happening in the industry. A lot of organizations try to enable self service, data preparation, and visualization, and analytics.
Also, the underlined data management and date governance capabilities. What we identified was that, really it was those two things, self service data access, and the underlying data governance, that were really important factors in organizations converting this [00:12:00] idea of the data lake from sort of a concept to a reality. We're really keen to delivering a functional data lake environment. To the extent, we actually argued that organizations should really consider those underlying data governance and data management requirements, before they even started embarking on data lake projects.
What we've seen, as these kind of environments have evolved [00:12:30] is there are, obviously other important factors as well. Data catalog is one, data lineage is another, and analytics acceleration is also important. The other thing is, this self service data access and underlying data governance, can only really be applied obviously, once you have data in the data lake. Another key aspect that has to be addressed is, that in order to keep that data lake fresh, clearly it needs to be constantly updated.
This is when you get to [00:13:00] the importance of streams or stream processing, data stream, data ingestion into the data lake environment. We've seen historically batch mode data integration has really served them as well, in terms of being crementically updating historical data sources. Obviously during quiet periods. Increasingly it's entirely unsuitable for driving real time decision making on live data. That clearly requires live data streams, as well as integration of those [00:13:30] data streams with that historical data, on an ongoing basis. The device that contexts your basis for decisions.
What's really required is an approach that integrates data from multiple sources and makes it available for analysis as it is is generated. You can could about this as perhaps being real time data integration. The term real time, we're always a bit cautious about using that, clearly it lacks an agreed upon definition ... Depending on which industry [00:14:00] you're in, and what your requirement needs, you have a definition of real time.
Also we've seen people talk about streaming integration, which is clearly relevant, but in a stream processing technologies are clearly important here. What we see is the key element here is, not how or how often the data is generated or analyzed, but that the approach to integration means the data is [00:14:30] continuously integrated from those various data sources as they're updated. The data might be streamed, or it might not. It might be considered real time, depending on your definition, or might not. That process of integration from testing and release deployment, operation measurement, etc., has to be continuous.
Some of you obviously might be aware of the continuous integration from a development stand point, and obviously and what we bored here, is what the key terminology [00:15:00] that applies in that space. One of the reasons is to make a distinction, we're not talking about continuous integration, we're talking about continuous data integration. These are elements we think are equally applicable. It's those concepts of that continuous process that is important, and can be applied to the process of developing, deploying, and managing data integration pipelines ... That are responsive to changing business and data processing [00:15:30] requirements.
If you think about, now we've got data streaming into the data lake and it's a more continuous process, and that's all good. What's interesting is you know, ingesting that streaming data into the data lake is good. The integration of the data streams with historical data sources, on an ongoing basis provides that contextual basis for decision making. However, in real time decision making on live data, [00:16:00] is even better. That requires analysis of live data streams. It's a tiny sort of little, additional line we've added to this chart, but it's an important one. Where the data can actually be analyzed by analysts as it is ingested into the data lake. Not just once it's been ingested into the data lake.
What we're really talking about there, is opening up, access to this continuously integrated data. [00:16:30] Beyond the developers and data scientists, who obviously have the skill and resources themselves to do that through writing code. Actually for the business analysts, the data analysts, and perhaps to some extant, senior executives. There's always caveats to that. Opening that up anyway in terms of, democratization of data. This again is one of they key, sort of broader things we're seeing at play here. We see that enterprises obviously have always analyzed data. [00:17:00] They typically, it's been from the interspace applications, into a data warehouse, and then IT professionals have been responsible for creating reports, and dashboards for distances, and makers and data analysts.
What we've seen, that an expansion in terms of, the sources of data, out to low encryption data, mobile data, mobile apps. Obviously IoT, devices and services, etc. Also in terms of the platforms with storage and processing of that data, obviously [00:17:30] Hadoop. Cloud storage and increasingly relevant as you've added stream processing spot as well. Then also, in terms of the consumption of that data. Beyond IT professionals, decision makers, into ... Business users, scientists, [inaudible 00:17:48] even consumers, or customers through data driven applications.
Just to sum up, a very brief run [00:18:00] through of this page as we see it, and just to set the scene in terms of the rest of the discussion here. Some of the key takeaways, we see the data lake concept is clearly attractive. We see a lot of organizations trying to build environments of this nature. In this slide, they didn't explicitly explain what actually it was, or necessarily ... Specifically from a technology perspective, or how to build one. A lot earlier doctors set [00:18:30] off without a really clear idea of requirements or use cases. That's now changing where we see a lot of more emphasis being placed on those industrial scale processes. Required to turn the data lake concept from theory to reality. To take that data from multiple sources, process it in such a way that it is available for multiple users, and for multiple purposes.
As I said, self service data preparation and analytics are [00:19:00] one level and the underlying data management and governance for key drivers and enablers for that. That integration of data streams with historical data sources, on an ongoing basis, as you set the contextual basis for real-time and event, come with decisions. However, that real-time decision making on live data, also requires the analysis of live data streams, and that's where we see a lot more emphasis being placed now. In terms of, not just access to data once it's in the lake, but actually [00:19:30] the ability also to analyze data as it's ingested into that environment as well.
With that, thank you for your time and as they said, if you've got any further questions ... Please contribute those, we'll get to those at the end of the session. For now, I'll hand you back to Steve. Thanks very much.
Steve: Awesome Matt. I love that slide you had with the data lake showing step one, get data in, something mysterious in the middle, and then [00:20:00] I forget what the words were ... Profitability and magic at the end. I think that's part of what we'd like to try and break down ... Is how do people actually get the value from the data lake? I'll pass it here to Shant in a minute, to talk about some approaches that we've seen customers take, around giving access to the end users and incorporating real-time streaming analytics.
Before I do that, we'll do one more poll, and this is a question around, for people that are considering architectures, in pilot phase, or deployed. [00:20:30] How are you planning to give your end users access to the data? A lot of times we see the scientists being the one or only person that's going after and getting applications built or access to the data lake. Development tools could be direct SQL access through some of the SQL projects ... Like Hive, Impala, Spark SQL, or Drill.
A lot of people have existing investments in traditional BI tools, that could be a way you're looking to give end users access. There's [00:21:00] also this concept of native distributed BI platforms that run within the system. There's been some interesting research in the market around that. Then there's other things that people do sometimes.
Go ahead and submit your questions, or submit your answers to the questions. Sam, when you feel like we've reached the boiling point, go ahead and flip that live. I'll try not to mess with my screen, so hopefully it shows live once you're ready here.
Sam: Looks [00:21:30] like we're getting a pretty good stream coming in still, give it a few more seconds.
Steve: Okay, great. I didn't know the history of the term data lake. I had heard a big customer when I was at Teradata use it once. I didn't realize it was from 2010, with Pentajo. I always learn something new with you Matt, thank you. [00:22:00] How we doing on the poll then, Sam?
Sam: Looking good. Let me share the results.
Steve: Wow, okay. Hands down, traditional BI tools, 61%. Second would be some of the native BI tools that are not available, then SQL access. I'm surprised that fewer people are doing development tools, Direct, Spark, and MapReduce [00:22:30] types of development. But, great context. Thank you for that. I will now pass it over to Shant. Shant you with us? You may be on mute.
Shant: Yeah thanks Steve. Hi everybody, thanks so much for joining. Thank you Matt for that introduction. I don't know about you guys, but I'm just thinking about waterfalls and lakes, and clear bodies of water. It's very beautiful slides, thank you so much.
[00:23:00] I will focus on squiggly boxes and arrows because I'm the CTO, and this is how we talk about things. Quickly, to summarize some of the points that Matt was making, we seen that data lakes are very comprehensive. You have a lot of physical hardwired infrastructure, declustered distributed solution, and really there's an abundance of date sources that you can bring into your data lake. We see everything from videos, to images, to traditional relational data, to individual Excel workbooks, [00:23:30] data warehouses, next generation free text unstructured documented data. Getting data into the data lake, doesn't seem to be a problem. Data lakes been really successful as an efficient, scalable storage medium, that eventually gives you the opportunity to do analysis if you want to do it.
Especially what we're excited about in seeing a lot more narrower streaming sources, as Matt mentioned. Keeping the data and the data lake fresh, and having regular intervals, kind of [00:24:00] that continuous development cycle, for your data ops integrated into the system. OF course, once all that data's in there, the biggest challenge because of the diversity and variety of the data that's in there, but also just the demand from the user population, to get access to that data. It's really finding the insights and getting insights out of your data, and that's one of the biggest challenges that we're seeing out there in the field right now.
Where, everyone has some sort of data lake strategy in place, you kind of saw the results from that poll. But, really getting and [00:24:30] extracting the value out of that, can avoid the stage we're seeing a lot of companies try to figure out what the right approach is. I don't blame them for spending so much time doing it. It's a cumbersome problem, it's a little overwhelming. The data lake has such diversity and variety in it, so different than what we're used to seeing in our software.
Let's talk about three ways customers can accelerate value from their data lake. Number one, move behind batch, enable live, real-time analytics. Of course, addressing business problems requiring both [00:25:00] historical and real-time analysis at the same time. So much of our world happens in the now, these days. As this webinar is happening, I can log into Google Analytics and get a real-time view, of how many attendees are actually visiting our website. At the same time, over the weekend, a Fantasy Football app gives me live stats as games progress, injury reports, what's going on ... My home has so many connected devices these days, that I can get a real-time picture of what's [00:25:30] going on in there, temperature, even see if someone's in there that's not supposed to be in there. So much of our lives have become real-time, and it's time that the business world catches up as well, to that avenue. That's essential really, to competing.
Second, providing direct, interactive analysis to hundreds of users. The days of insights and data analytics being for executives, are over. Every single employee within an organization, should have access to data. It makes them [00:26:00] do their jobs better. They get more insights, they get more understanding of why things are happening. The concept that Matt brought up of data democratization of kind of opening up the why, and the how to everyone in the organization ... Is critical for everyone's job function too. It makes it so much easier for someone to optimize themselves if they have access to the data that they need.
Lastly, let the data do the talking. We're talking about machine insisted insights. When was the last time when you were writing an email, [00:26:30] that you stopped and got a dictionary or a thesaurus, looked up the spelling of a word ... Or had to look into some kind of grammar syntax. We don't do that anymore. Our typing software, more or less, gives us recommendations about how to spell things, when there are grammar issues. That kind of simple, nicely integrated assistance, built into the technology in the stack, is critical now at the data layer. As we have these data lakes with such diversity of data, we need the system [00:27:00] to help us along the way. Recommend, suggest things, to kind of make our jobs much easier.
We'll start with, moving beyond Batch, real-time analytics. This is types of things we hear from customers, you hear within your organization all the time. I wanna respond faster to recent events. I want to be alerted immediately. Down town, risks, these cases, knowing when something is broken is critical. I also want to outperform the competition. Everyone's fighting for the same resources, and having that upper hand is critical.
A lot of what [00:27:30] you hear around why you're not using real-time analytics, "I don't know how to get started, it's very hard to set up and maintain, I'm still trying to get the basic batch stuff working." We kind of understand that there's a huge need for real-time out there. The world hasn't really caught up with being able to deliver that. Don't fear the challenges, real-time can be achieved. It could provide real value, and we've seen it.
Let's talk a little bit about how. What comes to mind [00:28:00] when I say real-time visualization. This is a medical monitoring center right here, someone sitting in front of a bunch of screens. Here is a security operations center in a large enterprise where a bunch of people are kind of looking at a bunch of giant screens on the wall. Here is one of my favorite tidbits from The Simpsons, this is NASA. This was an episode about NASA launching [inaudible 00:28:22]. The director comes in, "How's the spacecraft doing?" Everyone turns around and says, "All of this machinery isn't for monitoring the spacecraft, we just want to know how many people [00:28:30] are watching the launch live on television."
The fact of the matter is, so many of the traditional use cases for real-time visualization were very hard real-time, high-risk scenarios. Really, what's happened is, no one wants to sit there and look at a dashboard, and just monitor things. Humans aren't meant for this. Machines are great at it. A lot of the mission critical systems, we're seeing more and more of what used to be traditional real-time dashboarding, translate to automated systems [00:29:00] that just alert, pop-up. Even in the case of space shuttles, self-heal. We don't need human beings staring at a wall of monitors, a lot of them are real-time analytics that matter.
We can have our computers, we can have AI systems do that automatically. It's better use of human time, to actually be able to interact and explore the data and visualizations and analytics. Both real-time and historically, as opposed to just sit there and monitor things.
Let's just quickly define visual analytics. You can think of visual analytics somewhere between a scale of [00:29:30] like, charting and plotting your traditional data science types of workload, MATLAB, or all the newer sort of plotting libraries. Then on the other end, BI reporting, traditional static report that some executives look at every morning. It gets sent their cell phones these days, it used to be emailed, PDS, or ... Visual analytics is somewhere in the middle. Where you actually do analysis by interacting directly with the visual. It's way more interactive than what you're used to seeing with static reporting. But, [00:30:00] it's not the same as, you have to generate a plot or a chart on the fly every time you need to use it.
Which not only tends to be, way more business friendly. You know, you don't need to actually write code to get any of these things working. But, it's not as simple as just starting at a pie chart that kind of tells you everything you need. There's a little bit more of that user driven mentality to assist them, but it's point and click, as opposed to having to write a different type of expression language.
We can incorporate more sophisticated analysis with [00:30:30] visual analytics. Getting more predictive capabilities, more machine assistance into the system is much easier because they're gonna drive those charts and visuals interactively. In a traditional reporting world, it's much harder to incorporate some of that more complicated analysis and structure data. Where on the flip side, in the traditional kind of plotting/data science world, it's all about the predictive analytics and insights. That kind of helps frame visual analytics, somewhere in the middle where you can get the interactivity that you need, but you still get user friendliness that you're used to from your [00:31:00] traditional reporting platform.
What makes real-time visualizations challenging currently. Well one, there's an architectural problem. You require a lot of steps, intermediary stores, transformation processes ... We'll talk a little bit about the architecture later, but it's an involved stack. To get things going from an intermediary store, to get those data planes from message queue, onto something that you can analyze. Second, [00:31:30] is just the lack of real-time visualization, if you look at the medical use cases, or even in financial services ... A lot of the visual systems that can actually update in real-time were proprietary systems. They're not the type of traditional tools that we're used to working with, the enterprise.
Of course, these are complicated to set up, there's a lot of data staging. You gotta do a lot of data modeling, sometimes you can't actually respond to data change immediately, we have to pull, sometimes very efficient. IT has to get involved. It's a technology stack. It's not a business stack.
Now that we have Visual Analytics and we have some of the real-time systems, can we do streaming visual analytics? Well, not quite yet, there's a little bit of an architecture problem we need to [00:32:30] resolve. The world of real-time had been regulated to proprietary heavy-weight applications. The web has done a lot to change it. Your web browser these days, is a marvel of technology. There's a very cool new standard that makes it much easier to send data live to a screen. Where previously, you think of a Bloomberg Terminal or some kind of heavy-weight application to get real-time analysis. Your web browser is your or less, a Bloomberg Terminal, a window into the real-time [00:33:00] web. Also, our program models have changed. It's become easier to build applications with things like transformative interactive reactive programming styles, that make it really easy to actually develop software in the context of streams in real-time. From an application standpoint, that's become much easier as well.
Strategy number one, this is the traditional one, I'm sure everyone's heard of this, this is a Lambda Architecture. You sort of have a batch system, and a streaming system. It's a well known setup. You kind of mirror your data between both of them, and [00:33:30] you consolidate the data between both systems in an application level logic or as a batch job that runs periodically. You can get some real-time functionality, but at the end of the day, you still have a single source of truth. It's a complicated system to maintain. There's a lot of data duplication, there's a huge abundance of application level clients [inaudible 00:33:52] implemented.
The second strategy is more of a Kappa strategy. I call this like a staging store. You have your real-time stream, you don't keep a batch copy [00:34:00] of it per se, but you can actually stage something that looks like a very fast key value store. You end up just keeping one copy of the data, you don't have to consolidate across the different data sources. It's lower latency system too, but you still have to deal with complicated security models, you have an extra element in the stack that you need to maintain, which is your Q value store. Some of those Schema evolution, and data change properties become very tricky.
Lastly we have native streaming. This is the ability to actually do analysis [00:34:30] directly on streams. This gives you linear scalability, your analysts can ask any questions they wanted to data. You have true real-time low latency system, and it's much easier to maintain. There are few moving parts. Of course, this is a brand new horizon on a technology scale. We're really trying to see this type of stuff emerge on the market just now, and probably become very exciting at the start of next year.
For an example, at Arcadia we have a [00:35:00] great connectivity with Confluent and Kafka particularly. The ability to actually do analysis directly on topics and streams in Kafka immediately get rich insights without having to store the data into another system. Typical streaming capabilities, we mentioned being able to respond to alerts, real-time dashboards, alerting, actions, being able to pivot from historical analysis into real-time. It's great if I can see in real-time time that something is misbehaving. [00:35:30] I want to be able to drill down into the historic data for that system to see, "Is this a one-off? Does this happen every Saturday afternoon? Is there some commonality to the system?" It's critical to be able to do this without having to pivot between a bunch of different tools. Kind of do swivel chair analytics, so to speak. Having to log into a different system for each data source.
Lastly, stream data enrichment. A lot of times data that comes in, especially in the case IoT, is very binary, numerical, you may want to actually augment that with some more [00:36:00] user friendly data sets. Being able to actually do that direct analysis, is really critical for a lot of these used spaces. Some examples, everything from cyber security, where you wanna do risk and incident response immediately, customer intelligence ... Understanding how your ad campaigns are doing in real-time. The latest blog post that you've posted, how its reacting. Financial Services traditionally had a wealth of use cases, everything from risk modeling to trade surveillance, to stress testing. Then IoT, the world of sensors, looking at industrial [00:36:30] physical machine, predicting maintenance ... When they're gonna fail and reacting to those changes.
Tip number two, scale to hundreds of users. We call this smart acceleration. Matt talked about his analytics acceleration layer a little bit, and how it's an important piece of any kind of data lake strategy, In order to have hundreds, if not thousands os users, giving access to everyone in the [00:37:00] organization, to the data lake, you really need some kind of layer that's aware of all the users ... That can do acceleration, and make the user experience scale out at high concurrency enrichness.
For example, Arcadia, we let you do analysis directly on the raw unstructured data that's sitting in the data lake. But, as you're doing this analysis the system kind of understands what the users are doing, see's what's popular, looks at historical [00:37:30] trends, and tries to model new data structures that can accelerate that behavior. A lot of this we're used to in the traditional world of [inaudible 00:37:41] where you think about cubes and aggregate tables, materialized views, complicated caching mechanism ... But different ways of trying to understand what a user is doing, and kind of precalculate that expression before the user needs to do it. In the case of Arcadia, we can let you do the raw storage [00:38:00] access, but also dynamically rewrite and re target a lot of the user workloads, especially if they're shared. If you think about data analysis as a credit principal, 80/20. A lot of what people are looking at, a lot of the analysis are along the same dimensions and the same measures. They're all looking at it at the same time. Being able to leverage the similarity and commonality, and the analysis between your user base, is critical to getting a high-performance system to[inaudible 00:38:27]
Arcadia has this concept of analytical views, which are [00:38:30] dynamic rearrangements of the data, both in memory and physically in storage, to help accelerate that behavior. You can do AdHub analysis, all of a sudden, a bunch of other users start logging into the system and accessing the same piece of analytics that you developed, or the same dashboard. They don't have to redo a lot of that analysis into raw data. They can leverage the intermediate data sets, and the intermediate calculations, that you as the user did.
All of this happens automatically in the system. It's rerouted dynamically, [00:39:00] so that you don't have that administrative IT overhead in there. This is also great because it can scale linearly with the system, so as you put more resources in your data lake, you can easily support more and more users ... Because all of this is running within that same data lake environment. You got the automatic modeling and you can keep those logical data models simple, without needing to change your data. Kind of the whole Schema read versus Schema write. You don't want to constantly rewrite your data to optimize it for usage, because if you keep the data in its [00:39:30] simple state, so that the users can understand it, and then optimize on the fly based on their [inaudible 00:39:35] patterns. That's really and truly the only way you can scale out to lots and lots of users.
Lastly, let's talk a little bit about recommendations, right. The data lake is complicated. There's a lot of different data analysis on there, and everyone is kind of trying to figure out their own thing. However, there's one system, that kind of knows [00:40:00] what everybody else is doing ... And that's the analytics platform. Your analytics platform can see what everybody else is doing, it's a little spooky, has the big brother effect. There's a lot of information, there's a wealth of knowledge that can be leverage by the system to help recommend different types of analysis that are common throughout the system. Look for intelligent tools. Basically, tools that are leveraging what other people [00:40:30] are doing in the system, trying to understand the data properties, using the hardware and the physical availability of the data lake to predict, do extra computation and try to understand what the user wants to do.
You want machine intelligence built into your applications. Even if you're not doing predictive analytic work loads, your software should have AI built into it, to make your life easier. That's really important when we're dealing with the scale of data lakes. For example, here on the screen, you can see Arcadia ... [00:41:00] You have the ability to kind of work with specific types of data, select the fields you're interested in. There's a magic one you can click on that explores across all of the visual artifacts and their features. It gives recommendations based on historical analysis of the data, as well as properties of the data itself in the system ... To help the user figure out really what they're looking for and what they need, without the user having to comprehend all the nuances of the data set, or understand all of the possibilities. It really helps you see which visuals best represent [00:41:30] your data in which situations.
Some more example recommendations, you can imagine when you have a user guided sessions where we're recommending the different properties, the different attributes ... Becomes very rich, very interactive and the use can focus on results and insights, as opposed to trying to understand various nuances of properties or math, or expressions. You know, what color patterns express the cardinality of my data set, [00:42:00] in the best way.
Just to summarize, if you need to accelerate value from your data lakes ... Move beyond batch, try to do these in real-time, get those insights as soon as they're available, scale out the system to all the users in your organization, don't make data analytics something for the upper class one percent, but open it up to as many people as you can ... You'll be surprised about the next generation ideas [00:42:30] that develop within your organization, when more and more people have access to data insights. Let the data do the talking. Don't expect everyone that's working with the system to know the ins and outs of everything. Have the software be intelligent, guide the process, and help the end users.
Steve: With that said, we're gonna do a quick demo of some of the cool types of real-time analysis that are capable and Arcadia systems on things we've seen from our end users.
Shant: Awesome, thanks Steve. This is [00:43:00] a connected vehicles use case. In the sense of an IoT situation where you have a bunch of sensors deployed in a lot of automobiles. We have more and more companies doing what's called, Usage Based Insurance. You have a physical dongle or a device that you plug into your car, it connects to a GPS cellular network, and it monitors your automobiles conditions, safely standards, also your driving ability. [00:43:30] Usage Based Insurance means instead of assigning arbitrary tiers of insurance pricing, why not pay based on your driving habits and your driving behavior? Also, monitoring the automobile, which parts are manufacturing, or which parts are manufactured break down, and where they break down, why did they break down. If there are certain patterns in driving habits, or is it that one pothole on Main Street that's always causing issues? That's the reason for a lot of the maintenance that we see.
It's a really great use case, [00:44:00] and the amount of data, there's a bunch of analyst predictions, but a single automobile can generate across all the sensors that are being deployed in the system, up to a terabyte an hour of data per automobile. If you think about the number of automobiles in the US, that's a lot of data that we'll be sending and having analyzed through the system.
I'm gonna click on this even stream over here, to kind of pivot into a real-time view of the data. This is monitoring a bunch of automobiles in the San Francisco Bay Area. [00:44:30] The sensors are deployed real-time on the devices and the data gets fed in. You kind of get a sense in this case, we're just really looking at specific types of events ... Hazardous driving conditions more or less that happens. You can get a view of the number of events over time, you get this nice map of where all the data is coming, kind of a heat map of the automobiles and what they're doing.
Over here you have the event stream, where you can look at a quick summary in real-time of speed, their location of devices, [00:45:00] sort of the VIN number, but the device identifier and the type of event that we saw. In this case over here which is collisions, I wanna click on the VIN number. What this will do, is all of a sudden pivot from this real-time view of all the event streams that were happening on the system, to looking at the historical data for that one specific automobile. We can identify interesting metrics, what happened before the collision. Obviously here we can see that the collision item that we clicked on, there was some kind of hazardous [00:45:30] road condition that happened to the automobile, and then it quickly transitioned into collision incident right after the hazardous.
In this case, you're seeing a lot collisions on the screen, this isn't actually real data, so don't worry. While the Bay Area has a lot of traffic and very bad driving behaviors these days, there aren't this many collisions happening all the time.
But we're about to look at some interesting metrics and sort of do some analysis here, you see that we have an acceleration [00:46:00] aggression score, we have a breaking aggression score, handling aggression score. You can see the number of miles this automobile has gone, break applied count. Kind of look at a break count models, in this case we can see here, that their acceleration aggression is very high. This is someone that really likes to slam the gas pedal. Is this potentially a reason for some of their driving behavior? Is this, you can, they take the Bay Bridge pretty often. They're mostly focused in the San Francisco Area, sometimes they go in South Bay. [00:46:30] If I wanna be some kind of insurance adjuster or really try to understand the users behavior, looking at these metrics for that single user isn't enough. I might wanna do deeper analysis, and sort of pull in more rich analytics across the entire user base. This is why the data lake really has value. Being able to look at all of the data, historically run some more sophisticated analysis, and really help that understand what the users behavior is like.
For that, [00:47:00] we'll go into this analysis mode. Where we can look at some more sophisticated trending analysis about collisions versus aggression score. We talked about the aggression score of that user, I think being near the three range. This is a scatter plot across all of the VINS, and all of the automobiles that we're tracking in the system. You're looking at clear linear correlation between the number of collisions and their aggression score. A couple outliers here which are interesting, [00:47:30] as you have a higher aggression score, you're more likely to get into collisions, looking at the historical data. We can definitely tell that this driver has a high acceleration aggression and that's gonna cause them to get into way more collisions.
Let's look at some more interesting date over here, the average aggression here is at about 3.45. Which is a little bit higher than what that driver was doing, so there may have been something else, like we saw the hazard that happened before the collision. Then you can also look at acceleration aggression verus oil replacement, that's inversely [00:48:00] proportionate in this case. You have to replace your oil more after aggressions or accelerations collisions are aggressive.
A lot of correlations between distance, as drivers have been driving more and more, they tend to have a greater degree of acceleration. We're seeing a lot of the hot spots in this heat map over here, indicating that over time ... Drivers tend to get more aggressive in this area and opposed to less aggressive. Something to look out for, kind [00:48:30] of counterintuitive to a lot of insurance adjusting that we've seen, where you can talk about more experienced drivers tend to have lower premiums. In Usage Based Insurance model, when we have this sensor data, we can really get a much richer insight into that user behavior, and sort of price accordingly.
That's it. Thank you so much everyone. Steve, I'm gonna hand it off back to you. I think we got some more stuff to talk about.
Steve: Yeah absolutely, thanks for that Shant. Very cool you were able to pull [00:49:00] off a lot of demo, all virtually on this laptop here. All right, we're gonna just wrap things up and have the Q & A now. I saw a few different questions come in. I'll leave up the three source page if there's anything you'd like to learn more about on Arcadia data. I've also put a plug in for 451, as I mentioned, I've worked with Matt and 451 Research for over a decade and I'd say that they've really built up some very unique and in depth research. So, If you haven't checked [00:49:30] them out, definitely do. We've got some materials on our site as well, but Matt does a great job accessing the landscape and all the technologies that's out there. I think one of his classic infographics, if you will, that's been out there is this, depiction of the London Underground ... And it does a great job of categorizing the different technologies and things out there. Definitely check that out if you haven't seen it before.
I'm just trying to find the questions here. Sorry, I'm a little [00:50:00] new to this particular webinar platform. One of the questions that came in was around, streaming platforms and Shant talked a bit about Kafka specifically, but what's your view Matt? In terms of the different streaming platforms out there like Kafka? Is Kafka competitive with Hadoop? Do they compete, co-exist, how are companies looking at utilizing those different approaches?
Matt Aslett: I definitely think we see that they are complimentary. [00:50:30] For a while, depending on the organization and which direction you're coming from, and where you started often that sort of clouds your perspective of whether the different technologies are competitive or complimentary. I'd say bottom line, is we see that it's complimentary. I think Shant illustrated that really nicely with talking about the Lambda Architecture, and the Kafka Architecture and obviously they're more the native streaming. As we see organizations get more mature, perhaps around [00:51:00] the streaming first, approach to these kind of initiatives, and projects ... Then perhaps the focus shifts, less on the Hadoop back end and more on the stream processing capabilities, be that Kafka, obviously there's a myriad of different apps ... Especially Apache, upstream processing technologies.
I think the [00:51:30] emphasis changes, the focus changes, but bottom line obviously is that they are complimentary, and back to what we were talking about. There's [inaudible 00:51:37] data in the stream, and then also you wanted to combine that with historical data, and clearly Hadoop is a primary platform for storing that. To get the context based on the combination of the two.
Steve: Okay, cool. Anything Shant, that you [00:52:00] want to add to that, that you've seen around customer deployments, of things like Kafka and Hadoop together?
Shant: Yeah, I think some of the more interesting, we have a lot of the traditional streaming systems. Everything from Tipico to the TMS's, to messaging. I think the cloud has made this a lot more interesting as well. Where we're seeing all the different vendors and providers with their own, everything from Canesis to Google's Pub/Sub, to the new Microsoft's venting system and even sourcing ... As an idea in general. Seeing a lot of technologies, [00:52:30] they're all very complimentary, they're all meant to work with each other. I think the hardest part is just avoiding lock-in and thinking about what the right solution is, across the diversity of the data sets. This isn't the Era of IBM, JMS's, where everything is the same, comes from the same vendor and same manufacturer. You're working with software and enterprise systems across the board. Thinking about a tool that really works across the whole gamut that's out there, is critical in my mind.
Steve: [00:53:00] Yeah, that's a great point about avoiding the lock-in, cool. Okay. Another question here I'll pass this to Matt, to begin with. I think you've touched on some of this Matt, but the person is asking, "Is it true that the data lake is not a brand of technology, but a concept of a collection of technologies?"
I'll just editorialize a little bit, I'd say my perspective was ... The term big data was synonymous with Hadoop at one point. [00:53:30] Data lake as well, but it seems like people are using different types of technology. What's your take on that?
Matt Aslett: It definitely, you made the point, it's a concept, if you go back to the original sort of use for the term. [inaudible 00:53:50] being sort of, James Dixon was quite clear about the fact this was something used for storing raw data. He didn't specifically talk about what you would store that in, obviously, that [00:54:00] lends itself to Hadoop. I think a lot of early projects clearly were based on Hadoop and HGFS. Increasingly obviously, we see organizations building data lakes, actually where their primary storage would be, more like cloud storage like CS3 ... Being the most widely adopted of the cloud storage systems. They're obviously available. Particularly, you see the ability to directly query that data using something like Impala, of [00:54:30] the approaches that are available.
From that perspective, the definition of what is a data lake, and what it goes to make, the components involved is very fluid. I actually, we've got a report coming out next week hopefully, or in the near future anyway ... Around the evolution of the data lake project. We see a lot of organizations moving beyond thinking about the data lake and more strategic, managed, self- [00:55:00] service, sort of processing environment and what you end up calling that is open for question. It's not just clearly anymore about somewhere to store your raw data. As we talked about here, it's evolving to taking different approaches to how that data is produced, and ingested, and analyzed ... And become a much more sort of fluid and agile data processing and analytics environment.
Steve: Got it. That actually [00:55:30] leads into another question that was in here is, "What's the difference between a data lake and a data warehouse? Is it just a new name on an old concept or is there, how do you separate and-
Matt Aslett: Yeah. I think as I said, the original sort of concept of the data lake, it was very much about storing, and processing, and analyzing raw data. I think if you think about a data warehouse, one of the core things that defines the data warehouses, is that you defined out front [00:56:00] the data model, and the Schema, because you know what questions you want to ask. A data lake in comparison, is something which is designed to be inherently more flexible, to the point that it will be used for different purposes. Therefore, you cannot necessarily predefine how you're gonna store and analyze that data. There maybe elements within a larger data lake environment, you could see that there are use cases that require some sort of predefined data model [00:56:30] and Schema definition. You'll end up with perhaps a data warehouse within a data lake. Everyone sort of stick their heads under the covers and scream for a bit. Don't know what you call it, suppose it doesn't really matter. There are use cases that require different technologies for different purposes.
Steve: Yeah. I like the way you separated the usage of each, where one's more predefined and advanced. Very Good.
A couple of other questions, we'll kind of go in a different [00:57:00] direction, I'll bucket a couple together here, and pass them to Shant to start. One is around real-time data and the question simply is, "How do you deal with data cleansing on real-time data?" Another question in a similar vein is, "Lambda Architecture, is that obsolete? Or is it still useful in some approaches?" Which gets into the batch and real-time, and that sort of thing. Shant you want give a perspective on those two?
Shant: Yeah. I can start with [00:57:30] the Lambda Architecture. I'm a big believer in the most important thing about software and the job, is getting stuff done. Normally I use a different word that starts with S, to describe that, but I don't put that in this webinar. Lambda Architecture is kind of easy to deploy, might be a pain to maintain. At the end of the day, if you need real-time insights, like Lambda Architectures with Nathan Mars coined the term and was working on storm ... Developed it. He did it because [00:58:00] he had urgency. He needed to get something done. Lambda Architectures, if that's your best option, or that's what you can roll out the easiest. What's more important, are the insights in the application that you deliver. I think in the industry, we're too focused on the nuances of the how and the architectural imperfections of everything.
I don't think Lambda is dead. I've built the Lambda system to this day. If you need to get a job done, and you can easily deploy Lambda Architecture, it is easy to deploy, just [00:58:30] do it. Right? The important thing is delivering those results and those insights. I think the how is less important, especially because of the way a lot of this gets abstracted. You can kind of go from Lambda to Kapler, go from Lambda to True Streaming, without really changing the applications. Do what gets you those results first. That's always by best advice to people. Start small and get something out the door.
As far as the data cleansing, [00:59:00] there's a lot of interesting analysis. There's a whole new world coming out now that we're talking about, streaming ETL. A lot of this is powered by what we're seeing in streaming SQL. I would almost say that's an interesting space to keep an eye on. Typically, what happens, because now a lot of them are modern streaming systems, especially Kafka or Schema, their serialization formats. You can actually apply a little bit of a Schema to a message as it comes in. That gives you the ability to do cleansing. The [00:59:30] first one is filtering, actually filtering in streams works really well. Kind of data transformations is also something that's coming a long the way, it's a simple expression application, and creating new topics from an input topic.
Combining with different data sets from a cleansing perspective, is a little bit trickier. I think that space and that approach will become very viable, very soon. I can imagine a world where people stop doing ETL and advanced process, and all [01:00:00] data transformation and cleansing happens on the fly. If the system can support it, why wouldn't you do it, is my question.
Steve: Very cool. Anything you want to add to that perspective Matt?
Matt Aslett: Yeah. I guess the key point there being, based on the requirement, perhaps if you're doing that real-time [inaudible 01:00:26] analysis of data in the stream, [01:00:30] I hesitate to say, the data quality might not be as important. I think there are different use cases, where if you want to see what's happening right now with that data, as it's ingested into the organizational, perhaps data quality isn't top of the list of requirements. Obviously you would pair that with the other requirements, perhaps with the same data, where the quality of that is more appropriate, the [01:01:00] time driven sort of requirements, if that makes sense.
Steve: Very good. Very good. Gentlemen, I know we're one minute past the top of the hour. There's two questions left I want hit one, because I think they're important. It's around the cloud, but also one of the polling questions was ... People were saying they were gonna use their traditional BI tools on some of these platforms, so the first of those questions is, "Can [01:01:30] we provide, we being Arcadia Data," This persons saying they work with Amazon, date lake service, as well as cloud data, big data services. "Can Arcadia be a front end to them?" The question is, "Why would I switch to Arcadia data versus something that's in that cloud environment, or maybe the traditional BI tool, like Tableau or something like that." Can you talk a little bit about the cloud capability, and then why people would look at something other than their existing BI tool?
Shant: [01:02:00] I was hoping that was a question for Matt. Obviously I'm biased, but Arcadia works great on the cloud. We have maybe about 40-50% of our customers on cloud deployments right now, very large. We do have direct access to a lot of the Amazon functionality, as well as ... The big deal there is Elastic Scalability. Why would you wanna sort of use a modern BI tool, as opposed to one of the more traditional BI tools? [01:02:30] I think there's a few facets there. One, architecture ... So modern architecture designs tools, so being able to support real-time alone, is a huge architectural challenge. From a BI perspective, it's not your traditional client-server type of a model that you're developing to. You kind of need to instrument and instruct things very differently to support real-time.
The other one is, being focused on big data, data lake types of use cases. For example, we have support for complex data [01:03:00] types, being able to work with semi-structured, unstructured data natively, but also not expecting everything to be a SQL data source. A huge use case we see people integrating pretext search engines, things like Elastic, and Solar, with their traditional data sources, being able to pull in some of the key value historic data sets, or with a boring data set that's sitting in an Oracle server somewhere. The data variety and being able to build and sort of design a tool from Day 1, not just for the scale elasticity [01:03:30] performance, but also for the diversity and the variety in the data. I think is a big reason to consider a modern BI tool, because at the end of the day, your traditional BI tools ... Everything looks like rows and columns. Everything is expected to have some kind of ODBC/JDBC SQL Interface. The world of the data lake is not so clean cut.
Steve: Very good. Gentlemen, thanks for staying over. Matt, thank you so much for being our Keynote, onstage webinar. Shant, [01:04:00] thank you as well, for your perspective. This is recorded. We will send out the recordings to everyone who attended. I got some e-mails out that as well. Thanks everyone, have a great rest of the week, and we'll talk to you soon.
Sam: Thank you, thanks everybody.
Shant: Thank you Matt.
Matt Aslett: Thank you, cheers.