How Hewlett Packard Enterprise Gets Real with IoT Analytics
Webinar | June 26, 11 AM PST, 2 PM ET, 7 PM BST
Before blockchain, there was Internet of Things (IoT). But beyond the hype, how does IoT apply to real-world use cases?
Hewlett Packard Enterprise (HPE) is a Fortune 500 enterprise information technology company that makes tier 1 storage arrays used by data centers worldwide. To ensure quality service, HPE needed a data analytics platform that could:
- Monitor millions of incoming diagnostic data points each day
- Visualize this data at scale for internal business users and offices
Join on us June 26th at 11 am PT to learn how HPE uses visual analytics within a data lake to create an “Industrial Internet of Things” model that solves their data analytics problem at scale. We will discuss:
- How to scale IoT Analytics for 24/7 operations
- The metrics and insights that benefit customers, support, and product line managers
- Before and after: key metrics and results of scaling analytics securely across diverse users and teams to achieve business goals
- How visualizing data at scale across diverse internal teams can help achieve business goals
- How data lakes can be used to manage data at scale
Chief Architect, HPE Fellow
Sr. Director of Products and Solutions
Siamak Nazari is the chief software architect for HP 3PAR. In this role he is responsible for setting technical direction for HP 3PAR and its portfolio of software enhancements. His current area of focus is solid state storage systems and the software systems for a new class of storage systems.
Nazari has over 25 years of experience working on distributed and highly available systems. He has been working on HP 3PAR technology since
2000, responsible for designing and implementing distributed memory management and high availability feature of the system. He also
architected Virtual Domains and federated storage features of the HP 3PAR array. Prior to joining 3PAR Nazari was the technical lead for distributed highly available Proxy Filesystem (pxfs) of Sun Cluster 3.0.
Dale Kim is the senior director of products/solutions at Arcadia Data. His background includes a variety of technical and management roles at information technology companies. While Dale’s experience includes work with relational databases, much of his career pertains to nonrelational data in the areas of search, content management, NoSQL, and Hadoop/Spark, and includes senior roles in technical marketing, sales engineering, and support engineering. Dale holds an MBA from Santa Clara University, and a BA in computer science from Berkeley.
Dale Kim: Hello everyone. Let's go ahead and get started. And welcome to our webinar: How Hewlett Packard Enterprise Gets Real with IoT analytics. I'm Dale Kim of Arcadia Data, and I'll be your host and one of the speakers. Joining me is Siamak Nazari, chief software architect and HP Fellow of Hewlett Packard Enterprise. To quickly summarize his background, Siamak has many years of experience working on distributed and highly available systems. He is responsible for setting technical direction of the portfolio of software enhancements for the HPE 3PAR division. His current area of focus is solid-state storage systems and the software for a new class of storage systems. And I'm the Senior Director of Products and Solutions at Arcadia Data, where I guide strategy for a big data analytics platform. I’ve held a variety of technical and management roles at IT companies with significant experience in nonrelational data platforms, including search, content management, NoSQL and the Hadoop ecosystem.
Dale Kim: Now, before we get started let me call out a few items. First, if you have questions on the webinar content along the way, please type them into the chat window and we'll get to them after the formal speaking session of the webinar. If you have any audio problems along the way, please let us know by the chat window and we'll try to address them. If we can't resolve the problem, be assured that you will get a recording of this webinar in a few days so you can watch it again or even share that recording with your colleagues. And of course, please join in by live tweeting. The Arcadia Data handle is simply @arcadiadata.
Dale Kim: Now let's get started with the webinar. Siamak will begin by describing how his team was able to solve an IoT analytics problem by using big data technologies. Siamak, take it away!
Siamak Nazari: Thanks Dale. Essentially, one of the.. just to give you background on what we do. We sell high-end storage arrays, tier one storages, for data centers. And over the years we’ve sold a significant number of these storage arrays. These storage arrays report home a number of telemetry information and what the business, you know, really does is supports all these systems out there. These systems provide tier one services such as you know database backends, SQL server backends for our customers. And the customers are anywhere from small to medium-sized businesses to very large enterprises with hundreds of arrays at their site. A 3Par is responsible not only for sort of the architecture in the system but figuring out what the future system needs are. And we rely on this data that is coming back home to devise our strategy and figure out what we actually need to do in the future. You know, the system has been designed over the years and there's been a lot of transitions in technology, and these transitions include the introduction of flash and all of our transitions and technology are guided by looking at the data and the customer use cases to be able to devise brand new systems in the future.
Siamak Nazari: Next slide, please. So these storage arrays have descended to the center, but there's a lot of metadata that they produce in general. The minute includes environmental sensors such as the temperature, vibration, you know, the kind of commands that the customers enter, how much data on performance and diagnostic type events, when a piece of hardware fails... If an over temp is detected, for instance, the fans go high and they’re continuously producing this data, and this is actually sent back to HPE for analysis later. But we also send actual use case information, such as the number of volumes that the systems have created, how often those volumes get deleted, how often those volumes get expanded, and so on. So the use case for us on the backhand is to figure out not only how the system gets used for us to devise new systems and new policies and new designs, but also to try and understand if the customer really needs to expand their system for instance, this is what they call a sales and marketing opportunity: if the system is oversubscribed, if it's capacities is completely exhausted, or we can actually even project when the capacity is about to get exhausted so we can actually have the right person called the right customer at the right time to make sure not only they get their system upgraded but they don't actually run in to issues later.
Siamak Nazari: We also used the system's backhand to deal proactively with potential issues with the system. So for instance, if you notice a high number of drive failures or over-temp conditions, often we actually recognize that the system is in overtemp condition before the customer realizes their data centers actually got hot, right? Because in A/C failure there take some time for it to actually start to show itself. But since our systems are able to see the temperature and kick up the fan and call home, we actually often call the customer saying, hey, did you realize your data center is hot? So that's the kind of stuff that we actually used our systems today for. But as you can imagine, as the number of systems has increased, we have a lot of data coming home and unfortunately from the get go we designed these systems… they produce lots of text files, right? So the condition of the system is one text file. The number of volumes that this system has one takes file. You know, how- what commands have been entered is a text file. And these come in on a regular basis from these tens of thousands of systems back into the headquarters, right? And we really needed to do some sort of real-time versus historical analysis and trends of these things and they're very difficult to do with text files. You know when we have a few thousand systems it wasn't so bad but once you get into nearly a hundred thousand systems it actually becomes an impossible task. It takes days to run those commands if you just wanted over the text files.
Siamak Nazari: Next, please. So we talked about the scale. The scale was essentially tens of thousands of arrays you know approaching one hundred thousand and each one of those arrays individually produces, you know hundreds of files a day, right? And you can imagine the number of data entry points that actually come back to headquarters now transit in millions per day. And the volume keeps growing for a couple reasons. Number one, we obviously are deploying more and more arrays in the field but also over time as we upgrade the software, we’ve realized that the actual need additional telemetry information to be able to give a better, you know, customer experience to our customers, right? It's a combination of you know, software upgrades that demand more information to be analyzed back home and also the large volume of the systems out there essentially means the scale keeps growing. And these text files obviously need to be, you know, analyzed on a continuous basis but they don't need to be converted to some sort of analyzable format, right? Always, you're looking for individual information inside the text files not particularly efficient if you have millions of them distributed over a whole bunch of different servers. So we took an initial stab at this by having a process by which these individual files information were getting, grab other text files and insert it into a database. In some cases the actual text files themselves, if they were small, were getting inserted into the database as a blob, right? This quickly became overwhelming to the database, in fact, you know, but when the most expensive service that we could from HPE had a terabyte of RAM and the largest database instance that we could actually configure, but over time within six months it was essentially not really offering me at the level we wanted to and the queries were taking a long time and any sort of pick up with the database was essentially preventing us from being able to analyze the data and in some cases to actually do a timely you know reaction to an event in the customer environment.
Siamak Nazari: Next, please. So what we had to step back and say is okay, so this mechanism using you know either directly looking at text files using you know, various homegrown scripts was not going to scale. Inserting all that data into a database wasn’t really working for us and we started looking at so what are the possible options for us and what can we do here? Right? And the obvious answer was Apache Hadoop. We're not exactly a big data express here and we have day jobs but we really had gotten to a place where we had no choice but to look at another platform. And you know, we did some experiments in Hadoop and it looked like it could handle the job so that's what we actually ended up deploying, right? And that became kind of our plan of record in terms of inserting all that data. It fit the model of having in the millions of text files on that platform and being able to consume thousands of text files per hour and send something that can actually be done on that platform. But obviously, the next piece is that how do you actually analyze this data once it has landed on Hadoop. And actually looked at, you know, a bunch of technologies and the one that seemed to fit our needs best was Arcadia. It really was architected for the big data and it really could directly run on Hadoop clusters that didn't really require us to go and reinvent the wheel and reinsert our data or do a bunch of data transformation and as you imagine now we have hundreds of terabytes of data and it really is not that easy to sort of convert it from one format to another format. The other piece that was good and nice is that we had from the get go several use cases in mind for our data right? The use cases were either for engineering to understand how the systems are being used, for the support staff to provide backend support, and quickly analyze and also for sales and marketing to find additional marketing or sales opportunity right? So we really wanted a platform that can satisfy the needs of these very distinct groups and that not only are the distinct their needs but they're also very distinct in their ability to deal with the data and the level of competence in terms of delivering the kind of interface that could use whether it's a direct SQL interface or a GUI interface and also be able to actually change the models that are presented to them.
Siamak Nazari: Next, please. So this actually worked quite well for us. This was something that took some, you know, less time than anticipated really because the Arcadia folks were quite helpful and, you know, sort of understanding with what the use cases are and what the details of the system are. And, you know, after six months we essentially had a system up and running and you know, we are loading about sixty gigabytes worth of data per hour, that's the amount of data that's coming in from the field and we are actually sticking it into Hadoop with Arcadia front end. You know, one of the biggest things we wanted to do this is having a series of pre-built reports for marketing and sales folks, right? Things like the utilization trends, failure notifications, failure trends, the various cycles and so on so we can have the Arcadia people we built that but the other nice thing is that the data queries that could be modified on the fly to sort of drill down and understand what you are looking so that they're not static and our previous model where we were kind of running these reports overnight and the next morning we had these reports but you had what you had you really couldn't change the report and drill down and say, well, look at the teachers of this particular system or the details of this subclass of systems, right? With Arcadia it's quite easy to go on actually say well this is a report that talking about the entire field, but I'm interested in this subclass of systems because they just were released and you wanna know what the experience in the field, right?
Siamak Nazari: The other pieces we talked about, the fact that they have multiple users of the system, right? And we didn't want to mix up, you know different users having access to different parts of the system so far since we're sensitive about our customer names, obviously ah, you know, we didn't don't want necessarily have everybody have access to the system. So while you may need access to the failure analysis data, you may not need actual access to actually whose system you're analyzing, the data failure and us against, right? So this kind of limited access and this author uses model that Arcadia provides helped solved that particular problem for us as well, because that's something that, you know we initially knew we had to handle, but the only way we handle those was to really just limit, but who could access the data, which made life difficult, but in this model you just let them access it, but you set up access privileges ahead of time.
Siamak Nazari: The other pieces that were difficult for us to do: our model was always we were looking at the last snapshot of data from the data that ran last night, right? Which really didn't help with our historical data analysis. We did some historical data analysis but a typical historical data analysis will take a few days to run because we have to gravel through lots of data. Now once that data is actually inserted into Hadoop and then you have a SQL interface in front of it the historical data analysis actually becomes possible. This is the kind of thing that really helps us understand if there is a transfer essence of a particular feature the system is no longer in use or a particular feature of the system is picking up use that's the kind of thing that matters to us because essentially, you know, heads off our ability to figure out you know, what's a features of the stimulant to focus in terms of development, right? And this was really you know, for the first time you're able to do unstructured data analysis using kind of a relative interface as opposed to always inventing a new mechanism on the fly because each file was a little different and each need was a little different which was a very ad hoc model. While it worked initially, over time it essentially became very burdensome for us.
Siamak Nazari: Next, please. So you know, we talked about the various use cases but one of these cases we were looking for was to help the sales folks, right? And it turns out the sales folks actually had developed their own database but it turned out each region and each salesperson kind of managed their own arrays or their own set of customers and they have all sorts of different ways to handle figuring out what the needs were, but by having kind of a consolidated system and a set of interfaces we are able to sort of provide the sales folks, the ability to sort of project what their customers when their customers are going to run out of either physical space or performance on a particular array so we can head off and say, hey, mister customer, you really need to add storage or this storage array is running out of performance you will be thinking about either migrating some of the loads and you could actually even tell them what sort of arrays their data centers are underutilized so they could actually rebalance the note among their arrays. And if they are just truly out of, you know, performance or capacity because older arrays are fully utilized, then there's an opportunity for them to add another array into the environment. And this is kind of interesting: customers really, really like it when heading off disaster. They don't really mind spending money if it heads off a disaster, an embarrassment for them in their production side.
Siamak Nazari: Um, and the other piece that really helped us was, you know, being able to systematically look at our component reliability. We buy a lot of components from different vendors, right, and not all of them have the exact same kind of reliability factor, right, and in the past it was quite difficult to figure out if a particular vendors drives are more reliable than other than they're right or within a vendor if a particular set of drives have particularly low reliability and turns out that we actually notice a trend in one of our vendors in terms of reliability because we could do the historical analysis and it became quickly very obvious that the drives were not having the typical early life failure but they were having, you know, a failure that was occurring three months into it and this was kind of very unusual and they had missed it and we had missed it because just swapping the drives but after a while doing there circling as it became obvious that there's a specific thing that's happening a few months into the life of a drive that is causing in very high failure rate and that helped us to go back to the vendor and work with the vendor to figure out what's going on in this case luckily there was a firmware fixe that we could apply very quickly to our customer to prevent you know, additional failures, right? And of course, the last piece we talked about a lot from the engineer that designs the system I very much like to understand, whether a particular feature we added to the system is used or not, right? Because often, we invent things and the customers may or may not use it and the fact that if they use it, we know that's where we need to focus to sort of for additional enhancements that if a particular feature isn't really being used, do we deprecate that and sort of save resources and on the engineering side focus on what actually matters to customer, right?
Siamak Nazari: And then, you know, I talked about the processing times always being kind of overnight in the past, but now it's done in minutes and often this really leads to us being able to ask questions that, you know, we couldn't ask because every time you run a query or you look at something, additional questions come up and an additional drill-down possibility comes up and these kinds of complicated queries that we could run in seconds really has helped us. For instance, I talked about the particular drive family having issues, within three months. That's the kind of thing that you know, would have taken a dedicated person spending weeks on it versus this stuff, just popping up on a graph within a few minutes of choosing that particular drive type and a particular lifespan and we looked at it and it really popped out and was quite obvious.
Siamak Nazari: Next, please. So here are some of the you know, examples of graphs that we use you know, the graphs include things like the number of drive types deployed in the field, the state of those drives so you know very quickly we can get a sense of you know what the deployment model is but beyond that you can actually figure out over time what the mix is. And the mix is interesting because we look at not only just to drive counts but we also look at drive capacity and that's the kind of thing that gets lost sometimes in the shuffle because you could say you have eight hundred of something but you know, they had this type of drive and nine hundred the other type of drive but that's just a count but into the capacity of the eight hundred maybe would dwarf the nine hundred because they're just larger capacity drives, right? And that tells us the kind of optimizations that we actually need to do on the other pieces that each one of these drive types they tend to have different performance profile so the actual you know, performance available to a particular system is a function of the type of drives that are deployed in that system so just looking at the drive count by itself is not as illuminating as looking at the drive count and the drive time. That gives you a much better picture of what the available capacity as a performance on that system really is.
Siamak Nazari: Next, please. So this is another one this is where into the support folks care a lot about whether the field has is updating to the latest software revision which tends to have a lot of fixes and this actually helps us this kind of the color photograph gives you a sense of the trends in the software install bay. So for instance, if you look at the middle one with the red you could see an increase and then decrease and then the pink is taking over and the brown is taking over so that's when a software releases is actually introduced in to the field and as you can see as a new software release is introduced it starts to take up more and more and the older soft releases get upgraded to the new software release. That's the kind of graph that helps, you know, us understand, you know, which releases can be deprecated and patches doesn't don't have to be introduced and beyond it what sort of if you find a particular release of the software that is particularly vulnerable, you know, this gives us a picture as you know how quickly we can actually go and you know deliver patches to those customers to put the system in a much healthier state. And the one on the right shows actually the subrevisions not just the main release but the various patch off dates that go into the customer environment and often we can sort of very quickly tell if the environment if the pool of the systems are in a healthier state because we do know which of the revisions the software are the most reliable, you know, so that's part of it and then the other pieces if you could also look at the actual license update right and the trend that that license update uses? So we can actually figure out, you know which set of licenses are being used and not being used and where you know, and why they’re not being used and that's when we go back and figure out with the marketing and product management why that ridiculous software license is no longer being acquired. Is it a change in technology? Is it a change in the marketplace? And whether we need to do something into the market and to actually get that particular license more of those licenses sold. And in fact it's not depicted here one of the big changes that we did we were changing our licensing model, you know, from a model where each license was being sold at the piecemeal where it says a model where, you know, the licenses came pre-built and we want to have two or three simple license models and we use this data to figure out where it will have a significant material impact on us if we changed our licensing model.
Siamak Nazari: Next please. So this is the other piece we talked about, you know, the failure analysis it is a big part of you know the story is right that doesn't work is not it does not make for a happy customer so, you know, over the years we've done a lot of detailed analysis of our drive failures and drive failures it not just mechanical, sometimes actually there's correlation with particular a drive and the firmware that is on it, right, and that's the lower right hand picture right? So we were actually able to control for the quality of the firmware on those drives because what we noticed sometimes is a physical drive fails but not because it mechanically failed but because there's issues with the firmware of that drive and so we actually able to look at a particular drive family and look at a particular firmware for particular drive family to see if they have a higher or lower, you know, failure rate compared to the to their brethren's. And we could actually do that over the performance because you could actually look at this the service time from the drive based on the firmware because sometimes the particular firmware on a particular drive actually causes performance issues and that's something that you actually need to be looked at carefully to keep the system healthy over time.
Siamak Nazari: Next please. So this is we talked about a sales opportunity based on you know, both the capacity and performance right? This is the kind of thing that we actually used and we could actually go back to the customers saying yeah, look, this system is about to run out of space you really need to think about updating and this is a graphical depiction of that conversation where we are able to actually look at, you know, not just a customer but that actually look at it, you know, by customer type by system type by region and that helps a particular consumer of this data to only see that the data that matters to them as opposed to you know, the data across the entire installed base that may not be as useful to particular person were or for, you know, somebody that cares put entire system for the entire list installed this is a hole right.
Siamak Nazari: Next, please. All right, so the other pieces we talked about all these events that are coming back well, it turns out that but we've noticed is sometimes the systems emit a particular event right before a failure and their predictors of failure to occur sometime later and this is a particularly powerful tool for us, right? So for instance, you can imagine a system failing in two days if a particular event is emitted two days earlier, so if you could sort of make those correlations, that means that, you know, you have two days to react if a particular event is emitted for us to go do something about that system before the failure occurring, right? So, you know, we are able to actually do some analysis on failure and then go back and look at the actual events have been come back before a failure and try and find these telltale signatures and discover systems that are vulnerable to the failure, right? And this is an example of, you know, the queries that we could do, so for instance, we can look at a particular event getting emitted across thirty one billion events that come in in a week and in eight minutes, figure out those events that are a concern to us. If you just want to find a sample of those events to see if they're actually even getting generated takes only a minute or so to run it. But this is the kind of thing that essentially means that we could take cause it next level and we could actually have different versions of this report pre-built, so if somebody's concerned about software because there are different people concerned with software or hardware or different pushes to hardware to get really only focus on the type of event that they're concerned with. Is it being driven? Was it being power supplies? Was it being the software itself? Meeting some sort of some sort of thought, some sort of event that is an indicated will fall for us, right? And this has helped enormously and sort of they're figuring out their behavior system and sometimes misbehavior as it may as it may be.
Siamak Nazari: Next. So that's where the other piece becomes interesting is that we could actually combine looking at structured data, unstructured data in a sense, right? We could look at the trends and figure out if a particular system for us turns, you know, is having issues of particular service is over there having issues and begin, drill down and look at the actual ride, and logs. So a moment ago, I talked about the fact that we could actually look at, you know, the trend of systems of drive failures but we also want to look for these telltale signs that may be going off and the telltale signs are often hidden in their own event files. This is where we're able to actually look at the particular event, a particular trend, and then look at the actual ensuing events on that particular system to say ah, so these are probably contributing factors. And then we can actually use that information from their own event file and go back and now create a query based on that particular event and try and capture on a real-time basis and a continuous basis to head off potential issues. So, you know, in the previous slide I talked about the fact that we can sort of look for the signature events but the question is how do you find the signature events? Well, this is how you find signature events. You look at the particular a set of trends that are happening and it could arrive and file by this by screen and what you find the wrong event then you're able to actually to do this backward analysis.
Siamak Nazari: Next, please. So if you really have lots of devices like us that produce a lot of data, you really need to think about a big data platform in your production environment and you really should think about this scale ahead of time and that's a mistake we made, we set up a few thousand devices and they could do it in an ad hoc way but over time you're going to build something that can accommodate a lot more than that. And you may be able to do things with the standard platforms but you will spend a lot of time trying to fix it over time if you don't think about it ahead of time carefully, right? And the other pieces you should think about not only the data that is coming in but the consumers that are going to use that data, right? That's the other piece that matters that need to architect your system ahead of time to think about the security aspects of it, who are the actual users that are planning to use this be it in a customer support, being marketing, being sales, being engineering and just architect it from the get go to handle those various use cases. Thank you.
Dale Kim: Great. Thanks so much Siamak for a great presentation. One thing that I really appreciate is your comment about customers really liking, you know, the ability to avoid disasters and I think that that definitely validates a lot of what I’m hearing around this notion of predictive maintenance. So, rather than, you know, scheduling maintenance or reacting to some problem being able to detect and be, you know, much more efficient, much more smarter about maintenance to avoid some of these disasters, you know, putting some of the effort up front to put a plan in place to be able to do the predictive maintenance versus you know, using the more reactive model I think is very important, particularly when it comes to collection and miles of IoT data. So thanks again Siamak, that was great and to all reviewers hope you gained some great insights on what you should be considering when deploying your IoT analytics and big data environment. Now, if you'll indulge me, I'd like to ask you to spend a just few seconds responding to a quick poll. So, I just like to know and share with the rest of the audience, you know, where you with your big data or data lake deployment? So are you still in the gathering knowledge stage? Are you working on the strategy? Are you actually piloting or maybe even deployed? Or would you say you've got you know, very well oiled system that's fully operational? So I'll give you a few seconds to respond.
Dale Kim: Okay, so let's, take a look at the results and, you know, interesting that no one's in the piloting face, but it is a bit what I expected in terms of the early stage of gathering knowledge and developing strategy. It's fascinating that many of you have been deployed, some use cases are probably getting some value, so hopefully this webinar was good in terms of giving you some additional ideas. Maybe you're expanding out from a traditional data lake to incorporate more real-time data feeds or even IoT analytics. So it's, very interesting, and of course, some of you are fully operational, so good, good for you, and I hope you will continue expanding out your deployments and try to get more value from that. All right, so now let's continue with webinar where I'll talk a bit about the higher level concepts around big data and how it relates to a IoT analytics. I also want to talk a bit about how Arcadia Data addresses some of the requirements that you might have in your analytics environment and really think about when I talk about some of the features and how we address some of the requirements, think about how you're addressing some of those requirements, how it, you know, might be the same or different from what our Arcadia Data has to offer, and then also think about, you know, some of the other requirements that you might want to address. So one thing I like talking about is the phenomenon that we're seeing on the market today where enterprises are seeking to separate BI standards, and what that means is that they typically have a traditional BI platform from one of the big popular BI technologies, and they use it for the data warehouse, and it works pretty well and for all of their requirements, all their use cases, you know, they're they're pretty set there, but they also have a complimentary analytics platform, which is their data lake, and they're also realizing that they need a separate BI platform for that data lake. The reason why you have to go back on the platform with the reason why you have separate data warehouse and data lake, you know, you probably are very familiar with just quickly review, you know, some things I hear about include, you know, the flexibility for handling more types of data and, you know, more sources from both within your organization as well as from external third party sources, and then with that additional data, you can do a lot of data correlation and enrichment, so you get a lot more value from your data, and of course, as you're bringing in more data sets for both internally and externally, you will face a scale challenge. So how do you address scalability in a very economical and cost effective way versus using the traditional route of simply upgrading your servers? And, of course, some of the advantages of a data like that people talk about include things like self service and data agility and those terms basically mean, how do you try to minimize the overhead of IT as much as possible? I mean certainly not not eliminated completely because there are some tasks that are IT required, but if you hand off in a data lakes platform to a non technical business analyst and business users, how do you make sure that they can get the most out of the platform without having to constantly go back to the IT team.
Dale Kim: So let me talk a little bit about, you know, how data and platforms have changed along the way, and I won't go into too much detail because I think most of you get this, I mean, data, when, you know, started getting really popular with the relational databases many years ago, you remember that it was all about data from people who typed it in and so definitely was very constrained, it didn't grow very fast, you know, you didn't have a lot of challenges that you face today, but now that you have real-time data sources, you know, sources are being created by systems, you know, the volume is much more significant. And then the platforms evolved along the way. And so, as some of these newer data sources look to be very interesting, the platforms grew along with them. Now, certainly some of the traditional platforms have evolved a bit, so you might see relational databases incorporating data types like JSON and handling other various complex types, but only to a certain level. And so some of these other modern data platforms like, you know, NoSQL or Apache Hadoop and Apache Kafka are intended to be able to handle some of these bigger challenges around data that you might have within your environment. And interestingly not only our platforms responding to the changes in data but I think that the platforms, as they evolved, encourage more types of data so that you can have them both growing and evolving over time so that you're trying to do new creative things within your organization.
Dale Kim: And that brings us to the BI tools. If you think about what has progressed in the BI tools over the years they largely still have the same model of the rigid scheme up front life cycle that you find in relational databases and data warehouses and if you're looking at some of the advantages that you want to get from your data lake, then naturally would seem like you would need an upgrade to BI tools to handle the data lake environment. So if you talk about some of the challenges that we're seeing within the industry when you're trying to use a traditional BI tool that was built for data warehouse in a data lake environment one of the things that you'll face is inefficient scale and that kind of makes sense if you think about the model around traditional BI tools which in many cases, when people try to deploy them on data lakes, they're treating the data lake simply as another repository, not a unique repository, not a progressive repository but something just like a data warehouse and that means that you're going to move some of your data from that data lake to a dedicated BI server, and when you do that you have a compromise. You have, you have to compromise on scale so you could have a lot of doubt a volume but only a few users, so not a lot of concurrency is currency or you can have a lot of users on just a subset of your data. So again, you know, you're not getting the full value of all the data that you have put into your data link.
Dale Kim: And this notion about not really being efficient and handling data variety. So with all the different data types that are coming in, particularly the newer, complex and nested types that we’re getting IoT environments, you really can't take advantage of it. Because while a lot of these BI tools nominally say that they can handle newer data types what's really happening is that the IT team is involved and does a lot of transformations, performs a lot of transformations in the background to be able to put it into a format that these traditional BI tools can handle. So a lot of work to be done whenever a new data source comes in, whenever the data source changes a little bit, so that inhibits some of the self service ability that you want your business users to have.
Dale Kim: Then finally, you know, that's related to that previous comment about a variety of data types is related to this notion of agility and lack thereof. So if you think about the model that you get from a lot of these traditional BI tools within your data lake environment. A lot of times what you're really doing is having the IT team do a lot of work in preparing the data and a lot of times that's unavoidable. So you go through that, but then you want to hand it off to your not technical folks and let them explore the data. Do you go through discovery, build dashboards and so on and then deploy a production application? But if there are any changes that are required in the underlying tables and the underlying data, then they have to go back to IT. So essentially business analysts only get a very narrow window in terms of what they can do in a self service manner. If you think about it, if you think about any self service application you've seen on the web, you know, you might be able to do a bunch of things by yourself but to a certain limit and then when you're trying to ask questions or do anything that wasn't baked into the application at the very beginning, then you just go back to the IT team. So what we're trying to say here is that for an analytics environment, especially when you're trying to promote self service, try to give as much of the capability to the nontechnical folks so that they can run the whole life cycle by building the dashboards and appointed production so the business users get access to the data as quickly as possible. And if I talk from an architectural standpoint to compare some of the architectures that are available out there the first was what I referred to earlier around the BI server, the dedicated BI server within a big data analytics environment. So you have your BI front end and then the BI server where you move your data. And it's that BI server that typically is the bottleneck for a lot of the things that you want to do with big data so it can't scale out - it wasn't designed for that so it's typically not a cluster but rather just a single server that does the analytics. You have to move down into it and then you have to do a lot of the activities like the transformations and securing the data within the BI server as well, so not an ideal configuration for when you have a data lake and you're looking for some of these new advances and the new flexibility that you can get versus a traditional data warehouse environment.
Dale Kim: And the next architecture that I'm referring to here is the big data BI architecture largely has to do with middleware. So, if you think about it middleware is often about connecting two disparate pieces together that otherwise would be very clunky in talking together directly. So you have the BI tools on one hand and then you have the data repository on the other hand or the query engine and then your middleware plugs in and oftentimes, you know, handles things like query acceleration so that when you're running repeated queries in the context of a dashboard over a foreign application, you get fast responsiveness. And again, you know, because these tools, to tools, the traditional BI tools and the query engines designed for big data typically don't talk to each other that well and there are some limitations there this middleware and trying is trying to resolve that communication but often times there's a translation at the lowest common denominator level and so you lose a little bit of the granularity and fidelity of the information that's being passed back and forth so you don't get the types of optimization that you could if you had a native connection.
Dale Kim: Which brings me to the third architecture of this notion of a native BI architecture and now essentially means that you have an end to end platform that includes information from the dashboard as part of the processing, so that you can be more intelligent about what optimizations to create. And not only that, you can create some seamlessness around the whole analytical life cycle so that business analysts could, for example, build a semantic layer and then build visualizations based on that, share this semantic layer, share the visualizations, share the dashboards, and ultimately push to production with the minimal involvement of IT. That's what we like to call not only native BI, but this notion of lossless. And so if you, that's what you have as part of your platform, the dashboards tightly integrated with the backend query acceleration engine, then you have a lot of information to use for some of these optimizations that you need in the big data environment.
Dale Kim: So let's take a look at it a different way. So we have your data warehouse. You know, this notion of running something on your data warehouse is simply not possible, and the reason why bring that up is this whole ecosystem around Hadoop was all about being able to run services or applications on the same nodes as the hooded cluster, and so you get a lot of bands around data locality meaning that you're processing is pushed close to the data, so you're not reading over the network. You also get some advantages around parallelism and certainly scale, but this model is not supported within a data warehouse, which is ok, so we just say that we move the bad BI server to a separate cluster and here's where the problems arise. So as part of the analytical process, you probably want to optimize at a physical level, so that simply means building a schema that's designed for the use cases that your end users need in an optimal way, so that they get the fastest responses possible. There's also the semantic layer that, you know typically your IT team will put together, and that simply is about developing the business meaning on top of the underlying tables, so that would include giving a more human-readable name to some of the columns. So if you have something like fl_date, an IT person might know that that actually stands for flight date and might spell it out as part of the semantic layer. So, again, just some metadata that helps define you know what the meaning is behind the underlying tables. Then you have to secure the data, of course. In any environment, you know, I would highly recommend that security is baked in as part of that initial step so it's not about worrying about security later, it's always about thinking about security from the start. And then loading the data, of course, is going to be an important step, oftentimes a very IT heavy task, because some of these traditional technologies require building that schema up front. So you're doing a lot of the schema building before you can actually load and get started. And what this means is that you're doing the analytic process on both the data warehouse and the BI server, so you're duplicating a lot of effort, and so redundancy adds a lot of time, and now think about some of their big data requirements when it comes to things like having that native connection, and I talked about how having that tight integration allows you to share more information from the frontend to the backend and thus leading to some greater gains around efficiency and optimization. Single structure and, multi-structure data is gonna be important, you know, the parallelism and real time, and so you only get them to a certain extent in the best cases, but oftentimes you really don't get these big data requirements addressed. And that's where a native environment makes a lot of sense especially when you're deploying a data lake. And so if the analytics are being run on the same nodes where the data resides, you get a lot of these deficiencies and that means that you're not running this analytic process on two separate environments, you're just running it on your data lake and you address a lot of these big data requirements.
Dale Kim: So one thing I want to talk about specifically with regard to Arcadia Enterprise, our flagship product, but you should also think about in how are you going to address it in your data lake environment is accelerating your queries. And so if you look in the bottom left of this diagram there's some set of your workload that are going to be ad hoc queries, meaning that nobody else has run that query before and possibly no one else will run it again, so you know when you're doing the data discovery and you’ll run a bunch of queries, maybe even type them in and because the system knows nothing about these queries, oftentimes you have this expectation that it’s going to be a little bit slower, but what about queries that are going to be running across many different users? You want that to be fast and that's where Arcadia Enterprise can help with the notion of analytical views which are simply data constructs that are stored within your big data cluster, and that helps to accelerate a lot of these queries that generally have a lot of commonality across the board. So even if you have many dashboards, a lot of these queries have some similarity that could be addressed by analytical views that help return the queries in a very quick way. So that's one aspect, and so you've probably heard a lot of other technologies out there that are able to help with speeding up queries. But what is the process involved in creating those optimizations? And oftentimes that is a really time consuming process and definitely requires a lot of IT help. What if the system can watch queries for you and be able to determine what are the right data structures to build, to help boost the performance? And that's what smart acceleration is all about in Arcadia Enterprise. So you have a recommendation engine, watches the queries, understands what are the relationships between queries and dashboards, maybe also even understands some of the characteristics of the dashboards like filters and filtering options, and so understanding that can help to build optimizations for queries that are important that may not have been run often, but nevertheless are recognized to be important because they are in the dashboards. So getting a lot of information in that native environment helps to build the most optimal analytical views to boost your environment.
Dale Kim: And giving an example of what it means to boost the query in terms of performance. In this chart, we recently just ran a benchmark that included a bunch of queries, about thirty eight of them, and this is a report that we’ll be publishing soon, but there's just a sampling of the queries and their response time. If you look in the lower left, you'll see that the label says accelerated next to unaccelerated and what that’s saying is that if you look very carefully, there's a bar that's very thin, just a sliver compared to the brown bar, and so the thin sliver of a bar represents the response time for an accelerated query, and the brown bar is the same exact query run on a SQL on Hadoop engine that was not accelerated. So you'll see that there's a stark difference when you do leverage the query acceleration in Arcadia Enterprise, and you'll see that stark difference across the board across these different queries. And along the bottom from a bigger picture standpoint, with regard to concurrency, you'll see the numbers five, ten, twenty five and ninety five that represent the benchmark that we run in terms of concurrent clients to give you a sense of how these things perform when there are many users together and you'll see that for twenty five and ninety five we don't have corresponding numbers for the unaccelerated queries because they took so long to respond, but at least I wanted to show you how the accelerated queries compared to some of the smaller concurrencies and still looks pretty good, so you get a lot of concurrency despite having complex queries. And this does not even include the cache capabilities within Arcadia Enterprises, so you can get a further boost from in memory caching that's being done on top of the data structure based acceleration.
Dale Kim: And let me go through this really quickly, but what I'm trying to outline here is just the long process it takes if you treat your data lake as if it were a data warehouse using traditional BI tools. So you have a number steps like you landed secure you’re doing some transformations, you necessarily have two separate environments again the traditional BI server alongside your data lake. And so when you go through these first three steps, you're really limiting your time to access your data so a number of steps along the way we're going to move some data and then you finally get the analytics discovery phase where your business users and business analysts will be able to explore the data and then to find a use cases and so you've lost a lot of time along the way before you can even start the performance modeling. So contrast that to a native environment where you know you do the landing secure as well, you build a semantic layer, and do the analytical discovery and what's interesting here is that steps two and three are next to each other and often done by the same team, the business analysts, so there's a very quick feedback loop between, you know, adjusting the semantic layer and the analytic discovery, so that greatly reduces the time to insight. So now we're talking about days versus weeks or months. And then as I talked about the smart acceleration, if you can push this out to some of your users and have them go through the dashboards and be able to get a sense of how the system learned from those queries and optimize again, removing a lot of the overhead from the IT team and freeing them up for some of the other, you know tasks that were responsible and not have to worry so much about performance because the system will take care of it for you.
Dale Kim: Just a quick summary there. And the summary: scales compromise, enabling real-time analytics, unlocking complex data, acting directly on your data, poor data discovery, and very easy optimization, and productionization staying a step.
Dale Kim: And so just quickly, an overview of Arcadia Data. I think we're running out of time but I just want to talk about, you know, this notion of data blending, so not only do you have access to your data lake, but also you can incorporate other sources like Apache Kafka and the other data sources so that you can get that full view of data including sources that are traditionally relational, but you can include in part of your dashboard.
Dale Kim: And to wrap up my portion, you know there's a lot of information that you can read, check out our resources area in arcadiadata.com and feel free to download our Arcadia Instant as well and check it out. So thanks for listening. Let me check out some of the questions that have come through.
Dale Kim: So I think we have time for a few questions here and if we don't get to your question we can follow up afterwards. Let me ask this. Siamak, let me ask you this question. So what question came in is, did you get any pushback along the way? If so, who are the detractors?
Siamak Nazari: Well, the biggest pushback really came from folks that had initially comfortable with using the RDBMS, you know, the old traditional database because they're really comfortable with that model, and then they just were listening to the fact that the performance wasn't gonna scale and, you know, at some point in time which have sort of let it be until the system just couldn't scale anymore and then they came around. So that was our biggest sort of issue, you know, so people have built the model and they felt comfortable with it until it was obvious that it wasn't going to scale anymore.
Dale Kim: Okay, great. And, here's another question. I think it's related and I think maybe this is directed for your opinion. At what point did you believe that Hadoop was the right direction?
Siamak Nazari: I think, you know, I talked about using it in RDBMSs. It turns out that we actually used in a number of different types of databases, right? Some of them they're supposed to scale better than others, right? And it was really after, you know, the third try that we decided that we really just needed to rethink it and at that point in time when we sort of did a quick POC of seeing how quickly we can load the data into the Hadoop and it became pretty obvious that Hadoop could do the job where the others just simply couldn't do it.
Dale Kim: Okay, great. And which of your users/departments would you say are the most innovative?
Siamak Nazari: I think, you know, initially I would have said engineering just because I have a changing head on, but over time I’ve realized that that there's a lot that goes on in the product management team in trying to understand how the system is utilized, how to sort of deliver, you know, the best mix of features and function. And ours is a complex protect, lots of software and lots of hardware and for us to devise, you know, and several lines of businesses several lines of products, right? And to figure out exactly you know what lines of products and with how much memory and capacity and what sort of physical config there was a lot that they had to sort of think through and, you know, they came back with some really good ideas in terms of how do we actually kind of build systems that are much more useful to the customers, right? Even their own licensing, for instance, they sort of is that relicensing when it looked at the actual trends in the in the data center and how people were using the system. So here it is. Innovation is a funny word sometimes because if you just make your life, a customers life easy by just changing the selling motion and are changing some feature on the box that's also the innovation in the sense that it helps the customer deal with the overall infrastructure in a much more seamless way.
Dale Kim: Okay, great. And let me ask you one last question. How much of a factor was speed of your incoming data?
Siamak Nazari: It became a huge factor over time. We had gotten to the point where we couldn't process data for twenty hours so data’d come in and we weren't even able to actually process data and that time, like was just growing and at some point in time it became obvious that, you know, we needed to get something that goes much faster and, you know, we can never go realt time, but, you know, now here within a few minutes of the data that had arrived, processed, you know, inserted into the system and ready to go.
Dale Kim: Awesome. Great. So I think we're at the top of the hour, so thanks again Siamak. I really enjoyed having you on this webinar. And thanks for everyone for joining us today. If we didn't get your questions, we'll follow up with you. And again, we will send out the recording to this webinar in a few days so you can have that to watch again at your leisure or share with your colleagues. So thanks again and see you next time.
Siamak Nazari: Thanks.