When everybody wants Big Data
Who gets it?
Original air date: Thursday, March 24th 2016
See how you can put Hadoop to work with Unified Visual Analytics and BI.
The accelerating supply of big data is converging with accelerating data demand from everyday business users. What does it take to get from Hadoop as a data reservoir to Hadoop as a day-to-day data source for your business and end users?
The answer to ‘what’ is ‘how’ and ‘who’. Reducing architectural reliance on ‘small data’ technologies and broadening access to Hadoop hold the key to big data payoff.
Join Nik Rouda, Big Data Analyst and blogger at the Enterprise Strategy Group (ESG), as he hosts this webcast featuring guest presentations from real world practitioners Tanwir Danish, VP of Product Development at Marketshare (acquired by Neustar) and Rajiv Synghal, Chief Architect, Big Data Strategy at Kaiser Permanente.
- Latest research on Hadoop adoption patterns and anti-patterns
- Putting users at the center of big data utilization and avoiding the data scientist paradox
- Architectural misconceptions that can tank big data initiatives
- Security and multi-tenancy strategies to accelerate adoption
- Retooling skills and organizational thinking when big data is the rule, not the exception
Presenter: Nik Rouda
Senior Analyst | ESG
Presenter: Tanwir Danish
VP of Product Development | Marketshare
Presenter: Rajiv Synghal
Chief Architect, Big Data Strategy | Kaiser Permanente
[Begin transcription at 0:02:15]
Nick: – David. I'm really happy to be here today. Just a quick intro to ESG, if you're not familiar with the Enterprise Strategy Group, we are an independent analyst firm. That means we do our own research, our own commentary on the market. We love to participate in these kinds of discussions, and our ultimate goal is to help technology buyers and users understand trends in the market, what's happening, what products and technologies are out there, and how various technology vendors fit into that big picture. So today, we're going to be looking at big data, more specifically, Hadoop environments, and I'll share with you a broad-level market view of what we're seeing happen in the industry. Obviously, after that, we'll get to hear specifically from a couple of end-user companies, how they're using this technology and what some of their challenges and questions and goals were along the way.
But I wanted to level set first, and share with you what we see broadly happening out there, so first of all, it may be no surprise to anyone here, big data is happening, and I mean happening in a big way. We found in our study this year that 63 percent of firms surveyed said that they plan to increase their spending on big data and analytics this year, and think about that. That's on the back of a number of record-growth years in terms of focus here, so this is a hot space of the market. It's a really important priority for a lot of businesses, and if you look at the chart on the right, you can see when we went and surveyed these respondents, people responsible for their data strategies, 48 percent said it was one of their most important business and IT priorities. Thirty-two percent said it was within the top five, so it's definitely a hot area out there. Now, it's important with that focus to get the most value possible.
Now, when you think about big data, I know it's a loose term. A lot of marketing people like to use it very freely. I'm gonna use it in a broad sense here, but there are different qualities that I think help define what's important in a big-data solution here. And you can see the response from our market here talking about speed of analytics, velocity, diversity of data sources, variety, wanting to have more access to analytics across different lines of business, reducing costs over traditional database, data warehouse type solutions, looking at more scale, supporting new requirements, and as well, knowing that the quality of the data is good. These are all important categories for people to think about, and solutions should, of course, be able to match and support these characteristics.
Taking it a little bit deeper, part of the cause for changing the way we're doing these types of solutions is to address challenges people have had traditionally with _____ analytics and even as they get into big data, and some of the common challenges that came out through our interactions with user companies are datasets being too large, difficulty spanning disparate systems, maybe silos of information out there, lack of skills within the organization to manage or analyze data. This isn't to imply that people aren't very smart. It's to imply that technology is changing rapidly, and it can be harder to keep up. Sometimes, organizations say there's a lack of collaboration between the IT department, the business analysts, and other people in the line of business, and that communication can be really important, or difficulty working with different data types, some structured, some unstructured, so being able to solve these challenges is really important to the industry.
Now, in terms of outcomes, we'll hear again from Kaiser Permanente, MarketShare about some of their goals, but common types of things we got back from our study was wanting faster tactical response to customer needs, being able to understand what their customers are doing and be able to act on that information quickly. A similar number wanted to reduce risk, be able to make smart decisions, not go based on experience or instincts, but be data driven. You see a lot of people talking about sales and marketing performance. It's important to recognize that, done right, you can definitely improve the top line, not just the bottom line, but that improved operational efficiency is another big part of it, finding opportunities within the business and within the IT department itself to save money, run leaner, and yet still achieve the right outcomes.
Now, to support this, there's been a lot of interest in Hadoop. Hadoop has been available for about 10 years. It is interesting to see a new technology emerge as a platform and see the adoption that's happening across the market, and when we surveyed, we found that 20 percent of businesses said they're already using Hadoop in some shape or form. Another 37 percent are very interested in Hadoop, and further on down to very few that aren't actually interested, so Hadoop is seen as a game changer here, and a lot of it has to do with the economics, but it also has to do with the flexibility, the performance, and many other characteristics that it brings, but as a new technology, Hadoop is also complicated. It's not just one piece. It's a broad ecosystem of tools, some open-source projects, some vendor-specific, and being able to get comfortable with that technology, make it productive for the business is definitely an important goal as we try and leverage this to achieve our desired results.
Some of the desires people want out of Hadoop is reducing costs. Economics is a big factor, being able to run on open-source software, commodity hardware. It's a very efficient way of storing data, being able to complement or offload BI tools, being able to set up a data lake or hub, a central repository that may or may not fit alongside a data warehouse, or maybe be a replacement for that in some ways for discovery, distributing those workloads across the environment, working with diverse or unstructured data. Some people are looking at real time or streaming, again, complementing or replacing data warehouse. Not so many are looking at it as a direct replacement for BI tools, but it may be that Hadoop itself wasn't designed that way, and there may be better ways to do that function in collaboration with a Hadoop environment, and we'll talk more about that.
So obviously, with Hadoop, there's different pieces that people need to get familiar with, and map reduce is often seen as the front end to be able to build and interact with the environment, but map reduce isn't very familiar to a lot of people. When you talk to business analysts and others, they see SQL as a language that they have come to know and love and are very familiar and comfortable with, so there've been a lot of efforts in the market to introduce SQL on Hadoop-type solutions. Last time I counted, there were well over a dozen different offerings, some open source, some vendor backed, some fully proprietary.
Now, when you look at that SQL-on-Hadoop offering, if you wanna be able to query the Hadoop environment as a data lake or maybe even data warehouse type approach, you need to be able to think, "Is my solution gonna deliver against the expectations of the users?" And those, again, may be business analysts. They may be coming from different parts of the community. What's important to them is performance, being able to work concurrently, being reliable with full _____ compliance, having the right security controls, being able to work with a lot of different file types, being able to work with different data types, having full _____ SQL support, being able to work _____ or scheme on read, and being able to manage the environments appropriately in Hadoop, and so lots of different factors to be aware of and consider as you look into, "How will I do SQL that will work with this Hadoop environment?" And these are the things you should be considering and evaluating as you're looking into your options.
So I've mentioned Hadoop and data warehouse a couple of times, and Hadoop wasn't intended originally as a replacement for data warehouse, but we're starting to see a shift in the market here. I would say the most popular thought on how Hadoop complements a data warehouse is that it could offload or optimize or complement an existing data warehouse environment. Data warehouses are extremely good at what they do, but they're also somewhat expensive, and in some ways, a bit rigid. People find that if they can move EGL, if they can move BI, if they can move more exploratory or discovery-type functions or more unstructured data into a Hadoop environment, it works well alongside a traditional data warehouse.
That said, just recently, I've been hearing a lot from users, and it's reflected in the data here, that some people are considering actually replacing data warehouses with Hadoop environments. Now, that shows a lot of confidence in the technology and the maturity, that it's now performing critical business functions _____ a business to stand up and say, "We're willing to shift to Hadoop as the core of our data environment, our insights environment." That really does reflect assurance that they're gonna be able to do it successfully. It can't just be cheaper. It still has to work, but seeing that shift does suggest that with the latest round of technologies out there, either part of the Hadoop stack itself or complementary offerings, that you can certainly see different ways to work with or even replace data warehouse in your environment and still make it work for the rest of the organization.
Now, one of the challenges here is who is gonna do this? Who is gonna manage this? And I think there's a little bit of a myth out there about data scientists as – I've heard terms – unicorns or others that are kind of the saviors, seen as being able to run the entire environment and being able to cross from technical considerations, architectural, data engineering-type things, all the way up to deep understanding of the business processes, business applications, and business communities within an organization. But when it comes down to it, somebody's gotta actually be able to deploy it and manage that environment, and we see oftentimes, there's a split here.
Traditional IT infrastructure and application teams, operations teams, they tend to take ownership. A business can't go off on its own. As much as you hear about shadow IT, at some point, they're gonna have issues, and who are they gonna turn to? They're gonna turn back to their own internal IT groups for help and support. Less often is the business analyst, data scientist seen as the key resource, so that group may have challenges to solve in terms of data modeling, data access, queries, servicing requests, those types of things, but they don't necessarily support the entire environment. They _____ find an easier way to collapse the stack so that it can be done, and then you see on down the list a number of third-party experts that can be called in to varying degrees to supplement those skills, to supplement the expertise that's available in _____. Systems integrators, consultants, vendors who sell the business applications, _____, and service providers all can help serve to answer the problem in these types of environments, of how do we manage it? How do we make it accessible and actually leverage and utilize this BI-on-Hadoop-type environment?
But again, that points to need for skills, and when we surveyed, we found 36 percent say they had a problematic skill shortage here. Again, this doesn't mean that the people doing it aren't smart. It means they may not have been fully trained. Technologies have changed a lot and continue to change rapidly, and there just simply aren't enough of them to meet the demand. Go right back to where I started, where it was the number-one business priority for a lot of organizations. That sudden interest and enthusiasm for data is to be applauded, but you really need to work with the organization to be able to support it.
And so again, you see on the chart on the right here, when you ask across all areas of IT disciplines, where are there gaps in skills, and where would more skills development be helpful? – another way of saying that is where could vendors make it easier? – cyber security is a huge concern. Cyber security can reduce risk, but when you look at big data analytics, infrastructure management, app development and deployment, these are areas to actually not just reduce risk, but increase efficiency and drive top-line and bottom-line-type new services, as we saw from the beginning. So if we can find ways to make these environments simpler, that's gonna be really, really powerful, and that's where I wanted to leave off here and hand off to my counterpart, Tanwir, to talk a little bit about MarketShare and what their experiences were and how they came to the solutions they did, so Tanwir, over to you.
Tanwir Danish: Great. Hi, everyone. This is Tanwir Danish, and I am very pleased to have the opportunity to share with you our experience in the big data analytics space. Now, to put things in context, I thought I would share a little bit about who MarketShare is, so you can understand our use case, so very quickly, MarketShare has made a name for itself by really connecting marketing to revenue. We offer our services to Fortune 1000 marketers, and we do so – we connect marketing to revenue by analyzing customer interactions, each customer interaction that a marketer is having, and the resulting business outcome, as an example, a sale conversion, using data science, so a lot of predictive analytics, a lot of machine learning behind the scenes going on on large volumes of data, really big data scale.
So a little bit under the hood of how we solve for this problem: Our technology stack has really evolved over the years. We have been a big data shop since 2010, and about 2012, here's a snapshot of how different stages in our back end look like, all the way from raw data to applying predictive analytics on that workflow, where we're really trying to understand what interactions with the customer is changing their behavior toward the brand to finally make a purchase, and then reporting it out in terms of rich insights on the customers themselves, but also what media course correction and ‘Buy’ changes one should make to decision analytics, where they're getting prescriptive recommendations every day on how to change their media buys, such that they can optimize.
So back in 2012, we were pretty much a self-managed Hadoop shop, storing our data in AWS S3 environment. Our predictive analytics was using Hive extracts from our big data, as well as using sampling techniques in R as a way of modeling in, really, an offline environment, usually on a server that was different from where the data was stored. And in terms of reporting workflow, we were using, again, a Hive extract from our big data back end into Oracle and then putting Tableau on top of it for visualization perspective. And on the last bucket, it was a mix of Oracle and Memcached DB, where our application, which is essentially a Java-stacked application, was tapping into it to support what/if analysis and optimization algorithms.
So that stack evolved over time, and really, my focus here is to sort of zoom into the reporting workflow, which is where we intersect with Arcadia. In 2013, we introduced AltiScale, which is essentially a managed service for Hadoop. Fast forward to 2014, we realized that the modeling environment needed to be upgraded to a big data scale as well, so we introduced H2O, which is a distributed R environment for doing predictive modeling. And as we made good headway into that, then we realized that in order to provide rich reporting, we need to have more advanced visualization tools than what we were using back in 2012.
And this is where we met Arcadia, and our journey since then has been fabulous. I'm gonna share a little bit about the more specific story that we were looking to solve with Arcadia, and then fast forwarding into 2016, we have introduced Spark for our last bucket, wherein we are doing a lot of big data scale vertical analysis and simulations within the big data environment itself, vs. bringing it down to more RDBMS environment.
So here was the key need that we were looking to solve with Arcadia. We needed to understand the customer journey and impact of each interaction on the business outcomes. This was increasingly a need for enterprise marketers, who are really trying to turn their marketing from the traditional way of planning using media to one where they're trying to understand each customer interaction and really then analyze that to say, "How can I improve that?" and therefore impact the top line of their business.
With respect to this core need, there were several challenges. In order to understand that, let's first take a look at what does a customer journey look like. Here is a representative customer journey, where the customer is likely getting influenced by many marketing drivers, such as competitive spend, promotions that one might be doing. But then they get exposed to more addressable media, like a display ad, or they went to your website and then got retargeted on Yahoo.com, so on, so forth, leading up to perhaps the final purchase at the very end on your website, so we are really looking to map all this customer journey and then provide rich insight into it.
Now, in itself, this representative customer journey is reasonably simple, from a complexity perspective. However, when we analyze this for the entire marketing dollars that are being spent by enterprise marketer and the number of customers and prospects that are seeing those ads and trying to map out their journey, the problem becomes much bigger. We are dealing with tens of data sources, usually 50 to 100 number of data sources where we are collecting data from, and those data are in the order of tens of terabytes and often in petabytes. They number of customer journeys that we are looking at is in the order of tens of millions, and any given customer journey is anywhere from 10 to 100 to 200 touchpoints.
So when we looked at this level of complexity, and then looking for an easy way to analyze that and be able to provide rich insights, we ran into these challenges with respect to our SAS offering. First and foremost, the performance itself was one where we were really waiting for the reports to really load, and as you could understand, the customer experience was really bad as a result of it. The second problem we ran into was when we were initially analyzing at a holistic level, let's say at a DMA level, so Los Angeles market vs. New York market, and how is the marketing spend doing in these markets, the problem was large, but it was still not big data scale.
However, as soon as we then analyzed deeper into every customer journey, we are dealing with the challenge of what's known as power of tens, as illustrated in this example. Here's a view of someone lying down in a park in the order of 10 to the 2d, which is 100 meters zoom out. As you zoom in further to 10 meters, you can roughly see that there's someone over there, and then you really zoom in, is when you truly know your customer, that this customer is a family person with a child with him and an outgoing, outdoors park kind of person. That's the level of detail we really needed insights to be delivered through our software to our marketers for their needs to be met.
So as a result, we broadened Arcadia to make progress and really deliver to the promise we had out in the market, so back in Q1 of 2015, we first adopted Arcadia, where we integrated it into our managed Hadoop _____-scale offering, and what is there is for one of our applications on our cloud platform. We introduced more visibility and performance and solved for those two challenges. Over the different quarters in 2015, we kept integrating Arcadia into our other product lines, including providing more visibility to our internal teams that are working behind the scenes, either on development of our SAS offering and our support services to our clients to address sometimes their more custom needs, and as this has grown over time, we launched a new app in 2015 called TV App. This is where we are trying to analyze television spending at a very, very granular level of detail and providing rich insights to the customer, and we cut down our time to market as a result.
And lastly, our strategy application is set up where we have most number of our clients, and we have introduced that on that app and hence achieved a little more of scale. And leading up to 2016, we are now focused on heavily cutting down the deployment timeline, the onboarding timeline for a given client, where the configurations are such that they're completely config driven for each one of the reports that we _____ for our clients to provide them rich insights on the customer journey are done through Arcadia in this instance.
So in summary, our experience has been one where MarketShare clients understand their customers better now, with the rich insights that we are providing them. In addition, our internal report development process has become much more agile, where we are _____ with our leading clients in _____ programs to create multiple versions of the report for the same need, but that we truly understand what's the best way to visualize it for them, as well as we have cut down on the client configuration life cycle and made it more efficient, so that has been our journey in terms of a key need that we have been working and partnering with Arcadia. Any questions, happy to take on later. Back to you, Nick.
Nick: Great. Thank you, Tanwir, so I'm really excited to hear some of the ways you have evolved in your utilization over the years, and some of the ways that it seems like Arcadia has _____ into your audiences. I know we've had a couple of questions come in. If more people have them, please send 'em through your browser window here, but I think at this point, let's continue on and go to Rajiv and here a little bit about Kaiser Permanente and what their experience has been and how they came to this solution.
Rajiv Synghal: Good morning, everyone, and good afternoon, and good evening. My name is Rajiv Synghal, and I'm the chief architect as well as the chief evangelist for the big data architecture and strategy at Kaiser Permanente. I think Kaiser Permanente needs no introduction, but if you do not know about Kaiser Permanente, you'll probably get to make some view of what we deliver, what kind of health-care services we provide, and how we provide as we walk through the presentation itself.
So I'm gonna actually set the stage by asking a couple of questions, and hope to answer them through this presentation itself, so why do we need big data in health care, and how do we formulate the big data analytics in health care itself? And I made the first mistake by clicking it twice, so we're gonna go back to the first slide, so understanding the member is at the heart of bringing quality, affordable health-care services. Now, that slide was the synthesis of some of the early research, piecing together information about the importance of data, about the type of the data we gather or we need, sound bites like, "Health happens in between doctors' visits."
So we wanted to take every piece of data that is a determinant of member health and bring it all together. Now, there was an article published in the Journal of American Medical Association in 1993, which was on – which was discussing the determinants of health and their contribution to premature death. So we took the first part of it, that what are the critical elements which are the determinants of health for a particular member itself, so if you look at the top right-hand corner, it is the medical care, which according to the study, it constitutes only 10 percent of the member health itself. It constitutes like your physicians' visits, the lab tests, every X-ray or scan that has happened, every prescription that you have taken, every outcome that has happened for every episode itself. Now, constituting 30 percent is the family history and genetics, which is right below the medical care information _____ factors, which constitutes 30 percent, and this includes like the things like genomics, which is your hereditary factors, _____, which is the body mass index, _____, which is the protein levels, PSA levels, key indicators of the prostate health in men, _____, which is the full range of _____ RNAs expressed within an organism, which is a cell or a tissue.
The remaining 60 percent, which is like – which is very interesting to note, of an individual's health is determined by personal behaviors, coming in at 40 percent, which includes personal lifestyle choices like your eating habits, your exercise regimens, your vices, like smoking and drinking; and by environmental and social factors at 20 percent, where the social factors like the affiliations, friends and family, communications with them, how often. All these are determinants of members' health at any given point in time. Now, to bring all this data together, organize it, and make it available for analytics in a single place requires a new platform paradigm and new tools and new structures, so that's where it goes back to Nick's comments about that there is a new set of technologies coming out, which is in Hadoop.
So again, to click in twice, but so why do we need analytics in health care, and what is this changing need of the analytics in the health care itself? So health-care reforms are changing the health-care market conditions itself. There is a renewed focus on managing data and analytics, and this includes all types of analytics. I think you have heard it before – there is the descriptive and predictive, descriptive and prescriptive, right? We're the descriptive, predictive, and prescriptive analytics guys, the community at Kaiser Permanente, in finding new ways to deliver on its mission of high-quality, affordable care, whereas the proscriptive analytics guy does on what to avoid.
So let me give you an example, so drug safety, long-term effects of a drug are not known at the time of the FDA approval, so Dr. Peter Handler, who's an MD, rheumatology at Kaiser Permanente, told me last summer, as we working together on a research project, about the dilemma of the position – "We are treating patients that we know very little about, of the diseases that we know a little, with the drugs that have been tested in isolation, so it is a challenge." So drug interactions across comorbid diseases are not very well known. He was quoting an example from his field, rheumatoid arthritis and autoimmune diseases. Which one should be treated first? Which drugs should be given first, which drugs to avoid? In absence of clear guidelines, common wisdom has to prevail, but this is where the cohort analysis can give guidelines to the physician who is making decisions at the point of care.
Now, this renewed focus on analytics can further improve care delivery, clinical operations, and member engagement. Now, let's take an example, right? Let's look at the care delivery first: Telehealth, remote biometric monitoring of Kaiser Permanente members with chronic diseases, predicting events, triaging them, recommendations, right? It's not actually triaging only the recommendations, but it is actually the options also for treating patients – treat them at home, get an appointment with your primary care, wait till tomorrow to go to urgent care, or take the patient to the nearby hospital, and calling on expert advice wherever and whenever necessary. So these are all the elements that need – are needed in order to do _____ analytics on the data in order to improve care delivery itself.
It also aids in running more efficient clinical operations, like through environment monitoring, right? Knowing about the extreme weather conditions will help the hospital administrators to adjust resource levels. Now for triaging recommendations, once the member has decided to avail the option of going to the closest clinic, we can always provide the navigation, step-by-step navigation to the nearby facility, thereby increasing the member engagement and satisfaction, right? Of course, a happy customer is a loyal customer, right? Provide food recommendations to members with diabetes, tracking and monitoring brand sentiment through social media, so what I believe is that learn and adapt is the new norm, right, be it a technology, be it a platform, be it an employee, be it a member, be it a care – a health-care service that we are providing to our members itself.
So what is the future of the data and analytics then in health care? We do envision a future where the advancements in analytical platforms will enable universal access to data and analysis. Now, Kaiser Permanente has deep roots in use of technology. Our history dates back to the 1960s, with Dr. Sidney Garfield and Dr. Morris Collins setting the stage for the use of technology to be the backbone of Kaiser Permanente in the industry of the – in the delivery of the health care to its members. It was a simple declaration. They said, and I quote, "We should begin to take advantage of electronic digital computers." Now, that was '60s.
Since then, Kaiser Permanente has been at the forefront of bringing technology in the delivery of the patient care, keeping accurate records, right? Fast forward approximately 40 years from when that vision was set, it translated into adoption of electronic medical health records, first of its kind in the industry. We were at the leading edge of the technology, so our future is going to be dictated by the decisions in the enablement and availability of the universal availability of the data and analytics, right, to the – to all the users, right, so be it the scientists, be it the clinicians, be it the business user itself, and ultimately, making these things available in the hands of our members as well.
Now, providing this universal access to data requires us to build the next-generation data platform. Now, securing member data and organizing it is key to foster discovery and access. We need to be able to provision this data or any insights garnered from this data to the right person at the right time to aid in the right decision making. What you see on the right side of this page is a blueprint of the architectural capabilities focused on securing data discovery, data orchestration, data provisioning, and all this is built on the Hadoop base, and then we name it as Landing Zone, and it's essentially a home to secure and organize data available for universal access for business consumption.
What is at the heart of this platform? It is a 256-bit encryption at rest, so cyber security definitely was a challenge, so we are looking at those particular elements itself. How do we secure the platform? How do we secure the data in the platform itself? Other things: User authentications, validation of user credentials, access authorization, rule-based access – these are all the capabilities enabled natively in the platform. That was one of the choices that we needed to make, that do we need to go put these capabilities in different silos, or do we need to bring in all these capabilities into – and enable them natively in a singular platform itself?
So data is further organized into different zones. I don't know if you can see it clearly. We have created what we call is the raw zone and the defined zone, user-defined zone, and these all have been inherited from what has been done in the past in order to source the data from various systems in order to bring them together into a data warehouse construct, right, in a traditional world itself. So we are making sure that we stick with those concepts itself in the new world as well, so raw zone hosts the exact replica of the data from the source, so we are changing the traditional world ETL, which is extract, transform, load, to extract from source and load into the platform without transformation, which aids in actually using of this particular platform by the business community, because we take this seriously, as we are keeping – as we are looking to keep the semantic and syntactic _____, so they don't have to relearn that – what the new tables are, what the new columns are, and what is the value which is represented in the dataset is. So if they understood what was in the source, they understand in the data sitting in the Hadoop platform as well.
So further, we organize the data by domains in like clinical domain or a HR domain or a financial domain or a membership domain in the raw zone itself, so that they are readily available and accessible by domains and subdomains within each subject area itself, and by use cases in the refined zone, so you can look at it like the clinical analytics, which is like very specific use case. Could be a ETG, which is a episode treatment grouping, or EDIP, early detection, impending physiological change, so those could be done very much on a use-case-centric basis in the refined zone itself.
Now, Nick talked about the _____ and account on the – to reduce the – so there is a lot of the variations that you see today as the reports and analytics are run from the different siloed systems itself. To reduce and account for these variances in the reporting analytics across various use cases, there is a need to define a _____-defined zone, hosting the data which has been refined using agreed-upon business logic across the entire community. So we will apply the principles of the _____ _____ to create the datasets in _____ to harmonize the data from multiple-source systems, so multiple-source systems and _____-source systems have been providing the same functionality across different regions itself, right?
So we want to make sure that this data – this particular – the shared-refined zone or the common-defined zone is not available – this is kind of like more of a learn-and-adapt kind of a model, as we were hearing from what the business users want, so we are creating those particular ones. So that was one view of the _____ view of the world, and then there is the other view of the world, which was the _____ view of the world, of organizing the data into the cube _____ schemas and the summary tables. That's where the adoption of the Arcadia data of the world and the other technologies come into the picture.
So I think that if we start looking at it how we are managing these datasets and what we were doing in the traditional world itself, so the new world of the big data is going to disrupt the conventional wisdom in data life cycle management, because we need to provide access to all data all the time, whereas the conventional wisdom was to put in place because – the conventional was put in place because of the capacity constraints of the traditional systems – limited storage, limited memory, limited compute, right? Use these computer resources wisely in your applications, right? Otherwise the things won't work, so the new world is very different, right, when we look at the Hadoop, right, the scalability, right?
Totally the power of the distributed store and the distributed compute, right, of course, thanks to basically the advancements and the engineers that are made, right, at Google and the Yahoo and the Facebook and the Apache open-source community. I think we are getting close to the access to the unlimited compute resources, or close to thereof, right? So new data – new world is all about the all data, internal or external, and is treated – always treated as a first-class citizen, right? Data and compute characteristics, frequency, recency, right, and system fault tolerance thresholds will determine, right, the nature of their home, so let's try to redefine what we need to know about the hot, cold, and warm data itself, right?
In the new world, it is going to be all about it's a hyperactive data, because we need to define those things, right? In-memory constructs are becoming very important, right, in order to understand what is happening in real time, so we need to start looking at it truly hot, hot data, which I call it hyperactive data in the active datasets, right, which are requiring _____ access will find a premier seat at the compute-centric cluster vs. the inactive and the dormant dataset that we need to keep in the health-care industry for compliance reporting reasons or other analytics itself, will probably be not at the compute-centric clusters, but will be finding their access more at a – I'm sorry – I think it will be probably more at a economy-class-like seating in a storage-centric cluster itself. You want to draw the _____ to the premier seating, right, in the compute-centric cluster for the hyperactive and active datasets itself, so we want to keep all these datasets for all times to conduct retrospective studies, right, to spot and identify any of the changing KPIs over a period of time itself. Now, that is one _____. How does this world changes, right, with respect to making all data available at all times itself?
Now, on the other hand, the analytical tools landscape is also changing very fast. What we have been used to in the traditional world is not the new norm itself, so tool choices of today nicely complement the architecture in the new world data life cycle, management principles itself. And we have to look at some of those guiding principles and then see as we make choices for these tools, do they conform to these guiding principles across all the tools that have been chosen? So we say, "Okay, do not entertain tools that require its own infrastructure," so Arcadia data, I think earlier on, when David talked about it, that all the Arcadia data analysis runs on the Hadoop cluster itself natively in the platform, right? So we are looking at it _____ to the cluster, with excessive movement of data in and out of the cluster, right? That's kind of like the new norm as well.
Tools will run natively in the Hadoop cluster. We already talked about it, right, affinity towards the cloud data's way of doing things, right? Now, that is probably a little bit more of a Kaiser-centric statement, because we have chosen cloud data distribution as the foundation for – at Kaiser Permanente's Landing Zone itself, so our tool must honor and leverage the capabilities in the base platform provided by the cloud data itself. For all the tools that we are selecting, we want a little bit of an open metadata management, which is easily accessible, so that we can create the _____ metadata repository to understand what is going on in which tool and how that kind of like _____, so that as you're building end-to-end workflows, you can understand you can stitch those things together.
And of course, the final guiding principle is something like honor the old, but _____ service aspects of the new tools itself, as some of the functionality of the functions which are today in IT needs to start moving towards into the business community itself, so that they can go answer the questions or they can respond quickly to the changing business needs itself, and without the dependence on the IT stuff. So for hyperactive datasets, we are looking at tools that can pin the data in memory, right, by choice at the start of the application or on demand or based upon characteristics, caching hyperactive, active datasets in memory on – as you understand those access patterns itself.
So tool list includes _____ _____ we can look at it, right? I don't know if you – we can all read these particular ones itself, but we have kind of like looked at it that we needed to – since we needed to keep control on the exact replica of the datasets, so semantic and syntactic equalance, so we are building out our own, homegrown _____ capabilities. We are looking at tools like Trifecta and Informatica for data profiling and integration, Informatica because that has been the norm in the industry for a long time, but the tools like from Trifecta are coming along and then making it more self service. Borderline data science _____ looking at data tagging and cataloging. We are looking at Arcadia data to support the data cubes, so some of those things are _____.
I think what we – what the world we are looking at is equalance as much as possible, so it is not a new way of – it is probably a little bit of a new way of doing things, but the wisdom that has been garnered over the last 30 years, those principles are not going away. We are just looking at it, can we do the things faster, better in order to respond to the business needs, but some of the principles can move forward along with – from what has been learned over the 30 years itself. So we're also looking at H2O and _____ for analytics and deep machine learning and visualization, and Tableau in certain instances for visualization alone. As you know, Tableau requires a _____ _____ cluster, so in some cases, we have made some exceptions to the guiding principles itself, but we are mostly sticking by what the guiding principles are. I think that is probably all that I needed to talk about, so with that, I'll probably pass the ball back to our host, Nick.
Nick: All right. Thank you, Rajiv. That was fantastic. Really appreciate you sharing so much. I see a question right off the bat: When you talk about the different use cases in the different environments, you also cite a number of other tools involved here, so Hadoop is obviously more of an ecosystem than a single set of binaries or tools you'd install. What was your experience in terms of integrating Arcadia with the rest of that ecosystem? How hard was it to set it up, get the connectivity tied into other parts of the platform?
Rajiv Synghal: So we had been actually working with Arcadia, and as I said, right, earlier, that since we'd chosen cloud data, so looking at it that is it easy to, from a _____ installation perspective itself, right? That is the first thing. Does it adhere to the cloud data's way of installing the tool itself through the possible distribution, and yes, they can. Are they integrating directly with the active director? Yes. Do they honor the _____ integration? Yes, they are. Do they honor the access restrictions provided in Sentry? Yes, they are, so I think _____ looking at it, that what is the base platform capabilities, yes, so then it comes on top of it that as the datasets are available, now we are actually working – we just started coming out of the gate. Since we acquired these tools at the tail end of last year, we have gone through the installation process, and now we are – we have gone through a couple of POCs with our business community, so now we will be rolling it out. And as we learn more, we'll be able to share more that – where the gaps are, and we're looking forward to working very closely with Arcadia to fix those gaps.
Nick: Gotcha, and a follow-up question to that: As you talk about rolling it out to the community, who do you see as the different roles that will be interfacing here, and what kind of training do you think they'll require to get familiar and comfortable with it?
Rajiv Synghal: So when you're looking at Kaiser Permanente, right, it has been in existence for more than 60 years plus, right? So the roles have been clearly identified, right, that whether they are the IT groups which provide the data migration and the data integration and the profiling itself, so we want to look at the tools that kind of like goes towards that community itself. Then there is a set of groups which are providing the reporting and analytics on top of that particular tools itself, so there's Arcadia data _____ tools will be going to that particular community itself.
Now, since Arcadia data comes integrated with the visualization, so some of the groups which were doing aggressive reporting and analytics or maybe the more predictive analytics and creating dashboards, right, so they will be actually using these tools as well. So it's kind of like that some rules are being kind of like combined together, and some, we are kind of like keeping the organization structure in place, so we are not focusing more on changing the whole organization itself, making sure that the functions exist wherever they exist. Just empower them in place, right? And over a period of time, the new norm will set up that how these groups need to either collapse or they need to expand.
Nick: Great. That makes sense, and I wanna bring Tanwir back into the conversation about MarketShare, and ask the same question: What types of roles are then interacting with your data here, and how did you get them up and running and comfortable with the new approach?
Tanwir Danish: Yeah, sure, so for MarketShare, we really assess the risk provider, right? So they are external as well as internal users. All of the _____ that we do are guided towards our enterprise marketers, and within that marketing unit, it's folks like marketing insight managers, display channel managers, media planners. So most people have some data or analytics bend to it that are using the tools, but also, the business analysis and marketing analysts are sort of another group that really benefit from it.
Internally, we have been using it in two ways. One is just because we do a lot of predictive modeling, so a lot of data science is applied in the insights that we provide to our clients, so that group in itself to do a lot of exploratory data analysis, a lot of visualization, to then help them understand how to better configure different models for their specific client, so they are another group of users. And then on the development side, really, the development need is one where once the setup is available, then our domain experts themselves, in working with the _____ manager, can iterate on different reports. It really is a wide group of people, ranging from sort of business needs to predictive analytics need, in terms of the user base that we have at this point in time.
Nick: Gotcha. You've both mentioned visualization as being an important part of it, and I see that a lot as well, when I talk to various businesses, being able to actually see and interact with your data. Was there a specific set of criteria you had there in terms of what types of visualizations were available or how it scaled with different datasets? What are the considerations around selecting and evaluating the visualization capabilities? Tanwir, I don't know if you wanna continue on that.
Tanwir Danish: Sure, so I can add a little, and then maybe we hear from _____ as well, so in terms of our need for visualization, it's really about, first and foremost, access to the deepest level of data so we can craft what type of visualization we wanna see. And specifically in Arcadia, we have some of the _____ visualizations that we would usually not get otherwise, so floor diagrams is a very important diagram in terms of how are customers moving down the funnels, so from being aware about the brand to one that is interacting with the brand to one where they're converting, so that's an important visualization for us. Heat maps on sort of how different time points in the day is performing relative to others, so _____ reports are very important.
But also, for us specifically, the mantra we had going into this reporting stream was one of UPS, U standing for UX, P for performance, and S for scalability. And so from a UX perspective, we are not just using what Arcadia provides from a visualization perspective, but we are also developing some of our own visualization on top of the extracts we are getting from Arcadia. We don't do it for all of our reports, but certainly, some of the reports where we want to sort of spend a little more time, it's good to have that flexibility.
Nick: Gotcha. That makes sense, and Rajiv, at Kaiser Permanente, you're looking at a customer journey in a different sense, but you also are looking at other types of life sciences, pharmaceutical outcomes types of data. How do you see visualization playing out in some of the different use cases in a more hospital, clinical setting?
Rajiv Synghal: _____ you ask in the hospital and clinical settings, I'll probably tell you that typically, I don't have much of a view into that particular one. Remember that we are in the – I'm in the enterprise architecture world itself, right? So there is a whole business community which is focused on how the information needs to be presented, right, and there is expert after expert. Just taking the example of the ETG, right, which is the episode treatment groupings, which defines that what is the cost characteristics of – across a specific episode across all our member base itself, and what is the outcomes for each one of them, right? And there is multiple ways of presenting, and there is kind of like the variances which exist, even from facility to facility, from region to region, from physician offices to physician offices. So there's a lot of customization that is required in order to make sure that the information that we are providing is well understood and is actionable by the person who is receiving it.
So just to give you kind of like a view, not only the physician community, right, or the hospital community itself, Kaiser Permanente is a city in itself, right? We are shy of approximately 200,000 people, right? So all other aspects, whether it is the financial, whether it is the HR, whether – we are kind of like – I don't know whether you know or not, but we are _____ own our own hospitals, right? And there are like 38 of them, so we are, you can say, a big real estate management company as well, so there is other aspects of the Kaiser business also that require different types of visualizations, so as we are looking at it, that what makes the most sense, right? And of course, it is going to go back towards understanding the data and looking at it more from a dimensional perspective, whether it is a temporal dimension, whether it is a spatial dimension, whether it is any other dimension, right, across which we want to cut the dataset, so those things are going to be evolving. We already have some things in place, and those will be evolving over a period of time.
Nick: Gotcha. That makes a lot of sense, and it's really interesting to hear about all the different aspects at Kaiser Permanente, not just think about the physician experience and client experience in that case. I'm just curious – we're coming up towards the top of the hour – is there any surprise, or one piece of advice you'd offer to our listeners today? And Rajiv, maybe we'll start with you and then come back around to Tanwir.
Rajiv Synghal: Sure. I will tell you basically that I think that _____ is great, right, but do not get, I would say, truly incentivized to start putting the Hadoop clusters in place. There is a lot of heavy lifting that needs to get done in order to put the solution in place and then carve out that solution, understanding what is the security around that, understanding how the data movement is going to happen, understanding what the tools are. And I think it is going to be very much crafted very specific to the actual fabric of the organization that you work in, so don't look at the architecture that is being put together or, really, the success stories which are shared at _____ as being that you are going to be able to go realize those on day one.
Nick: Always good advice. We can get caught up in all the excitement, and it all looks so easy when you see all the success stories, but being very deliberate about how you approach it, I think is definitely important. Tanwir, we'll let you close off. Is there anything else you would wanna offer as a piece of advice for somebody getting started in this area?
Tanwir Danish: Yeah. My advice would be to really sort of get in it and try it out for your use case. Again, we are very focused on what business problem we are solving and therefore how good a fit a solution is to it, and so trying this out is quick, and assessing the fit is where you really need to spend the time, and so that would be my advice, that quickly – don't question whether you should try. Question if this fits your need, and spend time on that, and the more we do that, the better chances you will assess it right, as well as if there were gaps, then you will be able to understand that as well.
Nick: Okay, great. I see a couple more questions have come in, and we still have a couple minutes left, so just to ask again, how has the SQL-on-Hadoop experience been working, and compared to how you were trying to query Hadoop previously or build it into applications previously? Tanwir, do you wanna take that first?
Tanwir Danish: Yeah, so I sort of laid out how we were sort of using it before, and again, depending on which profile is accessing information, which person is accessing the information, their _____ were different, all the way from sort of running Hive queries to one where we are trying to put Tableau environments with an intermediary in the middle, to one where we are directly on top of it from a visualization perspective. But certainly, for the insights community, it has been a big deal for them to really tap into the most granular data and directly be able to see in form of visuals that give them the best insights for their specific needs.
Nick: That's great. Thank you. Well, I can see we're just coming up on the end of the hour. I wanna thank my co-presenters today, Tanwir and Rajiv, and especially Arcadia Data for making this possible and hosting us all today. It's been a really interesting discussion. I thank everybody for their participation and sharing information. I know Arcadia would be happy to follow up if there are any other questions our listeners have had, but thank you for joining. I hope everybody has a great day.
[End transcription at 1:00:00]