Editor’s note: Shant Hovsepian, CTO and Co-founder of Arcadia Data, presented at #datadrivenNYC on how to get BI right in the big data space.
“You don’t use previous generation architectures to store your big data, so why should you use previous generation business intelligence (BI) tools to analyze it?”
The following was adapted from a transcript of the presentation.
I’m going to do my best to talk about what we’ve been seeing in the field or that our customers complain about. In particular, I’m going to try to answer a question around BI on Big Data. Specifically, you don’t use previous generation architectures to store your Big Data. Why should you use previous-generation BI tools for our Big Data?
We’re very fortunate in the Big Data community, ’cause we have a celebrity, Big Data Borat. For those of you who are not familiar with Big Data Borat, Big Data Borat is the spokesperson of the Big Data community. You all just learned about Big Data Seacrest. And what I’m going to try to do today is I’m going to channel Big Data Moses. I’m going to talk about the “10 Commandments for BI on Big Data.”
To paraphrase Mike Olson, co-founder of Cloudera, chief strategy officer; when he was up here, he said something in regard to Oracle, and I’m going to paraphrase it: “Oracle gives you min, max, median on tabular data, and it’s $100 billion industry. And we do advanced analytics on a thousand times that data, and I feel like it’s gotta be worth more than that.” I’m going to try – really hard not to say things that are self-serving, but my 10 Commandments will be partially biased. Let me just give you that caveat for now.
First Commandment: Thou shalt not move Big Data.
I think this one speaks for itself. It’s somewhat obvious. But moving Big Data is expensive. By nature, it is big. Just physics are in play there. What we’ve been seeing in the field and the customers we have been talking to, that people want BI tools that can push down as much computation as close to the data as possible. There are lots of approaches – we’re very fortunate now, Apache Hadoop, Big Data, Mongo, Cassandra – the industry has really developed a lot, and we have a lot of amazing native tools that you can use for analytics.
When you’re out there looking for a BI tool, make sure there’s a BI tool that can leverage analytics as close to the data as possible, and don’t just settle for ODBC/JDBC connectors.
That’s something that everyone will claim. But really, you try to go below that layer and see if you can get some real native analytics from your BI tool. Be careful having it extract data out into data marts and cubes. And “extract” is, by definition, moving. Moving Big Data: Again, expensive, big, complicated. It’s also a huge maintenance problem.
A lot of customers we talk to don’t want to forget about the network performance or the CPU computation rate. It’s now management overhead; now there are two copies of something that are logically the same. If you do have to do extracts and cubes with your BI tool, it’d be great if they can do them in situ. If you’re using Mongo, a lot of the cool analytics tools from Mongo these days will create Mongo documents directly. If you’re doing anything on Hadoop, try to make sure that cube or extract actually resides on the Hadoop cluster instead of a separate system.
And then, the thing that I’m most excited about is On-Cluster BI. On-Cluster BI is something that’s possible now thanks to things like YARN and Mesos. We’re seeing a whole new resurgence – of operating systems for the data for the data centers. I have built an application that lives on a data center. I don’t have to write replication. I don’t have to implement a scheduler. I’ve got all of these services. It’s just made application development much easier.
Think about the possibilities of your Big Data system, of how much BI you can actually push down to the lower layers.
Second Commandment: Thou shalt not steal or violate corporate security policy.
I’m biased: I’m here in New York, so I talk to a lot of companies who are very, very serious about security, especially given the last few sets of data breaches. Every customer I talk to is really worked about Sentry – or about security, excuse me. A lot of the serious Big Data vendors – we’re fortunate enough to have one of them speaking later today – have really heard this from their customers, and they’ve implemented some amazing infrastructure to make security a possibility. But again, the theme with Big Data is: it’s large and it’s complicated.
When you’re looking for BI tools, you want to look for BI tools that can leverage the security model that’s already in place. If you have to re-implement your whole security model, once in your storage layer, once in your database layer, and once in your application BI layer, you’re not going to do it, or you’re going to lose information.
Look for unified security models. There have been a lot of great projects. Hortonworks has done a great job. IBM is working on Knox. Mongo has an amazing security architecture now that your applications can plug into, propagate that user information all the way up to the application layer, and enforce this visualization and the data lineage associated with it along the way.
And then auditing. If you can’t get security, and you can’t get encryption, at least make sure there’s an audit trail for your applications, because when Edward Snowden hits, you want to know where he hit. This is the consensus that we’re seeing a lot more of.
Third Commandment: Thou shalt not pay for every user or gigabyte.
One of the fundamental beauties of Big Data, besides the types of analytics and the storage, is it’s hard to deny, at the end of the day, there’s an economic advantage. Big Data is cost effective if done properly. You wouldn’t stick five petabytes of data in your Oracle system, because it would just cost you so much money. But you can put it in a Big Data system.
When you’re looking for BI tools, make sure your BI tools don’t penalize you for your Big Data. Pricing models that penalize you for increased adoption are somewhat dangerous. Traditionally we’d see lots of applications. Oracle loves to charge you by core. Lots of applications charge you by gigabytes. Some applications charge you by gigabyte index.
These are very frightening concepts when you’re dealing with Big Data, because it’s very common to have geometric, exponential, logarithmic – i.e., really fast growth, both from the data side and the adoption side. And the beauty of a lot of these Big Data systems is in their incremental scalability.
We’ve had multiple customers who, within a couple of months, have had deployments go from tens of billions of entries to hundreds of billions. They went from having 12 active users on the system to 600.
You want to make sure that your BI tools are business motivated and your business motives are aligned. Just because an application or a use case gets a lot of adoption, it’s very easy to incrementally scale these systems. We want to make sure you’re not paying a penalty on the BI side of having to pay too much money for having too many gigabytes indexed, which is a very frightening concept.
Fourth Commandment: Thou shalt covet thy neighbor’s visualizations.
First-class support for collaboration: Again, Big Data is complicated. If you’re just one person your organization will know everything, or you will have domain experts. But really, it’s all about letting people work together to come up with insights.
Let’s compare Publish and Share. You can publish static PDFs, export to PNG, send to e-mail servers. But you also want a way to publish these visualizations that preserves their interactivity so they’re not just static. Contrast that with sharing.
When I think of sharing, I think of the github model. It’s not, “Here’s your final published product,” but, “Here is a clone, fork it, and this is how I derived at those insights.” That way, other people could learn from those insights. And that’s really important, again, in the Big Data space, because one form of analytics can be applicable to different problem domains.
Fifth Commandment: Thou shalt analyze thine data in its natural form
What does Big Data look like? Well, the Wikipedia article about Big Data is Big Data. Big Data is preformatted text paragraphs. You may want to do search here, faceting some simple aggregation.
Then, fixed data format. This is what finance and sensors look like. It’s a bunch of key value pairs; tons of this –that gets generated. Now, JSON: probably the trendiest data format of all. It’s semi-structured, multi-structured data, where things like JSON, Avro, Parquet have made for these possibilities. Mongo has made a huge bet on making sure data should stay in this format, and not just for a performance scalability reason, but because there’s an extra bit of expressiveness here that you just can’t do if you convert the data into this the plain flat tables which everyone knows and loves.
But this is Big Data, too. There’s lots of just straight tabular data that exists in the Big Data world. The difference is there’s hundreds of billions or trillions of these. They have lots of columns. You still have to do lots of relational joins.
Don’t let your BI solution tell you otherwise. You want to find the BI solution that won’t tell you, “Hey, sorry, please transform your data into a pretty table.” You want BI solutions that can really analyze the data in its native form, because there is value for having data in that form.
Sixth Commandment: Thou shalt not wait endlessly for thine results.
No surprise here: things should be fast. Data’s big, you shouldn’t have to wait too long. There is a bunch of tricks that BI tools have always played to achieve performance. The First one is to build an OLAP cube. This actually works really well. The problem is you have to build the cube before you get performance. And going back to one of the earlier commandments, try not to move the data. We’ll see lots of tools that encourage you to build BI cubes or OLAP cubes — essentially moving the data into a pre-computed cache — and you’ll get good performance.
Creating temporary tables. I’ll call this fancy caching, but lots of BI tools materialize the intermediate results and the expressions during the session so we don’t have to do those calculations over and over again. Again, this actually works pretty well at a certain scale. Just make sure that temp table isn’t gigantic, and your laptop isn’t going to crash because it’s trying to materialize it locally.
Finally, samples: Sampling of data can be dangerous. You get instant gratification, but your results may not be correct. Look for tools that can sample intelligently. Certain types of data require blocking operation. At certain points in an analysis, I need to stop and count everything. You need to be able to push that sampling down below that and not above it. Otherwise, the only things you’ll be able to sample are trivial visualizations.
Seventh Commandment: Thou shalt not build reports but apps instead.
What comes to mind when I say reports? Traffic report, weather report, book report, report card. None of these things are pleasant.
If you’re like most people, you don’t want to deal with reports. Unless, perhaps, if you’re Alex Dunphy from Modern Family, you like reports, but that’s about it. What comes to mind when I say “apps?” Sunshine and rainbows, obviously. Apps are just so much cooler now. It’s better to build apps.
To explain what we mean by apps: in 1996, Ben Shneiderman and his grad students published an InfoViz paper about “Visual Information-seeking Mantra.” That was the team that went on to build Spotfire. I think it’s been –20 years, and it’s still very relevant, but the general theme for data analysis was overview, zoom, and filter, then details on demand. I think nothing has fundamentally changed there. All that is very relevant and important for Big Data Analytics. You just need a tool that can do that and express it in an attractive way.
I’m not going to make the tired analogy of iphones and apps; rather, consider Web apps. Everyone likes to talk about this, but for BI, you want data-driven apps. Just as with Web apps, I want asynchronous data for multiple sources so I don’t have to refresh something. I don’t have to wait for something to reload. I can get data really quickly.
And I think the third thing that’s really important, and what made Web apps popular, was frameworks like rails that made it easy to develop. Just ’cause a visualization tool or BI tool can give you interactive visualizations. It’s important I think about the developer side first as well. Rails really made it easy for a bunch of people to build Web applications. And you’d want similar functionality from your app, templates, reusability, things like that.
Eighth Commandment: Thou shalt use intelligent tools.
Again, Big Data is big, Big Data is complicated. Look for smart BI tools. BI tools have been doing a great job now recommending visualizations based on the data, based on usage. Look for tools that have search built in for everything, because I’ve seen customers who literally have thousands of visualizations they’ve built out. You need a way to quickly look for results, and we’ve been trained to search as opposed to go through menus these days. Search is a built-in feature. And then look for any kind of automatic maintenance of models and caching, where the end user doesn’t have to worry about it.
Ninth Commandment: Thou shalt go beyond the basics.
We have giant Big Data systems. They have amazing horsepower for predictive analytics. Our next speaker, actually, his company’s done an amazing job at this type of technology. We’re making advanced analytics accessible to business users. And you can run this. R function isn’t the right answer. The answer is what matters for business users, correlation, forecasting, these things, making them very easy to use.
Tenth Commandment: Thou shalt not just stand there on the shore of the data lake waiting for a data scientist to do big data.
Whether you approach big data as a data lake or an enterprise data hub, Hadoop has changed the speed and cost of data. We’re all helping to create more of it every day. But when it comes to actually using big data in everyday work for business users, big data becomes a write-only system: data created by the many is only used by the few. Nik Rouda of ESG calls it “the data scientist paradox:”
The rapid emergence of Hadoop has in effect created a new IT labor shortage: the bottleneck has shifted from battling the slowness of extreme rigidity to battling the slowness of extreme fluidity.
Business users have a ton of questions that can be answered with data in Apache Hadoop. Business Intelligence is about building applications that deliver that data visually, in the context of day-to-day decision making. Everyone in an organization wants to make data-driven decisions. It’d be a shame to reduce all the questions that big data can answer to those that need a data scientist to tackle.