April 19, 2017 - Susan Rojo | Big Data Ecosystem

4 Ways to Scale Analytics on a Data Lake

In our recent webinar, 4 Ways to Scale Interactive BI and Analytics on a Data Lake, MapR’s Sameer Nori, Director of Partner Marketing, and Saurabh Mahapatra, Senior Product Manager, joined Steve Wooledge and Priyank Patel of Arcadia Data to discuss the pros and cons of four different ways to scale analytics on a data lake, example use cases, and how Arcadia Enterprise integrates with MapR to support multi-tenant, high user concurrency applications.

Market Trends and Data Lakes

Sameer started off the webinar with a discussion of market trends and where organizations are in their big data deployment. According to a recent Gartner survey, 15% of organizations are in the knowledge gathering stage, 14% are developing a big data strategy, 30% of in the pilot/experimentation phase, and 15% are in the process of gathering information and performing data discovery.  This means that 73% of organizations have already invested or plan to invest in a big data solution.

With the explosion of data in the past few years, organizations are faced with one of three scenarios: 1) Their data warehouse has reached its capacity, but they don’t want to spend more money on newer technologies, 2) They have a lot more data that they can’t put into their data warehouse, much less retrieve it and perform analytics on it, and 3) They have a lot of data sitting in their data warehouse that they want to more effectively leverage.

As a result, customers are looking to establish a next-generation applications/analytics platform in order to capture their large volumes of new data, put all their data in play, bridge their data silos by aggregating data sources across business units, and meet regulatory requirements. Because of these needs, organizations are turning to data lakes to help them solve some of these issues.

Some companies deploy a starter data lake so that they can do some simple batch analytics on it. As those applications/use cases take off, their complexity and scale increases. They may need to monitor IoT data, do some security analytics, or develop recommendation engines or anomaly detection abilities. They may also start to consider applications like machine learning and deep learning. As the complexity of these applications grows, so does the scale. Regardless of where organizations are on their big data journey, a data lake can be a great starting point. The notion of a data lake and what it serves will evolve as the use cases expand.

4 Ways to Provide Analytics on the Data Lake

There are several approaches to scaling BI and being able to perform analytics on the data lake, each with its own set of pros and cons.

1. Separate BI server: In this phase, organizations want to use existing tools like Tableau, Qlik, and MicroStrategy to pull data out the data lake to do the analysis. The upside to this approach is that people already know how to use the tools. The downside is that they have to take massive amounts of data from a scale-out distributed system and try to force it into an end user desktop or onto a single scale-up BI server.

In this scenario, they can’t drill down to the granular data unless they know exactly what they want to analyze, extract that section, and put it on a server. They will also need to define an ETL process since data needs to be aggregated first before moving it to a server. Separate security models need to be set up and maintained between the data platform itself, as well as the role-based access controls and different authorization parameters within the BI tools.

In addition, since they’re moving data over the network, by definition, it becomes a batch job where they need to constantly refresh that or ask IT to refresh the view that they’ve got on the server.

2. Fast SQL + BI tools: In this scenario, organizations typically want to improve agility, since users won’t have access to all the granular data, and they won’t be able to run queries against semi-structured data like JSON. By connecting these BI tools up to SQL engines, users will get some more benefits.

The positive of this approach is that organizations already have the tools in place and people know how to use them. However, many of the BI tools don’t natively support these SQL-on-Hadoop engines, so they require a higher-skilled person who can perform free-form SQL in the tool itself.

Another con is that once you get above five to ten concurrent users, performance degrades exponentially. In addition, there are a lot of advanced analytics that can be done within the Apache Hadoop or scale-out environment that are not always accessible by these SQL-on-Hadoop tools, as you have multiple security layers and models of that BI tool connecting through, and not interpreting or inheriting the security that’s applied in the data lake itself.

3. Middleware application cubes: In this solution, users can build application cubes to run in the data lake itself. Users can deploy Apache Kylin or another commercial tool to do the aggregations in advance. They must define a cube in advance of end users performing analysis and store the aggregates in the data lake. They can query against the cubes to achieve higher performance and user concurrency. They may get good performance on the summarized and aggregated data that’s in that cube, but they also will lose some of the ad hoc freedom; the IT group needs to build that cube in advance, and they’ve got to understand all the questions that could potentially be asked. Obviously, with ad-hoc data discovery, you don’t always know what questions you want to ask until you start to see some of the patterns in the data.

Users typically need to have IT build some of the dimensions into the cube, and because the cube is typically refreshed on a nightly batch job with something like Apache Hive, the data they’re accessing is a little less than a day old. Because they’re creating structure and schema for this, they lose the ability to access some of the fidelity within the semi-structured data, and they still have separate security administration issues.

4. Data-native visual analytics & apps: The fourth, and most effective method, is called data-native visual analytics. This area is where Arcadia Data focuses. We bring the BI server directly into the data lake in a truly distributed, massively parallel fashion so that organizations can scale linearly with the data and have the processing run next to the data. In this way, it’s much more real-time, with more dynamic access, and it can support hundreds and thousands of users.

This is the best approach for organizations that are going to scale out the data to a large number of users with multiple applications and hundreds of thousands of users. It makes sense from a physics perspective. Organizations need the analytics and the BI to also be fully distributed, which provides user concurrency, scalability, and the ability to drill in detail or drill through to other dimensions.

By using Apache Drill and other methods, organizations can support native queries on complex data sources, in real time. Since Arcadia Data provides built-in advanced analytics in the package, users can do time-series analysis, path analysis, and network graph analysis, and it’s dramatically simpler in terms of security and administration because the underlying authorization and access controls in the data platform are inherited by Arcadia Data. The data is never moved; it stays in the platform, greatly reducing administrative overhead, security risks, and storage costs.

Watch the full webinar here to learn what else Steve, Priyank, Saurabh, and Sameer have to say about data lakes, BI, and analytics. 


Related Posts