Now that football season is upon us, I felt I should borrow from football lore for the title of this blog. But I admit, BI performance isn’t everything, and it certainly isn’t the only thing, but it still is very important. Before I dive into that topic, let me first take a step back.
We hear time and again that organizations seek ways to make their modern data platforms (e.g., Apache Hadoop, Apache Kafka, the cloud) more valuable beyond technical teams. Though data scientists and data engineers are regularly making great use of huge volumes of collected data, business users are typically left out. This phenomenon is not due to lack of trying. Often the first attempted approach to bringing business users into a modern data platform involves the use of traditional BI tools. That is, you take the BI tools that were designed for relational technologies and transplant them into a big data context. This tends to not work well for a variety of reasons. For starters, traditional BI tools expect you to move data from your modern data platform to a dedicated BI server. This adds significant administrative overhead that cannot scale. The dedicated BI server also has limitations on how much data it can manage, as well as how much user load it can sustain.
A recent benchmark shows Arcadia Enterprise accelerates the baseline configuration
of an external BI tool connecting directly to Hadoop. (Source: ESG infographic)
If you accept that limited data sets and extra administrative effort are bad, then you might instead take the route of connecting the aforementioned BI tools directly to your big data cluster. The question still remains on whether that model scales well enough for a production environment. I should note that there are actually a few ways to bolt on your existing BI tool to your modern data platform. You can use OLAP, middleware, in-memory aggregators, and data virtualization layers. These may work at a satisfactory level if you have a limited data set or a small number of concurrent users. And if that is the case, then with this option you’re missing out on the scale and flexibility advantage for which you deployed your big data cluster.
We at Arcadia Data take performance very seriously because we have seen organizations struggle with modern architectures like data lakes when using legacy BI tools to provide business analyst access. One clear challenge is the limited user concurrency of BI dashboards. If you have a team of users whose job entails using data to make better decisions, then making them wait for that data can be painful, enough to consider the system unusable. So if you want a BI-style environment on your data lake, Arcadia Data can run in your cluster to accelerate your dashboards to drastically improve concurrency and responsiveness. Our performance is based on accelerators known as “analytical views” on which you can get more information from two of our blogs, Beyond the Cube: Embrace Analytical Views, and A Closer Look at Query Acceleration with Analytical Views.
We recently ran some benchmark tests that were assessed by Enterprise Strategy Group (ESG) as part of their technical review practice. Their report is available here, and you can see the dramatic acceleration that we provide against a configuration that entails external, traditional BI tools. Three takeaways from the report I want to call out are below:
- Arcadia Enterprise shows dramatic acceleration, even at high concurrency. The numbers indicate that you can get the acceleration your users need. The data also shows that if you were to use external BI tools for your environment, you’d get neither the performance nor the concurrency you need for your production environment. Overwhelming your BI tool means that your end users get dashboards that appear to hang.
So how do the concurrency numbers map to real-world BI environments? For a given number of end users with access to a BI system, about 5 to 10 percent are actively issuing queries at any given time. This means for any of the concurrency numbers in the report, you can support a user base of 10 to 20 times that figure. So if you can get good response times for 10 concurrent sessions, you can support a user base of 100 to 200 users.
- Arcadia Enterprise scales nearly linearly. The metrics show that Arcadia Enterprise scales gracefully as you increase the number of concurrent sessions. This shows that Arcadia Enterprise is suitable for large user communities on large datasets.
- External BI tools will not scale. In our tests, even user concurrencies as low as 10 were slow for the external BI configuration. And it got much worse from there. Higher concurrencies did not return results in a reasonable time frame (will anyone wait an hour or more for a dashboard update?).
There are several graphs in the report that call out the performance and concurrency advantage you get with Arcadia Enterprise. The graph below is one example that compares mean visual load time (i.e., the average response time for all queries in a given dashboard) between Arcadia Enterprise and the baseline configuration. You’ll see that the difference is significant at concurrency level 1, and concurrency levels of 5 and 10 showed similar acceleration ranging from 21x to 88x. As mentioned above, we could not compare against the baseline for concurrencies above 10 since the response times were unreasonable and unusable for a real-world production system.
Download the report (and the accompanying infographic on the same page) to see how Arcadia Enterprise performance can help with your big data deployment. You can read the official announcement here. But keep in mind that we’re not only about performance. Arcadia Enterprise also is about supporting many data sources and formats, simplifying the analytic lifecycle, and aligning with existing security controls to help keep your data safe. Let us know if you’d like to learn more.