August 14, 2018 - John Thuma | Big Data Ecosystem

How Traditional BI Nearly Killed Hadoop

“Traditional Business Intelligence tools are killing Apache Hadoop!”  

There… I said it! (Keep reading…. There is a solution)

Traditional business intelligence (BI) tools such as Tableau, Qlik, and others evolved with traditional relational database management systems (RDBMSs). Both use ANSI SQL for querying (i.e., retrieving) data for analysis.  Traditional SQL databases include products like Netezza, Oracle, and Microsoft SQL Server. So why on earth would anyone think that legacy BI tools would work well with Apache Hadoop? I definitely can understand
why people would try to make them work, but simply put traditional BI tools are not architected for big data and modern data platforms!

The combination of the BI tool and the data platform tool are what defined the BI and data warehouse stack.   They required separate server and network infrastructure, meaning you could not run your BI solution directly within the data warehouse system.  These data platforms were considered costly and inflexible given todays adoption of JSON, XML, and other complex data formats. While the data platform evolved into today’s data lakes using technologies such as Apache Hadoop and cloud-based storage, the BI toolset did not evolve with it.  Traditional BI toolsets have nearly killed Apache Hadoop due to Traditional BI slow performance, lack of scalability, and its complete lack of flexibility with respect to real-time and complex data use cases.  In this article we will review how traditional BI tools like Tableau, Qlik, and others have let their customers down and nearly crushed their investment in Hadoop as a modern data platform.  We will finish up with the solution!

PERFORMANCE:  Traditional Business Intelligence is slow on Apache Hadoop

Yesterday’s BI solution was built and optimized for relational data platforms like Teradata, Oracle, Netezza, and others.  Let’s not forget how much work went into the data models, indices, and partitioning schemes to make those data platforms speedy.  The legacy business intelligence tool was built specifically for those legacy data platforms. Along comes the promise of Apache Hadoop and IT organizations cheered thinking that they have found an alternative to the high price of Netezza and other ANSI SQL platforms.  So they strapped their traditional BI tool to Apache Hive and found out quickly that things were very slow. They were so slow in fact that they created copies of data into third platforms such as MicroStrategy Intelligence Server, or Tableau Server Apache Spark (in memory), and other performance optimization trickery.  And while query performance may improve, most of these solutions rely on single server or in memory execution, severely limiting the scale of data which can be analyzed while also creating latency issues due to stale data, not to mention the administrative overhead, data governance, and security issues created with this approach.  Through my experiences, the traditional BI tool performance issue is a key factor in the lack of adoption and success of Apache Hadoop.

Why?  Traditional Business Intelligence tools are not ‘native’, first-party citizens to Apache Hadoop and cloud platforms.  We needed something that was built for the modern data platform!

ARCHITECTURE:  Traditional Business Intelligence is an incomplete solution with respect to Apache Hadoop

Traditional BI is incomplete with respect to Apache Hadoop.  Apache Hadoop is a cluster of commodity servers leveraging storage and compute against extremely large and sometimes complex data types.  They house massive data volumes! Many of the SQL on Hadoop technologies do not provide the optimization capabilities that yesterday’s data platforms provided such as indices and primary keys.  Apache Hive and other SQL based Hadoop solutions are batch oriented and thus not optimal for complex SQL operations and also have a limited SQL dialect. This contributed to slow performance and a lack of scalability when these legacy BI solutions were run against Hadoop.  Traditional BI has its own set of services that run outside of the Hadoop cluster and are not optimized to take advantage of the Hadoop cluster. One of the key value propositions of Hadoop is that if you want more power you add more nodes because it linearly scales. Because traditional BI is not native to the hadoop cluster it cannot take advantage of that scalability and power.

What  do we need?   We need something that is lightweight, doesn’t force you to move data to another platform like Apache Spark, and runs on the Apache Hadoop stack.  We need something architected for Hadoop!

REAL-TIME:  Traditional Business Intelligence is weak on real-time and complex types

One of the most valuable use cases in big data and analytics is IoT real-time event processing.  Apache Kafka is gaining popularity on the Hadoop platform to be able to take advantage of streaming analytics.  Building analytics against Apache Kafka is not something traditional BI tools do well or at all. Most streaming data comes in the form of json and this data structure is not easily queried or parsed using traditional BI tools.  There are very few BI vendors that can query complex data structures without flattening it into a row and column tabular format. Involving the business analyst in streaming data applications is virtually non existent as Apache Kafka requires heavy technical knowledge.

Big Money, Big Money, all sorts of WHAMMIES!  Event and streaming analytics require Apache Kafka and legacy BI tools aren’t up for the task.  Complex data types like, json and parquet, require ETL programmers to flatten them out into a structure most legacy BI tools can handle.  Get your checkbook out: E-T-L, are the three most expensive letters! What we need is a BI tool that can automatically work with streaming data and its complex types.

SOLUTION: So what are we going to do?

The promise of Apache Hadoop was that data access and analytics would be faster, flexible, and more affordable than the existing legacy data platforms.  Customers were hungry for an alternative experience that would quench an organization’s thirst for information and data access. Traditional Business Intelligence tools have failed big time on Apache Hadoop and I think have contributed to adoption issues.  What was needed was something that was built for Hadoop and the modern data platform. What we wanted was Google for Apache Hadoop Data Lakes and Arcadia Enterprise is that tool! Arcadia Data runs visual analytics natively in-cluster, accelerating insights from Apache Hadoop and other big data platforms.  It does this without moving data, bridging the gap between self-service data visualization and advanced analytics. Arcadia Data also provides a simple drag and drop interface for streaming Apache Kafka based applications.  Arcadia Data is the ‘Modern BI Tool’ for the ‘Modern Data Platform!’

Click here to learn more about Arcadia Enterprise!


Related Posts