Common Approaches to Big Data Analytics
Let’s look at four popular approaches to business intelligence architecture incorporating big data and the strengths and weaknesses of each.
OLAP on Big Data
An alternate approach embracing a modern data platform, OLAP (online analytical processing) concepts started to creep their way into the world of big data analytics in the early 2010s because of the scale and performance constraints of traditional BI approaches. The concept of OLAP involves predefining materialized views or virtual data “cubes” (which actually are pre-summarized aggregations of the underlying data) and directing queries to the appropriate level in the cube to return results much faster than the underlying granular data.
Many organizations experiencing frustration with the less than interactive response times of SQL-on-Hadoop engines (see previous section) are looking to close the performance gap by deploying such big data OLAP solutions in between their BI tools and their data in Hadoop. Modern big data OLAP solutions avoid the movement of data and achieve scale by deploying their cubes directly into the Hadoop environment, right next to the granular data from which they are summarized. Example big data OLAP technologies include Apache Kylin, AtScale, Kyvos Insights, and Zoomdata.
- No dedicated servers required. Analyzing data directly in the big data platform leads to significant advantages. First, with no data movement from the platform to dedicated BI servers, the overall architecture is simpler and easier to maintain. Second, no external BI servers or external ETL means lower hardware/administrative costs. Third, all raw data is available to allow details on fine-grained data, unlike the summaries and aggregations used in dedicated BI servers. Finally, governance and compliance frameworks are easier to support by not creating separate copies of data in external repositories.
- Maximize current investments. As with SQL-on-Hadoop offerings, companies can leverage existing BI tools which can visualize the output from cubes, providing a faster path to big data adoption with a reduced learning curve.
- Fast queries and high user concurrency. Because aggregates are predefined, stored, and accessed, sub-second response times are achieved for predefined queries.
- Cost-effective scalability. By deploying analytics that leverage a big data architecture, you can easily achieve cost-effective scale-out by incrementally adding more commodity nodes to a cluster.
- Requires up-front modeling. Cubes require significant investment from IT and other technical staff in terms of time to design, deploy, and administer, which makes them costly to build and maintain before queries and exploration can begin. This requirement often adds significant latency, inhibiting an immediate and real-time analytical environment.
- Ongoing assembly required. Since big data OLAP products lack the ability to automatically develop new cubes when new data are added, time and effort is required to develop, deploy, and test these enhancements prior to deployment. These updates can take weeks or months to complete, potentially losing opportunities for revenue in the process.
- Not real-time. Batch data updates are required for cubes, which typically happen once per day and can take hours based on their size.
- Lacks ad-hoc freedom. The tradeoff for scale and performance is limited flexibility. Users are confined to OLAP views and data cubes that are prepared by IT in advance, so they have little latitude to experiment. If the data is not in the cube, the BI tool needs to bypass the cube so all the benefits of pre-aggregated performance are lost.
- Increased administration. Due to the need to develop multiple microcubes, many of these providers require separate security for these structures in addition to what they have currently. While this may be manageable initially, as the number of cubes and users grow, this initially simple task can become very unwieldy extremely quickly.
- New skills and unfamiliarity with Hadoop. The concept of accessing large amounts of structured and unstructured data for analysis is new, and will require some ramp-up for users and IT organizations alike. Basic familiarity with Hadoop or cloud system skills are needed set up and maintain the data store.
At the end of the day, big data OLAP cubes offer a stop-gap solution that exists only to enable legacy BI technology, which was never designed to work natively with big data, to achieve fast query performance for certain predictable questions. They do not remove the cost or the complexity of the legacy BI architecture. In fact, they increase it by adding yet another middle layer into the equation.