Chapter 5:
Common Approaches to Big Data Analytics

Let’s look at four popular approaches to business intelligence architecture incorporating big data and the strengths and weaknesses of each.

SQL-on-Hadoop Engines with BI Tools

The emergence of SQL-on-Hadoop offerings such as the ones mentioned in chapter 1 are often used to enable existing BI tools. These tools reduce (or in some cases eliminate) the need to extract the data into a data warehouse, mart, or cube before it can be analyzed. Running SQL queries directly on a data lake using Hadoop or cloud platforms such as Amazon S3 provides easier access to the most fine-grained data available.

SQL-on-Hadoop Engines with BI Tools
SQL-on-Hadoop Architecture and Process Flow

Pros

  • No dedicated servers required. Analyzing data directly in the big data platform leads to significant advantages. First, with no data movement from the platform to dedicated BI servers, the overall architecture is simpler and easier to maintain. Second, no external BI servers or external ETL software means lower hardware/administrative costs. Third, all raw data is immediately available to allow details on fine-grained data, unlike the summaries and aggregations used in dedicated BI servers. Finally, governance and compliance frameworks are easier to support by not creating separate copies of data in external repositories.
  • Unified security. Related to the above, since there is no data movement, data can be secured by the platform’s security controls, thus simplifying data protection and lowering the costs of securing your data. In the case of Hadoop, integration with security technologies like Apache Sentry and Apache Ranger enable the unified security model.
  • Lower learning curve. Both “citizen data scientists” and RDBMS power users take advantage of tools and skills they already have. This leads to easier adoption since users don’t have to learn unfamiliar new tools.
  • Maximize current investments. Companies can leverage existing BI tools, skills, and training, thus avoiding additional expenditures on new BI tools.
  • Self-service. With minimal IT intervention to run news types of queries, SQL-on-Hadoop can effectively provide self-service analytics that give much more flexibility to business users.
  • Reduced dependence on ETL. Since many SQL-on-Hadoop tools offer ETL functionality natively, the need for more sophisticated (and costly) ETL tools can be reduced as well.
  • Cost-effective scalability. By deploying analytics that leverage a big data architecture, you can easily achieve cost-effective scale-out by incrementally adding more commodity nodes to a cluster.

Cons

  • Less mature SQL support. While these SQL-on-Hadoop engines are intended to be used with popular BI tools, they tend not to support the SQL syntax as extensively as other veteran technologies. This limits the types of queries that can be run from your BI tool.
  • Limited track record. While some tools suggest massive performance gains, these results are often performed under ideal circumstances. It is important to validate their claims in your environment first.
  • Concurrency limits. Consumers have grown to expect real-time or near-real-time results while performing their analyses or accessing data via their dashboards, but these products may not provide that due to their need to process SQL commands across a widely distributed cluster. This is especially true when a typical BI deployment has hundreds, if not thousands of concurrent users.
  • New skills and unfamiliarity with Hadoop. The concept of accessing large amounts of structured and unstructured data for analysis is new, and will require some ramp-up time for users and IT organizations alike. Basic familiarity with Hadoop or cloud system skills are needed set up and maintain the data store.