Chapter 7:
Selecting a Next-Generation Business Intelligence Platform

Query engine support.

There are many tools that enable some degree of SQL queries to be run against Hadoop data, but consistency and maturity vary widely. At a minimum, look for support for ANSI-standard SQL to lower the learning curve. There are many SQL-on-Hadoop engines besides the ones described previously, including IBM Big SQL, Teradata QueryGrid, Apache HAWQ, Microsoft PolyBase, Presto, and Splice Machine. Be sure to also check if their query engines can handle both normalized and denormalized data sets, and whether they support time-saving features like aggregate-awareness and multipass SQL.

Data discovery and exploration.

Users should have the ability to browse data sources, structure, and content with full granularity and transparency. Good data discovery capabilities enable access to data inside and outside of Hadoop and the cloud from within a web browser without proprietary drivers or extracts. Discovery should work across multiple relational databases as well, such as MySQL/MariaDB, PostgreSQL, Oracle, Microsoft SQL Server, and Amazon Redshift. Look for sampling support. This enables the system to retrieve only a small percentage of the underlying data for discovery purposes, greatly reducing query times. Finally, look for specialized processing engines that are optimized for discovery. Users should be able to write queries however they want, without worrying about the underlying processing engine.

Hadoop support.

New Hadoop versions are released frequently, so ensure that the platform integrates with whatever version(s) of Hadoop you use. If your use case involves commercial versions, such as Cloudera, Hortonworks, or MapR, look for an analytics or BI platform that is certified as compatible by those vendors. Native HDFS API support is important without the need for a separate extract engine or intermediate data structures. Also ensure that the platform integrates with the same Hadoop cluster manager that you use, such as Apache Ambari or Cloudera Manager. It should also integrate with Hadoop metadata frameworks, such as the Hive Metastore and HCatalog.

Native Hadoop security.

Many traditional BI and visualization tools rely on decentralized security models, which complicates the process of extracting and managing data from Hadoop. In these cases, administrators must redefine security roles and privileges redundantly at both the Hadoop tier and again in the BI environment. Integrating with native Hadoop security enables administrators to control data access at a granular level, from the platform through to the UI. Look for centralized role-based access control (RBAC) that is integrated with Hadoop-native projects like Apache Sentry and Apache Ranger. Authentication and group membership administration should integrate with underlying directory sources based on Active Directory, Kerberos, LDAP, or SAML, as well as role membership and privilege information from Apache Sentry. Data permissions should be defined in the cluster, including discrete access control down to a single row or column in the data. If users plan to publish their data applications externally, make sure that published data can be securely provisioned and controlled down to the exact dataset level.

Vertical integration.

Determine whether the tools that the product provides for data preparation, data modeling, semantic modeling, and reporting/analysis are seamlessly integrated with each other. This reduces or eliminates the need for extract databases. Some platforms integrate data preparation, modeling, and reporting/visualization but may have only limited compatibility with third-party tools. Check partnerships to see if the platform integrates with other visualization, preparation, and management tools you may already use. Integration with popular open source data management and analytics tools like Apache Kudu, Apache Spark, Apache Impala, and Apache Solr is desirable.

Multiple deployment options.

Even if your organization hasn’t made the jump to the cloud yet, it’s highly likely you will move at least part of your infrastructure to a public, private, or hybrid cloud at some point. Your BI engine should support both on-premises as well as cloud-based data sources. Look for compatibility with popular cloud operating systems and storage platforms, such as Amazon S3. Be sure to ask potential vendors if their analytical and visualization engines can seamlessly accommodate data from multiple sources without requiring extracts or intermediate steps. Can all or part of your data store be moved to the cloud without breaking existing queries or reports? Can data from HDFS and S3, for example, be combined in a single query? Each organization will have different tolerance for the trade-off between flexibility and functionality.

Ease of administration.

To minimize complexity, look for platforms that integrate natively with existing Hadoop administration tools like Apache Ambari and Cloudera Manager. In most cases, you will want to avoid introducing yet another administrative tool into your environment.

Get the PDF version for easy access to read offline or print.