Chapter 6:
Making Big Data Actionable

Modern Data Platforms Continue to Evolve

The commercial software and open source communities are constantly innovating in areas such as Hadoop management. For example, YARN was added to Hadoop version 2 to handle resource management for Hadoop jobs. Apache Ambari is a management environment focused on ease-of-use for Hadoop installation, system management, and operations, most commonly used in vanilla Apache Hadoop and Hortonworks HDP. Cloudera Manager and MapR Control System are proprietary but highly-functional commercial alternatives. A BI system that works directly with Hadoop data needs to take advantage of these and many other management tools for performance and capacity tuning, especially as their deployment grows.

Cloud deployments are becoming increasingly popular especially for the rapid provisioning capabilities. While the cloud has historically been revered for lower cost of ownership, it is the “elasticity,” or the ability to easily expand and contract a big data deployment, that is the key benefit today. Organizations no longer have to wait days or weeks for physical hardware servers to be set up in the data center, as cloud instances on third-party vendors can typically be provisioned in minutes. This capability is vital for supporting environments where data sets and user loads continue to grow and require more hardware resources. It is also valuable for handling short-term load bursts, where the extra capacity can later be shut down to reduce expenditures on unneeded resources. Most major big data technology vendors support third-party cloud deployments, so it is a viable option for advanced BI workloads. There are even popular cloud-native options like Amazon EMR, and emerging options like Snowflake as a cloud-only data warehouse technology, that round out the cloud landscape.

Another emerging technology that looks to change the way data is analyzed is the class of technologies around real-time streaming data. This includes technologies such as Apache Kafka, MapR-ES (formerly MapR Streams), RabbitMQ, and IBM WebSphere MQ. These products, particularly the former two, are commonly used in big data deployments, especially those based on Hadoop, to handle event streaming as a required complement to batch-oriented, historical data. In fact, Cloudera and Hortonworks both support Kafka as part of their offerings, and MapR provides MapR-ES, which is their Kafka-compatible event streaming engine. For added capabilities, specialized stream processing engines like Spark Streaming, Apache Flink, Apache Apex, Apache Storm, and StreamSets, just to name a few, help with analyzing the data as it is delivered. Visualizations on event data are becoming more mature as well, and new innovations to take advantage of high speed streaming data in a graphical way are going to pave the way for gaining faster insights from big data.

Get the PDF version for easy access to read offline or print.