This article was first published on Medium.
Apache Kudu is a columnar storage system developed for the Apache Hadoop ecosystem. Kudu runs on commodity hardware, is horizontally scalable, and supports highly available operation. Apache Kudu has a tight integration with Apache Impala, providing an alternative to using HDFS with Apache Parquet. Before Kudu existing formats such as Avro, HBase, Maps, and Parquet were not sufficient for reporting applications where newly-arrived data needs to be immediately available to end users. This document will highlight some key advantages of Apache Kudu in the support of business intelligence (BI) on Hadoop (i.e., big data systems). Many people define big data differently. It could be Apache Hadoop, it could be cloud storage, or it could be anything bigger or more complex than what I can easily manage and analyze in traditional relational databases. However you define big data let’s go back in time and understand the SQL database and how data models were formed to support BI. Many of the past data management practices still apply for modern data platforms and this will impact what type of data format you select for your BI efforts on big data systems.
Apache Kudu Provides the Mutability Required for BI on Big Data
You may have heard about ‘third normal form’ (3NF) and ‘star schemas’ and how they played a vital role in the development of legacy BI solutions. These data models did more than increase performance and denormalize data. They also played a critical role in ensuring data was accurate and consistent. To that end, we developed what we called a ‘slowly changing dimension’ or (SCD). This allowed us to maintain history inside of a dimensional reference data. For instance, if I have a CUSTOMER dimension that contains data about customers, then I will want to keep their identifying details and other information about customers current. I also may not want to lose the data about the customers’ history. As an example, if a customer moves and has an address change I will probably want to keep the old address so I properly assign demographic data to that customer. I will also want to know their journey in life as they move and grow in maturity. The person’s name might change too and I will want to keep that on record. I will always want to know what the current record is and what the date range for a customer record is in time so I can match it to facts and other dimensions accurately. To maintain such a dimension I will need a mutable data format that can perform UPDATES on records. I will need a data format that supports fast data, which allows me to find the records rapidly that will require updation. This is why Apache Kudu is so critical to BI on Hadoop.
Apache Kudu is best for late arriving data due to fast data inserts and updates
Apache Hadoop BI also requires a data format that works with fast moving data. Apache Kudu ships with Cloudera Enterprise and enables real-time analytics without a lot of hassle. Kudu is a storage engine providing fast analytics on changing data. It is purpose-built to enable use cases around time series data, machine data analytics, and online reporting. Kudu provides a one-two combination that enables inserts and updates against large datasets. Kudu’s efficient columnar scans enable multiple real-time analytic workloads on one storage layer. As a result, Kudu lowers query latency for Apache Impala and Apache Spark execution engines when compared to Map files and Apache HBase. This capability opens up the world of IoT and other use cases where real-time analytical needs are critical and a massive competitive advantage. This will benefit real-time data integration with respect to late arriving data.
Apache Kudu Data Format Enables Faster BI and Analytical Workloads on Hadoop/Big Data
Apache Kudu also provides excellent space utilization efficiency and best-in-class random data lookup latency. This is similar to how Apache Parquet data compression using algorithms like Snappy or GZip can reduce the consumed space significantly. Storage space can be decreased by a factor of 10 when compared with MapFiles. Apache Kudu, as well as Apache HBase, provides the fastest retrieval of non-key attributes from a record providing a record identifier or compound key. Additionally, Kudu has advantages in late-arriving data integration given lookup performance and mutability. Lookups are essential to analytic discovery and managing slowly-changing dimensions and give this format a key advantage over other Apache storage format projects. Taking up less space and providing high speed data lookups are key to building BI on Hadoop solutions.
Seeing is believing. Take a look at how Arcadia Enterprise leverages Apache Kudu. In this video we will review the value of Apache Kudu and how it differs from other storage formats such as Apache Parquet, HBase, and Avro.
To get more information about Apache Kudu there is an excellent article in the Cloudera Engineering Blog: ‘Performance comparison of different formats and storage engines in the Apache Hadoop ecosystem.’ This article shows a performance test comparison between several data formats including Apache Avro, Apache Parquet, Apache HBase, and Apache Kudu. Not all big data workloads or use cases are the same but it is really critical that the right format is used for the right job. For instance, if I were going to perform more transaction processing (OLTP) workloads I might select Apache HBase as my format. The table below demonstrates the appropriate workload for the right storage format.
Apache Kudu and Apache Parquet are the best-in-class data formats for Hadoop analytics where aggregation, reporting, and filtering are essential. This is why Arcadia Enterprise leverages both Kudu and Parquet formats both internally (Smart Acceleration Analytical Views are stored using Parquet) and as a data source. If you want to see it in action, give Arcadia Instant a try. It is a free version of our product and is simple to learn.
You can also see additional Hadoop analytics use cases using Apache Kudu juxtaposed with historical data from systems such as HDFS in one pane of glass in this other connected car demo.