Tech Talk on Accelerating Queries on Kudu
Hi, I'm Anjali Norwood. I am a founding engineer here at Arcadia Data and I lead the ArcEngine team. Today I want to discuss analytical views on Kudu tables in this video. Now the question is, Kudu is a real-time system. People store data that's constantly coming in. Data gets updated, deleted, inserted. It's a very dynamic system and a lot of the queries on these tables on this Kudu data would be a simple lookup this, lookup that, like quick queries like that. Why do you even need analytical views? Well turns out this data over time tends to accumulate and people are interested in doing analysis, historical analysis, on this data. So there are two very distinct kinds of workloads where people either look up what happened to this customer's shipment, “why did it not arrive?” This is a quick look up query versus looking at what was the average delay in getting the packages delivered to customers in the last three months, six months, one year, your timeframe. So analytical views help accelerate these historical analysis type queries. In one of the first videos, we discussed what is under the hood for analytical views in ArcEngine. There we discussed analytical views stored in Parquet format in HDFS, and the data was also in HDFS. These analytical views were refreshed using a partition-based refresh mechanism where ArcEngine relied on some file system metadata to figure out changes to the base table so that we could trigger incremental refresh of the analytical view. The Kudu analytical views, these are a little bit different. They are refreshed using a strategy called sequence-based refresh because with Kudu, that similar file metadata is not available. Now analytical views are essentially tables in Kudu and other covers so as such they have to follow the rules put down by Kudu. So one rule is each Kudu table must have a primary key. This also means an analytical view must have a primary key. Kud allows two different types of partitioning, for example hash partitioning and range partitioning. Analytical views allow them both. Hash partitions could be nested inside each other and range partitions could further be nested underneath that. The idea behind hash partitioning is that, an example would be a company deals with say, hundreds of customers. You would not want to create a partition for each of those customers. You would want to probably hash them into maybe ten or twenty buckets. Range partitioning on the other hand it comes in handy to track time. For example, a company tracks its sales data on a daily level. Data gets pumped into the table maybe every hour and analysis happens on a daily level so it might make sense to partition the analytical view on a day level.
Partitioning an analytical view as well as table has a distinct benefit that any queries that have predicates on these partitioning columns benefit from an optimization called partition pruning. What that means is if you are looking for data for a company say, ABC on which we hash partition and for a particular day maybe the Christmas Day 2018 then partition pruning reads exactly the partitions where company name is ABC and the date is 25th of December 2018. It does not touch the rest of the data, it doesn't read the rest of the data, which is a huge gain in execution performance. A new concept that ArcEngine adds is a sequence column. This is necessary to make incremental refreshes of analytical views. If this is not in place, analytical views would need to be refreshed completely and fully every single time which is an expensive process. Step size is a parameter that helps tune the size of the partitions in the analytical view and we will discuss that as well. Let's work with an example. Imagine a messaging provider, a company that provides messaging platforms. A row in a table might track which company, let's say company ABC, sent how many messages through this messaging platform on a given day. Now creation time would be likely the time the message was created and this would be maybe expressed in microseconds since UNIX epoch, but for the purposes of discussion we are going to use simple dates here and there might be a transaction ID and a whole bunch of other information associated with each row. A typical query against this table would be, I want to know how many messages were sent by a particular customer on a particular day and this type of query could be optimized by creating an analytical view that computes that sum of messages for that particular day and for that customer. So the idea with sequence-based refresh is that as the table gets more and more data appended to it, ArcEngine figures out that the last set of data that it saw was for August 3rd. And the way ArcEngine figures this out is because the sequence column was defined to be maybe creation time in the table. So having a column be designated as a sequence column essentially is an indication to ArcEngine that says well, this column is going to increase over time and you can use this to track changes to the table. And that's how, when a new row gets added maybe for August 4 for a company XYZ and a refresh of that analytical view needs to be done, ArcEngine is going to start reading data from August 4th onwards.
Now Kudu allows updates and deletes of the data as well so it's very well possible that somebody goes back to August 1st and maybe changes something, maybe a number of messages has changed. You would want that change to be reflected in the analytical view and the way to do that is to tell ArcEngine to reevaluate only the relevant portions of data by saying refresh this analytical view where sequence column is from maybe the data change for August 1st and August 2nd so the sequence column from August 1st to August 2nd so that only this portion of the data in the base table is read and the corresponding analytical view, part of the analytical view is reevaluated. So that's how the sequence column allows ArcEngine to perform a delta refresh of the analytical view rather than recomputing the full analytical view. The sequence column in the analytical view side results in internally creating a rage partition column so that the range of the base table data, its computation is stored in the analytical view. Step size allows you to modify the size of this range. A good idea would be, the range should correspond to one day's worth of data for example. You would not want to create a huge range where all the data in analytical views is going to reside because that means making any change to the base table would result in total recomputation. On the other hand you don’t want to create too many small partitions because that is not so good for performance. So step size allows you to tune this for your workload. Thank you for watching this video, have fun!