Tech Talk on Analytical Views

Anjali Norwood, founding engineer at Arcadia Data, explains how analytical views work and how they help with accelerating queries and dashboards in large-scale analytical environments. Read the transcript below.

Cloud-Native Introduction - AWS Athena

Tech Talk on Join Optimization

Cloud-Native BI


Hi, I’m Anjali Norwood, I’m a founding engineer here at Arcadia Data, and I lead the ArcEngine team. Today I want to talk about analytical views a little bit, under the hood as to how they work. But first the question is: why analytical views? Analytical views are absolutely essential to get good performance in your BI production environment.

In BI, people have complex reports and dashboards that do a whole bunch of complex computations. These are computationally intensive queries, and they run them over and over again so with this type of repeated access, analytical views really help performance. What are these analytical views? So these are acceleration structures as I mentioned. They store intermediate computation, partial computations, of the query but they're more than that. They can be refreshed incrementally, meaning they can be populated incrementally when your base data changes. They are transparent to your end-user. So let's imagine a BI tool, or let's say the Arcadia visualization tool, that connects to the ArcEngine, and these BI tools issue SQL queries which then ArcEngine executes. So the query is written against a table but ArcEngine internally will figure out if the analytical view has the right amount of data, the right data, to answer that query and it’s going to actually run the query against the analytical view. Now these are precomputed results. Analytical views are typically much smaller than tables which means they can be read faster and then these results flow back to the BI tool. But the end user doesn't need to see any of this.

Now you would say all of this is fine, but it's not worth it if I have to refresh the analytical view then populate it, and that process takes a long time. And you would be right. It's really important to refresh the analytical views in an incremental fashion. What I mean by that is when you change any data in the table, you want to change only relevant data in the analytical view. You don't want to have to throw out everything that's there in the analytical view and have to rebuild it. So yes, analytical views are incrementally refreshable. These are stored in Parquet format which is an open format. This is nothing proprietary. Also, once you create these analytical views, they can be shared by multiple users if you wish so, if you set up your authorization policies in that fashion.

Now I want to talk a little bit about when exactly a query written against a table can be routed to an analytical view. ArcEngine has a process called “matching and compensation.” The matching part essentially says, “does the analytical view have the right data to answer the query?” and compensation means, “does the engine need to do anything more on top of that?” So these Venn diagrams show that, let’s say the query requires data that is denoted by the black circle and what the analytical view contains is actually in the pink circle. So if the analytical view contains exact results of the query, that's great. ArcEngine can use that to answer the query. On the other hand, if an analytical view has more data than what the query needs, then ArcEngine needs to do what is called “compensation.” And we'll look at a concrete example to see what compensation is. It can take the form of applying some predicates, or regrouping the results that are in the analytical view, or computing certain expressions. Now on the other hand, if the query results are here in the black circle and the analytical view provides a small part of it, under certain conditions it can be used, but if we think of a very simple scenario with a query with a single table, the answer is no. In that case, the analytical view cannot be used. Same with the fourth Venn diagram here, where the analytical view does not have the correct result data to respond to this query.

Now let's look at a concrete example. Here I'm going to work with an events table. You can imagine events such as, maybe people gaming on their cell phones, generating certain events. So this table tracks the user ID, the device ID, which platform they used, when did the event happen, what type of event that was, which app that was, you get the idea. Another important thing is the date of the event, let’s say year, month, and day. In my case, year/month/day together also form the partition key of this events table. Now a pretty normal query that anybody would want to write to get some information out of this events table would be, “tell me the event count for every day, for each day, for a particular event.” So in SQL terms, we’re basically selecting year/month/day columns, the event name column, and simply counting up the events. And we’re grouping on those four columns and we want the results ordered by the number of events, in descending order, so that the highly frequent event will appear on top, and we cut off at 10.

So just by looking at this query, we would say: let's create an analytical view that would look like this. Now, the first part of the definition essentially has clauses that say, create this analytical view, have it be partitioned on year, month, and day columns, and store it in Parquet format. And here is the real definition of the analytical view that says, I want the event count for each day and for each event name. Now you will notice that there are no “where” clauses, “having,” “order by,” “limit,” none of that in the analytical view. It's like that for a reason; it's a good idea to create a generic analytical view that you can use to answer multiple queries. So you also want to prevent creating a whole bunch of analytical views each tailored to a specific query. For example, in this case, we could simply change the query. The user might say, now I want to know yearly event count for each event, so we are not going to be grouping on month and the query is going to look like this. The same analytical view can answer this query because the analytical view is at a finer granularity of data, so ArcEngine applies the compensation to roll up some of that data to answer this coarser granularity query. One can think of changing the order to see the results in a different order, maybe ordered on year. Sure, no problem, the order can be applied as compensation. One can add predicates to the query where maybe we are looking for a particular year, 2018, and maybe looking for certain events like app_opened, game_played, and so on. And again, that is okay, the same analytical view can be used and the filter can be applied as compensation. And I could give you more examples but they would fall in a similar category where you could change your ordering, you could add a “having” clause here to only look for positive results, you could add some expressions here, for example, we have year/month/day as separate columns, you could concatenate them to see a date string, and so on. But essentially the point here is that the analytical view has enough data and current data to answer this query. And ArcEngine can then exploit this analytical view.

Now going back to the incremental refresh of analytical views; in this case, the way incremental refresh works is, these analytical views in this example uses metadata about the files underneath the table. So ArcEngine is able to see what files constitute the events table, what their sizes are, when they were last updated, and so on and so forth. It can track that information to say, the data for a particular year, and a particular month, maybe January 2018, and the 9th day of January, has changed. Once ArcEngine detects that, when a “refresh analytical view” command is issued, ArcEngine can correctly say, hey you know what, I don't need the throw away the whole computation in the analytical view. I'm simply going to recompute one particular partition of data that matches January 9th, 2018, and that's how we achieve incremental refresh in these types of analytical views.

There are other types of analytical views that are typically used with Kudu, and they are stored in the Kudu format. They are refreshed using a different mechanism, and we'll look at those analytical views in another segment. Thank you for watching this video. Have fun!