Hi I'm Anjali Norwood. I'm a founding engineer here at Arcadia Data and I lead the ArcEngine team. Today I want to talk about join queries in BI environments. Join queries are notoriously expensive to compute and we know that in BI, typically star schemas are used and these star schema queries are pretty hard to optimize and these queries tend to run for a long, long time. So the purpose of this segment is to look at some optimizations that ArcEngine has to make these queries go fast and not consume so many resources.
So to begin with, let's first see what a star schema is. A star schema has a fact table at its center. Let's work with an example here, let's say sales. This contains data for each sale that happened, let's say in a store like Safeway. This fact table joins with multiple dimension tables. These dimension tables describe more about what is in the fact table. In this example, the product dimension describes the products that were sold -- two percent milk, one percent milk, whole milk, and so on. Similarly for the promotion table which might describe was there a promotion in place when the sale happened, which customer bought it, what is the email address, phone number, so on and so forth. Same for store and time.
An important characteristic of star schemas is that there is an N to 1 relationship in the data in the fact table and the dimension table. What I mean by that is, let's take the example of the product dimension. In the product dimension, maybe the prod_key or product key, is the key that describes the rest of the fields. For example, product name and product description. Product key one might refer to two percent milk, and there won’t be any other row that would have product key value equal to one. This product key joins with a column product key in the sales table. Here it's definitely not a primary key because multiple people will buy the same product, multiple stores would sell the those products, and so on and so forth. So that establishes this foreign key/primary key relationship, also referred to as referential integrity constraint. So a typical query on the star schema would look like this, where the fact table might join with one or more dimension tables. There would be some aggregation around the measures of the fact table and the query would likely select certain details from the dimensions. So these joins are essentially lookup joins because of this N to 1 constraint. Whatever data you select from the sales table, only new columns are going to get picked up from the dimensions. There is no row reduction or cardinality reduction because of the join. So the join happens on the primary foreign key. That's how a typical star query would look like.
Now without any optimizations, a way to execute this query would be, join this fact table with the dimension table first, then perform the aggregation to arrive at the query results. Now remember that the sales table is a very wide table with a lot of data, a lot of columns, and it's a tall table because it carries information for every point-of-sale transaction that has happened. By joining the product dimension with it first, we are essentially picking up more columns making this wide table even wider, which means the result of the join is much bigger than individual sales or the product table. Also there is no cardinality reduction. This huge data then has to be aggregated so no wonder this query if executed in this fashion is going to take a long time. It's going to consume a lot of resources, likely overflow memory, and so on and so forth.
In ArcEngine, there are a few optimizations that would help execute this query faster. The first optimization makes use of this referential integrity constraint. Since we know that the join is not reducing or expanding, the data from the sales table still stays the same, we can perform a rewrite called eager aggregation rewrite. What that says is take this aggregation and push it down through the join, on to the fact table. So the same diagram with the optimization applied might look like this. Now look what has happened. We are performing an aggregation on the fact table. This is a great thing because an aggregation typically reduces the amount of data by collapsing rows and computing the sum. This in itself is a win and note that this aggregated result is then joined with the product dimension and a final aggregation is applied. In addition to this win that we got because of eager aggregation, it allows another nice thing to happen and that is now we can use an analytical view to compute aggregated sales. You don't need to go to the huge fact table. You could have created an analytical view that has pre-computed some of the aggregations and essentially this box can be replaced by an analytical view. And what's even better, it's not the user that's doing this optimization. ArcEngine is transparently going to look for suitable analytical views and replace the aggregation of sales by the analytical view.
Now another strategy would be, one could create an analytical view on the join itself. The join of sales and product could be inside an analytical view and that would be okay too, but one has to think about the trade-offs. Analytical views are in general materialized views or cubes or anything that involves a join, are expensive to compute and expensive to maintain. So there is a cost associated with maintaining these join data structures versus, in this case, the analytical view happens to be on a single fact table which can always be incrementally refreshable so the refresh is going to be a delta refresh and it's going to be fast. Plus we got the added benefit of using it along with eager aggregation so we get the goodness of eager aggregation as well. So this is where I would like to conclude the segment on joins. Thank you for watching, have fun!