The Power and Flexibility of Derived Data, Part 1

Published on December 8, 2016

Derived Data, a powerful report-based feature in Arcadia Enterprise, enables users to conduct complex analyses within a single visual. In this post we will explore a use case for applying Derived Data that will allow a user to perform a multi-pass cohort analysis. Let’s start with data about flight delays: you’d like to categorize airports by average departure delays and then understand for those categories, which airlines fly the most delayed flights.

The initial chart, which shows average departure delay by airport, is difficult to read due to the high number of airports.

Initial Chart

We’d like to see fewer values on the X Axis, so we will use a CASE conditional statement to classify the average delays into three cohorts (Long, Medium, Short).

Classify X-Axis

If you remove the airport column from the dimension shelf and chart only the cohort expression and the average departure delay…

Remove Airport Column

…the SQL no longer groups by airport; the average delay is calculated instead for the entire result set.

Create delay category per airport

If we want to use the cohort expression as-is and calculate average delays for all airports, two SQL passes are required: the first pass assigns a Delay Category to an airport based on the average delay time; the second pass joins back to the original result set and computes the average departure delay by Delay Category. Writing this SQL manually could potentially become unwieldy, but thankfully the Derived Data feature makes the process easy.

 

In order to use the cohort as intended (categorizing average departure delays by airport) you must save the expression as Derived Data…

Save as derived data

… and give the column expression a meaningful name.

Give the column a name

Arcadia then adds this new expression as a discrete column in a special section.

New expression as a discrete column

For reference purposes you may view how this column is defined below.

View column definition

You can now drag this Derived Data column into the dimension shelf and use it as you would any other dimension column.

Add derived data into dimension shelf

Behind the scenes Arcadia generates a select statement that utilizes the Derived Data column expression as a sub-query. In this instance the Derived Data is aliased as T1.

Derived Data aliased as T1

The sub-query allows users to analyze data at a level of detail different from what exists in the primary report. We’ll now demonstrate that capability by analyzing which airlines contribute to the different airport delay categories using the same visual we started with above.

We add the Airline column to the dimension shelf and remove the Average Flight Delay measure, focusing only on number of flights per airline and Airport Delay Category.

Adding the airline column

This analysis holds some potential, but in its current tabular form isn’t very interesting. Instead we choose the Correlation Flow visual type, which Arcadia offers out of the box.

Choose correlation flow visual

We are now presented with a more insightful visual analysis which quickly tells us which airlines contribute the most delayed flights to each Delay Category.

Correlation flow view

The Correlation Flow not only gives an appealing representation of the data, it also provides useful metrics not calculated as part of the original query. For example, you can now see that Long Airport delays comprise 16% of total delays.

Drill-down to single airline stats

By clicking on one of the flows we obtain additional metrics about a particular airline in the result set. For example, we now know that 89% of Delta’s delayed flights have a Medium departure delay.

Drill-down view

Summary

In this post we covered how to create a multi-pass cohort analysis in a single visual. It’s just one example of how you can use Derived Data to enhance an analysis and extract deeper insights from your data. Other potential use cases include complex expressions (e.g., weighted averages) and set-based analysis (e.g., for top 10 TV channels opened the most times, show the number of minutes watched per channel in the last 10 days). In the not-too-distant future we’ll explore more facets of Derived Data, including parameterizing a Derived Data expression and referencing columns from a Derived Data set in other parts of a visual.