May 15, 2018 - Richard Tomlinson | Big Data Ecosystem

Complex Data Types: The World Is Not Flat

Unless you’re a fully signed up member of the Flat Earth Society you probably don’t think the world is flat. Just take a walk, a boat, a few planes and whatever other modes of transportation you prefer, and as long as you stay in a straight line, eventually you’ll end up where you started. In addition to ending up where you started, along your journey you will probably walk up and down a few hills, sidestep some potholes, climb a mountain or two, swim across a lake (assuming you can swim) and generally do whatever it takes to keep moving forward.

One thing you will certainly understand by the time you reach your destination (/origin) is that the world is absolutely not flat – at all.

Most BI Tools Are Flat Earthers

So why are we talking about the Flat Earth Society in a blog post about data? Well, that’s because 90% of business intelligence tools look at data as if the world were flat. To operate, they need data organized into monolithic row and column-oriented table structures so they can run efficient and predictable SQL workloads.  

Data Is Complex

In reality, of course, data is not flat or monolithic. It is complex and often random. Trying to describe things in a stable and uniform schema simply doesn’t work well in an unpredictable world. For example, take an electronics retailer’s product data. If we take just 2 products; video gaming consoles and tv’s, and look at just a few attributes for each product record, we see how quickly the data diverges:

Video Game Console – Category: Gaming, Model: PS4 Pro, Brand: Sony, Capacity: 1 TB, Memory: GDDR5 8GB, DVD Drive: 8 CAV

TV – Category: Televisions, Model: XBR-55X850D, Brand: Sony, Screen Size: 55”, Display Type: LCD, Resolution: 3840×2160

Both video gaming consoles and tv’s do have some attributes that are common or uniform (i.e., model, brand, and capacity) but quickly diverge into a unique set of attributes that describe only that specific product (i.e., capacity, memory, and DVD drive for gaming consoles vs. screen size, display type and resolution for tv’s).

Traditional Modeling Approaches – Flatten or Break Apart

Given that electronic retailers want to perform critical business analytics using these sorts of attributes with their existing BI tools, they typically attempt to model data like this in flattened out schemas. The first approach is to collapse all the attributes into a single mega-flat table:

mega-flat table

Obviously, this approach quickly becomes unmanageable since we end up with a table containing hundreds of columns generating incredibly sparse data since any specific attribute will only be populated if it is relevant to that particular product type, otherwise it will just be null. For the BI tool, although the schema and semantic layer are simpler, the ETL overhead of maintaining such a table over time, or the ability to efficiently query the table without scanning millions of irrelevant rows becomes unwieldy and impractical.

An alternative ‘less flat but still flat’ modeling technique is to break the data apart and create a master product table containing attributes that are common across all product types, then create individual sub-tables for each separate product type that contains only the unique attributes for that type:

individual sub-tables

Again, this quickly becomes an unmanageable schema since the model will require hundreds of sub-tables just to represent a single product catalog. Not only is this a nightmare to manage from a data model perspective, querying the data is tough for BI tools also. This is because it requires knowledge of exactly which sub-table to switch to for a particular set of attributes. This leads to very complex semantic models and lowers the overall usability for basic or non-power users. Performance is also impacted because more joins are required for SQL queries.

A More Flexible Modeling Approach

The above example helps us understand why traditional, relational schemas and basic SQL are not best suited for complex data structures.  The underlying problem is that the schema and the data are separated. In the relational world we define a table schema first, then put our raw data into that schema later, therefore if the next record has a different structure (or the data changes structure over time) the initial schema we created no longer fits.

The alternative approach is to not separate the schema and the data but keep them together for every single data element. Data formats such as JSON and XML do this well. In the example above, we can model our product data using a flexible JSON schema as follows:

{
“prd_id”: “123”,
“model”: “PS4 Pro”,
“brand”: “Sony”,
“category”: “Gaming”,
“capacity”: “1 TB”,
“memory”: “GDDR5 8GB”,
“dvd drive”: “8 CAV”
},
{
“prd_id”: “456”,
“model”: “XBR-55X850D”,
“brand”: “Sony”,
“category”: “Televisions”,
“screen size”: “55”,
“display type”: “LCD”,
“resolution”: “3840×2160”
}

In the above JSON example, any particular record can be completely unique in structure. This is a very flexible approach as a rigid schema is not needed in advance of the data and thus when data is not known or changes unpredictably over time our model accepts the changes very naturally.

Complex Data Types

To address such modeling challenges in the world of relational databases, complex data types such as ARRAY, MAP, and STRUCT were introduced in the SQL99 and SQL2003 standards but until recently haven’t received much widespread use. This has started to change because in the world of Apache Hadoop, file formats like such as Parquet, ORC, and AVRO have made it very efficient to store data in complex and nested structures. Coupled with the fact that more and more data generated today is in JSON format which naturally leads to the usage of complex types, the merits of SQL embracing support for complex types should indeed be revisited.

Using the MAP complex data type allows the above ‘flexible’ schema to be represented as follows:

CREATE TABLE products
(
prd_id STRING,
model STRING,
brand STRING,
category STRING,
attributes MAP <STRING,STRING>
)

In the above example, the attributes column is a dynamic set of key and value pairs. For Gaming Consoles, a typical row would look like:

(“123″,”PS4 Pro”,”Sony”,”Gaming”, {“capacity”: “1 TB”,”memory”: “GDDR5 8GB”,”dvd drive”: “8 CAV”})

and for TV’s:

(“456″,”XBR-55X850D”,”Sony”,”Televisions”,{“screen size”: “55”,”display type”: “LCD”,”resolution”: “3840×2160”})

The MAP type complex type allows us to model unpredictably wide records which is incredibly powerful, especially when dealing with JSON style data. Similarly, nested data can also be represented via complex types by combining ARRAY and STRUCT as seen in the following example:

Nested Data

 

In the above example, we have a single table that contains customer, orders, and order items records. The c_orders and o_lineitems fields are defined as <ARRAY<STRUCT>> which allows complex one-to-many relationships to be modeled within the same physical table structure. This removes the needs to model the data apart in separate tables and then join them together at query execution time, thus eliminating expensive join operations which will greatly improve query latency.

Most BI Tools and Complex Types

It’s pretty easy to see how powerful complex data types are when helping with all kind of sophisticated and highly flexible data modeling scenarios so why then do BI tools generally struggle to unlock these powerful capabilities?

There are 2 main reasons for this:

  1. Most BI tools don’t natively recognize complex type fields within their source connectors. Most of the time if a column is defined as a MAP, ARRAY or STRUCT they just throw an error or prevent the end user from accessing the column.
  2. BI tools generally don’t know how to generate the right SQL syntax to access the elements within the complex types. More often than not they require a DBA to create some kind of database view on top of the table to ‘unpack’ complex types ahead of time. Since the BI tool is always querying the underlying table via this view all of the efficiency and performance benefits of using native complex type SQL is lost.

Arcadia Enterprise 4.2 and Complex Data Types

It is for all of the reasons described, that we are excited that Arcadia Enterprise offers first-class support for ARRAY, MAP, and STRUCT complex data types. Not only will our source connectors recognize complex types and allow the creation of Arcadia datasets on these types with no upfront data gymnastics required, but we will also expose these complex and nested schemas within our UI via an incredibly intuitive and user-friendly semantic model. This will enable business users to operate on complex data elements in the exact same way they would any other basic data type. To the user, the experience is completely transparent, but behind the scenes, Arcadia takes care of the complexity of generating the right native SQL to ensure complex data is queried in the most efficient way possible.

efficient data query

Complex data types are incredibly powerful for operating on big data and Arcadia Enterprise fully harnesses this power and offers it to our customers which will, in summary, provide:

– More natural expressiveness since the data model is preserved in the schema
– Better performance compared to self joins or fact/dimension joins
– Simpler ETL cycles without the need to normalize data

To explore the powerful visualization capabilities available in Arcadia Data, download Arcadia Instant today!


Related Posts