Learn about the latest big data analytics and BI trends in Apache Hadoop, the cloud, data lakes, IoT analytics, data visualization, and more as you browse through these insightful posts.

July 26, 2018 - Paul Lashmet

Consolidated Audit Trail: Outside Looking In

The primary purpose of the Consolidated Audit Trail (CAT), a rule under the Securities and Exchange Act, is to arm regulators with the data they need to effectively conduct market surveillance and investigations into suspicious trading activities across all national exchanges.  The difference between this and current trade reporting regimes is that it covers more…

July 26, 2018 - John Thuma

FNU: For Non-Unicorns: What is Apache Spark?

This blog was first published on Medium. WARNING:  This is not for the high-tech unicorns, you mythical beasts who sparkle SQL and Java and make code bloom wherever you go. This is for the regular person who wants to understand Apache Spark at a pedestrian level. There are many resources online that help you take a…

July 24, 2018 - John Thuma

Three Ways Apache Kudu Supports BI on Apache Hadoop

This article was first published on Medium. Apache Kudu is a columnar storage system developed for the Apache Hadoop ecosystem. Kudu runs on commodity hardware, is horizontally scalable, and supports highly available operation.   Apache Kudu has a tight integration with Apache Impala, providing an alternative to using HDFS with Apache Parquet. Before Kudu existing formats…

July 20, 2018 - Shant Hovsepian

Five Things Soccer Analytics Teaches Us About Data Lakes

This blog was first published on Forbes. With the World Cup upon us, it’s an apt time to draw inspiration from soccer. In 1950, Charles Reep, an accountant, attended every game of the Swindon Town soccer team’s season, tracking events and recording statistics. He analyzed his data and concluded that long passes were the most effective way…

July 19, 2018 - John Thuma

The Data Science Iron Triangle – Modern BI and Machine Learning

Originally posted here. The New Iron Triangle It is cliché to discuss IT/business solutions as people, process, and technology. Some call it the “golden triangle,” but in this blog, we refer to it as the iron triangle. Since the 1960s, technology has disrupted business through the advent of computing and information management. These systems replaced…

July 18, 2018 - John Thuma

Data Has Time Value: Winners Exploit Data Streaming Now! Not Later!

Originally posted on Medium. Before I dig into Confluent KSQL, Apache Kafka, and Spark Streaming let’s first take a look at what ‘streaming’ is and why it is so valuable. Data streaming is a continuous generation of lightweight messages, typically in kilobytes, from potentially many different data sources. It can be from a variety of…

July 17, 2018 - Steve Wooledge

Three Surprises about Data Lakes, Hadoop, and the Cloud

If you’ve been paying attention to trends around Apache Hadoop, data lakes, big data analytics, and the cloud, you’ve probably noticed the see-saw hype around each of these. In 2012, there was no end in sight to what Hadoop could do, and organizations were beginning to build data lakes to augment or replace data warehouses…

July 10, 2018 - Dale Kim

What’s the Difference between Hadoop and a Data Lake

I recently participated in a webinar hosted by DBTA titled, Unlocking the Power of the Data Lake, where one of the audience members asked, “will data lakes be replacing Hadoop in the future”? I think the three speakers sufficiently answered the question on the webinar, but considering that many others might have similar questions, I…

July 3, 2018 - Paul Lashmet

Hypothetical to Actionable: From CCAR to CRE Market Factors

Introduction DFAST and CCAR require bank holding companies to report, in detail, how they would respond to hypothetical market scenarios that represent macroeconomic shocks like a housing meltdown or a stock market crash. The data used by each company to predict losses and create a response plan must be actual data, not approximated.  Using the…

June 19, 2018 - John Thuma

Five Classes of Use Cases for the Data Lake

A data lake was initially described as a storage system that held a very large amount of data in its original format until required. Originally the term data lake was synonymous with Apache Hadoop. Apache Hadoop enabled organizations to both store and compute data on commodity hardware. Apache Hadoop was a place where you had…

June 14, 2018 - John Thuma

Superheroes? Or Just the Best Women in Tech!

Superheroes are supernatural characters, many of whom have superhuman powers like flight, x-ray vision, or indestructibility. Some are mortals with loads of resources that enable them to create armored suits, amazing vehicles, and powered gadgets that give them superhuman capabilities. They are usually the protagonist in the story, and their goal is to protect the…