July 18, 2018 - John Thuma | Industry Solutions

Data Has Time Value: Winners Exploit Data Streaming Now! Not Later!

Originally posted on Medium.

Before I dig into Confluent KSQL, Apache Kafka, and Spark Streaming let’s first take a look at what ‘streaming’ is and why it is so valuable. Data streaming is a continuous generation of lightweight messages, typically in kilobytes, from potentially many different data sources. It can be from a variety of sources such as ecommerce, telematics, trading floors, instrumentation, and much more. Streaming data has many uses and should be processed sequentially and incrementally on a record-by-record basis. You can use it for event management and various types of analytics. What’s the big deal? We can now exploit the value of data as it happens rather than have to wait and process batches of records over a period of time. Some data has time value just like money has time value. I once worked with a major European stock exchange that claimed that a single stock trade transaction loses 80% of its value 5 seconds after that trade occurs. What is the time value of data in your enterprise?

“We can now exploit the value of data as it happens rather than have to wait and process batches of records over a period of time.”

Until now, building a streaming business intelligence application took months to build and required a great deal of technical capabilities. Arcadia Data has built a solution which enables business people to build visual dashboards on streaming data using Confluent KSQL. We will get into KSQL later but you can take a look for yourself by downloading the Arcadia Instant for KSQL package. Now, let’s dig in to Apache Kafka, Confluent KSQL, and Spark Streaming.

First let’s take a look at Apache Kafka. Apache Kafka is an open source stream processing platform developed in Scala and Java. It provides a low-latency, high-throughput, and unified platform for handling real time data feeds. It provides a massively scalable publisher and subscriber message queue which acts as a distributed transaction log. Apache Kafka also provides ‘Kafka Connect,’ an import/export system for linking to external systems, and t provides Kafka Streams, a Java library for processing streaming data.

How did Apache Kafka get its name? Apache Kafka is a system optimized for writing/capturing data so the inventors from LinkedIn (Jun Rao, Jay Kreps, and Neha Narkhede) thought that having it named after a writer made sense. A better description of Kakfa would be: a system which provides a unified, high-throughput, low-latency platform for handling real-time data feeds.

GREAT QUOTE: Franz Kafka: “By believing passionately in something that still does not exist, we create it. The nonexistent is whatever we have not sufficiently desired.”

This leads us to our next part of the discussion, Spark Streaming. Apache Kafka is a message broker with superb performance and it can redistribute data to other applications such as Spark Streaming. Spark Streaming is a complementary application to Apache Kafka and will be the topic our next section.

Spark Streaming is an extension of the Apache Spark core API. It provides high-throughput, fault-tolerant processing of live streaming data. Data can be ingested from Apache Kafka, Flume, TCP sockets, Kinesis, and others. Data can be processed and exported to databases, filesystems (HDFS), and dashboards. You can even apply Spark’s graph and machine learning algorithms on live streams. You can write these programs using Scala, Python, or Java. Some developers are challenged by the micro-batching processing which means that it is not truly real-time or at the atomic level. However you define real-time, Spark Streaming might be good enough to meet your expectations and business needs. It does require very specific technical knowledge and is bound by the limitations of Apache Spark. Some Apache Spark limits: problems with small files, decompression, and partitioning, back pressure handling(I/O buffer cache requires manual cleanup), and no file management system.

Finally, let’s discuss Confluent KSQL. KSQL is an open source streaming data processing engine which makes it easy to read, write, and modify data from Apache Kafka streams using a Structured Query Language (SQL) like language. With KSQL you can easily join and aggregate streams of data. SQL is simple to learn and is arguably the most widely used programing language today. Like Apache Spark Streaming it can consume data feeds from Apache Kafka as an application. Use cases include streaming extract/transform/load (ETL), anomaly detection, and event monitoring.

What is Confluent: Confluent is a company founded by the team that built Apache Kafka. They offer a variety of tools that can help your organization build highly robust and scalable streaming applications.

In conclusion, I will discuss how Arcadia Data can kickstart your journey into streaming data apps. Arcadia Data is the first native business intelligence and analytics solution for Apache Kafka and KSQL. What does that mean? Having a visualization platform that’s tightly integrated with KSQL, and requires no ETL. E-T-L are the most expensive three LETTERS in BI! With Arcadia Instant for Apache Kafka/KSQL, you eliminate a backend ETL process that you used only because you didn’t previously have a Kafka-ready analytics platform. Arcadia Data is that platform. Want to see it in action? Take a look at this day in the life video on streaming analytics.

For more details on how Arcadia Data can jumpstart you into real-time Apache Kafka analytics take a look at the following:


Related Posts