Chapter 6: 
Making Big Data Actionable

When Machine Learning Meets Data Preparation — Introducing Smart Acceleration

Machine learning can now substitute for much of the data preparation work that once required human interaction. By analyzing data usage patterns over time, the analytics platform can “understand” underlying data and accelerate queries for improved performance and higher concurrency. Simply put, the engine essentially identifies and prepares the data based upon demonstrated user preferences.

Arcadia Data Smart Acceleration™ is an example of this. Smart Acceleration is a framework that includes a recommendation engine for derived views (called “Analytical Views”) of raw data based upon dynamic data usage analysis within the data lake, whether it is a Hadoop cluster or a cloud platform. Arcadia Enterprise then transparently reroutes data queries to the Analytical Views, providing automated acceleration when needed for production and high concurrency uses. Queries are routed to these views in a cost-based manner. Views are stored in HDFS, Amazon S3, or other distributed data platforms and cached when in use for optimal performance.

In-cluster on in-cloud processing enables the analytics engine to scale linearly with the data for greater speed and easier management. Data is automatically modeled and maintained within the Hadoop cluster or cloud environment using simple logical data models that aren’t tied to specific data cube structures. Users work with a dashboard or application that presents consolidated views of data, which they then point and click to drill through or across to the raw data source on the data platform. Intuitive visualization enables instant micro-segmentation, network graph analysis, event and time-series analytics, and dimension/measure correlations.

Multiple Data Sources

Arcadia Enterprise is designed to work directly with a wide variety of relational, real-time, and NoSQL data sources including HDFS, Amazon S3, Apache Spark, Apache Kudu, Solr, MapR-FS, and more. These can be used in any combination, enabling structured and unstructured sources to be combined in a single view. For example, a single visual can combine data from relational, NoSQL, and Hadoop sources.

Users can also create views that combine real-time/streaming and historical data. This addresses an important shortcoming of legacy BI systems, which is that they only work on historical data. Folding in real-time data streams opens whole new applications of BI. For example, equipment managers can overlay streaming data from sensors on historical lifecycle data to see if there are signs of imminent equipment failure, or marketers can monitor real-time click data on a new advertising campaign to see how performance compares to previous efforts and can make adjustments on the fly.

By deploying directly on the Hadoop cluster or cloud environment, Arcadia Enterprise takes advantage of distributed scale-out architectures to accommodate hundreds or even thousands of users with virtually no degradation in performance. Instead of needing to purchase expensive new BI servers to handle increased demand, IT organizations can simply add low-cost servers to the cluster.

Advanced Visualization Goes Mainstream

Even though nearly two-thirds of people are visual learners, most IT reports are still composed of rows and columns. It’s difficult for the average end user to read rivers of numbers, much less derive patterns from them, which is why visualization is a must-have feature for any new BI platform.

Fast processors, increasingly high-resolution displays, and powerful analytics engines are enabling a wide variety of new visualization options. For example:

  • Network graphs (“network maps”) are used to identify relationships between related items and clusters such as when visualizing a social network or displaying a market basket analysis.
  • Correlation heat maps provide a graphical representation of data where individual values contained in a matrix are represented by different colors.
  • Path visualizations are collections of funnel visualizations which display information across a sequence of timestamped events, such as conversion data for website visitors or airline delays by time of day.
Advanced Visualizations Reveal Insights That Raw Data and Basic Visualizations Cannot
Advanced Visualizations Reveal Insights That Raw Data and Basic Visualizations Cannot

Visualizations can also be overlaid on maps, calendars, workflow diagrams, and still images like screen captures. A series of visualizations can be displayed over time like a movie to illustrate time-series analyses. The arrival of affordable virtual and augmented reality hardware will undeniably expand these options as well.

Comparing these powerful new ways to visualize data to the traditional bar and pie charts provided by spreadsheets and legacy BI tools is like comparing a watering can to a garden hose. The tools that work most closely with the underlying data give users the latitude to quickly explore new visualizations and combine a variety of data sets and types fluidly, and should be included in any organization’s requirements list when selecting an analytics platform.

Get the PDF version for easy access to read offline or print.