Chapter 3:
Rise of the Citizen Data Scientist

Hadoop as the Platform Game-Changer

Hadoop has radically altered these platform economics by leveraging inexpensive commodity hardware. That is, storage no longer needs to be centralized—it can be allocated to the low-cost nodes that also handle processing.

A key element of Hadoop is its distributed processing manager. By allowing the different nodes on a Hadoop cluster to handle processing of its own stored data, Hadoop greatly minimizes the movement of data, which minimizes latency associated with data movements. This setup is known as “data locality,” in which the processing work is done where the data resides, versus moving the data to designated processing nodes. More importantly, the technique of distributing work across many nodes for parallel processing leads to significant throughput gains. The end result is performance that approaches that of traditional data warehouses, but at a fraction of the cost.

Finally, Hadoop addresses the relative unreliability of distributed clusters of commodity hardware by storing (replicating) three copies of all data, with each of these copies being distributed across nodes. Should one node fail, its data is still available on two other nodes. Despite the replication of data, the use of commodity hardware still allows lower overall costs as compared to traditional configurations that rely on high-end servers.

Thus the platform piece of the big data analytics equation rests on a solid foundation of Hadoop clusters. But the primary tools for analysts on Hadoop remain in the hands primarily of highly technical specialists such as data scientists who are comfortable with procedural languages, R, SAS, and Spark. Declarative approaches with SQL and SQL-like processing engines are possible, but are not yet mature enough for complex, machine-generated SQL from BI tools. This means that data analysts must be willing to get their hands dirty with writing SQL. The more casual end user who prefers an intuitive graphical user interface (GUI) and data visualization cannot rely on traditional tools to get true self-service access and analysis directly against big data platforms.

Data Visualization Comes to the Fore

The relationship between clear data visualization and subsequently communicating analyses and insights about that data is obvious. It’s simply easier for most people to intrinsically understand complex relationships visually as opposed to when they are presented as rows and columns of tables filled with text and numbers.

How important is data visualization when it comes to expanding big data analytics use beyond statisticians and data scientists to business analysts? In 2016 Gartner made significant changes to its vaunted Magic Quadrant for Business Intelligence and Analytics Platforms. Gartner predicated its changes on the belief that enterprise analytics has evolved today to the point of being both more business-centric and more user-friendly. Most organizations have incorporated a bimodal IT approach—simultaneously emphasizing safety and accuracy (via “traditional and sequential” approaches) as well as agility and speed (through more “exploratory and nonlinear” models).

However, as time progresses, companies will replace legacy BI products with more sophisticated yet more user-friendly tools. These “modern BI platforms,” such as those providing advanced visualization capabilities, will not only support sophisticated big data analytics, they also will not require the intervention or oversight from IT. This level of self-service provides organizations with the agility to discover new insights from data in a faster, more iterative approach that allows hypothesis testing. Compare this to the traditional BI data flows where data is carefully prepared and organized to answer specific, known business questions based on requirements of the business in a much more centralized and governed approach.

In essence, data visualization tools allow business analysts to literally “see” the reasoning behind the big data analyses and discover new insights more quickly. As a result, it should not be a surprise that Gartner and other industry analyst firms maintain that data visualization is becoming a “must-have” for quickly communicating insights gained from big data analytics and converting those insights into actionable business decisions.