Chapter 3: Rise of the Citizen Data Scientist

The concept of “citizen journalism” refers to common citizens “playing an active role in the process of collecting, reporting, analyzing, and disseminating news and information.” While citizen journalism in the United States has been around nearly as long as the country itself, it has become more commonplace substantially since the late 1980s. Citizen journalism is not unique to the United States; individuals and groups around the world have embraced the concept as well, such as during the 2010 Haiti earthquake, the Arab Spring, the 2013 protests in Turkey, and more recent events such as the Euromaidan events in Ukraine and the Syrian Civil War.

According to Terry Flew, Professor of Media and Communication at the Queensland University of Technology in Brisbane, Australia, there were three key elements which led to the rise in this viral information sharing approach:

  • Open publishing
  • Collaborative editing
  • Distributed (online) content

All three elements are possible due to technological advances that simplified the business of journalism to all, even those who are not trained in journalistic practices. The use of the term “citizen” as an adjective to describe individuals empowered by technology also applies in “citizen data scientist.”

In 2015, Gartner coined the aforementioned term, characterizing such an individual as “a person who creates or generates models that leverage predictive or prescriptive analytics but whose primary job function is outside of the field of statistics and analytics." While the challenges of these armchair data analysts are different than their commentator counterparts, their primary objective is virtually the same. Some individuals discourage the use of the term “citizen data scientist” as in many cases it simply describes the work of a business analyst or “power user” of a BI tool. However, we’ll use it for purposes of this book as a good way to describe analytics beyond what a casual business user might want to do, namely leverage advanced analytical processing with a simple visual interface.

The broader story is about making data more valuable to more users. In the emerging world of big data analytics, “if the right BI-user applications can be built, this will empower a new generation of business data consumers, much broader than just the technical specialist pool of data scientists, DBAs, and analysts,” contends Nik Rouda, senior analyst at ESG. “Opening up access to insights in Hadoop would trigger a virtuous cycle of data utilization. As more users draw more value, that additional value would draw more innovation in the ways that Hadoop is leveraged across the business.”

The Imperative of User-Friendly Analytics

The bottom line is that big data analytics solutions can create significant business value by delivering relevant insights to the people best equipped to act on them for the good of the enterprise—and those people are business people. The success of digital natives such as Uber and Lyft transforming transportation, Amazon and eBay redefining retail, and YouTube and Hulu enabling “cable cutters” is closely linked to their ability to analyze data better than their competitors. Even companies in traditional industries definitely see how big data is truly accelerating business transformation.

The challenge, of course, is putting in place the right platform and the right tools to allow increasingly more business analysts to undertake big data analytics projects. Gartner recommends starting by “facilitating ingestion, preparation, and analysis of complex data currently beyond the reach of business information analysts.” Next, organizations need to “increase the range of analytics capabilities available to users by deploying tools” for data discovery, self-service data preparation, and behavioral analytics.

While data warehouses have been around since the 1980s, mass adoption has been primarily limited to large enterprises for the most part, due to their costs. Traditionally, data warehouses were centrally managed “on premises,” with storage supplied by a storage area network (SAN) or network-attached storage (NAS) devices. As the number of data warehouse consumers grew, the amount of system resources including storage, memory, and network bandwidth increased proportionally. Since data warehouses were originally intended to be repositories for structured data, maintaining and scaling such an environment proved to be costly from a capex, opex, or even a labor perspective.

Get the PDF version for easy access to read offline or print.