Early in the history of software engineering (the 1990s), there was vigorous debate as to whether the practice of writing software could properly be considered an engineering discipline to begin with. In comparison with mechanical engineering and chemical engineering, the argument went, software did not meet the test of application of scientific principles to solve practical problems.
Big Data owes its growth in no small part to Marc Andreessen’s dictum on software eating the world. The vigorous debate around whether software is engineering, or whether software engineering is, as E.W. Dijkstra put it, “How to program if you cannot” – is beside the point. Software development is flooding our world with data in the rush to digital everything.
So what’s an engineer to do? What kind of engineering or science is required for Big Data initiatives to pay off? In the latest podcast episode of Hadooponomics, Edd Dumbill (who also goes by Edd Wilder-James), VP of Strategy at Silicon Valley Data Science, brings a unique perspective to these questions. As a futurist and a veteran of Big Data initiatives, Dumbill has had a front row seat the tension between the vision of Big Data and its practice:
“[There was] a thought that really likened analytics to a big flat box of magic; that a CEO would say, here, analytics is for me! And then wonderful results would appear… [but] you have to understand a little bit about how to actually interface the exploration and exploitation of your data into solving a business problem…if the right people aren’t interested in it, then it’s not gonna get joined into business, it’s not going to be created with an eye to the problems it solves.”
Understanding the problems that Big Data solves — that practical streak is at the heart of engineering. That doesn’t mean that engineering is the singular answer to making Big Data pay off. Investment of time and effort needs to be tied to a business problem, i.e., the profitable creation of value and what holds it back.
So much engineering has gone into Big Data and the software that feeds our data lakes that it’s easy to imagine that there is some clear and common understanding of a practical problem behind it. There isn’t.
True, processes are optimized, costs are driven out, throughput is accelerated, failure rates are reduced, variance is quashed. But those are often done locally, not globally. Here’s the pattern that Dumbill and Silicon Valley Data Science have seen in their engagements:
“Working with data is…like an archaeological dig where you’re not entirely sure what you will find and what will be the prize thing that you’ll exploit…You certainly can’t have large investments hanging off the end of each notion. We all know, in company settings, there’s this two year pilot project which has just never died, it’s a zombie system that’s never quite delivered, because somebody’s attached a professional, political status to keeping this thing going or they don’t want to lose the budget. The nature of failure [is] to fail fast and fail often, and it’s finding those good results [that] is a positive contribution because it tells you where not to go, and as long as you’re moving quickly, and haven’t got too much invested in those dead ends, then it’s a good thing.”
Many of the powerful, costly and highly optimized systems from the 1990s at the heart of the data infrastructure prior to Hadoop were exactly those kinds of two-year projects (yes, zombies are among us). But taking that 20th century approach to kicking off Big Data is the opposite of science. In fact, the scientific method says that the most practical way to solve a problem is to experiment and fail, and learn from those failures. Each experiment refines the hypothesis; the more failures, the more learning.
In fact, because Hadoop radically lowers the cost of experimentation, it unlocks the scientific potential in data (not what we think of when we talk about data scientists), so long as failure is always an option.
To hear more about what Edd has to say, check out The Hadooponomics Podcast, Episode 7 – Big Data’s Evolution and the Future of Work. The Hadooponomics Podcast series is produced by Blue Hill Research in partnership with Arcadia Data. Listen to prior episodes here.