March 18, 2016 - David M Fishman | Big Data Ecosystem

Beneath the Surface of the Data Lake

Modern application development, whether you date it to Kernighan and Ritchie or Visual Basic, has been largely focused on ridding people of drudgery. Get rid of the grunt work, enforce consistency, spit out some data as a side effect, and software has done its job. It’s worked so well that the data spit out now fills lakes.

As originally conceived, Business Intelligence was designed to extract insight from applications that were not built to deliver insight. Before big data formally took on its shape as distributed computing infrastructure for data processing, the data warehouse was the metaphor we used for huge piles of side effect data that we would store somewhere for use later.

It’s time to admit that we too often talk about data as a defensive exercise; whether in a warehouse or a lake, it’s about staying out of trouble (compliance), figuring out what already happened (reports and dashboards), and protecting it with fierce hardware engineering (used EMC storage arrays, anyone?).

The uncomfortable truth is that application development is moving a lot faster than big data. While Spark and R will help close that gap, the huge majority of application developers aren’t working on consuming data; they’re working on today’s version of drudgery-elimination, whether or not they do it on smartphones. Witness the tremendous productivity created by moving application development and deployment techniques to the cloud.

Where does this go? If you want to start a food fight in Silicon Valley, question the inevitability of The Singularity. It presumes we are headed for that HAL moment when computers have the power to go and do all the automation and all the analysis for us better than we ever could, machine learning that can out-science all the data scientists.

There’s a middle ground between data science and inching towards the singularity, and that’s the development of data applications: software that treats analysis of big data as a task that’s worth better automating. I’m not talking about whizzier dashboards, or some new form of desktop productivity software that springs from cross-breeding spreadsheets and slide presentations.

Data applications, like other applications, recognize that there’s a process of working with big data directly, chewing through different angles and aggregates, and seeking insight. Storing and securing the data is necessary, but not sufficient. Think back to what it was like to give someone directions on how to travel from point A to point B before there was GPS and Google Maps. That’s the big data world we’re living in now, and we need to put that experience at the core of how we do big data. We need to rethink how that data is secured, exposed, explored, visualized and shared.

As software eats the world, it measures the nutritional value in data. And if you really want to see the value of a data lake, try standing downstream and poking a hole in the dam holding all that data back.


Related Posts