To read the headlines, it’s easy to think that the only thing bigger than ‘big data’ is big talk about big data, but in fact, it’s been about 3 years since Gartner’s Svetlana Sicular called the start of Hadoop’s “trough of disillusionment.” So it should come as no surprise that big data skeptics with entrenched interests in the status quo — IT specialists and their suppliers alike — are working overtime in waving flags with question marks.
However, no less an authority than Gartner recognizes that the trough of disillusionment is not the end of the road; rather, it’s traverses the inflection point leading to the “slope of enlightenment”. And with the accumulation of big data and analytic experience in the last several years, the shift shows no signs of slowing down. In 2016, big data will quite literally be the elephant in the room. Here’s what I predict you’ll hear from those who think they can continue to ignore it.
Until business users get well-structured access to ever increasing sources of information, self-service tools are really an exercise in rearranging the slices on the pie chart. Without direct, well-managed, granular big data access, visualization puts the chart cart before the horse.
With the large scale of repositories made possible by Hadoop and the broad range of data sources it can cache, the classic data warehouse is beginning to look like modest lake-front property. BI tools need to start with billions of records and up.
Have you exchanged one IT labor shortage for another? You probably shouldn’t count on a PhD statistician in a white lab coat to show up and save the day.
The pressure of direct access, consumption, and discovery will do far more to expose and address the risks in big data than taking a wait-and-see attitude. Putting big data in front of more people more often is likely the fastest way to expose its flaws and drive improvement. No data becomes reliable until you rely on it.
Open source pours gasoline on the Hadoop arms race, with transparent roadmaps and code-bases – fundamentally different from the central planning approach behinds the roadmaps for proprietary databases that predate the Blackberry.
ETL for that Much Data is Slow and Complex
Good news: business process automation; bad news: data silos. Seeking coherence across independently developed formats and schemas has always required coding. Now? Even Ralph Kimball, father of the Data Warehouse, argues that Hadoop will rapidly a become a leading player in enterprise ETL.
Looking for lost car keys under the streetlamp because it’s dark everywhere else is no different than using legacy BI tools that can’t span large-scale datasets with billions of data points. Collaboration and social media make it less difficult for people to find each other and ask each other questions. Why can’t we ask questions of the data just as easily?
It’s at the very least ironic to use Hadoop as the place you keep big data until you need to use it. Having to maintain a parallel data processing infrastructure to keep up with Hadoop is swamping these traditional architectures. Using Hadoop’s native capabilities to run analytics at scale can be a critical success factor.
Hadoop’s fundamental shift to schema-on-read via HDFS provides a faster, more flexible mechanism for exposing the structure of the underlying data without tying it down — in effect, decoupling what the data does from how it does it.
The Hadoop platform provides extensibility that’s comparable to an operating system: metadata via HDFS;native security with authentication, authorization and encryption services; coherent execution management via YARN; there are many examples. A cost-centered big data strategy is only a stepping stone to real big data advantage.
Cubes have long been the cure for slow joins; one way they do this is reduce granularity, a reasonable approach for small data. But in big data, you lose valuable insight. Hadoop’s delivers distributed compute horsepower to the data in place, rather than pre-diluting to fit dated data management methods.
This is a myth. Hadoop actually supports best of class primitives for security with Kerberos, LDAP/AD, file-level access control. Holding Hadoop to the same security threshold as your Data Warehouse means you’d be lowering your standards.
No More Excuses
2016 marks a decade since Google’s papers on MapReduce and BigTable inspired Doug Cutting and what became the Hadoop community to rethink what was once called “data processing”. You need only look back to where Oracle was in 1988, when it reached its tenth year — and what the relational model did to the data landscape in the decade that followed — to give you an excuse to rethink the objections to big data you’ll hear this year.