Sampling is at the heart of statistical methods. Sample bias, a.k.a. “the wrong denominator”, is also one of the biggest risks. While it’s easy to point fingers at sampling errors, we’d all be wise to remember the joke about the late-night drunk looking for a lost key in the glow of the streetlamp because that’s where the light is.
In this week’s Hadooponomics podcast, Cornelia Davis, CTO of the Transformation Practice at Pivotal, speaks with Blue Hill Research host James Haight about her work consulting with companies and helping them figure out how to make the most of their investments in Big Data. She talks about the transition to the “Third Platform,” i.e. cloud platforms premised on highly distributed, highly malleable, and constantly changing architectures.
Understanding that change is the only constant is easier said than done. While it’s not uncommon to hear complaints in the IT industry about silos, the fact is that silos helped create fierce focus. The downside? A historical divide between applications and the data they consume, which we now need to reconsider. Says Davis:
Applications are really a data problem. So you have to build these applications, but there’s no such thing as an application that doesn’t have data that’s serving it, that there’s some kind of data that’s supporting the application at the back end. … So I think it’s ironic that, coming from the Big Data side, you see it as an application problem, and coming from the application side, we see it as a data problem. The crux of it is that it’s a data and application problem… The divide is artificial.
One of Davis’s big insights is that converged teams, often referred to as DevOps, are a critical enabler of Big Data success. But that’s not a natural transition, because architectural and functional biases are deeply ingrained:
[T]he old model [is] ‘plan, build, run’ … a group that plans out what the application is that is needed by customers, a group that builds them, and a completely different group that runs them… How do you remap the roles to become more efficient?
A key side-effect of silos is that the complexity of each respective silo has driven ever-deeper expertise in different domains that need to find new ways to recognize, collaborate, and make the most of differences.
In other words, you can get really efficient with a very narrow focus, and suddenly you’ve found yourself with the wrong denominator. Another word for the wrong denominator: monoculture.
The most obvious example of monoculture is often right in front of us, in the gender gap that has the technology industry fishing for talent in half the pool. Here’s an anecdote Davis cites from the documentary Code: Debugging the Gender Gap:
[V]oice recognition systems, initially, were engineered by men, and they were designed and they were tested by men. So when the first voice recognition systems came out they couldn’t recognize female voices at all. They simply couldn’t hear them. Female voices could not be heard because none of the data that they used for development or testing was data of women speaking.
The most convenient way for engineers to make voice recognition work was to test it on each other. Simple enough, but the team left out half of their customers.
Working with a bigger denominator is one of Big Data’s most salient attributes, and analysis that can harness that scale is a key goal. But Big Data faces a perfect storm of monocultures. At the same time as IT organizations are tackling ever-bigger pools of data, business users are finding ways to pursue Big Data analytics without relying on IT.
What end users end up turning to are self-service analytics tools that predate Big Data, principally because users don’t want to wait for IT. This should come as no surprise: end users know that the supply of data is increasing, and if IT can’t deliver it to them in a timely fashion, end users turn to self-service analytics.These tools extract data into smaller chunks for visual analysis and PowerPoint-ready charting. By requiring smaller sets of data to be extracted, end users are left looking under the streetlamp, convenient and short-sighted. Why doesn’t this work with Big Data? That’s IT’s problem. Buying tools because self-service is more important than collaborating with IT is convenient, but ultimately self-serving.
The composition of teams tackling Big Data has to change fundamentally, both outside and inside IT. Davis describes this realignment within IT in depth. The Royal Bank of Canada describes exactly this approach to broadening collaboration. It’s short-sightedly easy – and self-serving – to do otherwise.
A lot of what drives gender inequity is what we call implicit bias. And every one of us has biases that are just ingrained in us because we, as human beings, we get so much data coming in that we need to categorize. And when we categorize, we risk including biases in those categorizations. And so every one of us has implicit biases. I had colleagues of mine come up afterward and say, “You know what, I signed my son up for coding camp this summer. I never thought to sign up my daughter.” To which I always respond, “You’re signing her up tomorrow, right?” … “Yep, signing her up tomorrow!”
Big Data turns out to be a team sport. Make sure your team is equipped with the right people and the right approach.
To hear more of Cornelia’s commentary, check out The Hadooponomics Podcast, Episode 11 – Finding Big Data Success & Debugging the Gender Gap. The Hadooponomics Podcast series is produced by Blue Hill Research in partnership with Arcadia Data. You can listen to prior episodes here.