If you’re a fan, enthusiast, consumer, perpetrator, or even a victim of big data, then you go to the Strata Data Conference. That’s what you do. I fall into at least two of the preceding categories, and I was very happy to spend the last week of September in New York City to attend the 2017 New York incarnation of the conference. I’ve gone to this conference numerous times over the years, and I especially enjoy the keynote presentations. Not only do I get to learn more about the latest technological innovations from leading big data vendors, but I also learn about how data is used for important initiatives throughout the world.
Targeting Wealthy Criminals
One great talk was by Sam Lavigne of The New Inquiry who is also an NYU professor. Sam’s talk on White Collar Crime Risk Zones described the namesake application he and colleagues wrote for The New Inquiry that leveraged a predictive policing model using financial data from FINRA instead of the typical police data on street crimes. The application would make predictions on where and when white collar crimes would occur based on data on where and when they’ve already occurred.
I’m continually fascinated (and disturbed) to hear about machine learning algorithms reproducing the biases in the training data. Most of us who follow machine learning trends know this, but it’s always interesting to hear specific examples. In the case of predictive policing, the biased data can be due to systemic biases that exist in the police department, particularly around race. What this means is that predictive policing creates feedback loops that lead to over-policing of communities that are predominantly comprised of people of color.
So Sam’s model risked having the same types of biases, but it potentially was different because of the different data sources. His system parsed textual data to search for corporate fines that are evidence of white-collar crimes. He also used a machine learning approach common in predictive policing known as “risk terrain modeling,” which correlates features of a landscape to risk factors such as incidents of financial malfeasance, as well as density of non-profits, bars, clubs, etc. He showed a heatmap of New York City to reveal the risk areas for white-collar crime. He also showed a feature that creates composite facial images scraped from LinkedIn of individuals who might commit white collar crimes, which was met with some skeptical snickers in the crowd, but it was still amusing to watch.
He wrapped up his talk by saying that typical policing methodologies tend to criminalize poverty, which means typical predictive policing apps will target the poor. On the other hand, White Collar Crime Risk Zones will target the wealthy criminals. As he pitched his application to mayors across the country, he received some interest, but also received rejections, apparently because white-collar crime was deemed a lower priority compared to other types of crime.
Internet of Things in the Wild
Jer Thorp, also of NYU, gave an interesting talk titled, Wild, Wild Data: Adventures with Big Data and the IoT in the Angolan Highlands. He noted that many of us have a strange tendency to stay away from data sources and look at them from afar to remain objective. As a researcher, he realized this was preventing him from doing his job as well as he could, so he made a push to be closer to the data. Many of us as corporate citizens might find “close to the data” to mean sitting in the data center, but Jer’s work is far more interesting than that. He shared experiences with Alvin, a deep-ocean submersible that he took 658 meters down the Gulf of Mexico. One incredible observation was a brine pool, which is essentially a lake on the seafloor that remains a separate body of water due to its ultra-dense salinity. He showed pictures of mussels that metabolize methane that rises from the earth. One might wonder, why not send robots to those depths? He actually asked that question to Cindy Lee Van Dover, the first female submersible pilot, to which she responded, “the only way that humans care about systems is if they actually go there themselves.”
If that weren’t enough, he also shared his experiences in the Okavango Delta in Botswana, one of the world’s richest and wildest wildlife areas. His anecdote on being chased in the water by a hippopotamus was both scary and funny. After all, I grew up thinking hippos were fun, gentle creatures, a misconception I undoubtedly concluded from watching reruns of Peter Potamus as a kid. Jer’s knowledge that hippos can’t swim was of little relief, as they can run underwater at speeds up to 30 kilometers per hour despite weighing up to 4000 pounds.
He also talked about one of the most remote places on earth in the Angola Highlands (following an armored mine clearing truck to get there), as well as the boiling river in Peru. But it wasn’t all fun and games. Collecting data was the objective, and one outcome of his travels was the site intotheokavango.org that lets you explore millions of data points. He also talked about “fieldkit,” a platform for open science that enables field researchers to collect environmental evidence in a very low cost way. With fieldkit, which he plans to release in early 2018, Jer is hoping to recruit as many people as possible to identify evidence of the effects of a changing climate. His parting advice to get better while working with data: get out of your chair.
More Lies, Damned Lies, and Statistics
The final talk I want to share was by Cathy O’Neil, former professor at Barnard College and author of Weapons of Math Destruction (also the title of her talk). In her talk, she gives examples of data analysis going awry. One example described the attempts to find “bad” school teachers using simple math. Never mind that there has always been an achievement gap correlated to economic status, where students in poorer communities tend to do worse than students in richer communities. This means that punishing teachers based on student scores tends to punish teachers of economically disadvantaged students. So how do we rectify this? One idea was to rate teachers based on expected scores versus actual scores, where the expected scores are based on last year’s actual scores. The problem is, if teachers game the system to artificially raise actual scores, the next year’s teachers will suffer. The teachers who inherit students with inflated scores will have an unfair comparison to actual scores, making the teacher look unfit. (I’m not sure I got this explanation right, but I’m sure I can correct myself after I read her book.)
Cathy also gave an example of a young man who was illegally profiled during a job interview. He suffered from a personality disorder for which he was being treated, but he was given a personality test during that job interview with questions that were very similar to ones he’d seen before during his treatments. Since the man was told he was not hired because of his test results, and that his father was a lawyer, they proceeded with a lawsuit around the company’s hiring practices. After all, there is no business requirement for a personality test, and these scores were not actually proving anything regarding the man’s ability to work.
Cathy’s final recommendation was that we all open our data analysis algorithms to scrutiny and understanding so that we don’t continue to make these types of analytical mistakes.
Even More Good Information
There were a bunch of other great talks as well. A few more notable highlights:
- Joanna Bryson (University of Bath, Princeton Center for Information Technology Policy) when talking about artificial intelligence, defined intelligence to be “doing the right thing at the right time.”
- Robin Thottungal (U.S. Environmental Protection Agency) talked about the significant improvements over the years that resulted in reduced air and water pollution. He encouraged us all to look at their data and tell them what is happening in the environment to help prevent the environmental problems we are facing.
- Manuel Garcia-Herranz (UNICEF Office of Innovation) talked about how we can create new data ecosystems to help the most vulnerable. For example, understanding the patterns of human movement can allow preparation for setting up medical centers to reduce the outbreak of epidemics.
- Chad W. Jennings (Google) talked about Robert Plutchik’s wheel of emotions that defined love to be the combination of trust and joy (i.e., Love = Trust + Joy).
Although I didn’t attend many breakout sessions at the show, I heard great things about a couple of them related to Arcadia Data. Michelle Tower of Procter & Gamble (an Arcadia Data customer) talked about their successes with data. Our CTO and co-founder, Shant Hovsepian, described the various options for real-time streaming analytics today. And there were a bunch of great presentations at the Arcadia Data booth from some of our partners, including Cloudera, Hortonworks, Leidos, MapR, and Pitney Bowes.
It was another exciting Strata Data Conference for us, and allow me to wrap up by saying that in terms of the tchotchke round up, Arcadia Data had a strong showing. Our Arcadia Data baseball T-shirt was a big hit, along with our recently published book, Modern Business Intelligence (soon to be available as an eBook) co-written by our CEO, Sushil Thomas, and our VP of Marketing, Steve Wooledge. The Pez dispensers with Batman, Superman, and Wonder Woman were also very popular, not to mention our selection of stickers. You’ll want to collect them all!
And finally, I hope you were able to learn more about how Arcadia Data can help you with your big data analytics initiatives. Contact us to continue the conversation or start exploring your data now with Arcadia Instant.