Human biases in data mining

Opposing views on what to do about the data we create

premium

Amy Webb

Last Updated : Mar 19 2017 | 11:10 PM IST

DATA FOR THE PEOPLE

How to Make Our Post-Privacy Economy Work for You

Andreas Weigend

Basic Books

299 pages; $27.99

THE ART OF INVISIBILITY

The World’s Most Famous Hacker Teaches You How to Be Safe in the Age of Big Brother and Big Data

Kevin Mitnick with Robert Vamosi

Little, Brown & Company; 309 pages; $28

Data is the new oil, and we humans are the wells. Our digital crude is a rich brew of mundane, everyday activities — our searches, texts and tweets — along with the GPS coordinates from our phones, the biometric information we share with fitness devices, even the IP addresses of our connected refrigerators. To the average person, this raw material is undetectable noise. But for organisations that know how to identify signals, there’s immense value in refining what has become an unlimited supply.

The popular old data-as-oil idea opens Andreas Weigend’s new book, Data for the People, an exhaustive and insightful look at how data is collected and used, often without our knowledge and almost always without our input. Weigend, the former chief scientist at Amazon, details the “social data” that emanates from billions of cameras, sensors and other devices, as well as social networks, online retailers and dating apps. Data refineries — those companies and people who turn our digital crude into profitable information — hunt for patterns, then sort us into buckets based on our behaviour. As Weigend points out, this exchange benefits everyone: If we let ourselves be mined, we receive personalised recommendations, connections and deals. Yet there’s an imbalance of power. Companies make a lot of money from our data, and we have very little say in how it’s used.

Weigend argues persuasively that in this “post-privacy” world, we should give our data freely, but that we should expect certain protections in return. He advocates a set of rights to increase data refineries’ transparency and to increase our own agency in how information is used. Companies like OkCupid, WeChat and Spotify should perform data safety audits, submit to privacy ratings and calculate a score based on the benefits they provide — a sort of credit score for the companies that mine our data.

Not everyone believes that our information should be freely available as long as we agree to the terms of use. In The Art of Invisibility, the internet security expert Kevin Mitnick advocates the opposite. Mitnick notes various reasons we may want to hide our data: We’re wary of the government; we don’t want businesses intruding into our lives; we have a mistress; we are the mistress; we’re a criminal. Mitnick, who served five years in prison for hacking into corporate networks and stealing software, offers a sobering reminder of how our raw data — from email, cars, home Wi-Fi networks and so on — makes us vulnerable.

Both books are meant to scare us, and the central theme is privacy: Without intervention, they suggest, we’ll come to regret today’s inaction. I agree, but the authors miss the real horror show on the horizon. The future’s fundamental infrastructure is being built by computer scientists, data scientists, network engineers and security experts who do not recognise their own biases. This encodes an urgent flaw in the foundation itself. The next layer will be just a little off, along with the next one and the one after that, as the problems compound.

Human bias creeps into computerised algorithms in disconcerting ways. In 2015, Google’s photo app mistook a black software developer for a gorilla in photos he uploaded. In 2016, the Microsoft chatbot Tay went on a homophobic, anti-Semitic rampage after just one day of interactions on Twitter. Months later, reporters at ProPublica uncovered how algorithms in police software discriminate against black people while mislabelling white criminals as “low risk.” Recently when I searched “C.E.O.” on Google Images, the first woman listed was C.E.O. Barbie.

Data scientists aren’t inherently racist, sexist, anti-Semitic or homophobic. But they are human, and they harbour unconscious biases just as we all do. This comes through in both books. In Mitnick’s, women appear primarily in anecdotes and always as unwitting, jealous or angry. Weigend’s book is meticulously researched, yet nearly all the experts he quotes are men.

Early on he tells the story of Latanya Sweeney, who in the 1990s produced a now famous study of anonymised public health data in Massachusetts. But Sweeney is far better known for something Weigend never mentions: She’s the Harvard professor who discovered that — because of her black-sounding name — she was appearing in Google ads for criminal records and background checks. Weigend could have cited her to address bias in the second of his six rights, involving the integrity of a refinery’s social data ecosystem. But he neglects to discuss the well-documented sexism, racism, xenophobia and homophobia in the machine-learning infrastructure.

The omission of women and people of colour from something as benign as book research illustrates the real challenge of unconscious bias in data and algorithms. Weigend and Mitnick rely only on what’s immediate and familiar — an unfortunately common practice in the data community.

As a futurist, I try to figure out how your data will someday power things like artificially intelligent cars, computer-assisted doctors and robot security agents. That’s why I found both books concerning. You may look like Weigend and Mitnick and therefore may not have experienced algorithmic discrimination yet. You, too, should be afraid. We’ve only recently struck oil.