Here is a pretty interesting article about racial biases and an in-depth explanation of how face recognition works for the vast majority of face recognition products!
Hey Scotty! Great article. Race is another feature that can discriminate an image of one human from an image of a different human, so it makes sense that computer vision engineers would want to use race as a differentiating feature when classifying a face.
The article also discusses that in computer vision algorithms your training data set must match the real-world conditions in order to be unbiased. I have a question for you! Let’s say I have a robot that operates in my home, and it’s running a computer vision classifier that I want to train ‘on-the-fly’ with images captured in real-time by the camera on the robot. How do I ensure that I’m training my classifier with data that is unbiased (i.e. the data accurately represents the true, real-world distribution)? It seems to me the data collected in the home would be biased toward the individuals who live in that home (i.e. “local” data), because the robot will naturally spend more time around those people. For example, assume that I wrote a computer vision algorithm running on the robot that classifies race from images of peoples faces. However, all the people who live in the home with the robot are Caucasian. If I train only on that data, wouldn’t my robot’s algorithm always predict race as Caucasian because the training data would be biased toward having way more images of Caucasians than images of other races? What are your thoughts on ‘on-the-fly’ training and keeping training data sets unbiased in situations where “local” data is more accessible?
I’m absolutely no expert, but I do worry about AI bias a lot, because I’m always one of those people who is outside the mean distribution of things.
Re: your question above, where local training should indeed take precedence, is it still possible for the classifier to come preloaded with a more absolute, global set of data? That way the local data could exist within a reference frame of a much larger distribution of more complete data?
First, most machine learning inference is tied strongly to the quality of the data used for training. Machine learning performs poorly when there is class biases and when the training data doesn’t accurately reflect inference data.
For face recognition specifically, I would say Misty’s “on-the-fly” training isn’t really training in the Machine Learning sense; thousands of faces are used to train a neural network on how to distinguish important features of a face (e.g. the size of cheekbones, or the distance between eyes). Those features are then used to classify whether a face the robot sees is “Bob” or “Sue” or _____. In this sense, the robot has already learned how to discriminate between faces.
There are learning techniques that do learn more actively and those methods would be more at risk of small data samples.
Beyond data balance as a means to avoid bias in algorithm output, Scotty’s article also discusses the consequences of bias. There’s a great quote from the article that captures this well:
“We’ve all heard about racial bias in artificial intelligence via the media, whether it’s found in recidivism software or object detection that mislabels African American people as Gorillas. Due to the increase in the media attention, people have grown more aware that implicit bias occurring in people can affect the AI systems we build.”
The author suggests that misclassification (or algorithm output errors) can promote racism. This issue is called data discrimination, and it exists in technologies that we use everyday. There’s a book coming out in February 2018 that takes an in-depth look at racial biases in algorithms.
Quote: “In Algorithms of Oppression, Safiya Umoja Noble challenges the idea that search engines like Google offer an equal playing field for all forms of ideas, identities, and activities. Data discrimination is a real social problem; Noble argues that the combination of private interests in promoting certain sites, along with the monopoly status of a relatively small number of Internet search engines, leads to a biased set of search algorithms that privilege whiteness and discriminate against people of color, specifically women of color.”
Noble shows how google searches such as “white girls” and “black girls” produce drastically different results. And if you can’t wait for the book to be released, here’s Safiya Umoja Noble on a recent podcast:
So, given that the data we use to train our algorithms has far-reaching social implications, what are some techniques we can use in algorithm design and data selection to mitigate racism and other unintentional consequences of the outputs of our algorithms? Or alternatively, does that mean that more effort and energy should be dedicated to developing algorithms that are not data-dependent?
We’ll have to get that book!