When it Comes to Facial Recognition, There is No Such Thing as a Magic Number

Companies and legislators are using misleading test scores to justify the expansion of facial recognition into our communities. That flawed approach understates the threat this dangerous technology poses to civil rights.

Marissa Gerchick, she/her/hers, Data Scientist and Algorithmic Justice Specialist, ACLU

Matt Cagle, Technology and Civil Liberties Attorney, ACLU of Northern California

We often hear about government misuse of face recognition technology (FRT) and how it can derail a person’s life through wrongful arrests and other harms. Despite mounting evidence, government agencies continue to push face recognition systems on communities across the United States. Key to this effort are the corporate makers and sellers who market this technology as reliable, accurate, and safe – often by pointing to their products’ scores on government-run performance tests.

All of this might tempt policymakers to believe that the safety and civil rights problems of facial recognition can be solved by mandating a certain performance score or grade. However, relying solely on test scores risks obscuring deeper problems with face recognition while overstating its effectiveness and real-life safety.

How are facial recognition systems tested?

Many facial recognition systems are tested by the federal National Institute of Standards and Technology (NIST). In one of their tests, NIST uses companies’ algorithms to try and search for a face within a large “matching database” of faces. In broad strokes, this test appears to resemble how police use face recognition today, feeding an image of a single unknown person’s face into an algorithm that compares it against a large database of mugshot or driver’s license photos and generates suggested images, paired with numbers that represent estimates of how similar the images are.

These and other tests can reveal disturbing racial disparities. In their own groundbreaking research, computer scientists Dr. Joy Buolamwini and Dr. Timnit Gebru tested several prominent gender classification algorithms, and found that those systems were less likely to accurately classify the faces of women with darker complexions. Following that, the ACLU of Northern California performed its own test of Amazon’s facial recognition software, which falsely matched the faces of 28 members of Congress with faces in a mugshot database, with Congressmembers of color being misidentified at higher rates. Since then, additional testing by NIST and academic researchers indicates that these problems persist.

While testing of facial recognition for accuracy and fairness across race, sex, and other characteristics is critical, the tests do not take full account of practical realities. There is no laboratory test that represents the conditions and reality of how police use face recognition in real world-scenarios. For one, testing labs are not going to have access to the exact “matching database,” the particular digital library of faces on mugshots, licenses, and surveillance photos, that police in a specific community search through when they operate face recognition. And tests cannot account for the full range of low-quality images from surveillance cameras (and truly dubious sources) that police feed into these systems, or the trouble police have when visually reviewing and choosing from a set of possible matches produced by the technology.

In response to these real concerns, vendors routinely hold up their performance on tests in their marketing to government agencies as evidence of facial recognition’s reliability and accuracy. Lawmakers have also sought to legislate performance scores that set across-the-board accuracy or error-rate requirements for facial recognition algorithms used by police that would allow police to use FRT systems that clear these requirements. This approach would be misguided.

How can performance scores be misleading?

It is easy to be misled by performance scores. Imagine a requirement that police can only use systems that produce an overall true positive rate, a measure of how often the results returned by a FRT system include a match for the person depicted in the probe image when there is a matching image in the database, above 98 percent in testing. At first glance, that might sound like a pretty strong requirement — but a closer look reveals a very different story.

For one, police typically configure and customize facial recognition systems to return a list of multiple results, sometimes as many as hundreds of results. Think of this as a ‘digital lineup.’ In NIST testing, if at least one of the results returned is a match for the probe image, the search is considered successful and counted as part of the true positive rate metric. But even when this happens in practice — which certainly isn’t always the case — there is no guarantee that police will select the true match rather than one of the other results. True matches in testing might be crowded out by false matches in practice because of these police-created ‘digital lineups.’ This alone makes it difficult to choose one universal performance score that can be applied to many different FRT systems.

Let’s look at another metric called the false positive rate, which assesses how often a FRT search will return results when there is no matching image in the database. Breaking results down by race, the same algorithm that produces the 98 percent true positive rate overall can also produce a false positive rate for Black men several times the false positive rate for white men — and an even higher false positive rate for Black women. This example is not merely a hypothetical: in NIST testing, many algorithms have exhibited this pattern. (1) Other recent NIST testing also shows algorithms produced false positive rates tens or hundreds of times higher for females older than 65 born in West African countries than for males ages 20-35 born in Eastern European countries. (2)

By only considering the true positive rate, we miss these extreme disparities, which can lead to devastating consequences. Across the United States, police are arresting people based on false matches and harming people like Nijeer Parks, a Black man who police in New Jersey falsely arrested and held in jail for ten days because police trusted the results of face recognition, overlooking obvious exonerating evidence. Human mis-reliance on face recognition is already a problem; focusing on performance scores might make things worse.

What’s the takeaway for policymakers?

Lawmakers should know that a facial recognition algorithm’s performance on a test cannot be easily or quickly generalized to make broad claims about whether a facial recognition algorithm is safe. Performance scores are not an easy fix to the harms that are resulting from the use of face recognition systems, and they of course don’t account for humans that will inevitably be in the loop.

As the ACLU explained in its recent public comment to the Biden Administration, the problems of facial recognition run deep and beyond the software itself. Facial recognition is dangerous if it’s inaccurate — a problem that testing aims to address — but also dangerous even if it could hypothetically be perfectly accurate. In such a world, governments could use face surveillance to precisely track us as we leave home, attend a protest, or take public transit to the doctor’s office. This is why policymakers in an expanding list of U.S. cities and counties have decided to prohibit government use of face recognition. And it’s why ACLU supports a federal moratorium on its use by law and immigration enforcement agencies.

Conversations about the shortcomings of performance scores are important, but instead of trying to find some magic number, policymakers should focus on how any use of facial recognition can expand discriminatory policing, massively expand the power of government, and create the conditions for authoritarian control of our private lives.

Endnotes:

(1) For one demonstrative example, an FRT algorithm developed by the vendor NEC and submitted to NIST’s vendor testing program produced an overall true positive rate above 98% in some of the testing. See National Institute of Standards and Technology, Face Recognition Vendor Test Report Card for NEC-2 1, https://pages.nist.gov/frvt/reportcards/1N/nec_2.pdf (finding a false negative identification rate (FNIR) of less than .02—or 2%—for testing using multiple datasets. The true positive identification rate (TPIR) is one minus the FPIR). However, in other NIST testing, the same algorithm also produced false positive rates for Black men more than three times the false match rate for white men at various thresholds. See Patrick Grother et al., U.S. Dep’t of Com., Nat’l Inst. for Standards & Tech., Face Recognition Vendor Test Part 3: Demographic Effects Annex 16 at 34 fig.32, (Dec. 2019), https://pages.nist.gov/frvt/reports/demographics/annexes/annex_16.pdf.

(2) See National Institute of Standards and Technology, Face Recognition Technology Evaluation: Demographic Effects in Face Recognition, FRTE 1:1 Demographic Differentials Summary, False Positive Differentials, https://pages.nist.gov/frvt/html/frvt_demographics.html (Last visited February 6, 2024). The table summarizes demographic differentials in false match rates for various 1:1 algorithms and highlights that many algorithms exhibit false match rates differentials for images of people of different ages, sexes, and regions of birth. For example, the algorithm labelled as “recognito_001” produced a false match rate for images of females over 65 born in West African countries 3000 times the false match rate for images of males ages 20-35 born in Eastern European countries. NIST notes that “While this table lists results for 1:1 algorithms, it will have relevance to that subset of 1:N algorithms that implement 1:N search as N 1:1 comparisons followed by a sort operation. The demographic effects noted here will be material in 1:N operations and will be magnified if the gallery and the search stream include the affected demographic.”