I'm a big ACLU supporter, but thought this was poorly done. They never released either the database or the code for this testing, and had configured the recognition level against Amazon's recommendations.
In the ACLU's posts, they say that they used the "default match settings". In their response [0] to Amazon's response, the ACLU links to a guide published by Amazon [1] intended to "Identify Persons of Interest for Law Enforcement" that does use `searchFaceRequest?.faceMatchThreshold = 0.85;` (this is still the case today).
Fast Company [2] writes about this as well: "The ACLU in both tests used an 80% match confidence threshold, which is Amazon’s default setting, but Amazon says it encourages law enforcement to use a 99% threshold for spotting a match." That bit of the article links to the CompareFaces API documentation [3] which (still) states "By default, only faces with a similarity score of greater than or equal to 80% are returned in the response".
My guess is: Amazon writes that 99% is the recommended threshold because there is little chance of false positive. If an agency is deploying this system and complains that it isn’t working well, a solutions architect will say in a meeting (but not in writing) to lower the threshold. If it comes out that false positives occur, AWS isn’t responsible.
True, but one can't blame Amazon in this case. The agency can reduce the threshold to 0% (ad absurdum) and claim that Amazon's technology is not working.
I guess one must define "adequate" first. The default value of 80% could work fine if you are developing a "find your celebrity doppelgänger" game while law enforcement should probably use 99%.
I think you'd have a point at which the threshold increases but the probability of the true positive being in your results starts to drop severely. 99% might be useless if you have one or two hits and they are unlikely to be correct. You can't assume that the one you're looking for will be a 100% match; if it was, then you'd just set the threshold to that, presto.
>Fast Company [2] writes about this as well: "The ACLU in both tests used an 80% match confidence threshold, which is Amazon’s default setting, but Amazon says it encourages law enforcement to use a 99% threshold for spotting a match
Then this whole thing is potentially misleading because there's a huge difference between 80% and 99%. It's probably nonlinear and they could possibly see their false matches drop to 0. This is not a fair test - or rather, the conclusions are not quite supported by the parameters.
Not that I'm defending police use of facial recognition tech, I think it's abhorrent, though possibly inevitable.
They made a facial recognition tool available to law enforcement and in the marketing it says "requires no machine learning expertise to use" then I think it's fair to look at any value of the threshold parameter they make available. Especially a parameter that, by changing it, will give you the answer you want more often.
I'm deeply troubled by the text I've seen here implying this threshold is some accuracy percentage or positive predictive value percentage. Unless God is working behind the scenes at AWS they can't make any claim about the accuracy of the model on an as yet unseen population of images.
That's even before getting to the more esoteric map vs territory concerns like identical twins, altered images, adversarial makeup and masks, etc.
Just to make sure I understand, which "whole thing" is misleading? The ACLU's test? Amazon's response?
As for the test, you say it's not a fair test. The point / conversation right now seems to be about the choice of parameters used by the ACLU. As far as I see / understand, the ACLU used the default parameters (and/or those recommended in the documentation / articles that are still up today with those same non-99% values).
My cynical guess would be "whatever the lowest number they can get away with using".
I would bet good money that cops KPI goals benefit from false positives, since they'll reward higher "number of identified/interviewed suspects" and "number of arrests" as a positive thing even if "number of convictions" doesn't line up.
Even more cynically, I'd bet this is a powerful technique for ambitious cop promotion, and that there's little blowback on fraudulently manipulating parameters that adversely affect POC much more significantly that white people.
Thinking about it, I'm now recalling the multiple reports of police departments claiming to not be using clearview.ai, only to have to backtrack when clearview's customer data got popped and it became public knowledge that individual cops were signing up for free trials - which their department/management either chose to hide or didn't know about. That's reasonably compelling circumstantial evidence to me that ambitious cops are quick to jump on unproven and unauthorised technology with insufficient or oversight or with management actively avoiding oversight for them...
In regards to the KPIs this is a known reality. Most states get money from the federal gov highway safety program. Then the states disburse it to local police depts, and the expect high numbers of citations (or even warnings) to be reported back up the chain. It is only for DUI that verdicts are considered, and that's only amongst the smarter states. Related to crime, there are NO KPIs based on the final outcome - all on the elements the police are able to carry out and be accountable for on their own. This makes sense in some ways beyond self promotion. I will say also that the general inflation of KPIs in order to justify promotions, grant renewals, etc is RAMPANT in state and local govs, but especially in policing when it comes to new tech investments and promotions
Wouldn't it be more likely that they say "ok, we can interview/investigate/whatever X number of people" and then they adjust the threshold to produce that number? If 80% gives them 10,000 hits and 99% gives them one or none, then nobody is going to just go with either setting.
I'd guess with the potato quality of facial pictures from incidents security or phone cameras, you might want lower confidence matches to get outcomes out of lousy pictures.
> had configured the recognition level against Amazon's recommendations.
Citations?
My understanding was that the ACLU used the default settings.
July 26, 2018 — Amazon states that it guides law enforcement customers to set a threshold of 95% for face recognition. Amazon also notes that, if its face recognition product is used with the default settings, it won’t “identify[] individuals with a reasonable level of certainty.”
July 27, 2018 — Amazon writes that even 95% is an unacceptably low threshold, and states that 99% is the appropriate threshold for law enforcement.
Either way, the defaults are the problem if the application is law enforcement.
"Defaults have such powerful and pervasive effects on consumer behavior that they could be considered “hidden persuaders” in some settings. Ignoring defaults is not a sound option for marketers or consumer policy makers. The authors identify three theoretical causes of default effects—implied endorsement, cognitive biases, and effort..."
I do not understand, a fair test is to replicate reality. Also I am wondering if you have for each city a different software package with it's own config or IT guy that tweaks the config
I posted in another comment, but I tried to recreate this using default 70% match. The dataset was 440 images of congressmen and 1,756 mugshots. There were ten mismatches between 70-77% certainty