NCMEC's database does not only contain CSAM. It has never been audited. It's ful...

simondotau · on Aug 14, 2021

You don't know what's in it, but you do know it's full of false positives? I wonder, do you know how many of those false positives are flagged as A1?

0xy · on Aug 14, 2021

I know for a fact that it is full of false positives, there's also public sources making the same claim. [1]

[1] https://www.hackerfactor.com/blog/index.php?/archives/929-On...

simondotau · on Aug 14, 2021

I clicked on your link hoping for serious analysis. I would have settled for interesting analysis. I was disappointed. It's just a guy who found one MD5 hash collision. The paragraph was written in a way that makes it unclear whether source of this specific hash was in fact NCMEC or if it was "other law enforcement sources". So which was it? Did this person follow up with the source to confirm whether his hash match was a false positive or a hash collision?

So in short, the source for your claims has evidence of between zero and one false positives.

Unimpressive would be an understatement.

0xy · on Aug 14, 2021

It was not a hash collision. It was a false positive. The way the hashing solution works doesn't really allow for a false positive except in extraordinarily rare circumstances.

It matched a man holding a monkey full clothed as a CSAM image, direct from NCMEC's database.

He encountered a 20% false positive rate while running a moderately popular image service, with an admittedly low sample size. It's still evidence, and given NCMEC is immune from oversight, FOIA and accountability, it's concerning.

Also, the fact I know there are false positives does not stem from that post. I know it independently, but since you asked for a source stronger than "just trust a random internet guy", I gave you one.

He's not the only person making the claim though, others throughout these threads with industry knowledge have confirmed what I already knew. If you're asking me to reveal how I know, I'm afraid you'll be disappointed. I'd rather be accused of lying than elaborate.

simondotau · on Aug 14, 2021

> It was not a hash collision. It was a false positive.

Again, this was an MD5. It's literally impossible to assert that it was a false positive with absolute certainty. A hash collision is not outside the realm of possibility, especially with MD5. Apparently no attempt was made to chase this up. And we still don't know whether it was the NCMEC database or "other law enforcement sources".

You continue to claim that it was "direct from NCMEC's database" but again, that isn't asserted by your source.

> He encountered a 20% false positive rate

He encountered one potential false positive. Converting one data point into a percentage is exactly why earlier I described this nonsense as being disingenuous. The fact that you would cite this source and then defend their statistical clown show is, in my opinion, strong positive evidence that your other citation-free assertions are all entirely made up.

0xy · on Aug 14, 2021

What you're missing is the statistical probability of two MD5 hashes colliding, which is astronomically unlikely.

Your argument is essentially that a collision is more likely than a false positive, which would imply a false positive rate of 0.00% in NCMEC's database based on its size.

It's clear that nothing will convince you if you believe that humans managing a database will never, ever make a mistake after over 300,000,000 entries. Because if they make one single mistake, then it supports my argument -- a false positive becomes substantially more likely statistically than a hash collision.

You're also providing a pretty large red herring with your suggestion that he could've simply asked NCMEC if it was a false positive or a hash collision. NCMEC would never provide that information, because the database is highly secret.

Given those statistics, I think that source is more than valid.

>in my opinion, strong positive evidence that your other citation-free assertions are all entirely made up

I am happy to let you believe that I'm lying.

One industry insider whose employer works with NCMEC cited a false positive rate of 1 in 1,000. [1] The product he works in is used in conjunction with NCMEC's database. Elsewhere, in press releases, the company cites a failure rate of 1% (presumably both false positives and false negatives) [2]

[1] https://news.ycombinator.com/item?id=21446562

[2] https://www.prnewswire.com/news-releases/thorns-automated-to...

simondotau · on Aug 14, 2021

Nowhere have I claimed that the NCMEC databases are entirely devoid of miscategorised data. I am merely pushing back at your evidence-free claim that "it's full of false positives."

Once again, you continue to assume that the MD5 collision cited was from a NCMEC corpus and not "other law enforcement sources". You're reading far more into your sources than they are saying.

And now you are conflating claims of false positives in the origin database with rates of false positives in a perceptual hashing algorithm. You are clearly very confused.

0xy · on Aug 14, 2021

>Nowhere have I claimed that the NCMEC databases are entirely devoid of miscategorised data

If NCMEC's databases have a false positive rate of 1 in 1,000, do you realise that means a false positive rate is substantially more likely than a hash collision?

>I am merely pushing back at your evidence-free claim that "it's full of false positives."

1 in 1,000 is "full of false positives" by my own standards, to which I've posted evidence from an employee whose company works directly with NCMEC.

>you continue to assume that the MD5 collision cited was from a NCMEC corpus and not "other law enforcement sources".

NCMEC and law enforcement are one and the same. FBI employees work directly at NCMEC. [1] Law enforcement have direct access to the database, including for categorisation purposes. To suggest law enforcement's dataset is tainted yet NCMEC's is not doesn't make any sense to me.

>You are clearly very confused

Address the 1 in 1,000 claim. Thorn is an NCMEC partner and a Thorn employee has claimed a false positive rate of 1 in 1,000. In press releases, Thorn is even less sure at 1% failure rates. Thorn uses perceptual hashing.

I can't see how you can simultaneously claim the database isn't "full of false positives" while acknowledging a failure rate as abysmal as 1 in 1,000.

I also didn't conflate anything, both a hash collision and a perceptual hash collision are less likely than a false positive, by an extraordinary margin. Apple claims their algorithm has a collision 1 in 1,000,000,000,000 times. Compare that to 1 in 1,000.

The database is full of false positives, and now presumably you'll deny both industry claims and NCMEC partner claims.

[1] https://www.fbi.gov/investigate/violent-crime/cac

simondotau · on Aug 14, 2021

Thorn isn't claiming that "the database is full of false positives." Once again you are conflating claims of false positives in the origin database with rates of false positives in a perceptual hashing algorithm. You are so catastrophically confused that ongoing dialogue serves no purpose. Goodbye.