Hacker News new | past | comments | ask | show | jobs | submit login
Codehash.db – A public database for software and firmware hashes (github.com/rootkovska)
111 points by andrewdavidwong on Nov 12, 2016 | hide | past | favorite | 29 comments



This is like a version of Certificate Transparency for software. Attempts have already been made to port CT to software ("Binary Transparency"), and I like them better than this approach.

Specifically, you can demand a CT receipt from your downloads, proving that the existence of the download had been made public.

Without that, and in this scheme, it's still possible to simply target someone with tailored malware and assume they won't bother to check hashes against one of these databases.

With CT, the download client itself can automatically refuse a download that has not been publicly announced.


Fair enough, but this system isn't intended for end users who, as you point out, are unlikely to bother checking the hashes of their downloads. Quoting Joanna Rutkowska:

"Also, in case it wasn't clear: the primary audience for such a DB should be developers or admins (e.g. IT department in a large organization), I think. Not users. Users are always somehow fated to trust the 'last mile' vendor, and there is little feasibility in implementing any form of trust distribution for them."

https://secure-os.org/pipermail/desktops/2016-November/00014...


Your quote's incorrect, in my opinion. Under CT, users are able to mistrust the issuing CA -- we start by assuming they want to give us malware (in the CT case, MITM certs) and trust them only to the extent that they are distributing publicly announced and tracked artifacts to us. This happens at the end user's computer, when their browser refuses to accept a cert with no CT announcement attached. This all happens in running Chrome browsers today.

If other software (e.g. your Linux distro) similarly checked for publicly announced artifacts (e.g. an offered package upgrade) then you would be protected against targeted malware from your last mile vendor. The malware has to be either offered to everyone (ensuring detection) or no-one.

I think the CT mechanism is simply better than a system that "isn't intended for end users", because a CT mechanism projects both administrators and end users.


I don't speak for Joanna, but I interpret that quotation as saying something like:

"Users are always fated to trust the 'last mile' vendor because the last mile vendor (e.g., Google Chrome), has control over what the user sees and does (i.e., sends and receives). If your Chrome browser is compromised or malicious, it can silently ignore the fact that no CT announcement is attached to a cert. In this sense, the user is fated to trust Chrome.

"Moreover, there's little feasibility in implementing any form of trust distribution for them, but this is not to say that there's little feasibility in implementing a system that keeps them relatively secure. Users running a non-malicious, non-compromised instance of Chrome do not have any form of trust distribution, since they place all their trust in Chrome (though they probably don't realize it). Nonetheless, Chrome may be keeping them relatively secure as long as it's working properly."


Thanks, "last mile" confused me because it implies a transfer. With CT applied to software updates, I think you really could be suspicious-by-default of your software vendor.

Do you have any thoughts on why codebase.db should exist, versus pushing the same hashes to a CT log and having clients check for CT announcements? Seems like CT is a clear improvement.


I honestly don't know enough about CT to have an opinion.


Is there any accommodation for the per-user downloads implemented by Chrome, Dropbox, etc. (each with their own ways of sneaking around Authenticode verification)?

https://textslashplain.com/2016/05/13/cheating-authenticode-...


Why would they even do that? After you get a signing certificate, you can sign as many times as you want, so it's not like they're saving any money.


Even if the installer is signed, Windows and/or your AV will warn you if it's a package that hasn't been downloaded by many other users.


Interesting idea. The Software Heritage project (https://www.softwareheritage.org/) has the goal of doing this for all software source code; perhaps they might be interested in extending that to binaries as well? That seems compatible with their goal of preservation.


Software Heritage looks excellent, but it sounds like the two projects may have different goals. It sounds like Software Heritage is focused on collecting, preserving, and sharing code (and, as you say, potentially compiled software), whereas codehash.db is focused on allowing people to securely authenticate it after they've obtained it through some other means.


How is this different than the NSRL and why wouldn't you use that instead?


The main difference seems to be that the NSRL does not include PGP signatures (or any substitute), so there's no way to verify that the hashes are authentic, in the sense that the hashed software is bitwise identical to the software that the developer intended to distribute. This is precisely the problem that codehash.db is designed to solve. Without any way to verify the authenticity of the hash values, we have to rely on the authority of the NSRL itself. (In addition, the fact that the NSRL appears to have close ties to the U.S. government might make it even harder for some people to trust it.)


The NSRL dataset has signatures that are typically used to verify both integrity and veracity.

http://www.nsrl.nist.gov/RDS/rds_2.54/split-hash.txt

Alleging the NSRL is untrustworty is inconsistent with the track record of the NSRL and NIST scientists.

Please be aware that there are thousands of forensic experts who have relied on the NSRL over the last decade or more as a basis for testimony in court. Those experts verify hashes for everything they do, and for every case, and as a result there has been significant amount of independent peer review of the contents.

While Codehash.db provides a hash for a package, the NSRL provides hashes for individual installed files.

This in no way diminishes the value of the Codehash.db design. They target different use cases.


> The NSRL dataset has signatures that are typically used to verify both integrity and veracity. > http://www.nsrl.nist.gov/RDS/rds_2.54/split-hash.txt

Can you explain this signature scheme? I'm not familiar with it. The link you provided just appears to show hashes and sizes for a file that has been split into four pieces.

> Alleging the NSRL is untrustworty is inconsistent with the track record of the NSRL and NIST scientists.

I'd just like to point out that neither I nor anyone else here has alleged that.

> Please be aware that there are thousands of forensic experts who have relied on the NSRL over the last decade or more as a basis for testimony in court. Those experts verify hashes for everything they do, and for every case, and as a result there has been significant amount of independent peer review of the contents.

I'm genuinely glad to hear that! That's good to know.

> While Codehash.db provides a hash for a package, the NSRL provides hashes for individual installed files.

I don't think that's necessarily true. Codehash.db is open to hashes for anything (source code, ISO, package, binary installer).

> This in no way diminishes the value of the Codehash.db design. They target different use cases.

Likewise, my remarks aren't meant to be in any way derogatory toward the NSRL. As far as I'm concerned, it's OK if they do, in the final analysis, target the same use case. If that's the case, the best solution should be adopted, whichever one that turns out to be. :)


I am not sure I follow you on signatures. Doug White at NIST (https://twitter.com/dwhitenist) could change some hashes trivially and then sign them, and you'd never know the difference unless you thought you had the same file with a different hash. Even then you'd probably chalk that up to having a new version of the file that wasn't in the NIST. Are you thinking of some other scheme?

At the end of the day, I think it comes down to trusting Doug, which a lot of people do.


That's precisely the point. Doug could trivially change some of the hashes before signing them. If he were to do that, he wouldn't be trustworthy, and you, as a security-conscious individual, would want additional witnesses to corroborate the hashes before you're willing to accept that the software you downloaded is authentic. This is what codehash.db is designed to provide. (If you would be willing to chalk up the hash difference to a version difference, then this is probably aiming at a higher level of security than what you seek.)

In reality, Doug would never change hash values like that because he's trustworthy. At least, he wouldn't willingly or knowingly do it. But if Doug's signature is the only thing that guarantees the authenticity of a list of millions of hashes, that paints an awfully large target on his back. How do you know that Doug hasn't been coerced into changing some hash values before signing them. How do you know that Doug's signing key hasn't been compromised? We can't know these things for certain, but we'd have much greater assurances if we could check the signatures of multiple independent parties in addition to Doug's, and that's exactly what codehash.db aims to allow. It's a way of distributing trust across a larger group of people instead of centralizing it into a single point of failure.

By the way, does Doug actually sign the hashes? I haven't been able to find any signatures, so please point me to them if there are any.


How do you determine identity with hash values? Alice could say that svchost.exe's hash is deadbeefdeadbeef and Bob could say it's baadcodebaadcode, but, of course, they both could be right because there are umpteen versions of svchost.exe. So, how do you solve the identity problem in order to detect evil?


It depends on the entity being hashed, but in the case of software, it's usually a version number. In the case of source code, maybe a git commit hash.


Personally, I had never heard of the NSRL: http://www.nsrl.nist.gov

Looks nifty, though.


Super nifty.


This would be great for intrusion detection if there were some tools that users could use to automatically query the database, and repository maintainers could use to upload hashes.


There are a fair number of tools that do just this, either with NIST's NSRL or with commercial hash sets, the notable one being Bit-9. Bit-9 has an order of magnitude or two in size over NSRL (which itself has several orders of magnitude over this database).


Don't packet managers like homebrew in principle do something similar? Would be interessting to join forces I guess.


Automated package building usually looks like this:

    wget http(s)://... | tar x
    make
    package
    sign_with_gpg
    upload_to_ftp
Only if a package maintainer gets involved there is a chance that release signatures are actually verified. But even then, a whole lot of upstream projects just don't sign their releases. Some distros don't sign their packages, either. Or even their ISOs (iirc Linux Mint only started doing this fairly recently).

Also, "web of trust" only works for a tiny subset of people. If I'm a "lone wolf" FOSS developer, my key won't be signed by anyone, there won't be any WoT to verify. Downstream packagers just have to swallow that or TOFU.


> Also, "web of trust" only works for a tiny subset of people. If I'm a "lone wolf" FOSS developer, my key won't be signed by anyone, there won't be any WoT to verify. Downstream packagers just have to swallow that or TOFU.

One way to mitigate this nowadays is through services like keybase.io, which allow you to aggregate evidence for the authenticity of your key from social media accounts and websites. You can also do this yourself by posting your PGP fingerprint in many different places. These methods make it much more difficult for someone to create a new key in order to impersonate you. Accordingly, it's easy to trust that a key really belongs to a certain person -- even if there are no signatures on it -- when there's a long history of evidence from many different sources that would collectively be very difficult to spoof.


Sorry, I am new to this, but my understanding is that homebrew will verify the hash before installing, right?


It probably does, but what good does that do if the source code wasn't transported securely and verified from the upstream to the packager?

Disclaimer: I don't know anything about brew packaging practices. Maybe they always require verification. Maybe they don't.


Cool idea.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: