Collisions aren't a major risk with MD5 when you also give someone the file size (even approximate).
Finding a collision in MD5 is costly, finding a collision in MD5 which is within -+10% of the actual size is extremely costly (technically possible, but maybe not in your lifetime).
As to the other reply "because it is zip something something" I disagree. Zip is an extremely good format for crafting fake files which match a checksum. Really any format which can take arbitrary metadata (which is MOST) is pretty easy.
I suspect the reason they use MD5 is because everywhere supports it and it is "good enough," particularly with file size. Plus the person downloading them knows the files are malware, so what could the security services do, inject an even more malware-malware that they then expect the user to run?! Seems dumb. You're likely more at risk from day to day applications installers which aren't digitally signed.
>Finding a collision in MD5 is costly, finding a collision in MD5 which is within -+10% of the actual size is extremely costly (technically possible, but maybe not in your lifetime).
MD5 collisions with 10% of the size of the file can be found in seconds on a old laptop computer. I've done it, we assign it as HW in class.
Notice that the two colliding exe are exactly the same file size. These attacks have only gotten better.
>Zip is an extremely good format for crafting fake files which match a checksum. Really any format which can take arbitrary metadata (which is MOST) is pretty easy.
The example I gave uses windows and linux executables. No zip files in sight. These attacks are from 2009.
> Notice that the two colliding exe are exactly the same file size. These attacks have only gotten better.
They're also 6, not 200+ KB. They have been specially crafted to be as small as possible to make the problem set as easy as possible.
> The example I gave uses windows and linux executables. No zip files in sight. These attacks are from 2009.
That's a really strange reply. What is it you think I said..? I said and to quote you quoting me: "'Really any format which can take arbitrary metadata (which is MOST) is pretty easy.'"
So why you felt the need to point out that it is an executable not a zip file is uhh strange to say the least...
>They're also 6, not 200+ KB. They have been specially crafted to be as small as possible to make the problem set as easy as possible.
That is not how it works, MD5 is vulnerable to length extension attacks[0]. Once you collide part of an MD5 hash, if everything that follows that collision is the same, it can be as long as you want. Colliding large files is just as easy as colliding small files. You could perform the same exercise with 1GB executables.
> Once you collide part of an MD5 hash, if everything that follows that collision is the same, it can be as long as you want. Colliding large files is just as easy as colliding small files.
I've read that three times, still don't follow what you're getting at. That isn't how length extension attacks work/can be utilised.
Please go ahead and generate a file that collides with any of the linked files and is the same file size. The content doesn't have to be valid or readable, junk/binary is fine. If you can do this in a reasonable period of time (e.g. 24 hrs) then your point would have been proven.
The smallest is 224K with a hash of 180caf23dd71383921e368128fb6db52.
I didn't use the expression "collision attack" ever in this thread. I quoted someone else who used that term however (and the context of the whole discussion is clearly related to preimage attacks, not collision attacks).
MD5 is both vulnerable to collision attacks and targeted collision attacks. We can imagine both in the wikileaks case. You are correct that Target collision attacks are more difficult but they have been done in research for many years now[0](2006) and they are showing up in the wild as well[1](2012).
Not at all; look up "md5coll" and "fastcoll", released nearly 10 years ago, could generate a pair of colliding blocks in under an hour. Testing them now on my machine (which is already a few years old) it generated them in under a second(!)
This has been used to create executables that behave differently but that's because they can inspect themselves; on the other hand I think generating two .zip files with the same hash but different (valid) contents would be rather more difficult, but it's probably still quite feasible today.
You're ignoring half of my post (on purpose?): Now generate the collisions matching file sizes. Even that "under a second" concept relies in tiny files.
As files get larger matching both the MD5 and file size becomes more costly.
That's a point I've always wondered about.
Given that most (all?) md5 collisions consist of appending or prepending data, how much more difficult would it be if you encode the size as well.
Surely the difficulty is much more. And then add the fact that it has to be semantically/syntactically similar enough to fool whatever ingests it...
They should definitely not being using MD5 for anything. Even if the people in this thread saying that finding MD5 collisions is hard were correct, and they aren't, why take the risk? The performance benefits aren't large for MD5 over competing hash functions that don't have know systemic weaknesses and MD5 attacks will only get better.
Use SHA256, SHA-3 or MD6 (I like MD6, others may disagree. Disclaimer I worked on proving the differential resistance of MD6).
Collision resistance is more interesting when hashes are used in cryptographic protocols and large amounts of data can be captured, seen and analyzed.
I can't think of a purpose where a collision of a non-malicious sample with a malicious file can be used by an attacker (let alone the same attacker). In addition, there are lots of historical threat data (tactical intelligence) that is based on md5sums. Newer tools support newer checksums, but will more than likely just increase the types of checksums supported, and not deprecate them.
Checksums are less and less useful when the malware can be configured, recompiled and re-assembled for a particular target. There are some good discussions on HN more fuzzy detection techniques that can't be evaded by changing inert parts of the payload, but that is orthogonal to using stronger checksums. Indicator of Compromise data including md5sums can be useful for general security, but because a determined attacker will mutate the files it is better suited to more commodity malware.
Because it would be very hard to find a collision of a file that behaves exactly like a ZipFile.
To make a collision work, you would need to inject the payload into the program, and find a specific blob to put into the zip file, that once compressed and hashed would cause a collision. This isn't computationally efficient.
I am not familiar with zip file format, but if zip files allow comments or other meta-data (in uncompressed form) then it is an easier path. I suspect even file-names could be an opportunity.
Too shame that they're using so much of money, which is basically tax money of Mongolian people, on surveillance tool when Mongolians living their like hell. shame on them.