If the default ETag algorithm for non-encrypted, non-multipart uploads in AWS is a plain MD5 hash, is this subject to failure for object data with MD5 collisions?
I'm thinking of a situation in which an application assumes that different (possibly adversarial) user-provided data will always generate a different ETag.
Sure, but theoretically you could have a system where a distributed log of user generated content is built via this CAS//MD5 primitive. A malicious actor could craft the data such that entries are dropped.
My understanding of the feature, and correct me if I'm wrong, is that you are not granted write access based on a hash. You already have write access. You can use the hash to avoid overwriting someone else's data that was appended to the file in between you checking the file and writing to it. If you already have write access, the hash is irrelevant. As a bad actor, you can corrupt the data without it.
MD5 should not be used for anything security related. Granting write access based on an MD5 hash would be a huge no-no.
Right, the issue comes when a trusted writer is logging data that is sourced from an untrusted party.
Imagine a transaction log being a blob per-customer with many lines corresponding to price, sku, etc, that additionally have some “memo” field provided by the customer. A trusted distributed worker process is responsible for taking incoming requests by the user, pulling their blob down, appending the line based on the request, and CAS’ing it back in (retrying on failure). With enough effort, a particularly devious user could issue many requests with ‘memo’s engineered to not alter the MD5 of their log. This would cause some lines to be lost. An audit of their account transaction log would be unable to accurately reflect the requests they made to the service, and the failure would be invisible.
This is obviously a bit contrived – I’ll be the first to admit. But if the incentives were to exist for this to be worth someone’s time for some system, I think it would be likely to see it come up eventually.
With Google Cloud Storage, you can solve this by conditionally writing based on the "generation number" of the object, which always increases with each new write, so you can know whether the object has been overwritten regardless of its contents. I think Azure also has an equivalent.
I'm thinking of a situation in which an application assumes that different (possibly adversarial) user-provided data will always generate a different ETag.