That's a point I've always wondered about.
Given that most (all?) md5 collisions consist of appending or prepending data, how much more difficult would it be if you encode the size as well.
Surely the difficulty is much more. And then add the fact that it has to be semantically/syntactically similar enough to fool whatever ingests it...
