Change your loop to only loop while $v < 42, and then when the md5 matches, set $v to be the number of copies of "gonna" in the text. ( * ) Print the final string at the end.
If you had a good random generator then with very high probability this program would be correct, although it is unable to run in any reasonable time.
* There are exactly 42 copies of "gonna" in that text. Did God speaking through Douglas Adams subtly rickroll all of us?
Could this not also theoretically colide with a string with the same hash and more than 42 occurrences of "gonna"? Better set the flag to `$v==42` just to be safe.
Is there a way to tell approximately how many 1943 character long string are expected to have the same md5 hash? A lot, or just a few? Or it's not possible to tell because of the properties of md5?
Sorry for all the silly questions, I'm curious, but my cryto knowledge is weak.
Probably not truly possible to calculate. But if we can probably figure out a rough estimate.
Let's setup what is valid in the string and we can go from there.
a-z | 26 characters
A-Z | 26 characters again
\x20| 1 character, space
.,' | 3 characters for punctuation
() | 2 characters
\n | 1 more character
Now these describe just a string that is very similar to the original pastebin (with a few extra characters). If we were to consider all possible 1943 byte strings it'd be fairly different.
So we've now got that we're looking at strings with 59 distinct characters in them. That means we can consider that it encodes log2[59^1943] bits of information, which comes out to 11429.975..., let's just call that 11430 bits. An md5sum contains 128bits, and we're after ones that match just the string we're after. We've already figured out that there should be 2^11430 possible 1943 character strings. So let's divide the possible choices by the number of possible md5sums and we can get a rough idea.
(2^11430)/(2^128) => 2^(11430-128) => 2^11302
Throwing this in a calculator since I don't want to flood HN with a gigantic number, that comes out to about 1.74 * 10^3402 possible strings that will match the MD5.
This is just an approximation and I bet I made a mistake on the math somewhere.
The thing I find interesting is this simple task highlights a constant problem with programming: clarification. Looking at the utf8 example, who wins, the fewest bytes or the fewest characters? It's always those little details that tend to cause the biggest issues.