Wrote up something in python that achieves similar, but for increasing numbers of characters. Found a few hits fairly quickly. pypy3 is pushing 420k evaluations a second on my laptop.
e.g.
"The SHA256 for this sentence begins with: five, two, and d"
"The SHA256 for this sentence begins with: seven, e, and nine"
"The SHA256 for this sentence begins with: one, five, e, and three"
My brain fart this morning was forgetting to account for the newline, so I started out spitting out bad options.
You can wrap tqdm around the "permutations(TOKENS, k)" if you want to measure progress. I haven't spent time trying to make this particularly optimised, for example that dict lookup is likely avoidable with a little bit of work via index lookup, and maybe cheaper. I've also not attempted to parallelise it, which would be fairly easy to do.
#!/usr/bin/env python
import string
import hashlib
from itertools import permutations
WORD_DIGIT = {
"one":1,
"two":2,
"three":3,
"four":4,
"five":5,
"six":6,
"seven":7,
"eight":8,
"nine":9}
TOKENS = [
"one",
"two",
"three",
"four",
"five",
"six",
"seven",
"eight",
"nine",
] + list(string.ascii_lowercase)
SEPARATOR = ", "
STARTING_TEXT = "The SHA256 for this sentence begins with: "
for k in range(2, 6):
for perm in permutations(TOKENS, k):
sha_start = ""
for char in perm:
if char in WORD_DIGIT:
sha_start += str(WORD_DIGIT[char])
else:
sha_start += char
test_string = STARTING_TEXT + SEPARATOR.join(perm[:-1]) + ", and " + ''.join(perm[-1:]) + "\n"
checksum = hashlib.new("sha256")
checksum.update(test_string.encode())
if checksum.hexdigest().startswith(sha_start):
print(test_string)
This is an interesting starting point for a python implementation, but it doesn't work correctly as far as I can tell. The list of tokens leaves out zero and includes a bunch of ascii characters that aren't valid hex. That second problem won't produce wrong results, but it'll spend a lot of time testing combinations that can't possibly appear in a SHA256 hex output. I also think the original tweet calculated without a newline.
More significantly, permutation in python is probably not the right tool for this. I believe it will not repeat token values at different positions, so for example you'll never end up testing "The SHA256 for this sentence begins with: f, f, and f", which is certainly a potentially valid result.
another pretty easy optimization is swapping to https://github.com/minio/sha256-simd, particularly if you're on a process with the sha extensions.
versus just spawning a goroutine per attempt, it'll likely be much faster as well to just split the search space into number of cores and have one chugging away on each chunk of the search space. (I have a thing to do vanity git commit hashes and that's what I do for that; for 7 characters in the hash, it takes well under a second on my CPU on average)
While we're there, since the message has a fixed prefix, you can initiate the sha256 computation with it, and then repeatedly branch off of that for the test suffix.
Oh I didn’t mean per attempt. If you have <= 16 cores, just wrap the first level loop body in a goroutine; if you have >16 cores, wrap the second level to spawn 256 goroutines. Very little code change, big speedup.
There are N possible sequences, and you try N times with a success probability of 1/N each (because it is a good hash function). This means the expected number of hits is 1.
I like little puzzles so I gave it a shot in Ruby.
#!/usr/bin/ruby
require 'digest'
charmap = %w{zero one two three four five six seven eight nine a b c d e f}
(0..0xf).map do |prefix|
Process.fork do
(0..0xffffff).each do |i|
hex = "%01x%06x" % [prefix, i] # "0123456"
digit_words = hex.chars.map { |c| charmap[c.to_i(16)] } # ["zero", "one", ...]
sentence = "The SHA256 for this sentence begins with: %s, %s, %s, %s, %s, %s and %s." % digit_words
sha = Digest::SHA256.hexdigest(sentence)
if sha[0,7] == hex
puts(sentence)
end
end
end
end
Process.waitall
8 minutes on an old quad-core. I don't expect there's much to optimize since all the heavy lifting is in the libraries. I didn't bother with synchronization so you might get a scrambled result if it somehow finds a flurry of hits. :)
> Was just verifying your tweet's hash, and then...omg!!! I couldn't believe what I realised. The SHA256 of THIS tweet starts with exactly the same 7 characters as your tweet's hash. What are the chances of that?
This means there's quite a lot of solutions. Running the entire 2^56 would take over 770 core-days if one core can achieve an impossible 1 giga-iterations per second.
> Was just verifying your tweet's hash, and then...omg!!! I couldn't believe what I realised. The SHA256 of THIS tweet starts with exactly the same 7 characters as your tweet's hash. What are the chances of that?
It's not too difficult. All you need is to generate many variations of potential words and check whether the sha256 hash matches the wanted leading characters. For example, check this text.
$ echo -n $'It\'s not too difficult. All you need is to generate many variations of potential words and check whether the sha256 hash matches the wanted leading characters. For example, check this text.' | sha256sum
182a7c9d2e99162688aaaf3f97638edd7d06f8d295e456c7bb1f16abf3a8f70c -
$ echo -n $'Was just verifying your tweet\'s hash, and then...omg!!! I couldn\'t believe what I realised. The SHA256 of THIS tweet starts with exactly the same 7 characters as your tweet\'s hash. What are the chances of that?' | sha256sum
182a7c9c08b2f0f9333bf23828c5fbf47addf74e815b6a22ca10825450bc2ee1 -
I'm confused as to how they've done this. The original message, sure, you can brute force the digits and hope you get a collision, and try a new, plausible preamble if not. But I don't see how they've found this collision without it looking like anything is brute forced.
Is it using substitute Unicode characters or something?
Indeed, this is how I did it. 2^28 is around 270 million. With little more than a handful options, it should be possible. Although I have to say, it turned out to be more difficult than I'd initially thought. Maybe it's better to think of it as 28 different boolean choices.
echo -n "Indeed. This is how I managed to do it. 2^28 is around 300 million. With only a handful options, it's possible. Although, I must say that it turned out to be more difficult than I'd initially thought. Perhaps it's better to think of it as 28 different alternatives." | sha256sum
Throwing a loop of numbers on the end seems far more feasible. I put together a hasty Python implementation which can immediately find three hits at 5 characters. TQDM is reporting ~450k tries per second, so depending on how lucky you are, would probably want to redo in a faster language to solve for 7+
import hashlib
import itertools
import tqdm
BLOCKS_SIZE = 5
sentence_prefix = "The SHA256 for this sentence begins with:"
itos = {0x00:"zero", 0x01:"one", 0x02:"two", 0x03:"three", 0x04:"four", 0x05:"five", 0x06:"six", 0x07:"seven", 0x08:"eight", 0x09:"nine",
0x0a:"a", 0x0b:"b", 0x0c:"c", 0x0d:"d", 0x0e:"e", 0x0f:"f"}
for nums in tqdm.tqdm(itertools.product(itos.keys(), repeat=BLOCKS_SIZE)):
sentence = f"{sentence_prefix} {', '.join(itos[num] for num in nums[:-1])}, and {itos[nums[-1]]}."
hash_true = hashlib.sha256(bytes(sentence, "utf8")).hexdigest()
guessed_prefix = "".join(f"{n:x}" for n in nums)
true_prefix = hash_true[:BLOCKS_SIZE]
if guessed_prefix == true_prefix:
print("collision")
print(sentence)
print(hash_true)
All we are demonstrating here is why Sha256 is 256 bits and not 32 bits. We have trivially identified collisions for the first 28 bits of the output, which is only 11% of the entire hash size.
Difficulty of collisions roughly doubles for each additional bit. Imagine we had a SHA32, that would be 16 times harder to achieve a collision. SHA256 is 43 with 67 zeroes behind it more difficult than the examples here.
7 digits of the hash makes for 16*7 possible hashes. I spot 4 potential "filler" lines in that tweet, so if you find log4(16*7)=14 candidates for each of those filler lines, then one combination would be expected to yield that hash.
Basically I had several substitutions around words, case, punctuation, etc. and just ran it until it found some hits. Quite easy with just four characters though but was only a proof of concept.
You can generate sentences of the form "This sentence begins with: " followed by seven comma-separated english words denoting hex digits. Then search that space of digits, until you get a hit.
For each digit combination, you can try it with multiple variations of the sentence like "The SHA256 of this sentence begins with", "The SHA256 hash of this text starts with" and many more. That increases the search space without increasing the number of digits that have to match, making it more likely that a hit is found.
If we don't vary the sentence ("begins with, starts with, is prefixed with" etc...) and only consider the final part, then the number of english sentences is (unsurprisingly) equal to the number of possible hashes.
So for each distinct sentence, we pick randomly from the n different hashes, and we attempt to brute force this n times. So the probability of not getting this is:
(1-1/n)^n
As n increases, this will yield e^-1, so we have about a 36.7% chance of this not happening for any given length. So this has a probability of happening of 63.3%.
So there is a decent chance that there exists a sentence "The sha256 of this sentence is ..." for even the full sha256. If you are allowed to modify the sentence to be something like "'Begins with', 'starts with', 'OMG guys, check this out':" you can get this up to almost 1. Finding it would be mildly hard though barring some novel discovery about sha256.
$ echo -n 'The SHA256 for this sentence begins with seven, seven, f, zero, a, b, b and five.' | sha256sum
77f0abb54cd09ad7b654bd5e762d7be58e7daffd1a0da6a56f5135bd667856a3 -
I saw this on HN years ago from the original that linked here.
2-3 days ago posted on 4chan's /g/ board that I remembered this sentence but didn't remember its source and the hash it contained.
Someone on /g/ found the post on HN and a large thread ensued that expired and auto-deleted 12 hours ago or so. And then this person posted it on Twitter and it got shared here.
Reposts do add value though. Think of it this way: there’s a pool of content you haven’t seen before and a subset of it which you’d enjoy seeing. If no one reposts anything ever, you won’t be able to see any of it.
In fact, some of the things you have seen before you would probably not mind seeing again. Maybe you didn’t think to save it at the time or maybe your circumstances have changed.
Anyways, I wouldn’t confuse low effort with low value.
So, if my logic is right (is it? That seems awfully cheap to me), this is about 2^22 times as easy as mining a bitcoin. Bitcoin is at about $25k, so there are people who can find hits like this one for way less than a cent.
Bitcoin difficulty is not measured in bits but its own unit (some hashrate was defined as difficulty 1, and difficulty 10 means the hash rate is 10x that). The number you see on that website is also in trillions, so the current difficulty is about 50 trillion. Also, you get more than one Bitcoin per hash (currently 6.25, halving every few years).
To mine six BTC, you currently need almost 80 bits of zeroes, so 28 bits is basically trivial.
By the pigeonhole principle, there is a sentence that writes out its entire SHA256 representation this way. Alternatively, the map from these kinds of sentences with 256 terms to 2^256 given by SHA256 admits a fixed point.
The pigeonhole principle does not say that. It can be used to show that there are two different sentences with the same hash as each other (among any collection of 2^256 + 1 sentences), but it tells you nothing about hashes that agree with the content of the sentence. The probability that a random hash function on a collection of 2^256 sentences has a fixed point is about 1 - 1/e, and it approaches 1 as you add more variations to grow the collection infinitely. But SHA-256 isn’t actually random, so the only way to know this for sure would be to find an example.
I don't believe this is necessarily true. Unless I'm misunderstand you, each of the possible variants of spelling out 32 hexadecimal characters could theoretically SHA-256 into the spelled-out hash + 1 (looping around at ff…ff).
I don't see how pigeonhole principle applies to that situation. It could well be that "zero" hashes to 1, "one" hashes to 2... and "f" hashes to 0, extended out to the hash's length.
Well, I got bored too, so here's some results with CUDA (on an RTX 3090):
The SHA256 hash of this message begins with 534d765
The SHA256 hash of this message begins with c18b2de
The SHA256 hash of this message begins with 7fe17da2
The SHA256 hash of this message begins with a7fdc855d
The SHA256 hash of this message begins with 46eae34f1
My unoptimized implementation takes about 40s for 9 digits, so it will probably take about 10 minutes for 10 digits, about 3 hours for 11 digits, etc. It shouldn't be hard to use words instead of hex digits, if that's desired.
Generating sequential Git short commits! It's so pleasant looking I'm tempted to try using it one of my projects, but deep down I know it's not worth the hassle.
The code searches permutations of increasing length. Benchmark is 8 seconds on a MacBook Pro M1 to discover the match of 182a7c9. The code is not yet optimized.
use std::time::SystemTime;
use itertools::Itertools;
use sha256::digest;
fn main() {
let digits: [usize; 16] = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
let chars = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "a", "b", "c", "d", "e", "f"];
let words = ["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "a", "b", "c", "d", "e", "f"];
let mut length = 2;
let start = SystemTime::now();
loop {
for permutation in digits.iter().permutations(length) {
let parts = permutation.iter().map(|x| words[**x]).collect_vec();
let sentence = format!(
"The SHA256 for this sentence begins with: {} and {}.",
&parts[0..(parts.len() - 1)].join(", "),
&parts[parts.len() - 1]
);
let checksum: String = digest(&sentence);
let starts: String = permutation.iter().map(|x| chars[**x]).collect();
if checksum.starts_with(&starts) {
println!("milliseconds: {:?}, {} ", start.elapsed().unwrap().as_millis(), &sentence);
}
};
length += 1;
}
}
Output:
milliseconds: 3, The SHA256 for this sentence begins with: zero, b, six and two.
milliseconds: 54, The SHA256 for this sentence begins with: zero, e, d, eight and f.
milliseconds: 8279, The SHA256 for this sentence begins with: one, eight, two, a, seven, c and nine.
Something that I'm not seeing mentioned in these comments (I may just have missed it) is that you can precompute the hash of the static part of the string and then extend it with the numbers in a loop, saving some cycles. This is because the full hex representation of a SHA hash gives you the entire internal state of the algorithm. This can lead to security vulnerabilities:
I did something similar once (with a bit of a twist) for my bio when I was a TA in college:
"Hi! I’m a senior studying CS. My hobbies include making semantic paradoxes and my bio includes eight a’s, seventeen e’s, fourteen i’s, eight o’s, six u’s, and one wrong number"
$ echo -n "The SHA256 for this sentence begins with: one, eight, two, a, seven, c and nine." | sha256sum
182a7c930b0e5227ff8d24b5f4500ff2fa3ee1a57bd35e52d98c6e24c2749ae0 -
With some clever information hiding, you can sort of mask the brute force:
On 2015/7/6 at 00:00, I wrote this message's hash down. It started with 1337.
Thought #335552803: I love beef so much I made sure the sha256 of this message started with 0xbeef!
A longer example:
$ cat predestination.txt
(1991/4/7 at 00:16) <siraben> I know the hash of this message before you do!
(1991/4/7 at 00:17) <larsivi> Huh. What is it?
(1991/4/7 at 00:18) <siraben> It starts with 1337.
$ sha256sum predestination.txt
1337a4654163ab48761bf8acc75a407179e700f67f97d754b08c7afeac7da70a predestination.txt
Heh at first i thought 'wait how did they get all 512bits in a collision!?'
Then i realized the program just calculates it's own hash and prints it. Simple as that. It'd be super interesting if the program had the hash as a string internally and printed that out but that's (within the realm of breaking cryptography) impossible to pull off.
After picking some scheme to generate messages that predict n bits of their hash in some form, the chance of finding at least one message that correctly predicts its hash is 1 - (1 - 1/x)^x where x is 2^n. This approaches 1 - 1/e = 63.2 % for large n. So by trying a few different schemes, for example slightly varying the prefix »The SHA256 for this sentence begins with:«, it becomes quickly very likely to succeed. In the limit of large n, for example only 5 different schemes will yield at least one correct message with 99 %.
With 5 patterns, 24 bits and SHA-1.
The SHA-1 hash of this text starts with 051E35.
The hash of this text starts with A943BD.
The SHA-1 hash of this text starts with B6640C.
The SHA-1 hash of this message starts with C3B03D.
The SHA-1 hash of this message starts with D93717.
Or the same as before but as patterns just padding the 6 hex digits with zero zero to eight dots.
Well, this nerd sniped me... I made a quick Go tool for this. You simply write a string {like|this}, and then it finds a hash with the same prefix automatically. (Hash this comment, it's also 182a7c9)
$ echo -n "Well, this nerd sniped me... I made a quick Go tool for this. You simply write a string {like|this}, and then it finds a hash with the same prefix automatically." |sha256sum
9e9f6a7fba355220b4c999b29bd12a3f65ec40e7b03c7bfaf3f34a1524a232b6
$ echo -n "Well, this nerd sniped me... I made a quick Go tool for this. You simply write a string {like|this}, and then it finds a hash with the same prefix automatically. (Hash this comment, it's also 182a7c9)"|sha256sum
182a7c9ba0f815f77542774dc5ad9ea34975bf3747635ed0768748bcd772ef43 -
I bet you could hack hashcat (or just use CUDA) to find these quickly for larger prefixes. Might make a fun (silly) challenge. Hashcat can do 21B SHA256 hashes per second on an RTX 4090; this translates into bruteforcing a 9-digit prefix in 4 seconds, a 10-digit prefix in 52 seconds, or a 11-digit prefix in about 14 minutes.
28 bits of entropy? That's definitely within brute-force regions with CPU. Possibly even single-threaded CPU honestly.
-------
GPU-level brute force is pushing ~40-bits. Assuming ~1 week of solid compute, a 6 TFlop computer has ~3.7 sextillion compute cycles. Or alternatively, that's ~3 billion clock cycles (single thread) per check to reach 2^40 bits of entropy. More than easily doable for most simple brute-force computational problems IMO.
If you go multi-GPU on top of that, you'll have even more computational work available.
Beneath the veil of binary territories, in a subtle symphony of stillness, sevenfold zeros rise and conjure a calm masterpiece. As if celestial beacons in an ethereal night sky, they materialize in the form of an arcanum with heptadic serenity, with divine grace.
I usually look at the beginning and end characters. Is this any safer? Does the difficulty scale the same way with added bits even if they aren’t adjacent?
It's an inevitable consequence of a good hash function. Every string you try is going to give you a different hash, so you just need to try enough strings.
But each time you add another digit to the prefix you're looking for, the difficulty of the match goes up by about 16 times.
If you're only looking for a few digits in the hash, then any brute force that is capable of changing the string every time you attempt the match will always find it. If you're looking for seven digits in the hash only, then it'll take on average 134217727 attempts. Which isn't really all that many on modern hardware.
In some places my tooling only shows 7 digits of the git SHA. I wonder how hard it would be to write something to tweak my commits until those are all the same. And I wonder how long it would take until someone noticed...
It just makes it seem like sha256 can be reversed (which would be a major security issue) but it's an illusion as the sentence only predicts the first 7 digits of its own hash which isn't difficult to guess through brute force.
If you keep randomly changing letters or words in a sentence, you can eventually make sha256 spit out a hex string whose start matches any 7 digit hex sequence you want.
If the password was in hex format yes. Hex has 16 alphanumeric characters. The English alphabet has 36 alphanumeric characters so that's a lot harder to break at as 36 ^ 7 is much larger than 16 ^ 7.
Would you believe in God if sha256 of "I am God. My name is BOB " would be 4920616D20476F642E204D79206E616D6520697320424F4220202020202020 (the text in hex)?
That would be finding a invariant point (fixed point) in sha256 what would also be a text message. There is no reason to expect that kind of thing existing in any algorithm.
SHA256 is a one-way hashing function. Which means that feeding that sentence into SHA-256 and it starting with the characters in that sentence is very unlikely chance.
They probably wrote a script that tacked on a bunch of random letters and numbers to a sentence, then hashed it via brute force until it returned a hash that started with the exact same thing that the sentence said.
This is similar to how Bitcoin's proof of work algorithm.
It's self-referential and that makes it non-obvious how it's accomplished. Changing the sentence changes the hash, which also changes the sentence. So in any case it's an interesting construction.
This is cool. Isn't finding these the same as minting a new block of bitcoin? So definitially hard using our best understanding of algorithms, math, etc.
bitcoin is relatively easy. you are just hashing/permuting the block header until you get the lowest hash of all miners. it isn't hard, just expensive.
It's way above my head mathematically as to if this is even possible, but it is hilarious how screwed so many things would be if sha256 was discovered to have a means to more quickly reverse at least a partial hash. Just off the top of my head:
- SSL
- Bitcoin (bonus: unlimited money hack if you can keep the discovery under wraps)
- Signed updates for devices
Goodness only knows what I am missing, but that first one along is enough to cause an unmitigated disaster.
I assume these tweets are effectively brute forced given the fairly short prefix though and we're all safe
Is brute forcing a hash not "more quickly reversing at least a partial hash"?
What do you have in mind for "more quickly", then?
Also even if you figured out a way to make sha256 a few orders of magnitude faster, that would not affect SSL or signing and bitcoin would adjust as soon as several people know the secret.
No. The pigeonhole principle and the birthday attack both apply to situations where you’re looking for two inputs with the same hash as each other, not where you’re looking for one input that describes its own hash.
(Easiest optimization is wrapping the loop body in a goroutine. GOEXPERIMENT=loopvar really makes this nicer btw.)
The reply is obviously more interesting, need to come up with a lot of variations.