> We assume that the attacker knows the content displayed on the attacked monitor, except
for the textual letters. This assumption holds in many cases, for example, when the victim is
filling a form on a known website. We also assume that the font is monospace and sufficiently
large. The requisite size of the text depends on the granularity of the leak, which changes
among different monitors. Another assumption is, again, that the screen is in portrait layout.
They go on to state that in their proof of concept they had 3-6 letter word in black against a white background. They recorded audio for 5 seconds, and the letters were 175 px wide using a monospace font.
Note that they do mention they expect that any background could work, as long as it is fixed and you can train your model on that background.
My point being, this is far from a practical way to read what is on screen.
I think it's better to treat it as a proof of concept attack. It's more just to say: hey this type of thing is possible, maybe with better equipment and more research.
They go on to state that in their proof of concept they had 3-6 letter word in black against a white background. They recorded audio for 5 seconds, and the letters were 175 px wide using a monospace font.
Note that they do mention they expect that any background could work, as long as it is fixed and you can train your model on that background.
My point being, this is far from a practical way to read what is on screen.