I think you are misunderstanding. I don't think the network matches the audio to a ground truth image and generates an image. It just takes in audio and predicts an image. They just use the ground truth images for training the model and for evaluation purposes.
The generated images are only vaguely similar in detail to the originals, but the fact that they can estimate the macro structure from audio alone is surprising. I wonder if there's some kind of leakage between the training and test data, e.g. sampling frames from the same videos, because the idea you could get time of day right (dusk in a city) just from audio seems improbable.
EDIT: also minor correction, it's not an LLM it's a diffusion model.
EDIT2: my mistake, there is an LLM too!
I’ve heard clips of hot water being poured vs cold water, and if you heard the examples, you would probably guess right too.
Time of day seems almost easy. Are there animals noises? Those won’t sound the same all day. And traffic too. Even things like the sound of wind may generally be different in the morning vs night.
This is not to suggest the researchers are not leaking data, or that the examples were cherry picked, it seems probable they are doing one or the other. But it is to say, if you were trained on a particular intersection, and heard a sample from it, you could probably train a model to predict time of day reasonably well.
Weather patterns have a daily/ seasonal rhythm… the strength and direction of the wind will have some distribution that is different at different times of the day. Temperature and humidity as well, like the other poster said.
It certainly looks like some amount of image matching is going on. Can the model really hear the white/green sign to the left in the first example in figure 3? Can it hear the green sign to the right and red things to the left in the last example?
Yeah, I also saw that sign and thought - 'yeah, this is bullshit.'
It's got exactly the same placement in the frame - which would requires some next-level beamforming capability - and also has the same color, which is impossible. There's some serious data leakage going on here.
[edit] The bottom right image is even more suspect. There's a vertical green sign in the same place on the right side of the image, but also some curious red striping in the distance in both images. One could argue 'street signs are green' but the red striping seems pretty unique, and not something where one would just guess the right color.
That would be explained by data leakage too, e.g. sampling frames in the train and test data from the same video sequences. There's nothing in the writeup that says suggests the model is explicitly matching audio to ground truth images.
The researchers' suggestion that certain architectural features might have been encoded in the sound [which is at least superficially plausible] is rather undermined the data leakage in the model also leading to it generate the right colour signage in the right part of multiple images. The fidelity of the sound clearly isn't enough for the model to register key aspects of the sign's geometry like it only being a few feet from the observer, but it has somehow managed to pick up that it's green and x pixels from the left of the image...
I don't know if data leakage is the right word, but maybe overfitting if they took a 1 hour clip from same place and used 90 percent for training and 10 percent for eval/test?
It is still decent way to start I think, but it needs to get more varied data after that and use different geographical locations for eval and test.
In response to the correction: The paper says that "we propose a Soundscape-to-Image Diffusion model, a generative Artificial Intelligence (AI)
model supported by Large Language Models (LLMs)" so there's an LLM involved somewhere presumably?
The generated images are only vaguely similar in detail to the originals, but the fact that they can estimate the macro structure from audio alone is surprising. I wonder if there's some kind of leakage between the training and test data, e.g. sampling frames from the same videos, because the idea you could get time of day right (dusk in a city) just from audio seems improbable.
EDIT: also minor correction, it's not an LLM it's a diffusion model. EDIT2: my mistake, there is an LLM too!