I think you are misunderstanding. I don't think the network matches the audio to...

swid · 2024-12-08T17:59:30 1733680770

I’ve heard clips of hot water being poured vs cold water, and if you heard the examples, you would probably guess right too.

Time of day seems almost easy. Are there animals noises? Those won’t sound the same all day. And traffic too. Even things like the sound of wind may generally be different in the morning vs night.

This is not to suggest the researchers are not leaking data, or that the examples were cherry picked, it seems probable they are doing one or the other. But it is to say, if you were trained on a particular intersection, and heard a sample from it, you could probably train a model to predict time of day reasonably well.

dfc · 2024-12-09T04:42:25 1733719345

How would the wind sound different in the morning versus in the evening?

swid · 2024-12-09T15:14:58 1733757298

Weather patterns have a daily/ seasonal rhythm… the strength and direction of the wind will have some distribution that is different at different times of the day. Temperature and humidity as well, like the other poster said.

ics · 2024-12-09T06:25:20 1733725520

I don't know about just wind by itself but temperature and humidity changes during night can affect how things sound.

ptx · 2024-12-08T16:17:33 1733674653

It certainly looks like some amount of image matching is going on. Can the model really hear the white/green sign to the left in the first example in figure 3? Can it hear the green sign to the right and red things to the left in the last example?

sdenton4 · 2024-12-08T17:19:01 1733678341

Yeah, I also saw that sign and thought - 'yeah, this is bullshit.' It's got exactly the same placement in the frame - which would requires some next-level beamforming capability - and also has the same color, which is impossible. There's some serious data leakage going on here.

[edit] The bottom right image is even more suspect. There's a vertical green sign in the same place on the right side of the image, but also some curious red striping in the distance in both images. One could argue 'street signs are green' but the red striping seems pretty unique, and not something where one would just guess the right color.

jebarker · 2024-12-08T16:20:22 1733674822

That would be explained by data leakage too, e.g. sampling frames in the train and test data from the same video sequences. There's nothing in the writeup that says suggests the model is explicitly matching audio to ground truth images.

notahacker · 2024-12-08T16:59:53 1733677193

The researchers' suggestion that certain architectural features might have been encoded in the sound [which is at least superficially plausible] is rather undermined the data leakage in the model also leading to it generate the right colour signage in the right part of multiple images. The fidelity of the sound clearly isn't enough for the model to register key aspects of the sign's geometry like it only being a few feet from the observer, but it has somehow managed to pick up that it's green and x pixels from the left of the image...

mewpmewp2 · 2024-12-08T19:48:22 1733687302

I don't know if data leakage is the right word, but maybe overfitting if they took a 1 hour clip from same place and used 90 percent for training and 10 percent for eval/test?

It is still decent way to start I think, but it needs to get more varied data after that and use different geographical locations for eval and test.

ptx · 2024-12-08T17:05:19 1733677519

In response to the correction: The paper says that "we propose a Soundscape-to-Image Diffusion model, a generative Artificial Intelligence (AI) model supported by Large Language Models (LLMs)" so there's an LLM involved somewhere presumably?

jebarker · 2024-12-08T17:49:11 1733680151

I'm sorry, you're correct, I missed that. I'll edit my edit!