Yes, I meant to run 2D CNN over generated spectrograms. We do something similar to classify some specific emissions in RF with good success. As for scoring/classification, you can start by having an output of CNN say whether the song is catchy or not.
why run a CNN over a spectrogram? besides what I said (convolutions in frequency domain are multiplications in time domain) an FT is linear. if classifying using those features were effective then your CNN would've learned the DFT matrix weights from the original signal.
A spectrogram has time on one axis and frequency on the other, so the ultimate result is a multiplication in one dimension and a convolution in the other. It can be used to show things like when a note starts and stops in a piece of music, which is difficult in either purely-time or purely-frequency space.
Also, it’s computationally intractable to individually train 2^N weights. What a CNN does instead is train a convolution kernel which is passed over the whole domain to produce the input for the next layer; by operating in frequency space, it’s considering the basis functions e^{j omega +- epsilon} instead of delta(x +- epsilon)
my mistake i didn't realize spectrogram and spectrum were distinct objects.
>Also, it’s computationally intractable to individually train 2^N weights.
that's a good point - i'd forgotten for a moment (because i'm so used to cooley-tukey fft) that in principle getting the spectrum involves a matmul against the entire vector. which brings up a potentially interested question: can you get a DNN to simulate the cooley-tukey fft (stride permutations and all).