You wouldn't use a CNN for this because there's no spatial correlation to take advantage of. Maybe an RNN somehow but the basic problem is scoring (ie how melodic a song is) not classification so honestly I don't know you'd do this using neural networks.
On the contrary, with a more generous reading of the previous comment, it holds some merit.
1. CNN's are used fairly commonly for sequence tasks nowadays. Convolutions can be 1D after all.
2. It's also possible the previous comment was referring to using 2D convolutions on the spectrogram of the audio, which is a common approach.
3. Neural networks are capable of more than classification. Scoring is a regression task which is common application of neural networks.
I have some questions (mainly to improve my own understanding):
2. Since data is MIDI-encoded, would a convolution hold any merit here? I suppose you could render to an mp3 and analyze the audio itself but that seems very computationally expensive and prone to overfitting.
3) If we're training a scoring classifier, we would need labeled data, but getting those labels seems very challenging, not least because of how subjective our impressions of melodies can be (for instance, the opinions of a fan of atonality would be drastically different from a fan of pop). Do you have any ideas on how to mitigate this?
do people really run 2d convolutions on spectrograms? that seems rather backwards to me - why convolve in frequency space when you can just multiply in the time domain.
re regression tasks: sure I guess that's just an embedding basically right?
Yes, I meant to run 2D CNN over generated spectrograms. We do something similar to classify some specific emissions in RF with good success. As for scoring/classification, you can start by having an output of CNN say whether the song is catchy or not.
why run a CNN over a spectrogram? besides what I said (convolutions in frequency domain are multiplications in time domain) an FT is linear. if classifying using those features were effective then your CNN would've learned the DFT matrix weights from the original signal.
A spectrogram has time on one axis and frequency on the other, so the ultimate result is a multiplication in one dimension and a convolution in the other. It can be used to show things like when a note starts and stops in a piece of music, which is difficult in either purely-time or purely-frequency space.
Also, it’s computationally intractable to individually train 2^N weights. What a CNN does instead is train a convolution kernel which is passed over the whole domain to produce the input for the next layer; by operating in frequency space, it’s considering the basis functions e^{j omega +- epsilon} instead of delta(x +- epsilon)
my mistake i didn't realize spectrogram and spectrum were distinct objects.
>Also, it’s computationally intractable to individually train 2^N weights.
that's a good point - i'd forgotten for a moment (because i'm so used to cooley-tukey fft) that in principle getting the spectrum involves a matmul against the entire vector. which brings up a potentially interested question: can you get a DNN to simulate the cooley-tukey fft (stride permutations and all).