"Good idea" is a tough subjective question to answer. Could you do better than F...

"Good idea" is a tough subjective question to answer. Could you do better than FFT/DCT/frequency-based transform? Maybe/possibly given that music tends to be quite "harmonic." Taking a step back, when trying to classify, the best features are ones that separate your classes out better. Certain representations / changes thereof can lead to better separations depending on the the properties of your input source.

The cepstrum is good for providing energy compact/sparse representation of harmonic features. This is why it's used (/was used) a lot in speech recognition. Speech sounds tend to have harmonic properties (see: formants). Frequency-based transforms tell you how much of a frequency (repeating/periodic signal) is present. If you have harmonics, those can sooort of be thought of has repeating patterns in the frequency domain. So taking a frequency transform of a frequency transform (which is super loosely a cepstrum) gets you a nice compact (separable) representation of inputs that tend to have harmonic features.

Most music tends to be pretty damn harmonic... so maybe?

Also an argument to be made that if you have a big enough network of the right kind that's just taking in windowed time domain data (almost surely involving recurrence), it might not be surprising that you could find some cepstral-like stuff naturally pop out.