Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Backpropping filter coefficients is clever, but it hasn't really caught on much. Google also tried with LEAF (https://github.com/google-research/leaf-audio) to have a learnable audio filterbank.

Anyway, in audio ML what is very common is:

a) Futzing with the way you do feature extraction on the input. (Oh, maybe I want CQT for this task or a different scale Mel etc)

b) Doing feature extraction on generated audio output, and constructing loss functions from generated audio features.

So, as I said, I don't exactly see the utility of this library for deep learning.

With that said, it is definitely nice to have really high speed low latency audio algorithms in C++. I just wouldn't market it as "useful for deep learning" because

a) during training, you need more flexibility than non-GPU methods without backprop

b) if you are doing "deep learning" then your inferred model will presumably be quite large, and there will be a million other things you'll need to optimize to get real-time inference or inference on CPUs to work well.

Is just my gut reaction. It seems like a solid project, I just question the one selling point of "useful for deep learning" that's all.



Are there resources you would recommend reading regarding ML and audio?


This is a really broad topic. I began studying it about 5 years ago.

Can you start by suggesting what you task you want to do? I'll throw out some suggestions, but you can say something different. Also you are welcome to email me (email in HN profile):

* Voice conversion / singing voice conversion

* Transcription of audio to MIDI

* Classification / tagging of audio scene

* Applying some effect / cleanup to audio

* Separating audio into different instruments

etc

The really quick summary of audio ML as a topic is:

* Often people treat it audio ML as vision ML, by using spectrogram representations of audio. Nonetheless, 1D models are sometimes just as good if not better, but they require very specific familiarity with the audio domain.

* Audio distance measures (loss functions) are pretty crappy and not well-correlated with human perception. You can say the same thing about vision distance measures, but a lot more research has gone into vision models so we have better heuristics around vision stuff. With that said, multi-scale log mel spectrogram isn't that terrible.

* Audio has a handful of little gotches around padding, windowing, etc.

* DSP is a black art and DSP knowledge has high ROI versus just being dumb and black boxy about everything.


I'm considering doing some ML stuff for a mobile DJ app.. like beat/bpm detection, instrument / vocal separation etc. Have you seen anything recent that might be efficient enough to run on a mobile device and process a track in a reasonable amount of time ( less than song length ) ?


I may not email as it isn't a serious pursuit, but more curiosity. Thank you for the invitation! My current fascination is in separation and classification. And modular synthesis where I guess DSP stuff comes about if translating into the digital domain.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: