Hacker News new | past | comments | ask | show | jobs | submit login
Application of Locality Sensitive Hashing to audio fingerprinting (santhoshhari.github.io)
31 points by pncnmnp 10 months ago | hide | past | favorite | 11 comments



Does anyone know how this approach differs from the approach taken by the popular(ish) dejavu project?

https://github.com/worldveil/dejavu


Last I checked Dejavu doesn't use LSH or any approximations. It simply queries for exact matches, "aligns" them, and determines if there is enough signal from the noise to consider it a match.


Ahh, I see.

The approach here is just getting vector nearest neighbors.

I guess I don’t see how this could work like “Shazam” when you’re comparing the vector of some sample to the vector for the whole song.


Oh, no, it's not for the whole song. For each song, you take N samples, each sample gets converted into a fingerprint, and all of the fingerprints are stored in a database. Then when you query, you perform the same sample-to-fingerprint computation, and query the database for matching fingerprints. The database will return many false positives, but only 1 will "align", meaning the relative timing of the fingerprints returned is in sync with an actual song in the database. Once the number of aligned fingerprints reaches a set threshold, you can stop and return it as a result. That's how it performs subset matching.


Gotcha.

So taking n overlapping samples for the song, fingerprinting then vectorizing each. Then doing the same thing for the captured sample and getting nearest neighbors.

I’ve used dejavu for a personal project and it is quite fast. Is the LSH approach somehow more efficient/attractive now because we’ve made strides in vector nearest neighbors search?


Almost, but dejavu doesn't search nearest-neighbors. It searches for exact matches, so its robustness is essentially based on oversampling. Nearest neighbor search might be a way to achieve similar robustness with fewer fingerprints - assuming it results in fewer false negatives. LSH is one way to approach nearest neighbor search. So in theory you could modify dejavu with an LSH-based NN search and potentially increase robustness with less data, modulo whatever tradeoffs are necessary for computing NN search.


Does anyone have experience with how well this works?


Check out Steve and Iran's website https://musicinformationretrieval.com/

They have a bunch of Colab notebooks where they explore various music IR ideas, including LSH.

GitHub: https://github.com/iranroman/musicinformationretrieval.com/


How well it works for identifying music is going to be an issue of how well you can construct the feature vector which is not covered. If you map every song in to the same small subset of space, LSH will struggle. Map them to well separated places and it will shine.


This is wrong, or misleading at best.

It's a known difficult open problem to create a good audio representation (feature vector). Good, in the case of LSH, would be that representation distance correlates well with human perception.

For something like Shazam, you want this representation to be invariant to minor transformations of the audio. That's the interesting problem.


How is that not aligned with what I’m saying? You add some detail, but that doesn’t make me wrong.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: