Application of Locality Sensitive Hashing to audio fingerprinting

bequanna · 2024-04-02T17:21:53 1712078513

Does anyone know how this approach differs from the approach taken by the popular(ish) dejavu project?

willseth · 2024-04-02T17:56:38 1712080598

Last I checked Dejavu doesn't use LSH or any approximations. It simply queries for exact matches, "aligns" them, and determines if there is enough signal from the noise to consider it a match.

bequanna · 2024-04-02T18:55:46 1712084146

Ahh, I see.

The approach here is just getting vector nearest neighbors.

I guess I don’t see how this could work like “Shazam” when you’re comparing the vector of some sample to the vector for the whole song.

willseth · 2024-04-02T19:19:28 1712085568

Oh, no, it's not for the whole song. For each song, you take N samples, each sample gets converted into a fingerprint, and all of the fingerprints are stored in a database. Then when you query, you perform the same sample-to-fingerprint computation, and query the database for matching fingerprints. The database will return many false positives, but only 1 will "align", meaning the relative timing of the fingerprints returned is in sync with an actual song in the database. Once the number of aligned fingerprints reaches a set threshold, you can stop and return it as a result. That's how it performs subset matching.

bequanna · 2024-04-02T19:29:04 1712086144

Gotcha.

So taking n overlapping samples for the song, fingerprinting then vectorizing each. Then doing the same thing for the captured sample and getting nearest neighbors.

I’ve used dejavu for a personal project and it is quite fast. Is the LSH approach somehow more efficient/attractive now because we’ve made strides in vector nearest neighbors search?

willseth · 2024-04-02T19:56:40 1712087800

Almost, but dejavu doesn't search nearest-neighbors. It searches for exact matches, so its robustness is essentially based on oversampling. Nearest neighbor search might be a way to achieve similar robustness with fewer fingerprints - assuming it results in fewer false negatives. LSH is one way to approach nearest neighbor search. So in theory you could modify dejavu with an LSH-based NN search and potentially increase robustness with less data, modulo whatever tradeoffs are necessary for computing NN search.

abetusk · 2024-04-02T15:04:03 1712070243

Does anyone have experience with how well this works?

pncnmnp · 2024-04-02T17:01:42 1712077302

Check out Steve and Iran's website https://musicinformationretrieval.com/

They have a bunch of Colab notebooks where they explore various music IR ideas, including LSH.

GitHub: https://github.com/iranroman/musicinformationretrieval.com/

johanvts · 2024-04-02T15:39:26 1712072366

How well it works for identifying music is going to be an issue of how well you can construct the feature vector which is not covered. If you map every song in to the same small subset of space, LSH will struggle. Map them to well separated places and it will shine.

bravura · 2024-04-02T16:27:54 1712075274

This is wrong, or misleading at best.

It's a known difficult open problem to create a good audio representation (feature vector). Good, in the case of LSH, would be that representation distance correlates well with human perception.

For something like Shazam, you want this representation to be invariant to minor transformations of the audio. That's the interesting problem.

johanvts · 2024-04-04T17:46:19 1712252779

How is that not aligned with what I’m saying? You add some detail, but that doesn’t make me wrong.