Last I checked Dejavu doesn't use LSH or any approximations. It simply queries for exact matches, "aligns" them, and determines if there is enough signal from the noise to consider it a match.
Oh, no, it's not for the whole song. For each song, you take N samples, each sample gets converted into a fingerprint, and all of the fingerprints are stored in a database. Then when you query, you perform the same sample-to-fingerprint computation, and query the database for matching fingerprints. The database will return many false positives, but only 1 will "align", meaning the relative timing of the fingerprints returned is in sync with an actual song in the database. Once the number of aligned fingerprints reaches a set threshold, you can stop and return it as a result. That's how it performs subset matching.
So taking n overlapping samples for the song, fingerprinting then vectorizing each. Then doing the same thing for the captured sample and getting nearest neighbors.
I’ve used dejavu for a personal project and it is quite fast. Is the LSH approach somehow more efficient/attractive now because we’ve made strides in vector nearest neighbors search?
Almost, but dejavu doesn't search nearest-neighbors. It searches for exact matches, so its robustness is essentially based on oversampling. Nearest neighbor search might be a way to achieve similar robustness with fewer fingerprints - assuming it results in fewer false negatives. LSH is one way to approach nearest neighbor search. So in theory you could modify dejavu with an LSH-based NN search and potentially increase robustness with less data, modulo whatever tradeoffs are necessary for computing NN search.
How well it works for identifying music is going to be an issue of how well you can construct the feature vector which is not covered. If you map every song in to the same small subset of space, LSH will struggle. Map them to well separated places and it will shine.
It's a known difficult open problem to create a good audio representation (feature vector). Good, in the case of LSH, would be that representation distance correlates well with human perception.
For something like Shazam, you want this representation to be invariant to minor transformations of the audio. That's the interesting problem.
https://github.com/worldveil/dejavu