Hacker News new | past | comments | ask | show | jobs | submit login
Dejavu: Audio fingerprinting and recognition in Python (github.com/worldveil)
170 points by of on Sept 11, 2014 | hide | past | favorite | 25 comments



Hey! Creator here. Awesome to see this get posted and people excited about the project.

I made a cool writeup about it here: http://willdrevo.com/fingerprinting-and-audio-recognition-wi...

It's a great little library for doing audio recognition, stream radio advertisement verification, and all sorts of interesting people email about all the time that I never would have thought of.

It's certainly not as speedy as Echoprint, which is both written in C++ and doesn't use an FFT for the locality sensitive hashing, but is quite user friendly. The benefit of doing constellation or time delta based LSH methods like in Dejavu is that you can actually recover the time at which you matched.

If you love it, feel free to dig in and contribute!


Great work! Thanks for open sourcing this - its very educational.

At the moment I'm using it to process a few hundred gigs of song files that I've collected as a big furry hairball of a mess over the years - something about having multiple iPods and MP3 players over the years, and not really doing very good house-keeping in the move from one to the other (and avoiding things like iTunes where possible) has meant that I have a lot of files that may have duplicate songs in them - but the filenames and organization doesn't necessarily reflect that fact.

So I'm using dejavu right now to clean this up .. I'm assuming you'd be happy to have a "find_duplicates.py" script added - if so, I'll let you know as soon as I have one working .. ;)

Thanks again!


Glad to see it's working well for you!

I'd be curious as well to see how the performance holds up getting into the terabytes as I haven't tested that. Remember too that there are a lot of parameters for the matching algorithm here (https://github.com/worldveil/dejavu/blob/master/dejavu/finge...) which allow you to trade off accuracy, speed, and storage in different ways. I've tried to document it throughly.

Finding duplicates is a great one! Actually generating a checksum for each audio file (minus the header and ID3 tags) and adding this as a column in the songs table for all the different filetypes Dejavu supports (mp3, wav, etc) would probably be the best way to do this.

I say this because so many songs today are built on sampling. Mashups and EDM music often samples from other work, and as such, the fingerprints and their alignment can be shared across different songs. Something more clever like seeing the percentage of hashes by song that are the same and comparing to a threshold might do the trick, though.

Happy hacking, and feel free to send in a PR! :)


I worked on this too a few years ago. There is a massive patent minefield out there, including this algorithm.

We were doing recognition of TV shows where you are holding a cell phone in your hand some distance from the TV. A prototype with this algorithm was okay, but very easily confused especially as the number of items in the database increases. You also end up with a lot less amplitude at that distance from the TV which makes the source messier. In theory phase shouldn't have had an effect, but in practise it did so things had to be run multiple times at different offsets to improve matching.

Our final algorithm was way better. It was based on what audio codecs do. It even worked reliably with a 60db signal while there was a 70db interferer signal!


Surprised nobody has mentioned MusicBrainz, it's the free and open source music fingerprinting database which powers the Picard, Jaikoz, Beets, etc taggers. They have been doing audio fingerprinting for years, you can download the DB or access it via a web API. The author's solution may work quite well with small number of entries to match against, but I suspect the match rate goes down significantly when lookup is against hundreds of thousands or millions of other fingerprints.

https://wiki.musicbrainz.org/Fingerprinting


Audio fingerprinting as used by MusicBrainz is a little different concept. Because it doesn't have the need to match short phone-recorded samples, we can use more efficient algorithms for both the fingerprinting and their matching. It's usually not the match rate that goes down when dealing with a large database, but the false match rate that goes up. And of course performance. Those were my two main things to worry about when I was working on AcoustID (the current fingerprinting technology used by MusicBrainz).


Fingerprinting is fine; but the actual value would come from a large database of all sorts of fingerprints, so it could be used to identify songs, snippets, movies, etc.


That's certainly useful, and what Echoprint and MusicBrainz have tried to do.

Unfortunately, many fingerprinting use cases require hashing at different granularities (ie, FFT windows), or need different collision guarantees to trade off space vs. accuracy and so on and so forth.

A perfect example is throwing away part of the SHA-1 hash of a fingerprint. You lose some entropy, but you become more space efficient.

Thus in many cases, while the core algorithm might be the same, the parameters and constraints of the individual use case often mean that the fingerprints themselves aren't universal in size or format.


This is amazing! truly great explanation in the related blog post here http://willdrevo.com/fingerprinting-and-audio-recognition-wi... Does anyone know of a good explanation of locality sensitive hashing? I know there are other applications.


Shameless self-promotion: in my Stack Overflow answer [1], I reference good introductory LSH papers [2-5].

In short, LSH is an algorithm that hashes points that are nearby in a feature space into the same bin with high probability. Contrast that with cryptographically secure hashes where the tiniest change in the input is designed to yield a completely different hash. The point is that, in domains like multimedia, you want to tolerate some distortions to your signal, e.g. microphone noise, blur, etc. These minor distortions shouldn't affect your characterization of the data, e.g. "is this a guitar", "is this a cat", etc.

The advantages are that it's simple to implement, and it has mathematically provable probability bounds and query complexity.

[1] http://stackoverflow.com/questions/5751114/nearest-neighbors...

[2] http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bi...

[3] http://www.vldb.org/conf/1998/p194.pdf

[4] http://www.vldb.org/conf/1999/P49.pdf

[5] http://web.iitd.ac.in/~sumeet/Slaney2008-LSHTutorial.pdf


Like "hashing", I don't think "locality sensitive hashing" is a specific technique, but more like a type of algorithm.


I worked for a company working on audio fingerprinting before (as well as image/video). I have to say it's great that we have a song (audio) recognition system that works pretty well, but the business never takes off. May be it's just my previous company as I'm not sure how Shazam is doing, but I haven't heard of them for a while


Actually Google is now competing with them: https://play.google.com/store/apps/details?id=com.google.and...


I was surprised that a lot of my friends still have Shazam on their phones - but it's the kind of app they only need once in a blue moon.


I've putzed around on something very similar to this for performing analytics on terrestrial radio stations and the commercials. Great work, and I love that you've open sourced it...I didn't use python for mine (C++), but, python allows for a much easier barrier to entry versus my spaghetti code :)


I've written similar code, but using image processing algorithms. You can find it here:

https://github.com/jminardi/audio_fingerprinting


How well would this work for spoken word instead of music?


Probably not particularly well. It's based on the variations in pitch over time and unfortunately the human voice fills a very narrow frequency band. Maybe you could limit it to focus on that area of the frequency spectrum to get better results but I suspect it would require more dramatic changes to get good results.


this is great! i was just starting out on a project in which one component was to recognize gunshot sounds from varying distances.


Not sure if you've seen this, but relevant:

http://www.shotspotter.com/


Rather then just fingerprinting recorded audio can this thing fingerprint words and passphrases that the user just says outloud?


Not even close. This technique only works when the relative energy of different frequency buckets remains the same, and the same time periods apart (in milliseconds). You are very unlikely to have the same fingerprints when repeating the same words/phrases.

Try using an app that shows the FFT and see if you can get it to show the same thing twice when speaking. For example on Android this works https://play.google.com/store/apps/details?id=org.hermit.aud...


You can use CMUSphinx http://cmusphinx.sourceforge.net for keyphrase verification. For example you can find Android demo for keyphrase spotting at http://cmusphinx.sourceforge.net/wiki/tutorialandroid


Nope! Dejavu is meant for recovery of perfectly maintained signals with additive noise. That is to say, detecting a signal or song played slightly slower than the original hashed version is completely outside what Dejavu can do. Detecting a signal played back at the same speed with a lot of background noise is just fine - that's what the algorithm is meant for.


I think this approach is probably a little too rigid for something like that, where there would be significant pitch, amplitude, and timing differences each time you say your passphrase.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: