Dejavu: Audio fingerprinting and recognition in Python

muzakthings · on Sept 12, 2014

Hey! Creator here. Awesome to see this get posted and people excited about the project.

I made a cool writeup about it here: http://willdrevo.com/fingerprinting-and-audio-recognition-wi...

It's a great little library for doing audio recognition, stream radio advertisement verification, and all sorts of interesting people email about all the time that I never would have thought of.

It's certainly not as speedy as Echoprint, which is both written in C++ and doesn't use an FFT for the locality sensitive hashing, but is quite user friendly. The benefit of doing constellation or time delta based LSH methods like in Dejavu is that you can actually recover the time at which you matched.

If you love it, feel free to dig in and contribute!

ibisum · on Sept 12, 2014

Great work! Thanks for open sourcing this - its very educational.

At the moment I'm using it to process a few hundred gigs of song files that I've collected as a big furry hairball of a mess over the years - something about having multiple iPods and MP3 players over the years, and not really doing very good house-keeping in the move from one to the other (and avoiding things like iTunes where possible) has meant that I have a lot of files that may have duplicate songs in them - but the filenames and organization doesn't necessarily reflect that fact.

So I'm using dejavu right now to clean this up .. I'm assuming you'd be happy to have a "find_duplicates.py" script added - if so, I'll let you know as soon as I have one working .. ;)

Thanks again!

muzakthings · on Sept 12, 2014

Glad to see it's working well for you!

I'd be curious as well to see how the performance holds up getting into the terabytes as I haven't tested that. Remember too that there are a lot of parameters for the matching algorithm here (https://github.com/worldveil/dejavu/blob/master/dejavu/finge...) which allow you to trade off accuracy, speed, and storage in different ways. I've tried to document it throughly.

Finding duplicates is a great one! Actually generating a checksum for each audio file (minus the header and ID3 tags) and adding this as a column in the songs table for all the different filetypes Dejavu supports (mp3, wav, etc) would probably be the best way to do this.

I say this because so many songs today are built on sampling. Mashups and EDM music often samples from other work, and as such, the fingerprints and their alignment can be shared across different songs. Something more clever like seeing the percentage of hashes by song that are the same and comparing to a threshold might do the trick, though.

Happy hacking, and feel free to send in a PR! :)

rogerbinns · on Sept 11, 2014

I worked on this too a few years ago. There is a massive patent minefield out there, including this algorithm.

We were doing recognition of TV shows where you are holding a cell phone in your hand some distance from the TV. A prototype with this algorithm was okay, but very easily confused especially as the number of items in the database increases. You also end up with a lot less amplitude at that distance from the TV which makes the source messier. In theory phase shouldn't have had an effect, but in practise it did so things had to be run multiple times at different offsets to improve matching.

Our final algorithm was way better. It was based on what audio codecs do. It even worked reliably with a 60db signal while there was a 70db interferer signal!

keypusher · on Sept 12, 2014

Surprised nobody has mentioned MusicBrainz, it's the free and open source music fingerprinting database which powers the Picard, Jaikoz, Beets, etc taggers. They have been doing audio fingerprinting for years, you can download the DB or access it via a web API. The author's solution may work quite well with small number of entries to match against, but I suspect the match rate goes down significantly when lookup is against hundreds of thousands or millions of other fingerprints.

https://wiki.musicbrainz.org/Fingerprinting

lukaslalinsky · on Sept 12, 2014

Audio fingerprinting as used by MusicBrainz is a little different concept. Because it doesn't have the need to match short phone-recorded samples, we can use more efficient algorithms for both the fingerprinting and their matching. It's usually not the match rate that goes down when dealing with a large database, but the false match rate that goes up. And of course performance. Those were my two main things to worry about when I was working on AcoustID (the current fingerprinting technology used by MusicBrainz).

discardorama · on Sept 11, 2014

Fingerprinting is fine; but the actual value would come from a large database of all sorts of fingerprints, so it could be used to identify songs, snippets, movies, etc.

muzakthings · on Sept 12, 2014

That's certainly useful, and what Echoprint and MusicBrainz have tried to do.

Unfortunately, many fingerprinting use cases require hashing at different granularities (ie, FFT windows), or need different collision guarantees to trade off space vs. accuracy and so on and so forth.

A perfect example is throwing away part of the SHA-1 hash of a fingerprint. You lose some entropy, but you become more space efficient.

Thus in many cases, while the core algorithm might be the same, the parameters and constraints of the individual use case often mean that the fingerprints themselves aren't universal in size or format.

morenoh149 · on Sept 11, 2014

This is amazing! truly great explanation in the related blog post here http://willdrevo.com/fingerprinting-and-audio-recognition-wi... Does anyone know of a good explanation of locality sensitive hashing? I know there are other applications.

stevetjoa · on Sept 12, 2014

Shameless self-promotion: in my Stack Overflow answer [1], I reference good introductory LSH papers [2-5].

In short, LSH is an algorithm that hashes points that are nearby in a feature space into the same bin with high probability. Contrast that with cryptographically secure hashes where the tiniest change in the input is designed to yield a completely different hash. The point is that, in domains like multimedia, you want to tolerate some distortions to your signal, e.g. microphone noise, blur, etc. These minor distortions shouldn't affect your characterization of the data, e.g. "is this a guitar", "is this a cat", etc.

The advantages are that it's simple to implement, and it has mathematically provable probability bounds and query complexity.

[1] http://stackoverflow.com/questions/5751114/nearest-neighbors...

[2] http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bi...

[3] http://www.vldb.org/conf/1998/p194.pdf

[4] http://www.vldb.org/conf/1999/P49.pdf

[5] http://web.iitd.ac.in/~sumeet/Slaney2008-LSHTutorial.pdf

ma2rten · on Sept 11, 2014

Like "hashing", I don't think "locality sensitive hashing" is a specific technique, but more like a type of algorithm.

UncleChis · on Sept 11, 2014

I worked for a company working on audio fingerprinting before (as well as image/video). I have to say it's great that we have a song (audio) recognition system that works pretty well, but the business never takes off. May be it's just my previous company as I'm not sure how Shazam is doing, but I haven't heard of them for a while

ma2rten · on Sept 11, 2014

Actually Google is now competing with them: https://play.google.com/store/apps/details?id=com.google.and...

financequoll · on Sept 11, 2014

I was surprised that a lot of my friends still have Shazam on their phones - but it's the kind of app they only need once in a blue moon.

imroot · on Sept 12, 2014

I've putzed around on something very similar to this for performing analytics on terrestrial radio stations and the commercials. Great work, and I love that you've open sourced it...I didn't use python for mine (C++), but, python allows for a much easier barrier to entry versus my spaghetti code :)

doctoboggan · on Sept 11, 2014

I've written similar code, but using image processing algorithms. You can find it here:

https://github.com/jminardi/audio_fingerprinting

Osmium · on Sept 11, 2014

How well would this work for spoken word instead of music?

aidos · on Sept 11, 2014

Probably not particularly well. It's based on the variations in pitch over time and unfortunately the human voice fills a very narrow frequency band. Maybe you could limit it to focus on that area of the frequency spectrum to get better results but I suspect it would require more dramatic changes to get good results.

ntsb · on Sept 11, 2014

this is great! i was just starting out on a project in which one component was to recognize gunshot sounds from varying distances.

muzakthings · on Sept 12, 2014

Not sure if you've seen this, but relevant:

http://www.shotspotter.com/

crimsonalucard · on Sept 11, 2014

Rather then just fingerprinting recorded audio can this thing fingerprint words and passphrases that the user just says outloud?

rogerbinns · on Sept 11, 2014

Not even close. This technique only works when the relative energy of different frequency buckets remains the same, and the same time periods apart (in milliseconds). You are very unlikely to have the same fingerprints when repeating the same words/phrases.

Try using an app that shows the FFT and see if you can get it to show the same thing twice when speaking. For example on Android this works https://play.google.com/store/apps/details?id=org.hermit.aud...

nshm · on Sept 13, 2014

You can use CMUSphinx http://cmusphinx.sourceforge.net for keyphrase verification. For example you can find Android demo for keyphrase spotting at http://cmusphinx.sourceforge.net/wiki/tutorialandroid

muzakthings · on Sept 12, 2014

Nope! Dejavu is meant for recovery of perfectly maintained signals with additive noise. That is to say, detecting a signal or song played slightly slower than the original hashed version is completely outside what Dejavu can do. Detecting a signal played back at the same speed with a lot of background noise is just fine - that's what the algorithm is meant for.

kevincennis · on Sept 11, 2014

I think this approach is probably a little too rigid for something like that, where there would be significant pitch, amplitude, and timing differences each time you say your passphrase.