Abusing hash kernels for wildly unprincipled machine learning

fiatmoney · on April 3, 2013

Essentially random projection, which has good theoretical justification and comes in handy quite often, for instance, in SVMs [1]. I'd be concerned about using something as naive as CRC32 though, ideally they'd be using a cryptographic-strength hash fn.

There's also an entire area of research around "semantic hashing" and local embedding, that starts with such a random projection, and tries to improve the mapping to be better at a certain task while still remaining low-dimensional.

[1] http://arxiv.org/pdf/1211.6085v2.pdf

rm999 · on April 3, 2013

I think this is actually quite different from random projection, which creates linear combinations of several continuous variables. RP is much more similar to PCA in that respect. OP's method works on a single categorical variable and essentially shuffles the possible values, then 'folds' them down to an arbitrary size (the size of your hashing function).

>ideally they'd be using a cryptographic-strength hash fn

Cryptographic adds nothing; there is no danger from someone reversing the hash (and if there is you may have bigger problems). Any hash function that is suitably random in its redistribution of the variables should suffice.

fiatmoney · on April 3, 2013

It's a sparse random matrix, but definitely in the same family of techniques.

I was concerned more about distributional properties of different hash functions than reversibility. Checksums-as-hashes often work, but it's not that much work to swap in something a little more robust. If you want a good output distribution, pick something where that's a goal of the algorithm.

apw · on April 3, 2013

The cited Weinberger article says:

  Different from random projections, the hashing-trick
  preserves sparsity and introduces no additional overhead 
  to store projection matrices.

cscheid · on April 3, 2013

The storage of random projection matrices is puzzling, because there's a well-known trick in always building the projection matrices from a known seed. It gives you the right distribution properties, and none of the storage woes. With a sufficiently fast RNG, this could actually be faster than storing it in main memory because of cache pollution.

tripzilch · on April 3, 2013

while cryptographic hashes would fit the bill, there's plenty non-cryptographic hashes with equally good key distribution properties and avalanche behaviour, that are an order of magnitude faster to compute (some, like the FNV hash are even super-easy to implement yourself). These are the hashes used in hash-tables etc, not cryptographic hashes.

rm999 · on April 3, 2013

This is pretty similar to the approach that many predictive modelers already use: compress data down into 'dummy variables', where each variable represents some attribute. For example, you could convert the variable state into 51 dummy variables, one for california, one for DC, etc. Hashing makes the programming a little easier and helps avoid throwing out data in the long-tail of the distribution when the number of possible values is very high. It's common in NLP exactly because of this.

Hashing comes with costs though. It's much harder to interpret a model with hashed variables because the variables lose meaning. Also, some information may get thrown out; as OP mentions integers may have continuous meaning and categorizing them can really damage the model. If you have a very large dataset it probably won't matter because most modeling methods asymptotically converge (i.e. as you have more and more data the model can learn that 333 and 334 are similar just by seeing enough examples that this is the case), but if you don't you could be throwing out valuable information. In a case like this I suppose the modeler could manually go in and convert his integers to floats to 'tell' the algorithm the data is continuous.

jhoon · on April 3, 2013

Your point regarding lost meaning is quite salient. It can be useful and enlightening when a learner reports a measure feature importance (such as in random forest models, http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home....).

When you say "hashing makes the programming a little easier," I think you hit the nail on the head. I'm not trying to improve classification accuracy -- my goal was just to make it as easy as possible to learn on arbitrary structured data.

rm999 · on April 3, 2013

>my goal was just to make it as easy as possible to learn on arbitrary structured data

I'd be very careful about throwing arbitrary data at your learner, at least if you don't understand your data well. Oftentimes the predictors and response are not properly separated in the same way they will be during real-world usage (for example, in time); this leads to target leaks, where your model is effectively cheating by using data it won't have in production.

Target leaks are obvious when the classifier performs suspiciously well on in-sample test data, but sometimes the repercussions are more subtle but still very damaging in a production environment.

mhluongo · on April 3, 2013

Hm, couldn't a hybrid approach deal with this? Eg hash all the data except a few dimensions you think are vital, and add those to the resulting hashed array?

mkmkmmmmm · on April 4, 2013

Or location-aware hashing?

tomarr · on April 3, 2013

The hashing is not for security though, so why not keep a store of your hash + key. Again, added overhead but you wouldn't have to hash twice and you could just use the mapping table for debugging, rather than in operational code at the expense of resources.

mattj · on April 3, 2013

This is very similar to the hashing trick used in vowpal wabbit (http://hunch.net/~vw/). I haven't used this python module, but vw is absolutely incredible to use.

jhoon · on April 3, 2013

Great point! VW was actually an inspiration for this blog post. I don't believe it handles structured data quite the same way as hashkernel, but it does a great job of accepting both categorical and continuous features. We use VW models to classify all sorts of badness on my team at Facebook (http://www.newscientist.com/article/dn21095-inside-facebooks...).

textminer · on April 3, 2013

I love this entry. Two months ago, I wrote a perhaps more-pedantically-technical entry about the hash trick and its use in building a simple classifier, even making it play nice with Scipy's sparse linear algebra (eventually writing my own CSR times CSC multiplication method in Cython): http://blog.newsle.com/2013/02/01/text-classification-and-fe...

jimdanz · on April 3, 2013

Nice writeup, jhoon!

In case anyone else was searching his/her eyes trying to figure out where the "likes" field is actually sanitized out from the the training data set, it's the item.pop("likes") inside the function "is_liked": https://github.com/jeremydhoon/hashkernel/blob/master/fblear...

... Maybe I like long names too much, but my vote would be something longer like "check_if_liked_and_remove_likes" since "is_liked" sounds like a predicate that won't be modifying its input.

datahipster · on April 4, 2013

Meh. Diving a little bit more into the results on how well the hash kernels algorithm did with UCI Adult Names data set is mildly disappointing. Take a look at the results (http://archive.ics.uci.edu/ml/machine-learning-databases/adu...) and you'll see that hash kernels rank 14 of 17.

To be fair, I would definitely like to see how this algorithm does with other data sets as well.

All I see this algorithm doing is basically a projection of a highly dimension feature set onto a random n-d projection via a hashing function. Another words, it's not clear to me how an optimal classification boundary can be constructed using this random projection. I feel comfortable with understanding the performance characteristics of techniques such as SVM or vector quantization since they both focus on implementing algorithms that optimally reduce the dimensionality of the feature space. However, random projects are, in my humble opinion, overrated.

Also, there are two additional sources of parameterization that might make it difficult to use this tool: the selection of a hash function (and accounting for the information loss via a random projection) along with the arbitrary hash kernel array size. That makes it somewhat difficult to train and validate a model.

However! I really do like the ideas presented here and look forward into diving into this and other variants. I really appreciate the OP for taking the time to piece together this cool tool and compare it's performance with other algorithms.

ashish01 · on April 3, 2013

Anyone interested in the article should give VW a try. You can get good baseline performance in a fraction of time compared to other traditional approaches like liblinear et. al.

https://github.com/JohnLangford/vowpal_wabbit/wiki

342 · on April 3, 2013

You are one of my favorite online authors merely for this post. Do you write often?

I'm going to explore using this to fill parts of a vector in an otherwise normal regression algorithm and see how it does against only linear regression and only hash kernel on my dataset.

timothybone · on April 3, 2013

For someone (me) not particularly knowledgeable on the subject. I'm curious to understand the contrast between this and vowpal wabbit. Could someone offer an insight into when it would be useful to use this project as opposed to vowpal?

jhoon · on April 3, 2013

First, please don't use hashkernel in production. It's a toy designed to showcase how hash kernels can help us (easily) learn on structured data.

VW is faster, smarter, and more featureful. However, it doesn't know how to learn on structured data.

bcoates · on April 3, 2013

I don't know anything about this space, but I have some real-world datasets and love throwing algorithms I don't understand at them. Right now I'm writing classifiers by hand, it'd be awesome to outsource my job to some AI. Does anyone have a good starting point where I can figure out which modern classification algorithms are relevant or even possible given my dataset and computer power available?

rm999 · on April 3, 2013

This is a pretty decent flowchart to finding appropriate machine learning algorithms for your problem: http://i.imgur.com/iVo4AAw.png

robrenaud · on April 3, 2013

I love chatting about ML. I am generally skeptical of blackbox machine learning, but I am happy to talk about what approaches might be viable for your specific datasets. gtalk or email works for me. rrenaud@gmail.com

brendano · on April 3, 2013

I don't understand why this is "wildly unprincipled." Doesn't it have all the guarantees that Weinberger and Shi et al papers describe?

jhoon · on April 3, 2013

This is a great question. The "unprincipled" part comes in the application of hashkernel. For example, any integer-valued fields encountered in structured data are treated as categorical features (whereas floating-point numbers are treated as continuous features). Of course, it's possible that some of these integer-valued fields should be treated as continuous features. When using techniques in ways they were never intended to be used, like treating continuous features as categorical, many of the assumptions that make something "principled" no longer apply.

brendano · on April 4, 2013

eh, it's just binning :)

SamuelKillin · on April 4, 2013

Writes one blog post - front page of HN