Essentially random projection, which has good theoretical justification and comes in handy quite often, for instance, in SVMs [1]. I'd be concerned about using something as naive as CRC32 though, ideally they'd be using a cryptographic-strength hash fn.
There's also an entire area of research around "semantic hashing" and local embedding, that starts with such a random projection, and tries to improve the mapping to be better at a certain task while still remaining low-dimensional.
I think this is actually quite different from random projection, which creates linear combinations of several continuous variables. RP is much more similar to PCA in that respect. OP's method works on a single categorical variable and essentially shuffles the possible values, then 'folds' them down to an arbitrary size (the size of your hashing function).
>ideally they'd be using a cryptographic-strength hash fn
Cryptographic adds nothing; there is no danger from someone reversing the hash (and if there is you may have bigger problems). Any hash function that is suitably random in its redistribution of the variables should suffice.
It's a sparse random matrix, but definitely in the same family of techniques.
I was concerned more about distributional properties of different hash functions than reversibility. Checksums-as-hashes often work, but it's not that much work to swap in something a little more robust. If you want a good output distribution, pick something where that's a goal of the algorithm.
The storage of random projection matrices is puzzling, because there's a well-known trick in always building the projection matrices from a known seed. It gives you the right distribution properties, and none of the storage woes. With a sufficiently fast RNG, this could actually be faster than storing it in main memory because of cache pollution.
while cryptographic hashes would fit the bill, there's plenty non-cryptographic hashes with equally good key distribution properties and avalanche behaviour, that are an order of magnitude faster to compute (some, like the FNV hash are even super-easy to implement yourself). These are the hashes used in hash-tables etc, not cryptographic hashes.
This is pretty similar to the approach that many predictive modelers already use: compress data down into 'dummy variables', where each variable represents some attribute. For example, you could convert the variable state into 51 dummy variables, one for california, one for DC, etc. Hashing makes the programming a little easier and helps avoid throwing out data in the long-tail of the distribution when the number of possible values is very high. It's common in NLP exactly because of this.
Hashing comes with costs though. It's much harder to interpret a model with hashed variables because the variables lose meaning. Also, some information may get thrown out; as OP mentions integers may have continuous meaning and categorizing them can really damage the model. If you have a very large dataset it probably won't matter because most modeling methods asymptotically converge (i.e. as you have more and more data the model can learn that 333 and 334 are similar just by seeing enough examples that this is the case), but if you don't you could be throwing out valuable information. In a case like this I suppose the modeler could manually go in and convert his integers to floats to 'tell' the algorithm the data is continuous.
When you say "hashing makes the programming a little easier," I think you hit the nail on the head. I'm not trying to improve classification accuracy -- my goal was just to make it as easy as possible to learn on arbitrary structured data.
>my goal was just to make it as easy as possible to learn on arbitrary structured data
I'd be very careful about throwing arbitrary data at your learner, at least if you don't understand your data well. Oftentimes the predictors and response are not properly separated in the same way they will be during real-world usage (for example, in time); this leads to target leaks, where your model is effectively cheating by using data it won't have in production.
Target leaks are obvious when the classifier performs suspiciously well on in-sample test data, but sometimes the repercussions are more subtle but still very damaging in a production environment.
Hm, couldn't a hybrid approach deal with this? Eg hash all the data except a few dimensions you think are vital, and add those to the resulting hashed array?
The hashing is not for security though, so why not keep a store of your hash + key. Again, added overhead but you wouldn't have to hash twice and you could just use the mapping table for debugging, rather than in operational code at the expense of resources.
This is very similar to the hashing trick used in vowpal wabbit (http://hunch.net/~vw/). I haven't used this python module, but vw is absolutely incredible to use.
Great point! VW was actually an inspiration for this blog post. I don't believe it handles structured data quite the same way as hashkernel, but it does a great job of accepting both categorical and continuous features. We use VW models to classify all sorts of badness on my team at Facebook (http://www.newscientist.com/article/dn21095-inside-facebooks...).
I love this entry. Two months ago, I wrote a perhaps more-pedantically-technical entry about the hash trick and its use in building a simple classifier, even making it play nice with Scipy's sparse linear algebra (eventually writing my own CSR times CSC multiplication method in Cython): http://blog.newsle.com/2013/02/01/text-classification-and-fe...
In case anyone else was searching his/her eyes trying to figure out where the "likes" field is actually sanitized out from the the training data set, it's the item.pop("likes") inside the function "is_liked": https://github.com/jeremydhoon/hashkernel/blob/master/fblear...
... Maybe I like long names too much, but my vote would be something longer like "check_if_liked_and_remove_likes" since "is_liked" sounds like a predicate that won't be modifying its input.
Meh. Diving a little bit more into the results on how well the hash kernels algorithm did with UCI Adult Names data set is mildly disappointing. Take a look at the results (http://archive.ics.uci.edu/ml/machine-learning-databases/adu...) and you'll see that hash kernels rank 14 of 17.
To be fair, I would definitely like to see how this algorithm does with other data sets as well.
All I see this algorithm doing is basically a projection of a highly dimension feature set onto a random n-d projection via a hashing function. Another words, it's not clear to me how an optimal classification boundary can be constructed using this random projection. I feel comfortable with understanding the performance characteristics of techniques such as SVM or vector quantization since they both focus on implementing algorithms that optimally reduce the dimensionality of the feature space. However, random projects are, in my humble opinion, overrated.
Also, there are two additional sources of parameterization that might make it difficult to use this tool: the selection of a hash function (and accounting for the information loss via a random projection) along with the arbitrary hash kernel array size. That makes it somewhat difficult to train and validate a model.
However! I really do like the ideas presented here and look forward into diving into this and other variants. I really appreciate the OP for taking the time to piece together this cool tool and compare it's performance with other algorithms.
Anyone interested in the article should give VW a try. You can get good baseline performance in a fraction of time compared to other traditional approaches like liblinear et. al.
You are one of my favorite online authors merely for this post. Do you write often?
I'm going to explore using this to fill parts of a vector in an otherwise normal regression algorithm and see how it does against only linear regression and only hash kernel on my dataset.
For someone (me) not particularly knowledgeable on the subject. I'm curious to understand the contrast between this and vowpal wabbit.
Could someone offer an insight into when it would be useful to use this project as opposed to vowpal?
I don't know anything about this space, but I have some real-world datasets and love throwing algorithms I don't understand at them. Right now I'm writing classifiers by hand, it'd be awesome to outsource my job to some AI. Does anyone have a good starting point where I can figure out which modern classification algorithms are relevant or even possible given my dataset and computer power available?
I love chatting about ML. I am generally skeptical of blackbox machine learning, but I am happy to talk about what approaches might be viable for your specific datasets. gtalk or email works for me. rrenaud@gmail.com
This is a great question. The "unprincipled" part comes in the application of hashkernel. For example, any integer-valued fields encountered in structured data are treated as categorical features (whereas floating-point numbers are treated as continuous features). Of course, it's possible that some of these integer-valued fields should be treated as continuous features. When using techniques in ways they were never intended to be used, like treating continuous features as categorical, many of the assumptions that make something "principled" no longer apply.
There's also an entire area of research around "semantic hashing" and local embedding, that starts with such a random projection, and tries to improve the mapping to be better at a certain task while still remaining low-dimensional.
[1] http://arxiv.org/pdf/1211.6085v2.pdf