Hacker News new | past | comments | ask | show | jobs | submit login

But with articles, what you might have found uninteresting one day may become interesting the next. And what you thought was fascinating one day may become boring the next.

True. This is a real problem. Much better results can be obtained by partitioning the space of articles and predicting on more focused subsets. The partition can either be manual ("these documents are from my RSS feeds about Python programming") or automatic, via some clustering algorithm.

Ideally you identify subsets where your interests are less transient.

But the flipside of using a general purpose classifier is that you can train it to predict whatever you want. You're in the driver's seat. It doesn't have to be "interesting," it can be "most relevant to my primary area of research" or "most relevant to this book I'm writing" or "most relevant to my small business."

Communicating this capability to users is whole 'nother deal of course.

I did something similar using dbacl.

I remember looking at dbacl. There's some great information in the writeups, but I was disappointed that it coupled feature extraction to classification. For instance, what if you wanted to use stemmed terms as features? How good is the word tokenization for other languages? (compare to Xapian's tokenizer) Can it do named entity identification and use those as features? Can you toss in other features derived from external metadata? Etc.

We ended up doing all of these things, but our classifier itself remained a simple unix tool. For training you piped in (item,feature,category) tsv records. For classification, you piped in (item,feature) and it output (item,category,probability).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: