Hacker News new | past | comments | ask | show | jobs | submit login
Clustering related stories (getprismatic.com)
84 points by mattdeboard on April 17, 2012 | hide | past | favorite | 28 comments



The underlying dilemma is that so many of these stores are not really "related". They're just the same story, rewritten off of a press release. The ideal system would pick out clusters, but also have sub-clusters within the cluster that would contain articles on the same subject but with diverse info.


It is helpful that the same story is often written off of PA or AP articles and thus includes pretty much the same info - but it doesn't change the fact that stories on the same subject almost always include the same set of key words unique to that subject whether or not they were rewritten from a press release. That's the beauty of TF.IDF weighting - that it'll cluster stuff based off of words that are uniquely important in one article.


Sometimes they're just rewrites of a press release - and those ones are easy to get right - but a lot of the time they really are totally different articles about the same event. Go back and look at the three sets of example clusters and you'll see what I mean.


Doesn't Google do something similar to this for their news aggregator?


They also have access to the link graph, so they probably use that for clustering instead of looking at the text. Pages with lots of inbound linkers in common are likely to be similar.


At first glance I interpreted the title as "Clustering-related stories", as in "Stories about clustering" which I thought was a very cleverly self-referential title. Oh well...


What has worked best for us is to narrow the category as much as possible before attempting to do unsupervised clustering.

We focus solely on sports and our classifiers(supervised) reduce the scope first to the sport and then to the specific team before we apply any sort of clustering (k-means, LDA, etc). That allows us to reduce the vocabulary to what is mostly a list of named entities for the sport/team and key words such as 'injury', 'quarterback', etc. With a significantly reduced vocabulary, even algorithms such as Hierarchical LDA work surprisingly well.


nice!

years ago i got my hands wet with some rudimentary text classification and grouping techniques. it was when google news came out, and it was fluffier than i like my news to be (it has since been tuned a lot better). so i wrote my own.

i wrote up my (admittedly naive and simplistic - but effective) methods here:

http://www.informit.com/articles/printerfriendly.aspx?p=3988...


Did you consider weighting of the different feature types (on top of tf/idf)?


We actually do weight them differently, but the post was already pretty long so I had to gloss over some of the details.


You might find this library helpful in this area -> https://github.com/rwynn/rugroupy.


You have to manually mark entities though, how is this advantageous over say indexing your data using solr or whatever, and using even the built-in clustering tools? (Carrot by default)


I didn't realize solr had clustering. One difference might be the ability to pass in scoring, include, and dynamic tagging functions at clustering time. Does carrot cluster on arbitrary document fields?


I'm not sure if Carrot does, but with Solr you have to specify which fields to cluster on then restart the instance. The clustering functionality is a contrib library I believe.


I worked on something similar for my final-year project at University in Python, at https://github.com/basicallydan/Onda - clustering related articles in the online media.

In retrospect it's pretty ugly, but it worked pretty well. I really wanted to implement named entities and n-grams but never got around to it. I'm glad you guys did :)


I worked on something similar for clustering BBC news articles. The (Ruby) code I used is here: https://github.com/bbcrd/similarity

I didn't account for names entities or n-grams in the feature vector though. That's a very interesting idea.

@mattdeboard - what algorithm did you use to count the occurrence and size of clusters?


Since you have feature vectors, have you looked at using LSH to reduce the number of comparisons/memory consumption?


In general, LSH is great for detecting near-duplicates which are very close, but we have two issues: (1) The stories we're looking for have a greater distance than LSH is great at detecting efficiently and (2) Because our feature vector has structure (named entities, n-grams, etc.) we have problem-specific 'coarse' feature-vector, which is good at detecting possible related candidates cheaply.


nice article. a few questions here

1. scalability: does your system ingest multiple documents in parallel? if so, how often do you observe over-segmentation, if any?

2. thresholds: how did you set the thresholds at various parts of the systems?


1. We do ingest multiple documents in parallel, but we also have a system in place for merging clusters, so it ends up not being a problem. 2. Pure hackery. Tweak, look at data, repeat.


Can you elaborate on how you merge clusters? Many times a clusters will be just single-doc clusters, how to check if you need to merge two single-doc clusters? (it seems like the same process that was done in the first place to create the clusters: doc x doc similarity)


Merging is pretty simple, and could probably use a little more TLC. The way we do it is that when we get a new document in the system, if the similarity score is above some threshold for documents in two different clusters we will consider merging those clusters. We then make the yes/no decision by comparing random documents from both clusters and averaging the scores, but the threshold we use here is a bit lower than the non-merging decisions (since we have the additional information of this new document doing a good job of linking the clusters).


this doesn't work in the case where there are two (and only two) similar documents get ingested into the system as new singleton clusters at the same time; this case is very rare so it is not a big issue to you, i guess.


What corpus did you get your term frequencies from?


If I understand the blog correctly, their corpus consists of all the articles they have processed so far. Maybe they have some additional source, e.g. a more general collection of web pages.


interesting approach. did you consider k-means for this? if so, what influenced your choice?


how are you doing entity extraction?


It's an in-house system similar to the one i built in grad school (http://nlp.stanford.edu/ner/index.shtml).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: