The underlying dilemma is that so many of these stores are not really "related". They're just the same story, rewritten off of a press release. The ideal system would pick out clusters, but also have sub-clusters within the cluster that would contain articles on the same subject but with diverse info.
It is helpful that the same story is often written off of PA or AP articles and thus includes pretty much the same info - but it doesn't change the fact that stories on the same subject almost always include the same set of key words unique to that subject whether or not they were rewritten from a press release. That's the beauty of TF.IDF weighting - that it'll cluster stuff based off of words that are uniquely important in one article.
Sometimes they're just rewrites of a press release - and those ones are easy to get right - but a lot of the time they really are totally different articles about the same event. Go back and look at the three sets of example clusters and you'll see what I mean.
They also have access to the link graph, so they probably use that for clustering instead of looking at the text. Pages with lots of inbound linkers in common are likely to be similar.
At first glance I interpreted the title as "Clustering-related stories", as in "Stories about clustering" which I thought was a very cleverly self-referential title. Oh well...
What has worked best for us is to narrow the category as much as possible before attempting to do unsupervised clustering.
We focus solely on sports and our classifiers(supervised) reduce the scope first to the sport and then to the specific team before we apply any sort of clustering (k-means, LDA, etc). That allows us to reduce the vocabulary to what is mostly a list of named entities for the sport/team and key words such as 'injury', 'quarterback', etc. With a significantly reduced vocabulary, even algorithms such as Hierarchical LDA work surprisingly well.
years ago i got my hands wet with some rudimentary text classification and grouping techniques. it was when google news came out, and it was fluffier than i like my news to be (it has since been tuned a lot better). so i wrote my own.
i wrote up my (admittedly naive and simplistic - but effective) methods here:
You have to manually mark entities though, how is this advantageous over say indexing your data using solr or whatever, and using even the built-in clustering tools? (Carrot by default)
I didn't realize solr had clustering. One difference might be the ability to pass in scoring, include, and dynamic tagging functions at clustering time. Does carrot cluster on arbitrary document fields?
I'm not sure if Carrot does, but with Solr you have to specify which fields to cluster on then restart the instance. The clustering functionality is a contrib library I believe.
I worked on something similar for my final-year project at University in Python, at https://github.com/basicallydan/Onda - clustering related articles in the online media.
In retrospect it's pretty ugly, but it worked pretty well. I really wanted to implement named entities and n-grams but never got around to it. I'm glad you guys did :)
In general, LSH is great for detecting near-duplicates which are very close, but we have two issues: (1) The stories we're looking for have a greater distance than LSH is great at detecting efficiently and (2) Because our feature vector has structure (named entities, n-grams, etc.) we have problem-specific 'coarse' feature-vector, which is good at detecting possible related candidates cheaply.
1. We do ingest multiple documents in parallel, but we also have a system in place for merging clusters, so it ends up not being a problem.
2. Pure hackery. Tweak, look at data, repeat.
Can you elaborate on how you merge clusters? Many times a clusters will be just single-doc clusters, how to check if you need to merge two single-doc clusters? (it seems like the same process that was done in the first place to create the clusters: doc x doc similarity)
Merging is pretty simple, and could probably use a little more TLC. The way we do it is that when we get a new document in the system, if the similarity score is above some threshold for documents in two different clusters we will consider merging those clusters. We then make the yes/no decision by comparing random documents from both clusters and averaging the scores, but the threshold we use here is a bit lower than the non-merging decisions (since we have the additional information of this new document doing a good job of linking the clusters).
this doesn't work in the case where there are two (and only two) similar documents get ingested into the system as new singleton clusters at the same time; this case is very rare so it is not a big issue to you, i guess.
If I understand the blog correctly, their corpus consists of all the articles they have processed so far. Maybe they have some additional source, e.g. a more general collection of web pages.