Clustering related stories

smackfu · on April 17, 2012

The underlying dilemma is that so many of these stores are not really "related". They're just the same story, rewritten off of a press release. The ideal system would pick out clusters, but also have sub-clusters within the cluster that would contain articles on the same subject but with diverse info.

basicallydan · on April 18, 2012

It is helpful that the same story is often written off of PA or AP articles and thus includes pretty much the same info - but it doesn't change the fact that stories on the same subject almost always include the same set of key words unique to that subject whether or not they were rewritten from a press release. That's the beauty of TF.IDF weighting - that it'll cluster stuff based off of words that are uniquely important in one article.

jrfinkel · on April 18, 2012

Sometimes they're just rewrites of a press release - and those ones are easy to get right - but a lot of the time they really are totally different articles about the same event. Go back and look at the three sets of example clusters and you'll see what I mean.

mattdeboard · on April 17, 2012

Doesn't Google do something similar to this for their news aggregator?

rabidsnail · on April 18, 2012

They also have access to the link graph, so they probably use that for clustering instead of looking at the text. Pages with lots of inbound linkers in common are likely to be similar.

cgs1019 · on April 17, 2012

At first glance I interpreted the title as "Clustering-related stories", as in "Stories about clustering" which I thought was a very cleverly self-referential title. Oh well...

papercruncher · on April 17, 2012

What has worked best for us is to narrow the category as much as possible before attempting to do unsupervised clustering.

We focus solely on sports and our classifiers(supervised) reduce the scope first to the sport and then to the specific team before we apply any sort of clustering (k-means, LDA, etc). That allows us to reduce the vocabulary to what is mostly a list of named entities for the sport/team and key words such as 'injury', 'quarterback', etc. With a significantly reduced vocabulary, even algorithms such as Hierarchical LDA work surprisingly well.

jnazario · on April 17, 2012

nice!

years ago i got my hands wet with some rudimentary text classification and grouping techniques. it was when google news came out, and it was fluffier than i like my news to be (it has since been tuned a lot better). so i wrote my own.

i wrote up my (admittedly naive and simplistic - but effective) methods here:

http://www.informit.com/articles/printerfriendly.aspx?p=3988...

ygyg · on April 17, 2012

Did you consider weighting of the different feature types (on top of tf/idf)?

jrfinkel · on April 17, 2012

We actually do weight them differently, but the post was already pretty long so I had to gloss over some of the details.

rwynn · on April 17, 2012

You might find this library helpful in this area -> https://github.com/rwynn/rugroupy.

mattdeboard · on April 17, 2012

You have to manually mark entities though, how is this advantageous over say indexing your data using solr or whatever, and using even the built-in clustering tools? (Carrot by default)

rwynn · on April 17, 2012

I didn't realize solr had clustering. One difference might be the ability to pass in scoring, include, and dynamic tagging functions at clustering time. Does carrot cluster on arbitrary document fields?

mattdeboard · on April 17, 2012

I'm not sure if Carrot does, but with Solr you have to specify which fields to cluster on then restart the instance. The clustering functionality is a contrib library I believe.

basicallydan · on April 18, 2012

I worked on something similar for my final-year project at University in Python, at https://github.com/basicallydan/Onda - clustering related articles in the online media.

In retrospect it's pretty ugly, but it worked pretty well. I really wanted to implement named entities and n-grams but never got around to it. I'm glad you guys did :)

chrislo · on April 18, 2012

I worked on something similar for clustering BBC news articles. The (Ruby) code I used is here: https://github.com/bbcrd/similarity

I didn't account for names entities or n-grams in the feature vector though. That's a very interesting idea.

@mattdeboard - what algorithm did you use to count the occurrence and size of clusters?

gms · on April 17, 2012

Since you have feature vectors, have you looked at using LSH to reduce the number of comparisons/memory consumption?

aria · on April 17, 2012

In general, LSH is great for detecting near-duplicates which are very close, but we have two issues: (1) The stories we're looking for have a greater distance than LSH is great at detecting efficiently and (2) Because our feature vector has structure (named entities, n-grams, etc.) we have problem-specific 'coarse' feature-vector, which is good at detecting possible related candidates cheaply.

lookforr · on April 18, 2012

nice article. a few questions here

1. scalability: does your system ingest multiple documents in parallel? if so, how often do you observe over-segmentation, if any?

2. thresholds: how did you set the thresholds at various parts of the systems?

jrfinkel · on April 18, 2012

1. We do ingest multiple documents in parallel, but we also have a system in place for merging clusters, so it ends up not being a problem. 2. Pure hackery. Tweak, look at data, repeat.

felipehummel · on April 18, 2012

Can you elaborate on how you merge clusters? Many times a clusters will be just single-doc clusters, how to check if you need to merge two single-doc clusters? (it seems like the same process that was done in the first place to create the clusters: doc x doc similarity)

jrfinkel · on April 18, 2012

Merging is pretty simple, and could probably use a little more TLC. The way we do it is that when we get a new document in the system, if the similarity score is above some threshold for documents in two different clusters we will consider merging those clusters. We then make the yes/no decision by comparing random documents from both clusters and averaging the scores, but the threshold we use here is a bit lower than the non-merging decisions (since we have the additional information of this new document doing a good job of linking the clusters).

lookforr · on April 18, 2012

this doesn't work in the case where there are two (and only two) similar documents get ingested into the system as new singleton clusters at the same time; this case is very rare so it is not a big issue to you, i guess.

rabidsnail · on April 18, 2012

What corpus did you get your term frequencies from?

lmichelbacher · on April 18, 2012

If I understand the blog correctly, their corpus consists of all the articles they have processed so far. Maybe they have some additional source, e.g. a more general collection of web pages.

brey · on April 17, 2012

interesting approach. did you consider k-means for this? if so, what influenced your choice?

senthil_rajasek · on April 17, 2012

how are you doing entity extraction?

jrfinkel · on April 17, 2012

It's an in-house system similar to the one i built in grad school (http://nlp.stanford.edu/ner/index.shtml).