Suppose you use bayesian filtering on the text surrounding the links to determine whether the connection is good or bad. With a reasonable amount of data it should be possible.
Note: I'm not an algorithms guy, I do business and strategy and a wee bit of programming, so maybe the example isn't good, but I thinkthe point is.
If you're interested, there are a whole host of fun and useful machine learning techniques that are actually not as hard to understand and apply as they sound. The best introductory book that I know of is Programming Collective Intelligence, which is surprisingly clear, if a little vague on the theory:
Naive Bayesian classifiers are just one of the more popular types; others include Support Vector Machines (SVMs), decision trees (and their relatives, random forests), and a bunch more. If you'd like to play around with some, Weka is good open source software for this:
That's how I'd solve this particular problem though. As I said in the parent I only have cursory experience in programming, and almost none in algorithms.
Google already analyzes backlinks in their context to determine how relevant the anchor text is to the topic of the page.
Determining sentiment (the topic of the NYT piece) is considerably harder though, because it would allow for spammers to write negative articles about a site and link to it and negatively affect its rankings. Also, determining the tone/emotions of a piece of text is probably one of the hardest things to do with textual analysis
Determining sentiment (the topic of the NYT piece) is considerably harder though, because it would allow for spammers to write negative articles about a site and link to it and negatively affect its rankings.
This could be solved by making sentiments act as a weight (i.e. a multiplier in [0, 1]). Positive sentiments would give a particular reference more weight, negative sentiments would give little to no weight. Then it would be impossible to negatively affect a site's rankings - only positively affect them. Just like now.
You wouldn't have to apply the sentiment factor for all sites. You could have a rule that would only lower rankings for sites which have an overwhelming majority of negative links (say >90%). I don't think there would be many cases where a spammer could achieve that level of impact and those few cases could be dealt with through manual intervention.
As for determining sentiment, it's not something I've ever tried to do, but is it really that hard ? Intuitively I would think positive and negative articles would have significantly different distributions of certain words.
Suppose you use bayesian filtering on the text surrounding the links to determine whether the connection is good or bad. With a reasonable amount of data it should be possible.
Note: I'm not an algorithms guy, I do business and strategy and a wee bit of programming, so maybe the example isn't good, but I thinkthe point is.