Hacker News new | past | comments | ask | show | jobs | submit login
Algorithmic search is sinking (skrenta.com)
53 points by McKittrick on Nov 28, 2010 | hide | past | favorite | 45 comments



"The only way to combat this and return trust and quality to search is by taking an editorial stand and having humans identify the best sites for every category."

There are billions of webpages. Who is going to do this review?

Is someone honestly going to review http://stackoverflow.com/questions/4300234/how-might-union-f... and put it in the category of "How Union/Find data structures can be applied to Kruskal's algorithm?"?

No.

The closest thing to a editorialized web is www.dmoz.org, and that hasn't been properly updated in years (and never will be) because it failed.

Search has to be done with algorithms - there are just too many search queries to do it any other way. Udi Manber, Google’s VP of Engineering stated that 20-25% of all queries made each day have never been seen before: http://www.readwriteweb.com/archives/udi_manber_search_is_a_....


The closest thing to a editorialized web is www.dmoz.org, and that hasn't been properly updated in years (and never will be) because it failed.

And noting circular irony one often sees, Rich Skrenta created a Yahoo knock-off in the bubble days called NewHoo and then sold it to Mozilla where it became the seed of dmoz...


Well then he has been at this for a lot of years, and perhaps now knows how to do it. :-)


This is edited, but not to that kind of granularity: http://www.google.com/dirhp


... says the guy with the (crowdsourced) curated search engine.


pretty much anyone that has switched over to some other search engine as their primary did so because Google's algorithmic search is sinking, for the past couple of years.


Given http://www.comscore.com/Press_Events/Press_Releases/2010/4/c..., I don't think there are many people who use a "non-algorithmic" search engine - or do you count Bing?


Use sentiment analysis to discover the intent of a link, and whether the destination should get more link juice. Positive sentiment: positive link juice. Negative sentiment: zero link juice.

Alternatively, for negative reviews, etc., use rel="nofollow".

To claim that algorithmic search is dead completely ignores the volume that Google is doing, or the fact that they are making $billions in algorithmic search and Ad placement. How much do curated places make?

Also, not to rain on anyone's parade or anything (just kidding, I'm going to rain it down) it would take decades of 10k people churning through pages to get even 1% of the new content that Google discovers daily.

You all saw the 24 hours of unique video uploaded to YouTube every minute of every day figure from a year or two ago, right? Imagine that, only text, and produced by 10x-1000x as many people at 10-1000x the volume posting to forums, newsgroups, social networking sites, blogs, etc., every minute of every day. Because of this, you can't just review a site, you have to review the content on each page of the site. That's going to kill any curated engine in the long term.


Ironic that he's proposing to have people solve a problem that arose because people were being manipulated, of course your people cannot be manipulated.

No, the solution to this problem is that GetSatisfaction et al use rel=nofollow. It's as simple as that. And arguably Google could improve its algorithm by taking negativity into account.


yes, this is all getsatisfaction's fault. surprised nytimes missed that angle.




I have heard about search engine spam before and sort of discounted it - but you know, if you search on something that's not a technical topic or something equally specific like a band name, that is, you're searching on a general topic that is of interest to the mundanes, then there really is a whole lot of spam on Google.

My case from this week was that I wanted plans for a bookcase. I searched, therefore, on "build a bookcase". There was exactly one useful link on Google's front page (a Popular Mechanics link), and the rest were regurgitated spam that I could improve on with a Markov chain algorithm.

I've read that as long as people click on ads, Google has no motivation to clean up spam, but surely this can't be the best even for Google?


That's strange. My Google results page for "build a bookcase" shows 8 high quality tutorials on how to build bookcases, besides two pretty decent video-tutorials.


I concur. I got to the second page before I found anything other than a first class result (incidently it was how to build a bookcase in 5 minutes which seems to be a dubious proposition at best). Now if you looking for information on something you actually want to buy then yes, google's results are indeed a sea of spam.


Another interesting example is trying to look for a hotel. Try to search for the one you know - there's ~0 chance that their actual page will appear on the front page. Only resellers, spam and car renting...


Google is all about search quality, so they have a huge motivation to fight spam. It's getting worse now because the mail spam is mostly a solved problem[1] and the web has became the new battlefront.

[1] I haven't got a single spam in my Gmail inbox for ages, and other spam filters are pretty good as well.


The core problem with having humans identify the best sites is that it doesn't scale. It's probably ok for big topics like travel or healthcare, but it shafts those users who are searching for long tail topics.


Or maybe our algorithms just aren't good enough.

Suppose you use bayesian filtering on the text surrounding the links to determine whether the connection is good or bad. With a reasonable amount of data it should be possible.

Note: I'm not an algorithms guy, I do business and strategy and a wee bit of programming, so maybe the example isn't good, but I thinkthe point is.


In this case, what you specifically want is Sentiment Analysis. It's getting pretty accurate+efficient, and should be usable in just this scenario.


Interesting, do people frequently use bayes' theorem in web programming? Ive only seen it it other programming contexts.


If you're interested, there are a whole host of fun and useful machine learning techniques that are actually not as hard to understand and apply as they sound. The best introductory book that I know of is Programming Collective Intelligence, which is surprisingly clear, if a little vague on the theory:

http://oreilly.com/catalog/9780596529321

Naive Bayesian classifiers are just one of the more popular types; others include Support Vector Machines (SVMs), decision trees (and their relatives, random forests), and a bunch more. If you'd like to play around with some, Weka is good open source software for this:

http://www.cs.waikato.ac.nz/ml/weka/


I have no idea...

That's how I'd solve this particular problem though. As I said in the parent I only have cursory experience in programming, and almost none in algorithms.


Google already analyzes backlinks in their context to determine how relevant the anchor text is to the topic of the page.

Determining sentiment (the topic of the NYT piece) is considerably harder though, because it would allow for spammers to write negative articles about a site and link to it and negatively affect its rankings. Also, determining the tone/emotions of a piece of text is probably one of the hardest things to do with textual analysis


Determining sentiment (the topic of the NYT piece) is considerably harder though, because it would allow for spammers to write negative articles about a site and link to it and negatively affect its rankings.

This could be solved by making sentiments act as a weight (i.e. a multiplier in [0, 1]). Positive sentiments would give a particular reference more weight, negative sentiments would give little to no weight. Then it would be impossible to negatively affect a site's rankings - only positively affect them. Just like now.


You wouldn't have to apply the sentiment factor for all sites. You could have a rule that would only lower rankings for sites which have an overwhelming majority of negative links (say >90%). I don't think there would be many cases where a spammer could achieve that level of impact and those few cases could be dealt with through manual intervention.

As for determining sentiment, it's not something I've ever tried to do, but is it really that hard ? Intuitively I would think positive and negative articles would have significantly different distributions of certain words.


A negative link on a page with a low reputation could be accounted for accordingly.


The mystery of the PageRank algorithm is not only a defense against gaming, it's a defense against competition. Other than stylistic differences (a la Bing), it seems difficult to differentiate a new service when nobody understands the details of the standard one.

As a net addict, I regularly find myself frustrated because I can't figure out how to get meaningful information out of Google instead of sites trying to sell me. And if I can't think off the top of my head of a website that will act as a relevant portal for that kind of info, then there isn't really any alternative to Google.

At least, not that I know of yet: can anyone suggest one?

Google has done amazing things for our ability to get what we want and fast, but it also is slowly eroding our independence from it and our ability to educate ourselves by other means.

Here's hoping they prove worthy stewards once they own all the information on the planet.


An awful lot seems to be getting made out of this one story, and there's really precious little else cited in the post other than generic claims of gloom and doom about search. Google's been fighting spam sites for a long time before this and the battle certainly waxes and wanes but I'm sceptical that it's actually being lost, it's just a constant struggle.

Now if you tell me that there is value in social search we could have a totally different discussion, but it's more about the persuasive power of personal recommendation than algorithms not working any more.


it is sinking, but for a different reason. the web is getting away from google. getting locked up in apps, or walled gardens like facebook or itunes


This. The open content web is beginning to disintegrate, being displaced by siloed apps which only incidentally happen to involve HTML and HTTP.


I don't know if Skrenta's approach is perfect (can spammers make slashtags? I'll bet they can!) but Google's is clearly failing.

Giant swathes of Google searches are now overrun with datafog spammers. Ehow, squidoo, hubpages, wikihow, buzzle, how-wiki, ezinearticles, bukisa, wisegeek, articlesnatch, healthblurbs, associatedcontent - all thee and thousands more domains filled with spam semi-automatically generated by legions of Indians for a few cents per page.

There's not one word of useful information on any of those domains. But apparently they serve a lot of ads for Google, so they don't get delisted.


I invoke SandGorgon’s law of outsourcing analogies

As an online discussion about PROGRAMMING grows longer, the probability of a comparison involving outsourcing or Indians approaches 1, if Godwin’s law has not already been satisfied


I really don't see that many comments about Indians and outsourcing in programming discussions here. Or maybe I am not noticing them?


Also, the law is pretty much a tautology since it doesn't define a time scale... eventually, pretty much everything gets said.

"As an online discussion about PROGRAMMING grows longer, the probability of a discussion of traditional medicine and spiritual beliefs surrounding childbirth in ancient sub Saharan African tribes approaches 1."


AS Godwin's law itself clarifies:

Godwin put forth the sarcastic observation that, given enough time, all discussions—regardless of topic or scope—inevitably end up being about Hitler and the Nazis.


There's one feature I wish Google would add - the ability to manually blacklist sites from the search results.


Why can't Google hire a crap load of people to manually sift through the most common queries and identify the spam and penalize those sites?

Google is worth billions. They could easily afford to hire 10,000 people to do this full time and it would completely transform their results.


Even if Google is imperfect, a curated list of links doesn't need to be any better. Duck Duck Go is basically Google (well, Bing) plus a blacklist. SEO takes time, so this should be pretty effective.


To everyone talking about "sentiment analysis": that's not easy. Sentences like "John has stupidly said foo[1], and even went so far as to say bar[2] (which was demolished by Jane[3] and Jan[4]); he's now capitulated[5]" would be quite difficult to parse. The following articles may also be instructive: all by the same author, all quite critical, but with links with quite different intentions.

http://scienceblogs.com/goodmath/2009/05/dembski_responds.ph... http://scienceblogs.com/goodmath/2009/12/id_garbage_csi_as_n... http://scienceblogs.com/goodmath/2009/08/quick_critique_demb...


>>Algorithmic search is sinking

I think we need better Algorithm.


There is little rigor behind most of the claims of the NYTime story: The targeted site already negates any pagerank benefit of their links (they do implement nofollow), and the definitive example seems to be nothing more than good SEO of the site in question (most of the other front and second page sites are pretty mediocre as well, clearly with little web competition in the keyword space).

In any case, go to a shopping specific (sub)site if shopping. A google search is a terrible way of find either products or retailers.


Note: I do SEO as part of my job, so I know a few tools that can look through this. Google keyword checker gives 590 searches a month for that phrase, so it's not too competitive. I'm sure he ranks for a lot of these tail phrases though.

A lot of his juice comes from every page (seems to be over 10,000 according to Yahoo Site Explorer) on his site linking with good anchor text to every other page. The fact that he ranks so low (on my Google he's number 6 or so) even with this on such an easy term shows something, doesn't it?


So there's some shitty retailer. What does this have to do with algorithmic search?


Well, Google uses algorithms to rank websites; the retailers unprofessional practices are gaming Google's algorithms by increasing the number of inbound links pointing to their site, which increases their page rank on a search return on Google and increases traffic.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: