One thing that I hope people don't miss is that the problem "Google Alerts" solv...

wodow · on Sept 25, 2013

Thanks for an interesting post.

The IBM journal publication - is it this one? http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=617771...

fintler · on Sept 25, 2013

Although I'm not sure if that's the mentioned publication, you might want to take a look at https://uima.apache.org/.

More info on how Watson is using UIMA: https://blogs.apache.org/foundation/entry/apache_innovation_...

PaulHoule · on Sept 25, 2013

Yep, thanks for the link!

AznHisoka · on Sept 25, 2013

Isn't Google Alerts simply based on keyword/phrase matches? So if I want to get an alert for the keyword "recipe", it'll give me web pages that are about recipes, as well as articles that simply have the word 'recipe' (ie "Customer development is the recipe to startup success"). I don't think it ever marketed itself as a topic search alert system.

PaulHoule · on Sept 25, 2013

Yeah, but here's a Fermi problem the crowd.

How many web pages get created every day that contain the word "recipe"?

I'm certain you'd be buried in notifications if Google sent you an alert for every recipe, so it has to have some selection mechanism to send you particular recipes...

beambot · on Sept 25, 2013

I thought Google Alerts were to tell you when specific phrases were encountered, like "MyMostlyUniqueFullName" or "MyCompanyName". "Recipe" doesn't seem that useful -- or at least doesn't match my M.O. for Google Alerts.

gavinpc · on Sept 26, 2013

I'm guessing that most Google Alerts are "vanity alerts," as your examples suggest. The putative decline of the service could thus fit with the idea that such functions are meant to be subsumed by Google+.

est · on Sept 26, 2013

I once tried to search for Avatar the movie, but SERP returns many Avatar the video game. I added -game, but realized that if a blog titles "Why I like Avatar (the movie, not the game)", it might be filtered out with the -game keyword.

zeckalpha · on Sept 26, 2013

But not the TV show? That's disappointing.

AznHisoka · on Sept 25, 2013

Well, if there's a ton of results, I'm sure they use some sort of way to filter out the unpopular ones, such as pagerank.

nl · on Sept 26, 2013

Nice points. This is pretty much why the old Google blog search failed, too.

The problem with your similarity scoring idea is that it fails badly in adversarial conditions (as I'm sure you are aware). It's easy to work around that failure, but then you end up using something like dot-product. I'm not at all convinced that normalizing the scoring is throwing anything away at all.

On another point:

Would you mind explaining this a bit more: Because these algorithms almost always get the wrong idea about the prior distribution, you often make a "failing" machine learning algo very useful if you do logistic regression on the output and use that to convert the output into a probability score.

PaulHoule · on Sept 26, 2013

In the case I'm talking about we were using RDF data that was curated, so we had no enemy.

Adversarial IR is a problem that came with Google and will go away with Google.

Bing has the problem because too they are trying to be Google.

If you accept Sturgeon's law,

http://en.wikipedia.org/wiki/Sturgeon%27s_Law

and realize that it's more like "99.9% of the web is crap", you can look at it as a whitelisting problem rather than a blacklisting problem. If you search for "WOW Gold" you're going to get some article from Wired about how people are working 18 hours playing video games under horrific conditions in a Satanic mill somewhere in Shenzhen. And that's it.

Google can't whitelist because of business and political reasons. Smaller companies, particularly vertical focused, can.

As for the prior, I was working w/ Thorsten Joachims and an undergrad years with classifying papers from the physics arXiv. If you want to separate out astrophysics (which was the biggest category) from anything else, the number of negatives in your training set will greatly outnumber the positives, and under this situation the SVM gets the idea that it's safer to bet against astrophysics than it really is. If on the other hand you have a balance number of pos and neg examples, it's also getting a wrong idea.

We tried using the SVM out of the box and had lousy examples and then Joachims told us to try

http://en.wikipedia.org/wiki/Receiver_operating_characterist...

and we found we could tune the cutoff to get performance that was much more satisfying.

Most machine learning books go on for hundreds of pages about Kernel theory and whatever and spend two or three pages on ROC analysis (and it's friends, like logistic regression -> probability score.)

A big problem with things like TREC and Kaggle is that need to pick one definition of "accuracy" so that a whole crowd of intelligent but unwise people can fight for the last 0.2% percent, but it doesn't lead to applications because in the real world the cost of some mistakes is worse than other mistakes, and you could use simple methods and ROC/logistic analysis to make something that maximizes business value with 1/10 the effort.