Hacker News new | past | comments | ask | show | jobs | submit login
Machine Learning Fairy Dust (stdout.be)
86 points by stdbrouw on July 18, 2011 | hide | past | favorite | 34 comments



What makes me really nervous is that we're nearing the point when Google's Prediction API and its knock-offs will increasingly pervade web sites much in the way that AJAX and other technologies have. While overuse of AJAX and the Facebook "Like" button is extremely annoying, it's still pretty harmless.

Machine learning, on the other hand, isn't innocuous. In order to use the Prediction API, you need a large corpus of data, which will just further incentivize web sites to ignore the privacy implications of their actions. Machine learning is far too abstract and too much of an "umbrella" term for it to be anything but careless to refer to it as some sort of panacea.

If you thought that Facebook's "Beacon" was a slap in the face to online privacy, just wait until you see what the feature holds. Once machine learning libraries with extremely robust, completely unsupervised classifiers become more abundant, we're going to see an exponential increase in the market for data. Banner advertisements will be replaced with much more terrifying 'targeted' ads, and we will enter into an age where we are judged not by the empirical evidence of our actions, but the inferences made from people who behave like us.


> Banner advertisements will be replaced with much more terrifying 'targeted' ads

Can someone please explain to me why ads for stuff I might actually be willing to buy (as opposed to hyper-annoying junk thrown at me every day) terrify so many people?

Not that I am ambivalent to privacy issues; just playing devil's advocate here.


First, everyone assumed that no one knew knew about anything they did on the net. They thought that when they looked at an oven mitt in an online store, only they knew about it. The truth was that information was just unused by the store, and so the user never saw it.

Gradually, this notion became slowly dispelled when stores actually started leveraging this information to provide, for example, suggestions. It's all fine and good here.

Finally came targeted ads. The part that people find terrifying is when the suggestions are "following them around!". This is creepy in multiple aspects:

The first is that people don't quite understand how this happens (of course, we do). How is this information showing up in different websites? Did the store just let these other sites handle my information? In their eyes, the boundaries for who is allowed to my personal information seem to blur. This also breaks the paradigm of location that the user has in their mind. "If I don't go to site X that is, I won't see anything about site X." It looks like everyone knows everything.

The second is that it's creepy-through-analogy. The fact that it's going wherever you're going and nagging you constantly is weird. When I walk into a retail store, I'm usually asked if I want any help, and I decline politely. If, however, the salesperson keeps approaching me and trying to sell me things that I don't want, I get the fuck out. With targeted ads, the average user can't do that! The creepy sales guy is following you out to the street and into your home. This usually ends in extreme frustration.

Finally, there are also some nuances that targeted ads miss. Targeted adds are actually not that targeted, they're just there to grab at the "low hanging fruit" customers that are on the verge of making a decision. Just because I looked at a dildo once because I thought it was funny doesn't mean I want to be bombarded with the world's finest penis emulators for the next 3 days.


So use, ad block and/or incognito mode. Problem solved. And don't say that most people don't know how to use these features. While true anyone, who is concerned about this is totally free to ask other people or even pay them to explain and solve the problem for them. The fact that they don't suggests to me that they don't care. Most of the whining about privacy is more the poster being upset that not enough other people are concerned in the same way the poster is.


I'd like to point out that I came by the above insights after a conversation with my aunt (who is in her 40s), accompanied by a few other older relatives.

I do actually care but those statements were not a reflection of my cares, it was of theirs. While this is not equivalent to a comprehensive study of average computer using people across the world, they sure as heck cared but didn't even begin to know where to start, in contrast to what you are hoping will happen.


Thanks for taking time to respond. I agree on some points with you, disagree on others, but it's nice to get a long thought-out response!


The "low hanging fruit" argument is fascinating. Is there any more information about this somewhere?


I think it's the feeling that someone is watching them. We know it's just computers, data, and statistics, but if you didn't understand that, I imagine it would be a bit scarier.


If you act on the internet like you would in real life, it doesn't seem like such a big deal. How is getting relevant ads a bad thing?


Couple years back, I spent a few weeks looking through different jewelry sites for a necklace for my girlfriend.

During this time, I let my girlfriend borrow my laptop one night. Each site that she visited was riddled with ads for engagement rings.

Next thing I know, she's giving me the twenty questions about why I'm looking at engagement rings.

Some things need to be kept private even if they are benign.


Yea, that's going to continue to be the case until someone (cough us, cough) figures out that what you look at or buy on one occasion might not actually be something you're interested in. Specifically speaking of //items//, this is really a huge issue that is so huge that it's almost become laughable in the sense that nearly all the recommendations via "machine learning" (Collaborative Filtering, Trust Systems, etc) nowadays end up being so tainted that they are worthless - to you, and to the companies spitting them out. I call it "Recommendation Blindness" (copyright 2008...20% on usage...etc...etc..sue you..etc.....I'm just kidding of course).


That's why the FSM gave us private browsing modes


And lo, the FSM didst claim it wasn't just for teh pronz. And the people did laugh. And wink. And nudge each other.

Until it was too late. Until your spam filter thought it wasn't spam. And on that day, the privacy geeks didth retort: "told ya so."


So, did you get engaged :)?


oh shit. You make a good point sir.


Because in real life, there are plenty of laws that prevent invasions into privacy. Also, the people capable of doing the same in real life don't have the tools to create truly personal advertisements.

For example, a supermarket chain might use an aggregate of purchase histories (i.e., products most often purchased together) to influence product placement in the store. They won't re-position products for each customer as they walk in the door, though... Online, this can, and does, happen.

It might not seem like a big deal when it's useful. However, there will be many instances of people trying, and failing, at creating a useful product. I think that's a valid point of distinction: when people fail at adopting new technologies right now, the results are mostly harmless. If they fail with machine learning, there could be some major privacy concerns.


Do you really think such a turnkey ML service is possible, though?


Absolutely, especially if it integrates with already-existing applications. For example, I think there's room for a turn-key "plug-in" to phpBB that offers thread-specific advertisements. Same thing with wordpress.


There are NO shortcuts. This is a fantastic article, and he lists a lot of good examples- machine learning, "social", crowdsourcing, AJAX, real-time.

I would add to that list "create a forum." Maybe that's part of "social." In marketing I hear it all the freaking time- you get a half-ass mediocre idea and it always includes some type of "forum" your customers will recruit themselves into somehow, and start to form a community. Most of these people have never been on a forum so I can't blame them for not knowing how it works, but it is a challenge.


I was nodding my head u til I got to the end - it seems like the google prediction API _is_ the magic fairy dust we've been waiting for?


I think your comment is an example of the point of view that the author was talking about. The Google Prediction API won't automate the process of grouping comments or stories by content. Someone has to do the work of collecting the data and preparing the corpus, determining the best way to analyze it and prepping the inputs and outputs. There are levels of understanding and effort between having an idea involving machine learning and getting accurate predictions.

The google prediction api takes care of the code for algorithmic computation. While that's handy, it's only one step of a much larger process. The scale of that process is something that many people don't fully understanding about machine learning (yet).


There have been readily available machine learning toolkits available for decades--if you want to use a SVM in your project, you just need to grab an implementation and get going. The trick, as others have described, is getting your data into a usable format, choosing features, and experimenting to find the method that gives the best results. As far as I can tell, the Prediction API doesn't do much to make any of that easier.


I think his point is that it's just a tool that makes it a little easier for startups to incorporate machine learning into their products - like he said, it may be appropriate for some types of problems, but not all. But I'm sure we'll start to see more tools like that become more widely used.

When AJAX first came out, not everyone knew how to do it - but now, everyone can drop in jQuery and do all sorts of complex things relatively easily.


I guess that a good point. I wonder what kinds of bad implementations of the api we'll see. What a great revenue stream for google - what startup won't use the api in some way? I know i'm setting it up tonight and using it on at least one project.


Judging from their forum activity, they hardly have any usage.


I think 'machine learning' is so complex that people just don't feel like trying to explain it. That, or their business secrets are tied up in it, and they don't want to give away the golden goose.


That's an explanation for some of the examples, but I think a lot of the times it's actually really simple, along the lines of, "we sift through some data and correlate it". The odd thing is, that often works, especially for user-facing perceptual stuff where there's a strong placebo effect, even more especially if you salt liberally with some hand-tuned biasing. Sort of how The Sims is able to use some super-simple algorithms to give the impression of interesting characters.

However, if you do need some real magic to be done, and your product really won't work without it, then things get trickier; bad statistics, or at least statistics not really used correctly, is really common in the innards of these kinds of products.


The problem with explaining machine learning is that it's not only complex, but it goes against the way that people normally think. Humans generally aren't built for making statistical calculations and going with highest probable outcomes. People are wired to understand narratives and compelling stories. To explain machine learning, you have to bridge a conceptual gap while also discussing a very technical idea.

Personally, I think that the benefits of a product should be so evident that people don't care if machine learning was used or not. The pitch shouldn't be "This aggregator is awesome because of machine learning", but "This aggregrator is awesome (oh and we used machine learning)"


I think 'machine learning' is so complex that people just don't feel like trying to explain it.

I think it is very simple(outside of the secret sauce part) to people who know it, so they don't feel the need to explain it. People who haven't sat down and thought about it see it as magical.

As an example(lifted from Programming Collaborative Intelligence by Segaran ) say you want to recommend movies to people. You have them rate movies. Then you take people in pairs and compare movies they have both rated to produce a distance between those two people. When you want to recommend a movie to Joe, you take the people who are closest to Joe and then find a movie that they rate highly that Joe has not rated, and suggest that to Joe. The secret sauce is in coming up with the distance function.


I think this is usually a case of marketing having a bit too much say in product discussions. In the publishing industry, it seems like "My widget does X" doesn't get as strong a reaction from publishers as "My widget does X and it adapts to your readers".

The problem being (of course) that people forget how hard a problem machine learning can be.


Its fine if people want to say that ML will take care of the "details" ...let them try to use ML right and they will see you need to spend a long time understanding how to do things right. Most of the time, you can't use linear regressions right out of the box, let alone SVM's.


Agreed. The use of ML is highly dependent on the data. Having a something like the Prediction api is fine, but seems like the use-cases would be rigid.


Yup, and if commit to it and suddenly realize you need a little more flexibility than the API provides, you're probably in a worse position than if you rolled it yourself.

Real-world ML is so full of black magic and hackery that it's the LAST thing I'd try to sell as a web service.


The cloud will solve it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: