Hacker News new | past | comments | ask | show | jobs | submit login
Don't tell StackOverflow I'm a hacker (they think I'm a bot) (medusis.com)
100 points by bambax on April 13, 2011 | hide | past | favorite | 46 comments



Hi!

First of all - team@stackoverflow email goes straight to Jeff, and he replies to it personally. There's no secret team of people pretending to be Jeff to deal with customer service requests :-)

I personally believe that any company--or country--that thinks it's a good idea to filter web traffic deserves just about as much collateral damage as they get. If a company or country wants to cut its workers off from the knowledge they need to do their job, honestly, I would like to see that company or country get beaten to a pulp by the forces of evolution.

But... I feel for you. There are lots of websites that use our API to provide alternate views on Stack Overflow at other addresses, for example, http://sa.column80.com/?api=0 has a nice lynx-compatible browser.


I get "Fatal error: The user ID must be numeric in /home/markness/column80-sofu/phpstack.class.php on line 419" when trying to read anything


I think you need to enable cookies. That fixed the error for me.


I'm surprised that people are scraping their content via bots, given that it's licensed under CC and freely available to download. (http://blog.stackoverflow.com/category/cc-wiki-dump/)


I'm guessing that these are people who either want something more up to date than the data dump (which is monthly if if I recall) or who have scraping infrastructure set up and working and it's easier to use that than to work with the schema.


It's bi-monthly nowadays. Even so, you can use the API instead, its there for a reason.

Its too bad for this guy that he simply can't access SO the normal way, but that's not SO's fault


I agree. SO is great but we all need to be realistic about situations where for one reason or another we're edge cases.


I'm a bit surprised they just block the adresses, nor rate-limiting them.


Whilst it's not ideal, currently experiencing spammer abuse had lead me down similar paths on one of my web apps. Sadly solutions like this free up time for more core functionality - which I'd much rather spend my time developing, considering that abuse prevention is not a feature you can charge for exactly.


Yes, I understand that, and scrapers certainly are a big pain; but shouldn't SO extend a little more effort to avoid false positives...?


It seems possible for them to set up the block as a response to request rate/patterns, but as I am unfamiliar with EC2 I have to ask: how hard is it to get a new IP address? If it a matter of stopping an instance and starting another (or easier), blocking all of EC2 may be their only move.

Once the scraper isn't tied to a single IP, I don't see how to filter out abusive requests from the legitimate ones. It will cause the scraper to emulate real-user behavior more and more to defeat the barriers. At the end of the cat and mouse game SO's only option is to block the IP range, or allow scraping from EC2.


It probably saves them a whole load of pain. And you are rather a special case, how many non-bots try to access Stack Overflow via a EC2 do you think?

There are plenty of other cheap VPS providers anyway...


> There are plenty of other cheap VPS providers anyway...

Certainly, but what am I supposed to do, hop from provider to provider according to who (blanket-)blocks what...?

> you are rather a special case

I setup this VPN, following instructions from a blog post from 2009; I'm guessing there are many people who did the same?

But if I'm going to die as collateral damage, I don't intend to succumb quietly! ;-)


While I don't agree with an employer or job site filtering web access, following a two year old blog post on circumventing a client's security measures borders on irresponsible and stupid. You may think of it as only bypassing some web filtering, but by establishing a VPN connection across their border, you've potentially opened their network up to any insecurities in your machine or your EC2 instance with a direct link past their border that your client is completely unaware of.

If you need to perform work on a client site that isn't permitted over their network, you should bring your own connectivity. I carry a cheap cdma modem for that very purpose. Just don't expose your client to additional risks so that you can read SO.


Certainly. If someone's looking for a cheap vps to tunnel through, http://www.lowendbox.com/ provides a listing of all the cheapest deals out there. While they wont provide much in the way of disk space or processing power, they're perfect for tunnels.


If you just need proxy support to get around restrictions, EC2 is an expensive way of doing it (assuming you're using it 8 hours a day).

You can get a VPS (http://www.lowendbox.com/ tracks various offers) for as little as 3 USD/month, and have it 24/7/365. Even if you're using an EC2 micro instance, you'd have to use it less than 150 hours a month (out of 730) to get ahead.


> If you just need proxy support to get around restrictions, EC2 is an expensive way of doing it (assuming you're using it 8 hours a day).

Not for the first year on the free tier.


Isn't the EC2 cost 1.68 USD/month for your assumed usage pattern? (0.007 USD/hour * 8 hour/day * 30 day/month)


I took 0.02 USD/hour for on-demand micro instance; 0.007 USD/hour also requires a $52/yr or $82/3yr fixed cost on top.

I have no idea how to predict the spot price, and I haven't tried to collect statistics on it, so I (personally) wouldn't choose it for this kind of application, where you expect it to be there all the time, not just when it happens to be cheap.

I mean, 3 USD/month, or 30 USD/year, is cheap enough to pay once and forget about, rather than worrying about turning it on or off.


...so I (personally) wouldn't choose it for this kind of application, where you expect it to be there all the time, not just when it happens to be cheap.

It's not very difficult to set up a script to spin up an on-demand instance if your spot instance gets nuked. Or to just buy another spot instance at a higher price.

https://github.com/boto/


Thanks for the clarification -- that makes sense. Given amazon's 1-year-free offer and its excellent documentation, I'd still use an EC2 instance for this purpose.


The cost of a micro EC2 per month ( is:

- 82 /3 / 12 = 2.27 (fixed monthly cost for a reserved instance)

- .007 * 24 * 30 = 5.04 (variable monthly cost)

for a total of $7.31, assuming it's running all the time. If it's running only during business hours, the variable monthly cost becomes:

- .007 * 8 * 20 = 1.12

But the thing is, even the micro instance does almost NOTHING when used as a VPN endpoint (CPU and memory are flat) so you can use the machine for anything else (hosting, dealing with other VPN clients, etc.)


The same thing - doing almost nothing when a VPN endpoint - applies to other VPSes at half or less the cost, e.g. http://www.lowendbox.com/blog/forever-hosting-19-95year-256m... - 20 USD/year. Now these are smaller instances, likely more contended, but that's precisely the point - when you don't need much, there's not much value in over-provisioning.

Anyhow, I've said my piece; I think EC2 is over-priced for this specific task; there are other options; and (at least in my case) they aren't blocked by SO, owing to not being as mainstream.


thanks for posting lowendbox.com, it's been added to my list of sites I will use one day.


How are the entire EC2 IP blocks known?

I doubt they rDNS every connection that comes in, too expensive.

Ah, here's a list but it's from 2007, might be out of date

https://forums.aws.amazon.com/message.jspa?messageID=106925#...


You only have to to lookup once for each /24 and presumably you'd cache the response for a while. The lookup doesn't need to be done online either.


Here's the official list from Amazon: https://forums.aws.amazon.com/ann.jspa?annID=986


StackOverflow once blocked the entire country I'm in (would just return a 403). I emailed them and was told they can re-enable it:

"It's not a problem, I just need reassurance that there won't be RssNotifiers hitting us 1,000 times a day and pulling down uncompressed data."

To be fair, I'm sure they have more important things to do than work on complex rate limiting and abuse detection code, especially for edge cases like small countries or EC2.


You probably have ssh access to more machines than just that EC2 instance? Tunnel your traffic through an ssh account.

ssh -D 10001 you@someplace.com

So long as someplace.com is not blocked by SO you can bind your browsers traffic to your now-running local socks proxy in the internet connectivity section. Set all traffic to go through 127.0.0.1:10001


I've hit this issue as well, since we use a lot of VMs on EC2.


“We’ll just block all of EC2” seems not only excessively broad but, well, lazy.

This conclusion bugged me a bit, in part because I'm old enough to remember when programmers considered laziness to be a virtue. But also because it's not evidence-based. If it were a problem faced by a significant portion of SO readers, I suspect they'd find another way to address it. But if it's a problem only experienced by a couple of readers a year, then spending much time on it would be an "industrious" misuse of resources.


EC2 is a big issue for a lot of things. For example, while I was maintaining a credit card donation form for a non-profit, a large chunk of fraudulent submissions came from EC2 addresses - actually more than came from African ISPs.

So I too blocked the whole of EC2; the logic is pretty straightforward, in that a legitimate customer originating requests from there is highly unlikely compared to the other options.


Same exact situation here--need a VPN to access some sites from China. No worries though--just make an exclusion rule for SO!


Adding the "api" parameter is also useful if you want to copy the URL to someone else. So you can write such things as: http://sa.column80.com/?q=1860&api=77 Now it is a URL that can be copied to someone else that is not on the same session.


I found out how to do without cookies: Add the "api" parameter to the query string when retrieving a message. You might have to do manually every time, unless you can write a program to do for you.


EC2 is always marked as spam if you try to send email from an instance too. It would be nice if Amazon could create some tier of white listed, verified non-spam instances or something.


they have spam as a service for that http://aws.amazon.com/ses/


Couldn't agree more. Blanket banning of something is rarely good.


You'd think they could rate limit requests coming from EC2 ip addresses rather than blanket banning, but I guess value wise its cheaper to block all of them given the low likelihood of it being a real person.


I often tunnel my traffic through an EC2 instance when I'm on public WiFi and I've run into this problem with a number of sites.

Most notably (and annoying when out in the city), Yelp.


> Most notably (and annoying when out in the city), Yelp.

Confirmed! (but I don't use Yelp so I wouldn't notice)


Hay! I try to use it on gopher?


I'm with StackOverflow on this one. If I was then, I wouldn't waste my time fine tuning the detection, and just block all of Amazon ECS.


Its a business decision.

Easy to do ... instead of having someone spend time trying to make it work for 0.005% of Stack overflow use cases (yes. I pulled that number entirely out of my rectum. chill out.) ... sucks, but I get it.


“We’ll just block all of EC2” seems not only excessively broad but, well, lazy.

Well, yeah. It's a Windows application, and lazy is the name of the game on Windows.


Why not just use TOR with your EC2?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: