First of all - team@stackoverflow email goes straight to Jeff, and he replies to it personally. There's no secret team of people pretending to be Jeff to deal with customer service requests :-)
I personally believe that any company--or country--that thinks it's a good idea to filter web traffic deserves just about as much collateral damage as they get. If a company or country wants to cut its workers off from the knowledge they need to do their job, honestly, I would like to see that company or country get beaten to a pulp by the forces of evolution.
But... I feel for you. There are lots of websites that use our API to provide alternate views on Stack Overflow at other addresses, for example, http://sa.column80.com/?api=0 has a nice lynx-compatible browser.
I'm guessing that these are people who either want something more up to date than the data dump (which is monthly if if I recall) or who have scraping infrastructure set up and working and it's easier to use that than to work with the schema.
Whilst it's not ideal, currently experiencing spammer abuse had lead me down similar paths on one of my web apps. Sadly solutions like this free up time for more core functionality - which I'd much rather spend my time developing, considering that abuse prevention is not a feature you can charge for exactly.
It seems possible for them to set up the block as a response to request rate/patterns, but as I am unfamiliar with EC2 I have to ask: how hard is it to get a new IP address? If it a matter of stopping an instance and starting another (or easier), blocking all of EC2 may be their only move.
Once the scraper isn't tied to a single IP, I don't see how to filter out abusive requests from the legitimate ones. It will cause the scraper to emulate real-user behavior more and more to defeat the barriers. At the end of the cat and mouse game SO's only option is to block the IP range, or allow scraping from EC2.
While I don't agree with an employer or job site filtering web access, following a two year old blog post on circumventing a client's security measures borders on irresponsible and stupid. You may think of it as only bypassing some web filtering, but by establishing a VPN connection across their border, you've potentially opened their network up to any insecurities in your machine or your EC2 instance with a direct link past their border that your client is completely unaware of.
If you need to perform work on a client site that isn't permitted over their network, you should bring your own connectivity. I carry a cheap cdma modem for that very purpose. Just don't expose your client to additional risks so that you can read SO.
Certainly. If someone's looking for a cheap vps to tunnel through, http://www.lowendbox.com/ provides a listing of all the cheapest deals out there. While they wont provide much in the way of disk space or processing power, they're perfect for tunnels.
If you just need proxy support to get around restrictions, EC2 is an expensive way of doing it (assuming you're using it 8 hours a day).
You can get a VPS (http://www.lowendbox.com/ tracks various offers) for as little as 3 USD/month, and have it 24/7/365. Even if you're using an EC2 micro instance, you'd have to use it less than 150 hours a month (out of 730) to get ahead.
I took 0.02 USD/hour for on-demand micro instance; 0.007 USD/hour also requires a $52/yr or $82/3yr fixed cost on top.
I have no idea how to predict the spot price, and I haven't tried to collect statistics on it, so I (personally) wouldn't choose it for this kind of application, where you expect it to be there all the time, not just when it happens to be cheap.
I mean, 3 USD/month, or 30 USD/year, is cheap enough to pay once and forget about, rather than worrying about turning it on or off.
...so I (personally) wouldn't choose it for this kind of application, where you expect it to be there all the time, not just when it happens to be cheap.
It's not very difficult to set up a script to spin up an on-demand instance if your spot instance gets nuked. Or to just buy another spot instance at a higher price.
Thanks for the clarification -- that makes sense. Given amazon's 1-year-free offer and its excellent documentation, I'd still use an EC2 instance for this purpose.
- 82 /3 / 12 = 2.27 (fixed monthly cost for a reserved instance)
- .007 * 24 * 30 = 5.04 (variable monthly cost)
for a total of $7.31, assuming it's running all the time. If it's running only during business hours, the variable monthly cost becomes:
- .007 * 8 * 20 = 1.12
But the thing is, even the micro instance does almost NOTHING when used as a VPN endpoint (CPU and memory are flat) so you can use the machine for anything else (hosting, dealing with other VPN clients, etc.)
The same thing - doing almost nothing when a VPN endpoint - applies to other VPSes at half or less the cost, e.g. http://www.lowendbox.com/blog/forever-hosting-19-95year-256m... - 20 USD/year. Now these are smaller instances, likely more contended, but that's precisely the point - when you don't need much, there's not much value in over-provisioning.
Anyhow, I've said my piece; I think EC2 is over-priced for this specific task; there are other options; and (at least in my case) they aren't blocked by SO, owing to not being as mainstream.
StackOverflow once blocked the entire country I'm in (would just return a 403). I emailed them and was told they can re-enable it:
"It's not a problem, I just need reassurance that there won't be RssNotifiers hitting us 1,000 times a day and pulling down uncompressed data."
To be fair, I'm sure they have more important things to do than work on complex rate limiting and abuse detection code, especially for edge cases like small countries or EC2.
You probably have ssh access to more machines than just that EC2 instance? Tunnel your traffic through an ssh account.
ssh -D 10001 you@someplace.com
So long as someplace.com is not blocked by SO you can bind your browsers traffic to your now-running local socks proxy in the internet connectivity section. Set all traffic to go through 127.0.0.1:10001
“We’ll just block all of EC2” seems not only excessively broad but, well, lazy.
This conclusion bugged me a bit, in part because I'm old enough to remember when programmers considered laziness to be a virtue. But also because it's not evidence-based. If it were a problem faced by a significant portion of SO readers, I suspect they'd find another way to address it. But if it's a problem only experienced by a couple of readers a year, then spending much time on it would be an "industrious" misuse of resources.
EC2 is a big issue for a lot of things. For example, while I was maintaining a credit card donation form for a non-profit, a large chunk of fraudulent submissions came from EC2 addresses - actually more than came from African ISPs.
So I too blocked the whole of EC2; the logic is pretty straightforward, in that a legitimate customer originating requests from there is highly unlikely compared to the other options.
Adding the "api" parameter is also useful if you want to copy the URL to someone else. So you can write such things as: http://sa.column80.com/?q=1860&api=77 Now it is a URL that can be copied to someone else that is not on the same session.
I found out how to do without cookies: Add the "api" parameter to the query string when retrieving a message. You might have to do manually every time, unless you can write a program to do for you.
EC2 is always marked as spam if you try to send email from an instance too. It would be nice if Amazon could create some tier of white listed, verified non-spam instances or something.
You'd think they could rate limit requests coming from EC2 ip addresses rather than blanket banning, but I guess value wise its cheaper to block all of them given the low likelihood of it being a real person.
Easy to do ... instead of having someone spend time trying to make it work for 0.005% of Stack overflow use cases (yes. I pulled that number entirely out of my rectum. chill out.) ... sucks, but I get it.
First of all - team@stackoverflow email goes straight to Jeff, and he replies to it personally. There's no secret team of people pretending to be Jeff to deal with customer service requests :-)
I personally believe that any company--or country--that thinks it's a good idea to filter web traffic deserves just about as much collateral damage as they get. If a company or country wants to cut its workers off from the knowledge they need to do their job, honestly, I would like to see that company or country get beaten to a pulp by the forces of evolution.
But... I feel for you. There are lots of websites that use our API to provide alternate views on Stack Overflow at other addresses, for example, http://sa.column80.com/?api=0 has a nice lynx-compatible browser.