Hacker News new | past | comments | ask | show | jobs | submit | davidjairala's comments login

I've actually been working on something like this for a while now, and found your comment about proprietary data interesting. Would this mean that hosting this data in a third party server is out of the question for you? OK with NDA?


What would a reasonable price for this product be for you?


anything billing for data volume instead of by attendant would be a good start.


Would you say the customer experience was that much better before the regulations in 2008? Cause that hasn't been my experience personally.


I personally find myself going through this with most aspects of my life. I have lulls at work where I'm less than 100% excited about the work at hand, and then after a while I'm back in top shape, but I now find it normal when the lulls occur.

It happens to me for most other hobbies as well. I enjoy working out a lot, but there are times where I just don't feel super excited about it, for like lets say a month or so, and I have to put in a lot of will-power to get to the gym. So I do basically what you do here as well, try different kind of workouts, etc., with varying results and that helps sometimes.

I think at the end of the day it's really hard to maintain that level of passion for something that you do day in day out. You just have to accept the downtime and know that it will pass in a bit. At least that's my take on it.

Edit: right after I posted I was reminded of a quote lifters use often: "motivation gets you in the gym, as it's fleeting, but discipline keeps you coming back". I think the same applies here, to some extent.


From personal experience, it's quite the headache, even if you stay within legal parameters, you will run into site owners who are less than thrilled about what you're doing (possibly understandably so).

I ran into several people who wrote cease and desists, which I honored, and into several others who started banning our IP addresses, etc, disallowing us specifically via robots.txt, etc.. There are obviously ways to get around these issues, but the main question is, morally, would you want to go around them? Are you willing to go against website owners who flat out don't want you scraping their data? Would you be willing to fight them legally for your right to do so?

Ultimately, that's what it came down for me, I just felt really crappy about it and stopped.


Agreed that it can be a headache, but wanted to offer an alternative perspective.

Personally, I feel that inclusion in Google constitutes public access to the data. As long as I'm not logged into an account on their system, I feel ethically justified about scraping their data.

In other words, I do not feel compelled to respect robots.txt if that file does not also block googlebot.

Legally it may be another issue, but ethically I consider inclusion in Google as an announcement that this information is public.


Ignoring/bypassing robots.txt is probably a bad idea unless you're going to never even look for it and are going to try to plead incompetence if someone comes after you.

In the early stages you probably won't be robots.txt'd because you're insignificant.

In later stages, you're hoping to not be robots.txt'd because you're providing a worthwhile service not just for users but for the site.

At neither stage should you force companies that want you not indexing their content to go beyond basic means (robots.txt) because the more serious measures are all going to cost them more money (tracking / blocking your IPs, C&D, DMCA requests to your provider requesting that the entire site be taken down because there are thousands of infringing items, lawsuits seeking (damages | court costs | costs for dealing with your circumvention of technical measures to keep you out of the site), finding of friendly prosecutors, etc.).

You don't want to go down that more expensive road.


Also worth mentioning: as long as you're scraping facts and combining them in a novel way, copyright law is much less relevant.

This opeartes in what I consider a legal grey area. Don't make it obvious that you're scraping, only scrape public information, transform the results, proxy your requests, all contribute to lowering the legal profile (which is my only concern, as I feel I am acting within my own ethical limits).


Eek. This is only kinda true. You ought to talk to a copyright lawyer and get a handle on derivative works and data compilations. You can get started by reading this Supreme Court case:

Feist Publications, Inc. v. Rural Tel. Service Co., 499 U.S. 340 (1991) https://casetext.com/case/feist-publications-inc-v-rural-tel...

Disclaimer: IAAL but IANYL.


In your opinion, how much of that situation's complexity is eliminated simply by scraping the Google cache of a site?

I also wonder how possible it is to hide behind proxies, especially if they are owned by entities in other countries. If a site I'm scraping is unable to identify who does the scraping, it seems difficult for them to prove "this guy uses our data and must be scraping us".


The more you have to jump through hoops to get the data (or hide that you're getting it or that you're the one getting it), the more it sounds like doing this for the wrong reasons.

Also, since this is presumably something you're going to be doing as a hobby (money creates trails), the unfortunate reality is that "right" and "wrong" in copyright law matter much less than "Oh crap, I'm being sued for $500k in $further_away(New York|California), how do I defend this?" That's why you don't ignore the polite way of saying "go away" which is robots.txt or the rude way which is a C&D - if a lawsuit (the mean way) is the first communication you have from a company, odds are pretty good that an attorney can help because judges are busy and don't want lawsuits to be the first thing unhappy companies try.


I understand what you're saying, we just come from very different perspectives. Most of my results are after significant transformation and combination, resulting in models to test against. I'm not very concerned with copyright violation, as I rarely (never?) re-publish copyrighted information.

Have there been any court cases where a person scraping public information has been found in the wrong? I know of the LinkedIn case from Jan 2014, but in that case the offenders were creating LI accounts to scrape private information. I believe that Craigslist lost it's case against e.g. padmapper, didn't they?

While I respect what you're saying in your first sentence, I view it differently. Setting aside the legal issues, I see it as someone trying to control use in a public space. I don't consider that a valid reason -- if it's public, I can consume it. Avoiding detection is a reaction to sites trying to create rules that I interpret as invalid.

If a company tried to block off a public road without legal backing, I would consider it not only my right but also my duty to traverse that road. [mediocre analogy, but it does represent my opinion fairly accurately.]


The things that jump out at me there are "that I interpret as invalid" and "Have there been any court cases where a person scraping public information has been found in the wrong?"

Tackling the second one first, I'd like to rephrase that: "Have all the court cases where a person was scraping public information been found in their favor and they were awarded all attorney fees and expenses?"

As far as "that I interpret as invalid" the courts exist to decide between varying interpretations of rights and laws. I've never heard that "inexpensively" was expected to be part of that description. I'm not saying that you're wrong - I'm just saying that there's a significant difference between "I'm taking on a coding and data analysis project" and "I'm taking on a coding and data analysis project with a big helping of legal distractions."

I'm not fully up on the Craigslist vs padmapper/3taps case - was it ever actually fully decided? And how much did fighting that case cost 3taps? Looking at the statement on their website it doesn't sound all that victorious, and I can't help but suspect that even ignoring whatever financial impact there was the distraction and demands of the case must have had a serious effect on any projects 3taps was working on (or considering and back-burnering) during that time.

As a counterexample since you said you were going to be keeping and displaying thumbnails, I'll toss out the artwork from "Kind of Bloop" (see http://waxy.org/2011/06/kind_of_screwed/) which was a highly-pixelated (and maybe only 8-color?) transformation of a photo of Miles Davis. TL;DR, Andy Baio ended up paying ~$32k to settle the case not because he thought he was wrong but because it was the least expensive option.

I'm not saying don't do it - I'm just saying that you should go into it with your eyes open and don't do things that will exacerbate any non-technical problems you may run into. That may be a chilling effect, but at least you can bring a coat.


Your public road analogy is very wrong. A better analogy would be a private road with a sign saying "Google streetview welcome. runbycomment stay out." Would you feel entitled to drive down the private road? Would the owner allowing Google to drive down the road make you feel entitled to do it?

We aren't discussing a public space. We're talking about a private server. They pay for hosting and bandwidth. Why do you feel entitled to use it?


Why is ignoring robots.txt a bad idea? The information's being made publicly accessible...


For the crass and practical reason, because A) Anyone can sue for anything (caveat: as long as it's not so egregiously stupid as to get them slapped down by a judge) B) techies' definitions of "egregiously stupid" and judges' definitions of "egregiously stupid" may not have very much overlap

As a simple example imagine that the owner of a local shop REALLY didn't like you to the extent that he had his door painted with six-inch letters at eye level "PNathan KEEP OUT!" It's a publicly accessible shop, but if you walk in and he calls the police, will your having ignored that sign make a difference in their interactions with you? How about if you've both ignored that sign and come in wearing a disguise?


I like the analogy.

> ...will your having ignored that sign make a difference in their interactions with you?

For sure.

> How about if you've both ignored that sign and come in wearing a disguise?

Not if he finds out. But the disguise will make it even worse if he does find out.

So it boils down to: Can you hide yourself good enough to not beeing detected (includes beeing detected by showing information that is presumably crawled rather than detecting the process of crawling)? It is a risk that you may take by weighting assumed loss (court case) and gain (money from using crawled data).

I may add: A clever "data provider" will inject some hidden beacons into their data that makes it easy for them to later detect that data in other websites. So actually you can always be detected, because you must have crawled that data from them.


Just to make sure I understand your reply correctly, are you saying that if a robot.txt file disallows your specific crawler but allows googlebot you'd see no problem with crawling it?


I've been having similar difficulties lately. I've been thinking about shared workspaces. Have you given them a shot?


I actually have an office in town that I can go to (I work for a larger company that's in a lot of cities) so I try to go once or twice a week. I just have an open cube in a quiet corner. It does help, but I usually don't talk with anyone because it's both a very small office and the team there works on different web properties. I'm moving in a few months to a really small town so I imagine I'll checkout shared space there, if I can find it.


What are your credentials? Not trying to be snobby, just for comparison's sake.


Generally speaking, installing and having an installation of Wordpress is pretty simple. However, my advice would be to keep things as simple as possible, especially since it seems you guys are starting out.

For the frontend, I'd recommend either just having very thin pages in your Rails stack that are fully or at least heavily cached that then ping your API, or relying on something like Jekyll to generate static pages that then do the same.

Eventually, when the project becomes larger and/if you start feeling pain points, you can abstract this frontend layer away from the stack into its own little app. You could also make it its own little Rails or Sinatra app that just serves basically HTML (again, heavily cached, since the dynamic content will come from the backend). I keep recommending keeping it in Rails or Sinatra just so you can use some goodies like layouts, easy caching infrastructure, you're already hosting rails, the asset pipeline.

As for your marketing site, I'd go for something that's hosted elsewhere, like Tumblr if you need a blog, etc. Just try to minimize the things you need to host and support yourselves.


Really liking the app. Also enjoyed the "free for a limited time", adds a bit of a time component where you download it quickly and then maybe get hooked to it for when you have to pay later on.


Thanks! Totally agreed, I think that psychologically, 'free for a limited time' makes people have to make a decision on whether they want it now or not, which is an interesting behavior.



One thing that many people don't realize about the Hierarchy of Disagreement is that going up the hierarchy takes more and more effort. If someone makes claims that are so hypocritical that they instantly fail the laugh test, there's no need to move up the hierarchy and develop more sophisticated arguments against their main points.

Now, the obvious rebuttal to this would be to claim that I'm only saying this because I can't rebut Assange's points on their own merits, so I have to attack his character. This is an appealing but dangerous line of reasoning, because it forces you to engage with trolls. The appropriate response to a troll is not to examine their argument and point out its weaknesses, but to classify them as a troll and ignore them. Is this ad hominem reasoning? Sure, but ad hominem reasoning isn't necessarily bad.

I can rebut Assange's points on their own merits, but I choose not to because I think doing so gives Assange attention and credibility that he does not deserve. Don't feed the trolls.


> I can rebut Assange's points on their own merits, but I choose not to

Yes. Instead, you chose an ad hominem attack, which is pretty much the definition of feeding the trolls.

> Is this ad hominem reasoning? Sure, but ad hominem reasoning isn't necessarily bad.

No, ad hominem reasoning is simply meritless. The opposite of engaging in a discussion is remaining silent, not scoffing and walking off stage.


>Yes. Instead, you chose an ad hominem attack, which is pretty much the definition of feeding the trolls.

Are you really claiming that saying "don't feed the trolls" is feeding the trolls?


And where did I say that?


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: