Hacker News new | past | comments | ask | show | jobs | submit login

> If I change certain things, they counter it immediately and make it work.

If the content is markup based, are your countermeasures about changing the IDs, classes, or overall tag structure of the markup you serve? I was wondering if you could have several variations of the above, and serve your content via a random one each time that would be visually indistinguishable to a human viewer. The person maintaining the scraper would have to have seen and adapted to all of them to get all your new content reliably. Not an impossible hurdle, but they might try easier targets if too many barriers are in the way.




Since the seemingly colluded disappearance of any freely available finance data APIs (excepting IEX), both Yahoo and Google Finance employ this method of randomizing class names on pages with finance data (quotes, fiscal data etc.). Inspect element on those pages for a good example of this tactic. I feel like this could make it much more difficult for your content to be stolen.


Easily bypassed: just retry on failure until scraper gets syntax it likes.


The other end retrying until it gets what it wants will dramatically change its usage pattern in ways that may be easy to detect unless they have an enormous store of IPs to connect from.

There are enough suggestions in here to provide a bunch of useful options, and while the site itself may not be making money, the experience dealing with this may be very useful on a resume or for building a client base with similar issues.

Possible approach: look for abnormal usage patterns to ID opponent systems. Randomize format and possibly other steps to assist that. Build that randomization in marginally effective ways that are easy to improve later. Build a way to feed bad/poison/"test" data to specific source IPs. At a time chosen to maximize impact, start feeding poison data to the suspect IPs using the marginally effective randomization, while feeding regular data to most visitors but with much improved randomization. Basically make your opponent's site visibly unreliable.

If you feel particularly vicious and know something about the opponent's infrastructure, make the poison vicious e.g. feed SQL injections. Be aware that this may have costs - you'd likely be fine on a legal basis ("I'm not responsible for their crappy sanitizing of inputs they shouldn't have had anyway") but you might still incur costs (lawyer if sued).

Edit: also, anyone going to serious measures to continue scraping after you act against it may also be inclined to ddos your site if you actually fully block them.


The content is served using API that is open for everyone to use. And that is making it difficult for me to protect it. I've tried changing the structure of API response several times, but they counter it within few hours. Tried adding unique headers to requests etc (That worked for quite some time) but they figured it out ultimately. I come up with some solution, they figure it out in a day or two. And that went on for few weeks




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: