Hacker News new | past | comments | ask | show | jobs | submit login

The hard thing is not scraping, but getting around the increasingly sophisticated blockers.

You'll need to constantly rotate residential proxies (high rated) and make sure not to exhibit data scraping patterns. Some supermarkets don't show the network requests in the network tab, so cannot just get that api response.

Even then, mitm attacks with mobile app (to see the network requests and data) will also get blocked without decent cover ups.

I tried but realised it isn't worth it due to the costs and constant dev work required. In fact, some of the supermarket pricing comparison services just have (cheap labour) people scrape it




I wonder if we could get some legislation in place to require that they publish pricing data via an API so we don't have to tangle with the blockers at all.


Perhaps in Europe. Anywhere else, forget about it.


I'd prefer that governments enact legislation that prevents discriminating against IP addresses, perhaps under net neutrality laws.

For anyone with some clout/money who would like to stop corporations like Akamai and Cloudflare from unilaterally blocking IP addresses, the way that works is you file a lawsuit against the corporations and get an injunction to stop a practice (like IP blacklisting) during the legal proceedings. IANAL, so please forgive me if my terminology isn't quite right here:

https://pro.bloomberglaw.com/insights/litigation/how-to-file...

https://www.law.cornell.edu/wex/injunctive_relief

Injunctions have been used with great success for a century or more to stop corporations from polluting or destroying ecosystems. The idea is that since anyone can file an injunction, that creates an incentive for corporations to follow the law or risk having their work halted for months or years as the case proceeds.

I'd argue that unilaterally blocking IP addresses on a wide scale pollutes the ecosystem of the internet, so can't be allowed to continue.

Of course, corporations have thought of all of this, so have gone to great lengths to lobby governments and use regulatory capture to install politicians and judges who rule in their favor to pay back campaign contributions they've received from those same corporations:

https://www.crowell.com/en/insights/client-alerts/supreme-co...

https://www.mcneeslaw.com/nlrb-injunction/

So now the pressures that corporations have applied on the legal system to protect their own interests at the cost of employees, taxpayers and the environment have started to affect other industries like ours in tech.

You'll tend to hear that disruptive ideas like I've discussed are bad for business from the mainstream media and corporate PR departments, since they're protecting their own interests. That's why I feel that the heart of hacker culture is in disrupting the status quo.


Thankfully I'm not there yet.

Since this is just a side project, if it starts demanding too much of my time too often I'll just stop it and open both the code and the data.

BTW, how could the network request not appear in the network tab?

For me the hardest part is to correlate and compare products across supermarkets


If they don't populate the page via Ajax or network requests. Ie server side, then no requests for supermarket data will appear.

See nextjs server side, I believe they mention that as a security thing in their docs.

In terms of comparison, most names tend to be the same. So some similarity search if it's in the same category matches good enough.


And you couldn't use OCR and simply take an image of the product list? Not ideal, but difficult or impossible to track depending on your method.


You'll get blocked before even seeing the page most times.


Crowdsource it with a browser extension




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: