Hacker News new | past | comments | ask | show | jobs | submit login

I’m biased since I’m an owner of a web scraping agency (https://webscrapingsolutions.co.uk/). I was asking myself the same question in 2019. You can use any programming language, but have settled on this tech-stack Python, Scrapy (https://github.com/scrapy/scrapy), Redis, PostgreSQL. for the following reasons:

[1] Scrapy is a well-documented framework, so any Python programmer can start using it after 1 month of training. There are a lot of guides for beginners.

[2] Lots of features are already implemented and open-source, you won’t have to waste time & money on them.

[3] There is a strong community that can help with most of the questions (I don't think any other alternative has that).

[4] Scrapy developers are cheap. You will only need junior+ to middle level software engineers to pull out most of the projects. It’s not rocket since.

[5] Recruiting is easier: - there are hundreds of freelancers with relevant expertise - if you search on LinkedIn - there are hundreds of software developers that have worked with Scrapy in the past, and you don’t need that many - you can grow expertise in your own team quickly - developers are easily replaceable, even on larger projects - you can use the same developers on backend tasks.

[6] You don’t need a DevOps expertise in your web scraping team because Scrapy Cloud (https://www.zyte.com/scrapy-cloud/) is good and cheap enough for 99% of the projects.

[7] If you decide to have your own infrastructure, you can use https://github.com/scrapy/scrapyd.

[8] The entire ecosystem is well-well-maintained and steadily growing. You can integrate a lot of 3-rd party services into your project within hours: proxies, captcha solving, headless browsers, HTML parsing APIs.

[9] It’s easy to integrate your own AI/ML models into the scraping workflow.

[10]. With some work, you can use Scrapy for distributed projects that are scraping thousands (millions) of domains. We are using https://github.com/rmax/scrapy-redis.

[11] Commercial support is available. There are several companies that can develop you an entire project or take over an existing one - if you don’t have the time/don’t want to do it on your own.

We have built dozens of projects in multiple industries:

- news monitoring

- job aggregators

- real estate aggregators

- ecommerce (anything from 1 website, to monitoring prices on 100k+ domains)

- lead generation

- search engines in a specific niche (SEO, pdf files, ecommerce, chemical retail)

- macroeconomic research & indicators

- social media, NFT marketplaces, etc

So, most of the projects can be finished using these tools.




Is that even legal? I've built a few fast scrapers in C but I balk at the thought of selling them for some reason it feels a bit grey area to me.


It depends. We employ a lawyer to assess risks associated with each project.

In general - as long as you don't have to login, don't infringe on intellectual property rights and don't harm targeted servers - you should be ok.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: