I’m biased since I’m an owner of a web scraping agency (https://webscrapingsolutions.co.uk/). I was asking myself the same question in 2019.
You can use any programming language, but have settled on this tech-stack Python, Scrapy (https://github.com/scrapy/scrapy), Redis, PostgreSQL. for the following reasons:
[1] Scrapy is a well-documented framework, so any Python programmer can start using it after 1 month of training. There are a lot of guides for beginners.
[2] Lots of features are already implemented and open-source, you won’t have to waste time & money on them.
[3] There is a strong community that can help with most of the questions (I don't think any other alternative has that).
[4] Scrapy developers are cheap. You will only need junior+ to middle level software engineers to pull out most of the projects. It’s not rocket since.
[5] Recruiting is easier: - there are hundreds of freelancers with relevant expertise - if you search on LinkedIn - there are hundreds of software developers that have worked with Scrapy in the past, and you don’t need that many - you can grow expertise in your own team quickly - developers are easily replaceable, even on larger projects - you can use the same developers on backend tasks.
[6] You don’t need a DevOps expertise in your web scraping team because Scrapy Cloud (https://www.zyte.com/scrapy-cloud/) is good and cheap enough for 99% of the projects.
[8] The entire ecosystem is well-well-maintained and steadily growing. You can integrate a lot of 3-rd party services into your project within hours: proxies, captcha solving, headless browsers, HTML parsing APIs.
[9] It’s easy to integrate your own AI/ML models into the scraping workflow.
[10]. With some work, you can use Scrapy for distributed projects that are scraping thousands (millions) of domains. We are using https://github.com/rmax/scrapy-redis.
[11] Commercial support is available. There are several companies that can develop you an entire project or take over an existing one - if you don’t have the time/don’t want to do it on your own.
We have built dozens of projects in multiple industries:
- news monitoring
- job aggregators
- real estate aggregators
- ecommerce (anything from 1 website, to monitoring prices on 100k+ domains)
- lead generation
- search engines in a specific niche (SEO, pdf files, ecommerce, chemical retail)
- macroeconomic research & indicators
- social media, NFT marketplaces, etc
So, most of the projects can be finished using these tools.
[1] Scrapy is a well-documented framework, so any Python programmer can start using it after 1 month of training. There are a lot of guides for beginners.
[2] Lots of features are already implemented and open-source, you won’t have to waste time & money on them.
[3] There is a strong community that can help with most of the questions (I don't think any other alternative has that).
[4] Scrapy developers are cheap. You will only need junior+ to middle level software engineers to pull out most of the projects. It’s not rocket since.
[5] Recruiting is easier: - there are hundreds of freelancers with relevant expertise - if you search on LinkedIn - there are hundreds of software developers that have worked with Scrapy in the past, and you don’t need that many - you can grow expertise in your own team quickly - developers are easily replaceable, even on larger projects - you can use the same developers on backend tasks.
[6] You don’t need a DevOps expertise in your web scraping team because Scrapy Cloud (https://www.zyte.com/scrapy-cloud/) is good and cheap enough for 99% of the projects.
[7] If you decide to have your own infrastructure, you can use https://github.com/scrapy/scrapyd.
[8] The entire ecosystem is well-well-maintained and steadily growing. You can integrate a lot of 3-rd party services into your project within hours: proxies, captcha solving, headless browsers, HTML parsing APIs.
[9] It’s easy to integrate your own AI/ML models into the scraping workflow.
[10]. With some work, you can use Scrapy for distributed projects that are scraping thousands (millions) of domains. We are using https://github.com/rmax/scrapy-redis.
[11] Commercial support is available. There are several companies that can develop you an entire project or take over an existing one - if you don’t have the time/don’t want to do it on your own.
We have built dozens of projects in multiple industries:
- news monitoring
- job aggregators
- real estate aggregators
- ecommerce (anything from 1 website, to monitoring prices on 100k+ domains)
- lead generation
- search engines in a specific niche (SEO, pdf files, ecommerce, chemical retail)
- macroeconomic research & indicators
- social media, NFT marketplaces, etc
So, most of the projects can be finished using these tools.