Many sites don't consider user enumeration a bug/threat, but theoretically given enough sites, one could build a profile around a specific email address.
I haven't yet. I'm a bit torn on whether or not to release it given the potential privacy implications. With more sites and a classification algorithm, one could say an email is a "Gender, Race, Age Range, Job Industry, Interests, etc."
What makes this tool work is most sites (as an UX feature) will tell you if an account/email already exists. Whether that be an API call or a notice saying "Your password is incorrect", you'll be able to get the data you need.
It was a learning experience for me to use Go to wrap each site check in its own goroutine to leverage concurrency. Quite nice.
It really is a great package and works extremely well in the context of scraping. It's what I opted to use for an API server [1] I wrote that handles the RFC Instant Answer [2] running on DuckDuckGo.
I was thinking today about making a more generic "when this thing on that website changes" notify me sort of thing. It looks like you're running on AWS. Would you be willing to share how much the bill for that runs to?
Is it not http://www.bbc.co.uk/food/ that's being merged into bbc good food and therefore gaining ads? Surely that should be the target for this scraper...
Yup, I thought that when I was implementing the nutrition stuff. My guess is that the original implementer went "salt content, well that's really a sodium property"
IMHO the real challenge for scraping (other than at-scale issue like spawning many processes, crawling, proxies etc), is a scraping framework that allows you to change you mind about what needs scraping, without having to redo the entire scrape. Also, re-scraping for updates.
Between "remember every bit of HTML", and "only remember parsed data" is perhaps "remember every bit of HTML, but notice base-html patterns so it can be massively compressed. "dynamic" content like java-script/AJAX content, rendered dates complicate this...
As awesome as the Internet Archive's Wayback Machine is, it is still under central control. Worse, as they (reasonably) abide to robots.txt rules, the BBC could easily block access to pages they removed at their end in the archive as well. If you care about something, you need to fully "own" it.
BBC Good Food is not going anywhere. So you can stop the scraping... It's BBC Food (different name, different website) the one who is going to close (even the recipes will remain online, btw)
"article implied that the bbc good food website would be taken down. This turned out to be false but by the time I realised that I'd already written this, so here you are."
I'm left-liberal British and am therefore expected to support the BBC I guess, but why in heaven this publicly-funded organisation has one recipe site, let alone two, baffles me.
BBC Good Food is funded by BBC Worldwide, the commercial arm of the BBC. It is not funded by licence fee money. The website is a companion (or extension) of their BBC Good Food magazine.
The BBC Food website, on the other hand, is funded by the licence fee. Recipes from many of their food programmes are published here, but much more too. It's grown to be much more than just a companion to their food programmes. But now, with little support from the public, they've decided to close the site.
Before the internet was a thing, cookery programs used to end with the information that the recipes featured in the program could be found on Ceefax, or you could send a self-addressed envelope into BBC TV center and they would mail the recipes out to you. It seems like an obvious and natural use of the web for the BBC to provide the same information on a web page that it used to provide.