Show HN: BBC Good Food Scraper in Go

krat0sprakhar · on June 23, 2016

Goquery is pretty great - used it last semester to write the crawler for a side-project of mine. https://github.com/prakhar1989/bekanjoos

lxchase · on June 23, 2016

Agree, used it to build this as a proof of concept: https://emailprofile.herokuapp.com/

Many sites don't consider user enumeration a bug/threat, but theoretically given enough sites, one could build a profile around a specific email address.

LunaSea · on June 23, 2016

Have you open-sources this package somewhere?

As an amateur web security researcher this project looks really interesting.

lxchase · on June 23, 2016

I haven't yet. I'm a bit torn on whether or not to release it given the potential privacy implications. With more sites and a classification algorithm, one could say an email is a "Gender, Race, Age Range, Job Industry, Interests, etc."

What makes this tool work is most sites (as an UX feature) will tell you if an account/email already exists. Whether that be an API call or a notice saying "Your password is incorrect", you'll be able to get the data you need. It was a learning experience for me to use Go to wrap each site check in its own goroutine to leverage concurrency. Quite nice.

imwally · on June 23, 2016

It really is a great package and works extremely well in the context of scraping. It's what I opted to use for an API server [1] I wrote that handles the RFC Instant Answer [2] running on DuckDuckGo.

[1] https://github.com/imwally/rfcsearch [2] https://duck.co/ia/view/request_for_comments

sleepychu · on June 23, 2016

bekanjoos looks nice!

I was thinking today about making a more generic "when this thing on that website changes" notify me sort of thing. It looks like you're running on AWS. Would you be willing to share how much the bill for that runs to?

wrboyce · on June 23, 2016

I also wrote a scraper for BBC Food, and implemented https://github.com/NYTimes/ingredient-phrase-tagger into it quite successfully, I guess I should push it up to Github.

iNerdier · on June 23, 2016

Is it not http://www.bbc.co.uk/food/ that's being merged into bbc good food and therefore gaining ads? Surely that should be the target for this scraper...

Jaruzel · on June 23, 2016

It is, but haven't a 1000+ people already scraped it? (myself included!)

sambeau · on June 23, 2016

Me too!

Jabbles · on June 23, 2016

Be careful on your conversion from Sodium -> salt (salt is 40% Sodium)

e.g. http://healthyeating.sfgate.com/difference-between-salt-sodi...

edit: oh actually the BBC website has it labeled as "Salt", but the HTML ID is "sodiumContent". Weird. Worth a comment then :P

sleepychu · on June 23, 2016

Yup, I thought that when I was implementing the nutrition stuff. My guess is that the original implementer went "salt content, well that's really a sodium property"

callmeed · on June 23, 2016

Really cool. I have some scraping projects I'd like to do in Go, so this is a great starting point.

sleepychu · on June 23, 2016

goquery is really neat, does a lot of the lifting for you!

Chris2048 · on June 23, 2016

IMHO the real challenge for scraping (other than at-scale issue like spawning many processes, crawling, proxies etc), is a scraping framework that allows you to change you mind about what needs scraping, without having to redo the entire scrape. Also, re-scraping for updates.

Between "remember every bit of HTML", and "only remember parsed data" is perhaps "remember every bit of HTML, but notice base-html patterns so it can be massively compressed. "dynamic" content like java-script/AJAX content, rendered dates complicate this...

imjacobclark · on June 23, 2016

It seems with all these website scrapers, we have forgotten about the Internet Archive. http://archive.org/web/

anc84 · on June 23, 2016

As awesome as the Internet Archive's Wayback Machine is, it is still under central control. Worse, as they (reasonably) abide to robots.txt rules, the BBC could easily block access to pages they removed at their end in the archive as well. If you care about something, you need to fully "own" it.

johtso · on June 23, 2016

And the fantastic ArchiveBot.. http://www.archiveteam.org/index.php?title=ArchiveBot

ryck · on June 23, 2016

BBC Good Food is not going anywhere. So you can stop the scraping... It's BBC Food (different name, different website) the one who is going to close (even the recipes will remain online, btw)

sleepychu · on June 23, 2016

"article implied that the bbc good food website would be taken down. This turned out to be false but by the time I realised that I'd already written this, so here you are."

Doctor_Fegg · on June 23, 2016

I'm left-liberal British and am therefore expected to support the BBC I guess, but why in heaven this publicly-funded organisation has one recipe site, let alone two, baffles me.

open-source-ux · on June 23, 2016

BBC Good Food is funded by BBC Worldwide, the commercial arm of the BBC. It is not funded by licence fee money. The website is a companion (or extension) of their BBC Good Food magazine.

The BBC Food website, on the other hand, is funded by the licence fee. Recipes from many of their food programmes are published here, but much more too. It's grown to be much more than just a companion to their food programmes. But now, with little support from the public, they've decided to close the site.

jameshart · on June 23, 2016

Before the internet was a thing, cookery programs used to end with the information that the recipes featured in the program could be found on Ceefax, or you could send a self-addressed envelope into BBC TV center and they would mail the recipes out to you. It seems like an obvious and natural use of the web for the BBC to provide the same information on a web page that it used to provide.