Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: BBC Good Food Scraper in Go (github.com/sleepypikachu)
92 points by sleepychu on June 23, 2016 | hide | past | favorite | 23 comments



Goquery is pretty great - used it last semester to write the crawler for a side-project of mine. https://github.com/prakhar1989/bekanjoos


Agree, used it to build this as a proof of concept: https://emailprofile.herokuapp.com/

Many sites don't consider user enumeration a bug/threat, but theoretically given enough sites, one could build a profile around a specific email address.


Have you open-sources this package somewhere?

As an amateur web security researcher this project looks really interesting.


I haven't yet. I'm a bit torn on whether or not to release it given the potential privacy implications. With more sites and a classification algorithm, one could say an email is a "Gender, Race, Age Range, Job Industry, Interests, etc."

What makes this tool work is most sites (as an UX feature) will tell you if an account/email already exists. Whether that be an API call or a notice saying "Your password is incorrect", you'll be able to get the data you need. It was a learning experience for me to use Go to wrap each site check in its own goroutine to leverage concurrency. Quite nice.


It really is a great package and works extremely well in the context of scraping. It's what I opted to use for an API server [1] I wrote that handles the RFC Instant Answer [2] running on DuckDuckGo.

[1] https://github.com/imwally/rfcsearch [2] https://duck.co/ia/view/request_for_comments


bekanjoos looks nice!

I was thinking today about making a more generic "when this thing on that website changes" notify me sort of thing. It looks like you're running on AWS. Would you be willing to share how much the bill for that runs to?


I also wrote a scraper for BBC Food, and implemented https://github.com/NYTimes/ingredient-phrase-tagger into it quite successfully, I guess I should push it up to Github.


Is it not http://www.bbc.co.uk/food/ that's being merged into bbc good food and therefore gaining ads? Surely that should be the target for this scraper...


It is, but haven't a 1000+ people already scraped it? (myself included!)


Me too!


Be careful on your conversion from Sodium -> salt (salt is 40% Sodium)

e.g. http://healthyeating.sfgate.com/difference-between-salt-sodi...

edit: oh actually the BBC website has it labeled as "Salt", but the HTML ID is "sodiumContent". Weird. Worth a comment then :P


Yup, I thought that when I was implementing the nutrition stuff. My guess is that the original implementer went "salt content, well that's really a sodium property"


Really cool. I have some scraping projects I'd like to do in Go, so this is a great starting point.


goquery is really neat, does a lot of the lifting for you!


IMHO the real challenge for scraping (other than at-scale issue like spawning many processes, crawling, proxies etc), is a scraping framework that allows you to change you mind about what needs scraping, without having to redo the entire scrape. Also, re-scraping for updates.

Between "remember every bit of HTML", and "only remember parsed data" is perhaps "remember every bit of HTML, but notice base-html patterns so it can be massively compressed. "dynamic" content like java-script/AJAX content, rendered dates complicate this...


It seems with all these website scrapers, we have forgotten about the Internet Archive. http://archive.org/web/


As awesome as the Internet Archive's Wayback Machine is, it is still under central control. Worse, as they (reasonably) abide to robots.txt rules, the BBC could easily block access to pages they removed at their end in the archive as well. If you care about something, you need to fully "own" it.



BBC Good Food is not going anywhere. So you can stop the scraping... It's BBC Food (different name, different website) the one who is going to close (even the recipes will remain online, btw)


"article implied that the bbc good food website would be taken down. This turned out to be false but by the time I realised that I'd already written this, so here you are."


I'm left-liberal British and am therefore expected to support the BBC I guess, but why in heaven this publicly-funded organisation has one recipe site, let alone two, baffles me.


BBC Good Food is funded by BBC Worldwide, the commercial arm of the BBC. It is not funded by licence fee money. The website is a companion (or extension) of their BBC Good Food magazine.

The BBC Food website, on the other hand, is funded by the licence fee. Recipes from many of their food programmes are published here, but much more too. It's grown to be much more than just a companion to their food programmes. But now, with little support from the public, they've decided to close the site.


Before the internet was a thing, cookery programs used to end with the information that the recipes featured in the program could be found on Ceefax, or you could send a self-addressed envelope into BBC TV center and they would mail the recipes out to you. It seems like an obvious and natural use of the web for the BBC to provide the same information on a web page that it used to provide.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: