Hacker News new | past | comments | ask | show | jobs | submit login
We scraped the World Bank's website (cgdev.org)
167 points by danso on May 10, 2014 | hide | past | favorite | 30 comments



An interesting and somewhat related (but only tangentially) article I read a couple of days ago found that nearly 1/3rd of World Bank reports are never read, not even by a single person: http://www.washingtonpost.com/blogs/wonkblog/wp/2014/05/08/t...

It was submitted to HN (https://news.ycombinator.com/item?id=7715881) by another user, but probably never got traction because the title of the article is very vague.


It sounds like a marketing problem. Although, I suspect the PDFs aren't the only dissemination method (as stated in the article). After someone spends so much time writing a report, it probably sits in their consciousness and spreads via the person's other interactions (eg shaping their perspective, thinking, and pursuits). With so many people producing so much content on a daily basis, it's hard to imagine people actually reading it all. So hopefully the good ideas get kept in the person's thoughts and come out repeatedly until it's heard. Otherwise, I am not sure there really is a good filtering mechanism on a system scale.


Or maybe they are spreading bad ideas, we won't know unless somebody reviews them.


Luckily there is TabulaPDF https://github.com/jazzido/tabula (Open source, made with a grant by the Knight Foundation)

A lot of praise from journalist in their twitter feed: https://twitter.com/TabulaPDF


Or...you could simply use the World Bank's website and free API that already offers much public data to download in various formats: http://data.worldbank.org/

A similar data repository is available from the European Central Bank: http://sdw.ecb.europa.eu/


"public domain... but can [only] be accessed in small pieces"

Sounds like the wonderful world of "APIs" on the www.

This sort of data should be on an FTP server.

I can build my own "apps". Give me the option of raw data.

Just my opinion, nothing more.


That's not quite what is going on. I had to do a paper for class recently that was looking at economic indicators. If I needed data on "Agriculture & Rural Development":

http://data.worldbank.org/topic/agriculture-and-rural-develo...

Down at the bottom is a link that will give you a csv file:

http://api.worldbank.org/v2/en/topic/1?downloadformat=csv

I thought it was really easy. I ended up having to visit multiple download links, but stitching the data together was simple using Python.

Without looking at the paper, I'm not sure what exactly they have. Have they just done the stitching already and are providing the complete data set? Or is there additional data not covered in the csv downloads?

Edit: Having looked a the paper it looks like the data they scraped was not in the CSVs. But I really cannot tell. If that is the case, I don't know why only some data is available as a bulk download and other data is not. So...back to your original point.


You're misquoting the article.

"...is not in the public domain, but can be accessed in small pieces..." (emphasis added)

Besides, this data should absolutely be provided in an API. An exporter tool that queries the API is superior to a static ftp site.


Sorry for the misquote. I guess that my misquote completely misrepresents the issue?

I never said the data should be not provided in an API.

If you read closely (for more than only "errors"), then you would observe the word "option".

That word is there for a reason.

The role of the FTP site (or whatever protocol you prefer) is to transfer the raw data in bulk to my local media.

Then I move it into my own database of choice and write my own code to access it.

Anyway, fear not. Your preferred www "API" world is not in jeopardy.

Cogent arguments why raw, bulk _public data_ _should never be provided_ in addition to rate-limited, by the slice snippets via "APIs" and third party "app developers" are welcome.


Looking at the source right now. Noticed comments in the code along the lines of "selenium is really slow traversing the dom" etc.. Also noticed the script implements the non-headless Firefox WebDriver. Wouldn't it have been much faster to have used GhostDriver or some similar headless solution?


Hi, one of the authors here. First, though I worked on the paper, I am not employed by CGD, so these comments are my own.

The main reason for using the non-headless Firefox WebDriver was that we wanted the script to access the site just like a human user. This made it easy to explain to non-technical people exactly how we had gotten the data. We didn't want to do anything that could be seen as circumventing the interface that the World Bank had created for that purpose.

Up to a point, performance was not a concern. In fact, as slow as Selenium is, we still artificially limited the speed of the script by waiting three seconds between each set of queries. However, when it came to selecting options, it could take Selenium tens of seconds, so that was done with js.


Unless you need to evaluate JavaScript or take screenshots of the rendered page is there any point at all in using a webdriver like that instead of building a plain old scraper?


I agree, using Selenium seems like an unnecessary waste of time and cpu resources when replicating the GET/POST requests and parsing the html response using a simple Perl or Python script would have sufficed.


Great point.


I'm not sure what to make of this. On the one hand, the data could be useful.

On the other hand, the data was probably already useful through the world bank's tool. Furthermore, it's reasonable to assume that the data was only released because the people who had to make the final decision were convinced that the effort required to reassemble the data would be prohibitive. The fact that it's now all been released might have a chilling effect on future releases.


Not sure I follow. Do you think the World Bank will now hesitate to release their data because someone scraped it?

They have an open data policy, using CC-BY [1], so unless this scraping effort took data that wasn't covered by that it should be ok I think.

[1] http://web.worldbank.org/WBSITE/EXTERNAL/NEWS/0,,contentMDK:...


The code is very well commented and I'd highly recommend if anyones interested in scraping to read through it. As a bonus, the Appendix details all the steps to run the script making it very easy for beginners.


Why would such important public data be only available to internal researchers? Cost? Politics? Fear?


If 70% of their reports are downloaded only a couple of times, ever, I would guess that the answer is that nobody is interested in them. Have you ever read one?


There is an appendix to the paper describing how to install and run the author's script. I'd like to take a look at the source, but can't find an actual link to the code. Am I overlooking something?


1. go here http://www.cgdev.org/section/publications?f[0]=field_documen...

2. click on the link to "We Just Ran Twenty-Three Million Queries of the World Bank's Web Site"

3. click on the "Data & Analysis" tab

4. scroll to the bottom and there are download links to harvester_parameters.py, harvester.py and unloader.py

BUT i can't seem to actually download them, as there is some redirect that fails :( i tried creating an account on that site as well. anyone else have any luck?


Perfect, thanks.

I got the redirect error too when I tried to download _all_ files. However when I deselected all and just checked the readme, csv data, and python scripts, it worked just fine.


Welcome to most software with academic papers published about it.

It's pretty typical to find papers written about a piece of software or a software technique with impressive claims and the software is totally unavailable to the larger academic community (neither in binaries or source).


i thought the point of this way, gee, the world bank makes it hard for you to get data out, so we scraped them, and here is the data for you guys to play around with. I looked for a bit, but couldnt find the data -- am I missing something?


The last sentence on that page reads: "The full data can be downloaded at www.cgdev.org/povcalnet."


And the page linked has large number of files that appear to be relevant to other papers but not this one.

However, this link

http://www.cgdev.org/section/publications?f[0]=field_documen...

leads to page that has two papers listed by title. Click the second of the papers. Click Data and Analysis tab, and you can download the CSV files they obtained. You don't need a log in but you do need to agree to sensible looking conditions.

I can't seem to link directly to the agree terms/download page.


I guess they could have just gone to http://povertydata.worldbank.org/poverty/home/ and downloaded via CSV


I wonder if they would have released this if Weev had not been freed.


Looking at some random datasets, this looks like it's pretty small datasets (less than a mb?) -- so it'd be nice to just have them zip-ed and available as a torrent (say plain-text description in one file, and csv in a folder per set)?

Rather than re-creating the problem of data being hard to find by splitting the download links over two-pages, and apparently requiring a click-through for every dataset to get at the data?


I'm not even sure what kind of data this would be but it sounds fascinating.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: