It sounds like a marketing problem. Although, I suspect the PDFs aren't the only dissemination method (as stated in the article). After someone spends so much time writing a report, it probably sits in their consciousness and spreads via the person's other interactions (eg shaping their perspective, thinking, and pursuits). With so many people producing so much content on a daily basis, it's hard to imagine people actually reading it all. So hopefully the good ideas get kept in the person's thoughts and come out repeatedly until it's heard. Otherwise, I am not sure there really is a good filtering mechanism on a system scale.
Or...you could simply use the World Bank's website and free API that already offers much public data to download in various formats: http://data.worldbank.org/
That's not quite what is going on. I had to do a paper for class recently that was looking at economic indicators. If I needed data on "Agriculture & Rural Development":
I thought it was really easy. I ended up having to visit multiple download links, but stitching the data together was simple using Python.
Without looking at the paper, I'm not sure what exactly they have. Have they just done the stitching already and are providing the complete data set? Or is there additional data not covered in the csv downloads?
Edit: Having looked a the paper it looks like the data they scraped was not in the CSVs. But I really cannot tell. If that is the case, I don't know why only some data is available as a bulk download and other data is not. So...back to your original point.
Sorry for the misquote. I guess that my misquote completely misrepresents the issue?
I never said the data should be not provided in an API.
If you read closely (for more than only "errors"), then you would observe the word "option".
That word is there for a reason.
The role of the FTP site (or whatever protocol you prefer) is to transfer the raw data in bulk to my local media.
Then I move it into my own database of choice and write my own code to access it.
Anyway, fear not. Your preferred www "API" world is not in jeopardy.
Cogent arguments why raw, bulk _public data_ _should never be provided_ in addition to rate-limited, by the slice snippets via "APIs" and third party "app developers" are welcome.
Looking at the source right now. Noticed comments in the code along the lines of "selenium is really slow traversing the dom" etc.. Also noticed the script implements the non-headless Firefox WebDriver. Wouldn't it have been much faster to have used GhostDriver or some similar headless solution?
Hi, one of the authors here. First, though I worked on the paper, I am not employed by CGD, so these comments are my own.
The main reason for using the non-headless Firefox WebDriver was that we wanted the script to access the site just like a human user. This made it easy to explain to non-technical people exactly how we had gotten the data. We didn't want to do anything that could be seen as circumventing the interface that the World Bank had created for that purpose.
Up to a point, performance was not a concern. In fact, as slow as Selenium is, we still artificially limited the speed of the script by waiting three seconds between each set of queries. However, when it came to selecting options, it could take Selenium tens of seconds, so that was done with js.
Unless you need to evaluate JavaScript or take screenshots of the rendered page is there any point at all in using a webdriver like that instead of building a plain old scraper?
I agree, using Selenium seems like an unnecessary waste of time and cpu resources when replicating the GET/POST requests and parsing the html response using a simple Perl or Python script would have sufficed.
I'm not sure what to make of this. On the one hand, the data could be useful.
On the other hand, the data was probably already useful through the world bank's tool. Furthermore, it's reasonable to assume that the data was only released because the people who had to make the final decision were convinced that the effort required to reassemble the data would be prohibitive. The fact that it's now all been released might have a chilling effect on future releases.
The code is very well commented and I'd highly recommend if anyones interested in scraping to read through it. As a bonus, the Appendix details all the steps to run the script making it very easy for beginners.
If 70% of their reports are downloaded only a couple of times, ever, I would guess that the answer is that nobody is interested in them. Have you ever read one?
There is an appendix to the paper describing how to install and run the author's script. I'd like to take a look at the source, but can't find an actual link to the code. Am I overlooking something?
2. click on the link to "We Just Ran Twenty-Three Million Queries of the World Bank's Web Site"
3. click on the "Data & Analysis" tab
4. scroll to the bottom and there are download links to harvester_parameters.py, harvester.py and unloader.py
BUT i can't seem to actually download them, as there is some redirect that fails :( i tried creating an account on that site as well. anyone else have any luck?
I got the redirect error too when I tried to download _all_ files. However when I deselected all and just checked the readme, csv data, and python scripts, it worked just fine.
Welcome to most software with academic papers published about it.
It's pretty typical to find papers written about a piece of software or a software technique with impressive claims and the software is totally unavailable to the larger academic community (neither in binaries or source).
i thought the point of this way, gee, the world bank makes it hard for you to get data out, so we scraped them, and here is the data for you guys to play around with. I looked for a bit, but couldnt find the data -- am I missing something?
leads to page that has two papers listed by title. Click the second of the papers. Click Data and Analysis tab, and you can download the CSV files they obtained. You don't need a log in but you do need to agree to sensible looking conditions.
I can't seem to link directly to the agree terms/download page.
Looking at some random datasets, this looks like it's pretty small datasets (less than a mb?) -- so it'd be nice to just have them zip-ed and available as a torrent (say plain-text description in one file, and csv in a folder per set)?
Rather than re-creating the problem of data being hard to find by splitting the download links over two-pages, and apparently requiring a click-through for every dataset to get at the data?
It was submitted to HN (https://news.ycombinator.com/item?id=7715881) by another user, but probably never got traction because the title of the article is very vague.