Hacker News new | past | comments | ask | show | jobs | submit login
The US Census now has an API (census.gov)
173 points by ams1 on June 6, 2012 | hide | past | favorite | 38 comments



> The response for all queries is formatted as a two dimensional JSON array where the first row provides column names and subsequent rows provide data values.

Hmmm... looking at the example of this I can't help but think there has got to be a better way. This is more just a standard CSV (first row is header, all other rows are data). Using JSON for data formatted as such is kind of a waste of JSON. If you pass that example into a JSON decoder you get an extremely more difficult to use object. Am I missing something? or is this just typical "the government doesn't do tech properly" stuff?

EDIT: sorry, I try to be less negative but sometimes it is hard. I do applaud them for at least making the info available. I don't have a use for it but if some else does then dealing with a stupid format is better than not having data at all.

EDIT2: (to clarify) I was not saying the data should just be CSV... I'm saying it is and that defeats the purpose of using JSON. They should still use JSON but with their data formatted differently. Using their example, properly closed but truncated to just to first 2 records, the PHP function json_decode() turns it into this array:

  array (
    0 => array (
      0 => 'P0010001',
      1 => 'NAME',
      2 => 'state',
    ),
    1 => array (
      0 => '710231',
      1 => 'Alaska',
      2 => '02',
    ),
    2 => array (
      0 => '4779736',
      1 => 'Alabama',
      2 => '01',
    ),
  )
I don't find that format to be very pleasant.


This doesn't seem unreasonable.

1) Whether it's user-friendly or not, it's still JSON, which makes JSON-P possible (they support JSON-P).

2) It probably mirrors the way they store their data (in tables, whether SQL or Access or Excel, doesn't really matter).

3) It's more compact than traditional JSON, which means less bandwidth, which means less cost. Keep in mind that this is essentially a not-for-profit API from a not-for-profit organization with the worst possible budgeting scenario. Yes, Gzip would basically eliminate this benefit, but they don't have gzip enabled and enabling it may be difficult or impossible under whatever constraints they operate under.

Also, it's entirely possible that this API has existed privately for a long, long time in a CSV format, and they simply made a minor enhancement to make it JSON and JSON-P compatible and open up the endpoints to the public.


#3 may matter for a number of reasons:

- I daresay Census is sitting on some _large_ data sets. Fancy data structures are one thing when you want the population of California, another when you're trying to get California, by ethnicity and age group, by zip code.

- Data structure compactness will matter less to HN readers than to someone sitting on the other end of a phone dial-up in Oklahoma.

- I am but an egg but don't fancier data structures require particular decisions for implementation on particular tables? Census has a lot of tables -> a lot of decisions.

- God only knows how many different systems are involved holding all their data. The simpler the data structure, the less they have to get into those -- and / or the simpler a layer they between some mainframe and the API output.

Also, it may not seem very friendly to HN readers. But it is stupid simple, you can _see_ how it is organized, making it accessible to a wider audience. And it's so simple that it can be adopted by any agency publishing tables. Note that there are a lot of agencies with a lot of tables -- something like this has a chance of becoming standard among all of them.

A public agency has special accessibility concerns, and it's just a reality that government agencies have particular legacy technology issues. A lowest common denominator format helps get the data over those obstacles.

Maybe they can do better in places, and they're free to, there is no reason they can't support other formats too. But this will be available for whatever subset isn't served by those extensions.


#3 seems a likely reason. and like I said, having a harder to work with data structure is still better than nothing. and it is free.


All good explanations, but I can't help but wish they had chosen a format like:

  {
    P0010001 : [
      '710231',
      '4779736',
      // ... etc.
    ],
    NAME : [
      'Alaska',
      'Alabama',
      // ... etc.
    ],
    state: [
      '02',
      '01,
      // ... etc.
    ],
    // ... etc.
  }


I'm having trouble thinking of any situation where that would be more convenient or useful than the format they are using now. Where would your proposed format be an advantage?


Why would that be better? You'd probably still have to convert it. And also that structure would be harder to convert to from a CSV format or a SQL query.


Factual.com serves their data like this too. I thought it was odd at first, but it does save a lot of bytes on the wire. I just parsed it with JSON.parse() and used underscore to map to my object format. You don't really even need underscore for that. It's just a matter of mapping an array of rows to an array of objects. It's nice there is an API now.


I also found this odd, but I can think of two reasons,

1. CSV is not a single well-defined format. Some people use rfc4180, but not everybody does.

2. JSON is easier to parse than CSV for JavaScript clients, i.e., browsers. On the other hand, JSON is not a subset of JavaScript.


It's very simple to transform that kind of data into $your_preferred_structure.

It's also a very agnostic format.


One of the talks given at fluent conf (on Google's feedback canvas screenshot system) last week talked about how moving data from object-based storage to array-based storage saved serialization time and memory use. I would guess that is why they did it this way.


With their format, it is an easy matter to mechanically turn the first item into:

   INSERT INTO table ('P0010001','NAME','state') VALUES (?,?,?)
and then loop through the remaining items using that to insert them into a database.


I can't edit my original post anymore so I'll reply to myself. Having had some time to think about this I've changed my mind slightly. It seems that if you are getting various datesets with various unknown field names, you really don't have any other way to programmatically know what keys you would pull. Having the first row give the fields might make processing know datasets wonky, it is the unknown datasets that need the most help. I take back my original criticism. Thanks for all the insight.


Wow, this is good government. I can nitpick the API details, but this is one giant leap in the right direction.

Under US law, the federal government can't claim copyright on works produced via tax dollars (makes sense). Since the feds can't require us to provide attribution for all this helpful data, how do we as a community advocate for more open data like this?


If anyone is interested, the FCC has a fairly comprehensive list of Developers pages for other federal agencies: http://www.fcc.gov/developers (on the right column).

Also, for further reading on the topic of .gov APIs, http://ben.balter.com/2012/06/02/publishing-government-data-... is a great start.


Cool, especially since the census is one of the main reasons why we have computers at all.

I went to the Computer History Museum a few months back, and when we were looking through the origins of the modern computer, a lot of it traces back to census needs: we needed to automate the counting of large quantities of uniformly formatted data, so we used punchcards to tally up the data.

Fast forward to 2012, and the census is now getting an API. Weird how that works.


When you sign up for an API key you get the message "Happy querying!" Quite the model of transparency.


Wouldn't this be better served by a downloadable SQLite file rather than a web service?


The Census Bureau has made their data available for download for a looong time, both over the Web (including a bunch of different interfaces for slicing out just the data you want -- see http://www.census.gov/main/www/access.html) and FTP (http://www2.census.gov/) All the stuff that's available via the Web service is already available via download, along with a lot more. They don't provide it in SQLite format AFAIK, but they do offer a range of other formats that should be easy to get into SQLite if you wish.

If there's any problem with their downloadable data offerings, it's that they have so many of them that it can be difficult to figure out exactly which one has the data you're looking for.


The downloadable files for this dataset are much larger than anyone would want to download at one go.


It's database agnostic this way and every language supports JSON.


Partly to play devil's advocate and partly out of curiosity, do you know of a language with a JSON library but no SQLite library?


Browser-based JavaScript. Their CORS and JSONP support suggests they have this use case in mind.


for you, and your repliers:

this data is already available as a downloadable file. Check out ftp://ftp.census.gov/ It's CSV, and it's kinda ugly-looking, but it's available.

The 'full' data set is ...rather huge. The full data set for any particular report is rather huge.

The advantage of this API, I expect, is it allows people to make quick, machine readable queries for useful subsets of the data. If you want a JavaScript mashup that tells you the number of people who live on your block, use this. If you want to know the population of every block everywhere for data-crunching purposes, just download the CSV.


Excellent. Much easier than requesting API keys and parsing JSON. But that's just my opinion. CSV is beautiful. Simple and easy to manipulate into some other format if desired.


CSV is beautiful - for numbers. When you invoke strings, and quoting, and whatnot, it goes downhill fast. :)


As other have already commented, this is already available as a download. You can even fire up an EC2 instance with the data pre-downloaded http://aws.amazon.com/datasets/Economics?_encoding=UTF8&...


How big do you imagine this file would be ? And I am sure someone could eventually create one..


It's pretty cool that they are putting up the machine time to run queries on the data. I wonder how long it will last if someone makes a popular web app using the census API.


Slightly OT, but UK census data can also be accessed by a web API. Some details are at http://www.programmableweb.com/api/office-for-national-stati....

The bad news is that it appears to be SOAP-based.


Has anyone considered possible use-cases for this dataset?

It's great to get excited about the data format, but actual use-cases currently seem a little limited to me. Of course this will hopefully change as more and more datasets become available.


Here is a quick example I put together this weekend (using data from their Excel files, not the API - didn't know about it!) http://babelverse.com/linguistics/funfacts/

Looking forward to many more visualizations and analysis thanks to the API :)


You need to request a "key" (essentially a user ID) before you can make queries.

I can imagine only 3 reasons for this: (1) they want to make you agree to the Terms & Conditions, (2) it gives a way to choke off a DDOS attack that makes repeated complex queries, or (3) the census bureau wants to track how you are using its service.

The key is indeed easy to get, but I observe that Google would have exactly the same concerns as 1,2,&3 above, yet they somehow manage to stay in business without making their users sign up to do a web search.

The federal government always makes things a little more complicated.


Google and the Census both don't require individual users to signup to access their data.

However, both require that their users signup for an API key to do automated querying. (In fact, Google charges for its Web Search API now.) It's pretty standard for any API to want to be able to identify and potentially meter usage.


(3B) Census data on some topics can move markets. If someone hacks to get a five minute lead time, the key policy gives them a start on figuring out who and maybe how.


Right. Because if I hack a website I'm totally going to log in with a totally real set of credentials traceable back to me.

Besides, this is the 2010 American Community Survey. Pretty sure it's moved all the markets it's going to move by now.


This opens my mind up to a whole new slew of ideas using their information so easily..


better if was URL of a tarball of structured human-friendly data




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: