Hacker News new | past | comments | ask | show | jobs | submit login
Discovering Millions of Datasets on the Web (blog.google)
443 points by Anon84 on Jan 23, 2020 | hide | past | favorite | 32 comments



This is great. I've been collecting a list of open data sets for a while now with an eye to at some point turn it into a blog post. Now maybe I don't have to... saved me some work!

Some other indices of open data sets I've found:

https://registry.opendata.aws

https://en.m.wikipedia.org/wiki/List_of_datasets_for_machine...

https://meta.m.wikimedia.org/wiki/Datasets



Also zenodo.org is seems like it's gaining traction as a de facto data repo for scientific journals. I've had to deposit a copy of my raw data here 2x in the last 6 months for different pubs (1 neurobio, 1 genomics).


For those interested, I recently wrote a blog post on how to download & parse USPTO patents for a large free corpus for NLP problems:

https://austingwalters.com/parsing-uspto-patents-to-create-a...

I actually have found FOIA requests[1] and downloads from government websites to be the easiest & most effective way to get robust datasets.

[1] https://austingwalters.com/foia-requesting-100-universities/


A couple good resources for finding datasets:

+ For individual requests, come over to https://opendata.stackexchange.com/ and ask!

+ Wikidata has loads of structured data, but using SPARQL is often a barrier. But you can request help: https://www.wikidata.org/wiki/Wikidata:Request_a_query


I really feel like the data side of things is under-rated. mostly, it seems like when people talk of IP, they talk about th e software and forget the data. Uber, Snapchat etc are companies mostly in the business of shuffling data around. Good or bad, that's subjective. And this data-search product is a nice welcome to those people who are trying to get something off the ground, research or just trying to understand the world and human behaviour better.


I don't think it's underrated, but I do think that there is a gap between massive data and an idea of what to do with it.

And data is simple, it's parameters plus timestamps plus a lot of storage.

Realtime access is harder but it's a well-specified problem.

The issue is inference. No one does inference extremely well except in limited circumstances. It's one of our greatest bottlenecks as humans and our software is going to be limited by it as well insofar as our understanding of what to build is controlled by what kind of inference we want.


I think data is underrated because it is actually not "simple". Especially the collection and curation of it.


I agree that data is underrated. The commercial real estate industry is a really interesting example.

There are a very limited number of companies that have complete datasets of buildings in all cities throughout the country. The leader is CoStar, they have the most complete and accurate data. What I have noticed with other CRE data companies is they are focusing on leasing activity or 1 other part of the CRE process. CoStar became the leader by valuing their data over all things, if other companies want to compete they need to do the same.


How is data underrated? Everything is data. Data is the primary asset of most technology companies.


Trying to teach my wife pandas and the thing she most wants to do is compute the 10 year projected return on fortune 500s (buffetology) based on last ten year financial reports. It's really hard to find a good data source though as it's either in PDF or Google has been optimized to rent seeking data repackagers where it's hard to see if they have the data without jumping through hoops. Would love a source for that.


I watch Aswath Damodaran's videos, Professor of Valuation at NYU. He recently had a video on the data he uses [0]. Might be worth it to email him to see if the data sources he buys matches your needs.

[0] https://youtu.be/M9pFTApeo_8

Additional link to his data site: http://people.stern.nyu.edu/adamodar/New_Home_Page/data.html


Look at the SEC/EDGAR page where you can find the data in xml and json formats

https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.h...


If you use Google's Dataset Search for SEC Filings[1], you get outdated information. FTP access has been removed for years but SEC Filings are still are great example of large datasets. I built a side business at https://Last10K.com using buffettology and provide 10 years of company annual reports (10Ks). There's also an API at https://dev.Last10K.com that returns financial data from these filings in JSON or XML.

[1]https://datasetsearch.research.google.com/search?query=EDGAR...


Interesting. I was considering having her as a side hustle type these sheets into a place I could then sell. Sounds like that was what you did. How did that work out?


Didn't see any contact details on your HN profile so feel free to contact me directly and I can provide details.


Grab the numbers from a few of the pdf files, then do Google searches for those exact numbers and see if you can find one of those "auto-generated news sites" that shows the same numbers and scrape that?


Capital IQ


I was picturing a link to Shodan, nice to see this is about legitimate sources instead.


If anyone is looking for cleaned and linked finance datasets, and works at a university. You should double check if you get access to Wharton Research Data Services https://wrds-www.wharton.upenn.edu/, it could save you alot of time.


All this (and comments) have taught me is the data set I'm looking for doesn't exist in the public domain. Time to make it myself.


Found a Vocabulary of Philosophy using it, very skookum! https://www.loterre.fr/skosmos/73G/en/

Unfortunately, most every result for the word `philosophy' is borderline garbage imho. Keyword indexing of datasets may need improving?


I gave this a try for a few queries, but the results are very varied. Often you get the landing page for a study with its PDF behind a research/journal paywall (even with the "Free" filter applied, so not sure what "free" means to Google). Sometimes the "dataset" is some kind of visualization without any obvious way to get the raw data. Only a couple of results had a JSON or CSV to download.

Overall a bit underwhelming.


Still can't find the damned 'International Corpus of Learned English', though.



You seem to be able to order it here: https://www.i6doc.com/en/collections/cdicle/


I have no practical way of reading a CD-ROM, why don't they make available a virtual version? Thanks anyway


If you pay 350$ for a cd rom u should be able to find a way to digitize it, for example a copy shop or just getting a usb drive.


A USB CDROM drive is ~$20.


Actually learner English. Sounds like an interesting corpus to work on.


Please try searching for datasets on this site on mobile. It needs some work.


Are there any resources where models are available?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: