Discovering Millions of Datasets on the Web

sixstringtheory · on Jan 23, 2020

This is great. I've been collecting a list of open data sets for a while now with an eye to at some point turn it into a blog post. Now maybe I don't have to... saved me some work!

Some other indices of open data sets I've found:

https://registry.opendata.aws

https://en.m.wikipedia.org/wiki/List_of_datasets_for_machine...

https://meta.m.wikimedia.org/wiki/Datasets

breck · on Jan 24, 2020

This is a good list: https://github.com/awesomedata/awesome-public-datasets

subroutine · on Jan 24, 2020

Also zenodo.org is seems like it's gaining traction as a de facto data repo for scientific journals. I've had to deposit a copy of my raw data here 2x in the last 6 months for different pubs (1 neurobio, 1 genomics).

lettergram · on Jan 24, 2020

For those interested, I recently wrote a blog post on how to download & parse USPTO patents for a large free corpus for NLP problems:

https://austingwalters.com/parsing-uspto-patents-to-create-a...

I actually have found FOIA requests[1] and downloads from government websites to be the easiest & most effective way to get robust datasets.

[1] https://austingwalters.com/foia-requesting-100-universities/

philshem · on Jan 24, 2020

A couple good resources for finding datasets:

+ For individual requests, come over to https://opendata.stackexchange.com/ and ask!

+ Wikidata has loads of structured data, but using SPARQL is often a barrier. But you can request help: https://www.wikidata.org/wiki/Wikidata:Request_a_query

dzonga · on Jan 23, 2020

I really feel like the data side of things is under-rated. mostly, it seems like when people talk of IP, they talk about th e software and forget the data. Uber, Snapchat etc are companies mostly in the business of shuffling data around. Good or bad, that's subjective. And this data-search product is a nice welcome to those people who are trying to get something off the ground, research or just trying to understand the world and human behaviour better.

bordercases · on Jan 24, 2020

I don't think it's underrated, but I do think that there is a gap between massive data and an idea of what to do with it.

And data is simple, it's parameters plus timestamps plus a lot of storage.

Realtime access is harder but it's a well-specified problem.

The issue is inference. No one does inference extremely well except in limited circumstances. It's one of our greatest bottlenecks as humans and our software is going to be limited by it as well insofar as our understanding of what to build is controlled by what kind of inference we want.

danso · on Jan 24, 2020

I think data is underrated because it is actually not "simple". Especially the collection and curation of it.

jimkri · on Jan 24, 2020

I agree that data is underrated. The commercial real estate industry is a really interesting example.

There are a very limited number of companies that have complete datasets of buildings in all cities throughout the country. The leader is CoStar, they have the most complete and accurate data. What I have noticed with other CRE data companies is they are focusing on leasing activity or 1 other part of the CRE process. CoStar became the leader by valuing their data over all things, if other companies want to compete they need to do the same.

anthonypasq · on Jan 24, 2020

How is data underrated? Everything is data. Data is the primary asset of most technology companies.

grogenaut · on Jan 24, 2020

Trying to teach my wife pandas and the thing she most wants to do is compute the 10 year projected return on fortune 500s (buffetology) based on last ten year financial reports. It's really hard to find a good data source though as it's either in PDF or Google has been optimized to rent seeking data repackagers where it's hard to see if they have the data without jumping through hoops. Would love a source for that.

withdavidli · on Jan 24, 2020

I watch Aswath Damodaran's videos, Professor of Valuation at NYU. He recently had a video on the data he uses [0]. Might be worth it to email him to see if the data sources he buys matches your needs.

[0] https://youtu.be/M9pFTApeo_8

Additional link to his data site: http://people.stern.nyu.edu/adamodar/New_Home_Page/data.html

marc__1 · on Jan 24, 2020

Look at the SEC/EDGAR page where you can find the data in xml and json formats

https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.h...

hbcondo714 · on Jan 24, 2020

If you use Google's Dataset Search for SEC Filings[1], you get outdated information. FTP access has been removed for years but SEC Filings are still are great example of large datasets. I built a side business at https://Last10K.com using buffettology and provide 10 years of company annual reports (10Ks). There's also an API at https://dev.Last10K.com that returns financial data from these filings in JSON or XML.

[1]https://datasetsearch.research.google.com/search?query=EDGAR...

grogenaut · on Jan 24, 2020

Interesting. I was considering having her as a side hustle type these sheets into a place I could then sell. Sounds like that was what you did. How did that work out?

hbcondo714 · on Feb 3, 2020

Didn't see any contact details on your HN profile so feel free to contact me directly and I can provide details.

londons_explore · on Jan 24, 2020

Grab the numbers from a few of the pdf files, then do Google searches for those exact numbers and see if you can find one of those "auto-generated news sites" that shows the same numbers and scrape that?

texasbigdata · on Jan 24, 2020

Capital IQ

blacksmith_tb · on Jan 23, 2020

I was picturing a link to Shodan, nice to see this is about legitimate sources instead.

unitykid9008 · on Jan 24, 2020

If anyone is looking for cleaned and linked finance datasets, and works at a university. You should double check if you get access to Wharton Research Data Services https://wrds-www.wharton.upenn.edu/, it could save you alot of time.

fudged71 · on Jan 24, 2020

All this (and comments) have taught me is the data set I'm looking for doesn't exist in the public domain. Time to make it myself.

igravious · on Jan 24, 2020

Found a Vocabulary of Philosophy using it, very skookum! https://www.loterre.fr/skosmos/73G/en/

Unfortunately, most every result for the word `philosophy' is borderline garbage imho. Keyword indexing of datasets may need improving?

davedx · on Jan 24, 2020

I gave this a try for a few queries, but the results are very varied. Often you get the landing page for a study with its PDF behind a research/journal paywall (even with the "Free" filter applied, so not sure what "free" means to Google). Sometimes the "dataset" is some kind of visualization without any obvious way to get the raw data. Only a couple of results had a JSON or CSV to download.

Overall a bit underwhelming.