This is great. I've been collecting a list of open data sets for a while now with an eye to at some point turn it into a blog post. Now maybe I don't have to... saved me some work!
Also zenodo.org is seems like it's gaining traction as a de facto data repo for scientific journals. I've had to deposit a copy of my raw data here 2x in the last 6 months for different pubs (1 neurobio, 1 genomics).
I really feel like the data side of things is under-rated. mostly, it seems like when people talk of IP, they talk about th e software and forget the data. Uber, Snapchat etc are companies mostly in the business of shuffling data around. Good or bad, that's subjective. And this data-search product is a nice welcome to those people who are trying to get something off the ground, research or just trying to understand the world and human behaviour better.
I don't think it's underrated, but I do think that there is a gap between massive data and an idea of what to do with it.
And data is simple, it's parameters plus timestamps plus a lot of storage.
Realtime access is harder but it's a well-specified problem.
The issue is inference. No one does inference extremely well except in limited circumstances. It's one of our greatest bottlenecks as humans and our software is going to be limited by it as well insofar as our understanding of what to build is controlled by what kind of inference we want.
I agree that data is underrated. The commercial real estate industry is a really interesting example.
There are a very limited number of companies that have complete datasets of buildings in all cities throughout the country. The leader is CoStar, they have the most complete and accurate data. What I have noticed with other CRE data companies is they are focusing on leasing activity or 1 other part of the CRE process. CoStar became the leader by valuing their data over all things, if other companies want to compete they need to do the same.
Trying to teach my wife pandas and the thing she most wants to do is compute the 10 year projected return on fortune 500s (buffetology) based on last ten year financial reports. It's really hard to find a good data source though as it's either in PDF or Google has been optimized to rent seeking data repackagers where it's hard to see if they have the data without jumping through hoops. Would love a source for that.
I watch Aswath Damodaran's videos, Professor of Valuation at NYU. He recently had a video on the data he uses [0]. Might be worth it to email him to see if the data sources he buys matches your needs.
If you use Google's Dataset Search for SEC Filings[1], you get outdated information. FTP access has been removed for years but SEC Filings are still are great example of large datasets. I built a side business at https://Last10K.com using buffettology and provide 10 years of company annual reports (10Ks). There's also an API at https://dev.Last10K.com that returns financial data from these filings in JSON or XML.
Interesting. I was considering having her as a side hustle type these sheets into a place I could then sell. Sounds like that was what you did. How did that work out?
Grab the numbers from a few of the pdf files, then do Google searches for those exact numbers and see if you can find one of those "auto-generated news sites" that shows the same numbers and scrape that?
If anyone is looking for cleaned and linked finance datasets, and works at a university. You should double check if you get access to
Wharton Research Data Services https://wrds-www.wharton.upenn.edu/, it could save you alot of time.
I gave this a try for a few queries, but the results are very varied. Often you get the landing page for a study with its PDF behind a research/journal paywall (even with the "Free" filter applied, so not sure what "free" means to Google). Sometimes the "dataset" is some kind of visualization without any obvious way to get the raw data. Only a couple of results had a JSON or CSV to download.
Some other indices of open data sets I've found:
https://registry.opendata.aws
https://en.m.wikipedia.org/wiki/List_of_datasets_for_machine...
https://meta.m.wikimedia.org/wiki/Datasets