Love Datasette. I have successfully deployed it internally to host many studies which were too big for sharing or consumption by normal users (100MB-20GB). Historical options have been to: distribute very high-level summary information (with a few data call outs), build up a minimal Django app, or use a much heavier weight solution (eg Metabase).
Once you get into this medium data space, just distributing the data becomes a challenge as you can no longer email results around. Maybe there are limits on network shares, Sharepoint, whatever your customer is accustomed to using. Then, you run into the problem that your typical user can only use Excel which will similarly barf on too much data.
Previously, I would make a few data call-outs for the most interesting results: whatever could fit in an email or a Powerpoint. Maybe an attached Excel file with the top N interesting data points. Tell the customers too reach out if they have questions or need to know anything else. Questions rarely came.
Now, you can give them ~everything (usually I would only include data after some amount of processing, raw signals are not useful without extreme domain knowledge or the software to process), build up a few views to show different data highlights, and a <5 minute tutorial (“this is Super Excel, this is how you filter data”), and away they go.
My first deployment, I thought it was a cute trick, that was just satisfying my nerd curiosity. However, when I checked in on the logs, I saw they were hammering the system. For a routine study, they were looking up all sorts of things. Which made me wonder: how many times in the past had they wanted more detail, but did not want to bother me? They were now empowered by data, and they could now do their own sleuthing. When I did receive questions post-Datasette, they were more sophisticated because they were able to answer the routine ones on their own.
Datasette is a open project from simon willison (a fairly well known HNer), and this looks like his monetisation project - good luck to you, hope Softbank buys you out soon :-)
(It's a sort of wrapper around sqlite files so it's fairly easy to publish a file, think maybe Tableau for sqlite?)
I realized that Datasette is the first project of my entire career where if I was still working on it in 15 years time I wouldn't feel bored yet. There's just SO MUCH scope for interesting applications of the core idea.
As such, I want to work on it for decades. But it's lonely working on it alone (the community around it has been growing and is delightful, but it's not the same as having a full-time team.)
So the question I'm trying to answer is how to make the project financially sustainable in the long-run - not just for myself, but so I can pay for a team to work on it with me.
There are plenty of other examples of open source projects that have turned SaaS hosting into a sustainable business model - WordPress and GitLab are just two of the best examples. It feels like it's a reasonably well-trodden path.
Plus... I want people to be able to use my software. Currently to use Datasette as an individual you either have to "pip" or "brew" install it, or you can try the macOS Electron app - https://datasette.io/desktop - but I want newsrooms to be able to use it to collaborate on data. And most newsrooms aren't well equipped to configure a Linux server.
So I realized that a hosted SaaS version can solve two issues at once: it can help the audience I care about actually benefit from the value of the software so far, and it provides a reasonably realistic path to financial sustainability for the project as a whole.
And yeah, I'd also like to make a ton of money out of it myself too!
I am generally a naive and simple person. I think I would appreciate some investment to make Datasette "more approachable" and user-friendly for laypeople.
Datasette's UX and setup seem to be more geared towards data hackers with a hobby in reporting. Personally, I don't see it as a standard toolkit for data reporting or data journalism. Even though you might argue, "What more do you want? It's as simple as it gets", to be honest, Simon has mentioned that their intended users are journalists who may not possess data hacking skills required to get started with Datasette.
Datasette is not a BI tool or an OSINT tool. As it is, Datasette is positioned between data enthusiasts and investigative reporters which is a very narrow niche. This severely limits its potential.
Simon should consider monetization and, more specifically, hiring individuals who can make Datasette Cloud more accessible. I think he recognizes this as he has created a GUI application which is a step in the right direction.
> As it is, Datasette is positioned between data enthusiasts and investigative reporters which is a very narrow niche. This severely limits its potential.
i bet in 2005 if you asked Simon what Django was initially intended to be, he'd say something similarly niche. dot dot dot, it became the backend for Instagram. gotta change hats from evaluating present state to future potential when presented with something new, particularly when the author has a track record.
Kind of funny that it's nearly 20 years later and I'm working on something else that I initially thought would be for journalists but is clearly useful for way more than that.
Journalists are good at words. An interface to their data that plays to their strengths there feels like it could be transformational - provided it doesn't hallucinate at them!
> Datasette is not a BI tool or an OSINT tool. As it is, Datasette is positioned between data enthusiasts and investigative reporters which is a very narrow niche. This severely limits its potential.
FYI, you can already perform some "BI as code" (as I like to call it) using the Datasette Dashboards plugin[1]: specify charts using SQL queries + a visual spec (Vega, Vega-Lite, Maps, Tables, etc.), and assemble a dashboard layout. It is not yet as feature-full compared to Metabase for instance, but several people have been using it for various use-cases successfully.
(disclaimer: I'm the author of the Datasette Dashboards plugin)
I think of it more like MS Access but a sane backend of sqlite and python. There are thousands and thousands of critical business processes cludged together in Excel and Access--datasette could be a much better choice for those use cases. Something both devs and business people can use.
Totally agree, so many things people get strong feelings about customizing workflows--note taking, todo lists, personal document management, inventory of goods, etc.--are really just a sqlite database with some nice custom views and interfaces. I could definitely see a future where datasette or similar tools can replace some of that stuff.
Access is probably caught in a weird spot internally at MS. If they put effort into it then it just removes some of the need to sell proper SQL server or azure cloud database tech. Better to just limp it along then start internal wars with bigger organizations/products.
And the great thing about those tools is that Datasette doesn't need to replace them - SQLite becomes the integration layer, so you can use any tool you like that provides a neat UI to storing data in SQLite, then use Datasette itself directly against that same database when you need to run your own SQL or integrate with other JSON apps or run custom plugins.
I was constantly thinking about MS Access while watching the introductory video. I loved MS Access in the 90s, and this being based on SQLite and Python makes it really great.
The bigger pro is the fact that you can export the data as JSON, which basically means that you have a server for your SQLite file which other applications can query against, without needing a full blown database server like MariaDB or Postgres while you still have the possibility to explore the data manually.
So for small projects this seems to be a really good tool.
Datasette was originally created to take on this problem. I realized that SQLite is the perfect platform for this: it's fast, robust and crucially can be deployed anywhere that can host a dynamic web application (if you're publishing read-only data you don't need to worry about backups and replication and suchlike).
If you're comfortable with the command-line, I challenge you to find a quicker way to publish data online than this:
sqlite-utils insert manatees.db locations \
Manatee_Carcass_Recovery_Locations_in_Florida.csv --csv -d
sqlite-utils transform manatees.db locations \
--rename LAT latitude \
--rename LONG_ longitude \
--drop created_user \
--drop last_edited_user \
--drop X \
--drop Y \
--drop STATE \
--drop OBJECTID \
--pk FIELDID
datasette publish vercel manatees.db \
--project datasette-manatees --install datasette-cluster-map
That's using the datasette-publish-vercel plugin, but Datasette can also publish to Fly, Google Cloud Run, Heroku and more using additional plugins: https://docs.datasette.io/en/stable/publish.html
So that's publishing. But Datasette has grown far beyond that in the five years I've been working on it.
I find myself turning to it any time I have any data I want to poke at and start exploring. That's the data journalism angle - "find stories in data".
In terms of commercial applications, I have a strong hunch that if I can help journalists find stories in their data, I can help everyone else find stories in their data as well.
Another key detail here is the plugins.
WordPress is a good CMS... with 10,000+ plugins that mean you can point it at any content publishing problem you can think of. As a result, it runs a double-digit percentage of the web now.
The most ambitious version of Datasette looks like that.
I want to build an open source EDA (Exploratory Data Analysis) and publishing tool that has thousands of plugins that mean you can use it to solve any data exploration, analysis, visualization or publishing problem.
It's at 127 plugins so far, so there's still a long way to go - but it's a great start! https://datasette.io/plugins
Attempting to turn the above into single sentences is hard, because there are a lot of different angles to it - but here are a few attempts:
Datasette is the fastest way to publish data online as an interactive, searchable database.
Datasette is WordPress for data: an extensible open source platform with plugins for exploring, analyzing, visualizing, and publishing data.
This is a great explanation. And to me, what sets Datasette apart from a generic SQL UI, is that Datasette excels at publishing _specific_ and curated datasets and allowing interactive exploration in a way that plain CSVs just don't offer.
I have been able to use Datasette for so much cool stuff in the last few years, I can't recommend it enough and will definitely try out datasette.cloud!
Datasette is fantastic. While working my last job I jury-rigged a way to publish Datasette internally to our Azure cloud so I could quickly share the results of other complicated SQL queries we were running. Glad to see Simon has got the dot cloud up himself.
I caught Simon on the Latent Space podcast and have spent the last few weeks going through his blogs and various YT videos. As a former journalist, I’ve been wanting to try to learn some data journalism. Maybe I’ll try it when this is available.
Interesting to see how data journalism has evolved into uploading CSVs into the cloud.
I remember when the hottest thing at the I.R.E.† conventions was learning how to extract and decipher the data from 9-track tapes.
In the early days of FOIA, governments would try to stymie your reporting by "complying" with data requests by dumping massive amounts of information on you in giant 9-track data reels.
Almost no newsroom had the equipment or technical ability to read them, so we had to figure things out by ourselves, or find friendly businesses and institutions that would help us out.
I love reminding people that NICAR (National Institute for Computer-Assisted Reporting, part of IRE) was founded in the 1980s and involved working with mainframes. Data journalism is not a new thing!
I wrote my first "database" program on a C64 with a Datasette (when I was about 7 years old I think, it didn't really do much!), the name is absolutely an homage to that.
Has there been any interest in using Datasette for bioinformatics? I didn’t see any plugins for that space, but I could see a lot of potential for scientists to publish their datasets in an interactive form.
Better-equipped or tech savvy groups do this using custom websites today, and some people upload raw data to central “depositories.” A suitably-priced offering of Datasette Cloud could open this up to many more scientists.
Python already has a fantastic ecosystem of biology-related libraries (arguably R’s is better but Python is definitely a contender).
One potential risk is that “omics” datasets are often much bigger than is typical for SQLite.
I've heard from a couple of people who are using it for bioinformatics. It's not an area I know anything about myself but I'm excited to hear it's being applied there.
How big are we talking here?
My rule of thumb for SQLite and Datasette is that anything up to 1GB will Just Work. Up to 10GB works OK too but you need to start thinking a little bit about your indexes.
Beyond 10GB works in theory, but you need to start throwing more hardware at the problem (mainly RAM) if you're going to get decent response times.
The theoretical maximum for a single SQLite database file is 280TB - it used to be 140TB but someone out there in the world ran up against that limit and the SQLite developers doubled it for them!
Lots of science is «big data, small but important metadata». Also «big raw data, small result data» use cases are out there. (I used to do hyperspectral stuff for a while, which lets you record tons of sensor data to get a small and neat result, think TB -> kB). So GB might not be the best or only metric, as such.
My story for Datasette and Big Data at the moment is that you can use Big Data tooling - BigQuery, Parquet, etc, but then run aggregate queries against that which produce an interesting ~10MB/~100MB/~1GB summary that you then pipe into Datasette for people to explore.
I've used that trick myself a few times. Most people don't need to be able to interactively query TBs of data, they need to be able to quickly filter against a useful summary of it.
I have a friend that's super smart, currently works in biotech, and studied computer science, would you be interested in chatting with him about possible applications? Happy to make an introduction if you like!
I know I’m not Simon but I’ve been in this space for a while and would love to chat with your friend about what applications they’re thinking about and how they’ve been solving this problem at their current company.
Are launch congratulations in order? If so, congratulations! I'm super excited to see where you take this, and I hope you're able to find a solid business model to support you and your work.
Perhaps I missed it while skimming, but one thing that’s not really explained is Datasette’s take on versioning. If you edit some rows in a table and change your mind later, is there an undo for just that change?
For small amounts of data, sharing files on GitHub is a default choice and I wonder what I’d be giving up. (There is also DoltHub but it didn’t quite do what I wanted when I kicked the tires a bit.)
DoltHub is explicitly trying to be a “GitHub for data” and it seems like Datasette could become that, though maybe with a different take on versions.
I've been thinking about this quite a bit recently. I want to start adding features where LLMS can help with data cleanup, but for that to be useful it will need VERY robust "undo" for if they make mistakes.
I've also had a lot of success using GitHub itself for versioned data. If your data is less than a GB (and each file is under 50MB) you can dump it out to a GitHub repo and use that to track changes over time.
Congrats!! How does it compare to the ELT space and the modern data stack where you have ingestion/storage/visualization layers decoupled?
Asking as the founder of CloudQuery (https://github.com/cloudquery/cloudquery), Saw Datasette quite a few times around data exploration but curious to hear about the most popular use-cases of Datasette!
This is a great question, and touches on one of the challenges I've been having positioning Datasette.
Is Datasette an ELT tool? That's part of the ground it covers, but it's not a primary focus.
Is it a visualization tool? Same problem.
I worry that picking a specific vertical for it instantly limits me, in terms of how people think about the product, how pricing can work and suchlike.
But the alternative is trying to define a new category entirely, which is absurdly difficult.
I work with large text datasets, and I typically have to go through hundreds of samples to evaluate a dataset's quality and determine if any cleaning or processing needs to be done.
A tool that lets me sample and explore a dataset living in cloud storage, and then share it with others, would be incredibly valuable, but I haven't seen any tools that support long-form non-tabular text data well.
This is also an area that I'm starting to explore with LLMs. I love the idea that you could take a bunch of messy data, tell Datasette Cloud "I want this imported into a table with this schema"... and it does that.
Amazing. Simon, do you know of any museums using this (I know of your niche museum site!) - but thinking more museum collections? Would love a conversation.
There are a bunch of people using it in the cultural / heritage space now, but I've not seen an official museum collection published online yet. Really looking forward to the first time that happens!
Always interested in talking - swillison @ Google's mail service.
My interest is now piqued.. How does this look on the backend? Does this store Parquet files and if so where? What's the compute model over those files (pyarrow, Spark, Trino)?
Most trying to understand how far this will scale.
It's using SQLite files (all of Datasette is built around SQLite at the moment) which are stored on Fly Volumes (Datasette Cloud provides a dedicated Fly Machines Firecracker container for each team account) and backed up to S3 using Litestream.
The initial goal was to provide a private collaboration space, where scaling isn't as much of a challenge - at least until you get companies with thousands of employees all using it at once, though even then I would expect SQLite to be able to keep up.
I've since realized that the "publishing" access of Datasette is crucially important to support. For that I have a few approaches I'm exploring:
1. Published data sits behind a Varnish cache, which should then handle huge spikes of traffic as long as it's to the same set of URLs.
2. Datasette has a great scalability story already for read-only data: you publish to something like Cloud Run or Vercel which can spin up new copies of the data on-demand to handle increased traffic. So I could let Datasette Cloud users say "publish this subset of data once every X minutes" and use that.
3. Fly are working on https://fly.io/docs/litefs/ which is a perfect match for Datasette Cloud - it would allow me to run read-replicas of SQLite databases in multiple regions around the world.
Part of Datasette/Datasette Cloud development is sponsored by Fly at the moment, in return for which we'll be publishing detailed notes on what we learn about building and scaling on their platform.
In terms of scaling volume storage itself... the technical size limit for SQLite is 280TB, but I'm not planning on getting anywhere near that! I expect the sweet spot for Datasette Cloud will be more around the 100MB to 100GB range, probably mostly <10GB.
Once you get into this medium data space, just distributing the data becomes a challenge as you can no longer email results around. Maybe there are limits on network shares, Sharepoint, whatever your customer is accustomed to using. Then, you run into the problem that your typical user can only use Excel which will similarly barf on too much data.
Previously, I would make a few data call-outs for the most interesting results: whatever could fit in an email or a Powerpoint. Maybe an attached Excel file with the top N interesting data points. Tell the customers too reach out if they have questions or need to know anything else. Questions rarely came.
Now, you can give them ~everything (usually I would only include data after some amount of processing, raw signals are not useful without extreme domain knowledge or the software to process), build up a few views to show different data highlights, and a <5 minute tutorial (“this is Super Excel, this is how you filter data”), and away they go.
My first deployment, I thought it was a cute trick, that was just satisfying my nerd curiosity. However, when I checked in on the logs, I saw they were hammering the system. For a routine study, they were looking up all sorts of things. Which made me wonder: how many times in the past had they wanted more detail, but did not want to bother me? They were now empowered by data, and they could now do their own sleuthing. When I did receive questions post-Datasette, they were more sophisticated because they were able to answer the routine ones on their own.