Hacker News new | past | comments | ask | show | jobs | submit login
Welcome to Datasette Cloud (datasette.cloud)
317 points by swyx on Aug 20, 2023 | hide | past | favorite | 68 comments



Love Datasette. I have successfully deployed it internally to host many studies which were too big for sharing or consumption by normal users (100MB-20GB). Historical options have been to: distribute very high-level summary information (with a few data call outs), build up a minimal Django app, or use a much heavier weight solution (eg Metabase).

Once you get into this medium data space, just distributing the data becomes a challenge as you can no longer email results around. Maybe there are limits on network shares, Sharepoint, whatever your customer is accustomed to using. Then, you run into the problem that your typical user can only use Excel which will similarly barf on too much data.

Previously, I would make a few data call-outs for the most interesting results: whatever could fit in an email or a Powerpoint. Maybe an attached Excel file with the top N interesting data points. Tell the customers too reach out if they have questions or need to know anything else. Questions rarely came.

Now, you can give them ~everything (usually I would only include data after some amount of processing, raw signals are not useful without extreme domain knowledge or the software to process), build up a few views to show different data highlights, and a <5 minute tutorial (“this is Super Excel, this is how you filter data”), and away they go.

My first deployment, I thought it was a cute trick, that was just satisfying my nerd curiosity. However, when I checked in on the logs, I saw they were hammering the system. For a routine study, they were looking up all sorts of things. Which made me wonder: how many times in the past had they wanted more detail, but did not want to bother me? They were now empowered by data, and they could now do their own sleuthing. When I did receive questions post-Datasette, they were more sophisticated because they were able to answer the routine ones on their own.


Wow, this is a fantastic success story. I really need to start gathering case studies like this!

Absolutely love "this is Super Excel, this is how you filter data".


I've heard the name Datasette a couple of times but never spent time informing myself about what it can be used for.

The video on the landing page does a really great job at explaining it, which doesn't happen all too often.

Simon can be congratulated for it and his entire project. I really hope he has the best of success with his cloud offering.


Where is the video? I dont see it in the blog post or the home page when I click through.


The https://datasette.io/ homepage has that video.

I just edited this blog post to link to that.


Thank you, sir!


Datasette is a open project from simon willison (a fairly well known HNer), and this looks like his monetisation project - good luck to you, hope Softbank buys you out soon :-)

(It's a sort of wrapper around sqlite files so it's fairly easy to publish a file, think maybe Tableau for sqlite?)

Anyway all the best


Hah, Softbank isn't the goal here!

I realized that Datasette is the first project of my entire career where if I was still working on it in 15 years time I wouldn't feel bored yet. There's just SO MUCH scope for interesting applications of the core idea.

As such, I want to work on it for decades. But it's lonely working on it alone (the community around it has been growing and is delightful, but it's not the same as having a full-time team.)

So the question I'm trying to answer is how to make the project financially sustainable in the long-run - not just for myself, but so I can pay for a team to work on it with me.

There are plenty of other examples of open source projects that have turned SaaS hosting into a sustainable business model - WordPress and GitLab are just two of the best examples. It feels like it's a reasonably well-trodden path.

Plus... I want people to be able to use my software. Currently to use Datasette as an individual you either have to "pip" or "brew" install it, or you can try the macOS Electron app - https://datasette.io/desktop - but I want newsrooms to be able to use it to collaborate on data. And most newsrooms aren't well equipped to configure a Linux server.

So I realized that a hosted SaaS version can solve two issues at once: it can help the audience I care about actually benefit from the value of the software so far, and it provides a reasonably realistic path to financial sustainability for the project as a whole.

And yeah, I'd also like to make a ton of money out of it myself too!


I am generally a naive and simple person. I think I would appreciate some investment to make Datasette "more approachable" and user-friendly for laypeople.

Datasette's UX and setup seem to be more geared towards data hackers with a hobby in reporting. Personally, I don't see it as a standard toolkit for data reporting or data journalism. Even though you might argue, "What more do you want? It's as simple as it gets", to be honest, Simon has mentioned that their intended users are journalists who may not possess data hacking skills required to get started with Datasette.

Datasette is not a BI tool or an OSINT tool. As it is, Datasette is positioned between data enthusiasts and investigative reporters which is a very narrow niche. This severely limits its potential.

Simon should consider monetization and, more specifically, hiring individuals who can make Datasette Cloud more accessible. I think he recognizes this as he has created a GUI application which is a step in the right direction.


> As it is, Datasette is positioned between data enthusiasts and investigative reporters which is a very narrow niche. This severely limits its potential.

i bet in 2005 if you asked Simon what Django was initially intended to be, he'd say something similarly niche. dot dot dot, it became the backend for Instagram. gotta change hats from evaluating present state to future potential when presented with something new, particularly when the author has a track record.


In 2005 Adrian and I though Django was a CMS for newspaper websites! https://simonwillison.net/2010/Aug/24/what-is-the-history/

Kind of funny that it's nearly 20 years later and I'm working on something else that I initially thought would be for journalists but is clearly useful for way more than that.


That's absolutely part of the plan here: I want to grow Datasette to a size where I can have full-time UX and design people working with me on it.

I'm also cautiously optimistic about the role LLMs can play here - hence https://llm.datasette.io/

Journalists are good at words. An interface to their data that plays to their strengths there feels like it could be transformational - provided it doesn't hallucinate at them!


> Datasette is not a BI tool or an OSINT tool. As it is, Datasette is positioned between data enthusiasts and investigative reporters which is a very narrow niche. This severely limits its potential.

FYI, you can already perform some "BI as code" (as I like to call it) using the Datasette Dashboards plugin[1]: specify charts using SQL queries + a visual spec (Vega, Vega-Lite, Maps, Tables, etc.), and assemble a dashboard layout. It is not yet as feature-full compared to Metabase for instance, but several people have been using it for various use-cases successfully.

(disclaimer: I'm the author of the Datasette Dashboards plugin)

[1]: https://datasette.io/plugins/datasette-dashboards


Bit cynical, no?


The cynical thing is to interpret something that talks about monetization as negative - it's a good thing Datasette is finding a way to make money.

Great tool that fills out the SQLite ecosystem well - I always have a server running when I'm working with applications with SQLite databases.


> hope Softbank buys you out soon :-)

The cynical thing is to say this is about monetisation and an exit, rather than about making an excellent tool more easily available to journalists.


The Softbank line was definitely a joke.


I interpreted the tone as friendly, lighthearted humour, rather than cynical.


[flagged]


No offense, but was this comment written by ChatGPT or something? Please blink if not.


Yes my original comment was supportive (an HNer launching a saas like based on their Foss work - great !)

Honestly this is HN - if we don't sound a little bit like AI generated text anyhow something went wrong :-)


You not blinking.


Yes they are. In... Is that Morse code?


I don't mean to be a partypooter, but I watched the full Datasette pitch video and there's something I probably just don't get.

Isn't this really just a SQL GUI? Like basically any other SQL admin panel out there (minus the the writes)?

What's the distinctive feature here? The extensions?


I think of it more like MS Access but a sane backend of sqlite and python. There are thousands and thousands of critical business processes cludged together in Excel and Access--datasette could be a much better choice for those use cases. Something both devs and business people can use.


Yeah Access is a really interesting comparison (Datasette has quite a way to go on that front).

I find it baffling that Microsoft haven't invested more in Access. The world needs a truly great desktop/mobile database solution! Excel isn't enough.

Regular human beings should be able to point a full database at their own problems.


Check out Grist in the ‘Access with sane backend’ space. SQLite, open source and fantastic UX https://www.getgrist.com/ and https://github.com/gristlabs/grist-core

I use and love both Datasette and Grist - they’re complementary.


Totally agree, so many things people get strong feelings about customizing workflows--note taking, todo lists, personal document management, inventory of goods, etc.--are really just a sqlite database with some nice custom views and interfaces. I could definitely see a future where datasette or similar tools can replace some of that stuff.

Access is probably caught in a weird spot internally at MS. If they put effort into it then it just removes some of the need to sell proper SQL server or azure cloud database tech. Better to just limp it along then start internal wars with bigger organizations/products.


And the great thing about those tools is that Datasette doesn't need to replace them - SQLite becomes the integration layer, so you can use any tool you like that provides a neat UI to storing data in SQLite, then use Datasette itself directly against that same database when you need to run your own SQL or integrate with other JSON apps or run custom plugins.


I was constantly thinking about MS Access while watching the introductory video. I loved MS Access in the 90s, and this being based on SQLite and Python makes it really great.

The bigger pro is the fact that you can export the data as JSON, which basically means that you have a server for your SQLite file which other applications can query against, without needing a full blown database server like MariaDB or Postgres while you still have the possibility to explore the data manually.

So for small projects this seems to be a really good tool.


This is a perfectly reasonable question. I've been thinking about this for five years and I still don't have a great, snappy answer to this.

I can answer it in several paragraphs, but I really would like to be able to answer it in a single sentence some day!

So here goes with the several paragraph version...

The problem that Datasette solves better than anything else (a big claim, which I'm happy to be challenged on) is publishing structured data online.

Let's say you have data you want to share online - every global power plant, or the full history of the US congress. How do you do it?

Some options:

- Publish some CSV files somewhere - on a website, on GitHub, to S3

- Build a custom application for it, often using Django or Rails or similar.

- Put it in a Google Sheet - that's how we solved this problem at the Guardian years ago, see https://simonwillison.net/2018/Aug/19/instantly-publish-data...

Datasette was originally created to take on this problem. I realized that SQLite is the perfect platform for this: it's fast, robust and crucially can be deployed anywhere that can host a dynamic web application (if you're publishing read-only data you don't need to worry about backups and replication and suchlike).

Here's every global power plant: https://global-power-plants.datasettes.com/global-power-plan...

And US congressional legislators: https://congress-legislators.datasettes.com/legislators - I use that one in the Datasette tutorial: https://datasette.io/tutorials/explore

If you're comfortable with the command-line, I challenge you to find a quicker way to publish data online than this:

    sqlite-utils insert manatees.db locations \
      Manatee_Carcass_Recovery_Locations_in_Florida.csv --csv -d
    sqlite-utils transform manatees.db locations \
      --rename LAT latitude \
      --rename LONG_ longitude \
      --drop created_user \
      --drop last_edited_user \
      --drop X \
      --drop Y \
      --drop STATE \
      --drop OBJECTID \
      --pk FIELDID

    datasette publish vercel manatees.db \
      --project datasette-manatees --install datasette-cluster-map
That's using the datasette-publish-vercel plugin, but Datasette can also publish to Fly, Google Cloud Run, Heroku and more using additional plugins: https://docs.datasette.io/en/stable/publish.html

For more on the sqlite-utils bits see https://datasette.io/tutorials/clean-data

So that's publishing. But Datasette has grown far beyond that in the five years I've been working on it.

I find myself turning to it any time I have any data I want to poke at and start exploring. That's the data journalism angle - "find stories in data".

In terms of commercial applications, I have a strong hunch that if I can help journalists find stories in their data, I can help everyone else find stories in their data as well.

Another key detail here is the plugins.

WordPress is a good CMS... with 10,000+ plugins that mean you can point it at any content publishing problem you can think of. As a result, it runs a double-digit percentage of the web now.

The most ambitious version of Datasette looks like that.

I want to build an open source EDA (Exploratory Data Analysis) and publishing tool that has thousands of plugins that mean you can use it to solve any data exploration, analysis, visualization or publishing problem.

It's at 127 plugins so far, so there's still a long way to go - but it's a great start! https://datasette.io/plugins

Attempting to turn the above into single sentences is hard, because there are a lot of different angles to it - but here are a few attempts:

Datasette is the fastest way to publish data online as an interactive, searchable database.

Datasette is WordPress for data: an extensible open source platform with plugins for exploring, analyzing, visualizing, and publishing data.


Heh...

"I can answer it in several paragraphs, but I really would like to be able to answer it in a single sentence some day!" ...

"Datasette is WordPress for data: an extensible open source platform with plugins for exploring, analyzing, visualizing, and publishing data."


Confession: I posted my multi-paragraph comment into Claude and asked it for some ideas. It didn't come up with exactly that, but what it said helped me get there: https://gist.github.com/simonw/a160a49d39446aa53870ed1ca43d7...


This is a great explanation. And to me, what sets Datasette apart from a generic SQL UI, is that Datasette excels at publishing _specific_ and curated datasets and allowing interactive exploration in a way that plain CSVs just don't offer.


This looks really great Simon, best of luck!

I have been able to use Datasette for so much cool stuff in the last few years, I can't recommend it enough and will definitely try out datasette.cloud!


Datasette is fantastic. While working my last job I jury-rigged a way to publish Datasette internally to our Azure cloud so I could quickly share the results of other complicated SQL queries we were running. Glad to see Simon has got the dot cloud up himself.


I caught Simon on the Latent Space podcast and have spent the last few weeks going through his blogs and various YT videos. As a former journalist, I’ve been wanting to try to learn some data journalism. Maybe I’ll try it when this is available.


whoa, not often that i get to introduce Simon to people, usually its the other way around. thanks for listening!

he's been on our pod 3x:

- https://latent.space/p/llama2

- https://latent.space/p/code-interpreter

- https://latent.space/p/no-moat

which is great bc we never had to schedule him at our studio, he just dials in from home or from the park or whatever haha


btw you can try it right now - thats the joy of open source - simon recently did a 2hr tutorial here https://www.youtube.com/watch?v=5TdIxxBPUSI


Interesting to see how data journalism has evolved into uploading CSVs into the cloud.

I remember when the hottest thing at the I.R.E.† conventions was learning how to extract and decipher the data from 9-track tapes.

In the early days of FOIA, governments would try to stymie your reporting by "complying" with data requests by dumping massive amounts of information on you in giant 9-track data reels.

Almost no newsroom had the equipment or technical ability to read them, so we had to figure things out by ourselves, or find friendly businesses and institutions that would help us out.

https://www.ire.org/


I love reminding people that NICAR (National Institute for Computer-Assisted Reporting, part of IRE) was founded in the 1980s and involved working with mainframes. Data journalism is not a new thing!


Simon is a great technologist. I have learnt quite a bit from his videos and articles. I hope this works out for him!


I thought this was a reference to the C64 C2N Datasette tape drive. Unfortunately not.

https://en.m.wikipedia.org/wiki/Commodore_Datasette


I wrote my first "database" program on a C64 with a Datasette (when I was about 7 years old I think, it didn't really do much!), the name is absolutely an homage to that.


I had been expecting something in the spirit of the "Floppy RAID" [1], but something even more ridiculous on even older technology.

1: https://youtu.be/1hc52_PWeU8


That’s funny. His comedy reminds me of the man who did “Roadworthy Rescues”.

https://youtu.be/-A8cvrTgqGk


I was hoping for the same. Still have the 1531 lying in the attic (next to the C16).


Has there been any interest in using Datasette for bioinformatics? I didn’t see any plugins for that space, but I could see a lot of potential for scientists to publish their datasets in an interactive form.

Better-equipped or tech savvy groups do this using custom websites today, and some people upload raw data to central “depositories.” A suitably-priced offering of Datasette Cloud could open this up to many more scientists.

Python already has a fantastic ecosystem of biology-related libraries (arguably R’s is better but Python is definitely a contender).

One potential risk is that “omics” datasets are often much bigger than is typical for SQLite.


I've heard from a couple of people who are using it for bioinformatics. It's not an area I know anything about myself but I'm excited to hear it's being applied there.

How big are we talking here?

My rule of thumb for SQLite and Datasette is that anything up to 1GB will Just Work. Up to 10GB works OK too but you need to start thinking a little bit about your indexes.

Beyond 10GB works in theory, but you need to start throwing more hardware at the problem (mainly RAM) if you're going to get decent response times.

The theoretical maximum for a single SQLite database file is 280TB - it used to be 140TB but someone out there in the world ran up against that limit and the SQLite developers doubled it for them!


Lots of science is «big data, small but important metadata». Also «big raw data, small result data» use cases are out there. (I used to do hyperspectral stuff for a while, which lets you record tons of sensor data to get a small and neat result, think TB -> kB). So GB might not be the best or only metric, as such.


My story for Datasette and Big Data at the moment is that you can use Big Data tooling - BigQuery, Parquet, etc, but then run aggregate queries against that which produce an interesting ~10MB/~100MB/~1GB summary that you then pipe into Datasette for people to explore.

I've used that trick myself a few times. Most people don't need to be able to interactively query TBs of data, they need to be able to quickly filter against a useful summary of it.


I have a friend that's super smart, currently works in biotech, and studied computer science, would you be interested in chatting with him about possible applications? Happy to make an introduction if you like!


I know I’m not Simon but I’ve been in this space for a while and would love to chat with your friend about what applications they’re thinking about and how they’ve been solving this problem at their current company.

Nicholas at sphinxbio dot com


Yes, absolutely! I'm swillison @ Google's email provider.


Cool, I'll reach out, and let you know if he's interested!


I've been a big fan of Datasette - as Simon recently featured in https://simonwillison.net/2023/Aug/11/dependency-management-... - and think this looks really cool, both as an approach and a service.

Deffo considering using it!


Are launch congratulations in order? If so, congratulations! I'm super excited to see where you take this, and I hope you're able to find a solid business model to support you and your work.


Perhaps I missed it while skimming, but one thing that’s not really explained is Datasette’s take on versioning. If you edit some rows in a table and change your mind later, is there an undo for just that change?

For small amounts of data, sharing files on GitHub is a default choice and I wonder what I’d be giving up. (There is also DoltHub but it didn’t quite do what I wanted when I kicked the tires a bit.)

DoltHub is explicitly trying to be a “GitHub for data” and it seems like Datasette could become that, though maybe with a different take on versions.


I've been thinking about this quite a bit recently. I want to start adding features where LLMS can help with data cleanup, but for that to be useful it will need VERY robust "undo" for if they make mistakes.

I wrote up one of my explorations here: sqlite-history https://simonwillison.net/2023/Apr/15/sqlite-history/

I've also had a lot of success using GitHub itself for versioned data. If your data is less than a GB (and each file is under 50MB) you can dump it out to a GitHub repo and use that to track changes over time.

One example of that is my personal blog, here: https://github.com/simonw/simonwillisonblog-backup/tree/main...

That's using this tool: https://datasette.io/tools/sqlite-diffable

I imagine Datasette Cloud will end up with some sort of hybrid of those approaches.


Congrats!! How does it compare to the ELT space and the modern data stack where you have ingestion/storage/visualization layers decoupled?

Asking as the founder of CloudQuery (https://github.com/cloudquery/cloudquery), Saw Datasette quite a few times around data exploration but curious to hear about the most popular use-cases of Datasette!


This is a great question, and touches on one of the challenges I've been having positioning Datasette.

Is Datasette an ELT tool? That's part of the ground it covers, but it's not a primary focus.

Is it a visualization tool? Same problem.

I worry that picking a specific vertical for it instantly limits me, in terms of how people think about the product, how pricing can work and suchlike.

But the alternative is trying to define a new category entirely, which is absurdly difficult.


How does datasette work with unstructured data?

I work with large text datasets, and I typically have to go through hundreds of samples to evaluate a dataset's quality and determine if any cleaning or processing needs to be done.

A tool that lets me sample and explore a dataset living in cloud storage, and then share it with others, would be incredibly valuable, but I haven't seen any tools that support long-form non-tabular text data well.


There are a few things you can do here.

SQLite is great at JSON - so I often dump JSON structures in a TEXT column and query them using https://www.sqlite.org/json1.html

I also have plugins for running jq() functions directly in SQL queries - https://datasette.io/plugins/datasette-jq and https://github.com/simonw/sqlite-utils-jq

SQLite's FTS search is surprisingly decent, and I have tools for quickly turning that on both from a CLI: https://sqlite-utils.datasette.io/en/stable/cli.html#configu... and as a Datasette Plugin (available in Datasette Cloud): https://datasette.io/plugins/datasette-configure-fts

I've been trying to drive the cost of turning semi-structured data into structured SQL queries down as much as possible with https://sqlite-utils.datasette.io - see this tutorial for more: https://datasette.io/tutorials/clean-data

This is also an area that I'm starting to explore with LLMs. I love the idea that you could take a bunch of messy data, tell Datasette Cloud "I want this imported into a table with this schema"... and it does that.

I have a prototype of this working now, I hope to turn it into an open source plugin (and Datasette Cloud feature) pretty soon. It's using this trick: https://til.simonwillison.net/gpt3/openai-python-functions-d...


Amazing. Simon, do you know of any museums using this (I know of your niche museum site!) - but thinking more museum collections? Would love a conversation.


There are a bunch of people using it in the cultural / heritage space now, but I've not seen an official museum collection published online yet. Really looking forward to the first time that happens!

Always interested in talking - swillison @ Google's mail service.


Ace, thanks, will see if there's stuff we can do with our client base - will be in touch :-)


My interest is now piqued.. How does this look on the backend? Does this store Parquet files and if so where? What's the compute model over those files (pyarrow, Spark, Trino)?

Most trying to understand how far this will scale.


It's using SQLite files (all of Datasette is built around SQLite at the moment) which are stored on Fly Volumes (Datasette Cloud provides a dedicated Fly Machines Firecracker container for each team account) and backed up to S3 using Litestream.

The initial goal was to provide a private collaboration space, where scaling isn't as much of a challenge - at least until you get companies with thousands of employees all using it at once, though even then I would expect SQLite to be able to keep up.

I've since realized that the "publishing" access of Datasette is crucially important to support. For that I have a few approaches I'm exploring:

1. Published data sits behind a Varnish cache, which should then handle huge spikes of traffic as long as it's to the same set of URLs.

2. Datasette has a great scalability story already for read-only data: you publish to something like Cloud Run or Vercel which can spin up new copies of the data on-demand to handle increased traffic. So I could let Datasette Cloud users say "publish this subset of data once every X minutes" and use that.

3. Fly are working on https://fly.io/docs/litefs/ which is a perfect match for Datasette Cloud - it would allow me to run read-replicas of SQLite databases in multiple regions around the world.

Part of Datasette/Datasette Cloud development is sponsored by Fly at the moment, in return for which we'll be publishing detailed notes on what we learn about building and scaling on their platform.

In terms of scaling volume storage itself... the technical size limit for SQLite is 280TB, but I'm not planning on getting anywhere near that! I expect the sweet spot for Datasette Cloud will be more around the 100MB to 100GB range, probably mostly <10GB.


@simonw, Datasette Cloud is a great idea! I wonder if it can benefit from additional ETL capabilities. How can I reach you? Thanks




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: