The Future Database

samhw · on May 23, 2022

I can't make sense of lots of this.

- Without self-replicating grey goo, infinite scalability is surely more a property of some kind of networked computer rental business (like AWS) rather than a database.

- What does 'serverless' mean exactly? My understanding is that it denotes a stateless application which is executed to serve a request but doesn't run continually as a daemon. Essentially the aforementioned computer rental business provides the event loop and the program provides the event handler. I fail to see how this is compatible with a database, which is definitionally very stateful. (And that encompasses much, much more than just the data.)

- 'Intelligence': Databases already are intelligent and self-optimising, and have been at least since MySQL/Postgres: https://dev.mysql.com/doc/refman/8.0/en/cost-model.html#cost...

- 'Fundamentally reliable': This idea could reasonably be described as, uh, 'not novel'.

- 'Distributed globally, locally available': As far as I can tell, this collection of words is entirely devoid of any meaning. It sounds like it came out of a random passphrase generator.

- 'Scale should not come at the cost of performance'. While technically semantically meaningful, this is not novel or interesting, and I'm pretty sure this has been a pleasant daydream for database designers since databases were stored in punchcards. As far as I can see, this is comparable to saying 'houses should not come at the cost of money'.

This feels more like a laundry list of daydreams rather than a meaningful narrowing-down of how future databases will be architected. "It should be infinitely scalable, usable by a toaster, and it shouldn't need a computer to live on. It should be serverless and stateless but also self-optimising and with connection pooling. It should be consistent, available, and, uh, partitions, it should be cool with those too. It should be usable by anyone and perfectly tailored to their needs as well as to the opposite needs. Also..."

ertian · on May 23, 2022

I suspect the intention is something more like IPFS (https://ipfs.io), built on some distributed data structure like a DHT. With that in mind:

- "Infinitely scalable" in the sense of the internet, I suppose? If we had a way of paying for storage and indexing in a decentralized way, then you could switch from one 'provider' to another, or host your own data. Data would not be siloed in the way it is by AWS.

- I assume "serverless" means you don't have some specific upstream server you must hit with requests. You could submit queries to a local job, which could pass them to any node in the network.

- "Fundamentally reliable" because data is replicated, and jobs can be executed by any member of the network.

- "Distributed globally" because you could store data from any internet-connected device, "available locally" because, again, you don't have some single-point-of-failure server you need to connect to, or VPN you need to join, or whatever: your computer would be a node on the network, as capable of accessing data and running jobs as any other node.

And "intelligence" and "no performance cost" are just aspirations.

throwusawayus · on May 23, 2022

> I suspect the intention is something more like IPFS (https://ipfs.io), built on some distributed data structure like a DHT.

the company's products are all built on top of sharded mysql .. unless they plan to throw away everything they have done so far, i do not think these assumptions are correct!

ertian · on May 24, 2022

Ah, fair enough. I didn't look into the company, I've just come across suggestions along this line.

imachine1980_ · on May 23, 2022

What does 'serverless' mean exactly? 1) i think server-less, means ops-less, no manager of server even at scale,maybe you need optimization query but not to deploy databases, migration and that. 3)Scale should not come at the cost of performance is easier said than done, is the same idea automatic cache an thinks like this 2) i think the same i will still use monolith like PostgreSQL use a doubt have more intensive workflows than medium.

samhw · on May 23, 2022

> maybe you need optimization query but not to deploy databases, migration and that

Well, how do you indicate to it that you want to create a database? Does the database read your mind too? How do you indicate that you want to provision some more because there's a big event coming up?

> Scale should not come at the cost of performance is easier said than done

That's essentially my exact point, yeah. It's not a novel aspiration, and they don't contribute any kind of solution. It's like saying "computers should cost less and also be gooder".

> i think the same i will still use monolith like PostgreSQL

Yeah, I think that's probably sensible for many people, though I'd also welcome more innovation. Kleppmann's idea of 'unbundling the database' – i.e. the modern database becoming fragmented into several distinct components – is I think very promising and very probable. (Of course, your business and your production environment may not be somewhere you wish to be a hotbed of experimentation.)

jandrewrogers · on May 23, 2022

It is possible to design scale-out database engines with very fast and elastic load following though it isn't common. In these kinds of systems, you don't provision for load, the system automatically adjusts to the load as it happens. In good designs you can often shed load in milliseconds once the additional server capacity is online, so the latency is often a function of how quickly you can bootstrap more server images.

These kinds of fast-twitch load shedding mechanics were not designed for elasticity -- it would not be worth the engineering investment in most cases. They were typically developed to support scale-out of data models for which uniform sharding is intrinsically impossible, requiring real-time adaptive resharding instead. If you have super-fast load shedding for extremely and unpredictably biased data distributions, you are 90% of the way to a really nice implementation of elastic capacity, just add hardware provisioning.

These setups are nice as a user, because the sharded nature of a table is (necessarily) completely transparent. You can create an empty new table and insert trillions of records without every having to manage sharding or cluster capacity as the table grows. In this sense, the fact that it is running on a cluster of discrete servers does not leak through the database abstraction presented to the user.

voz_ · on May 23, 2022

> How do you indicate that you want to provision some more because there's a big event coming up?

You literally do not. As load starts to increase, you scale up automatically. The word elastic has been used to represent this pattern in the last few generations of cloud and/or computing infra.

What's the alternative? Manually ssh into some box and crank up mysql instance by hand? like its 1996?

samhw · on May 23, 2022

Yeah, I appreciate what 'auto' and 'scaling' signifies. I've implemented an autoscaler on a huge Kubernetes cluster in the past. That's precisely where my doubt comes from.

I was about to write out a huge example, but I figure I'll just express the core logic simply. First, it takes a chunk of time to determine that increased traffic is not just random variance. Then it takes time to allocate and provision machines. And often the traffic spikes for you and for your co-tenants are not statistically independent, so the provider struggles to allocate machines in time when it most matters.

And how much do you scale up? 10x right away? Can't do that: vastly expensive, and anyway it could be a retry storm exacerbating things. 2x and then go from there? Well, if your business is serving ads during the Superbowl break, that's not gonna work. Etc. This all starts to look increasingly absurd against the backdrop of being able to just push a button and do it yourself.

I'm not suggesting never doing autoscaling. I'm just saying that wise men don't speak in absolutes, or make architectural decisions based on toy examples. Nor am I particularly bothered about whether I'm doing things "like it's 1996" - I'm not in the fashion business, so I'm purely interested in finding the optimal solution to my problem, and I couldn't care less whether it's flaming hot or whether it's from the Iron Age.

Karunamon · on May 23, 2022

I think that only applies if the incoming load is gradual. Any sane elastic configuration has some timeouts and a measurement period meant to prevent unwanted scale-up just because of transient load, and during that time you can get hit hard enough to degrade/take down your service before your additional capacity has come online.

It makes sense to get out in front of known massive load events before they hit your service. If I'm launching a new service that I expect to hit the front page of HN, I'm spinning up capacity first and asking questions later. A couple hours of running large instances or extra containers costs much less than potential lost sales from users getting timeouts.

tshaddox · on May 23, 2022

> Well, how do you indicate to it that you want to create a database? Does the database read your mind too?

I'm not sure what you mean. You still instantiate resources with serverless products. With an AWS Lambda function, you go to the Lambda web console, click "Create a function," and type in the code for that function (of course this can also done via the AWS API). There's no mind-reading going on. For AWS Aurora, you still go to the RDS web console, click "Create a database," choose Aurora as the database engine, etc.

samhw · on May 23, 2022

I was joking, with that particular sentence. My point was that there is someone managing the server irrespective, and that there's not a particularly clear metaphysical distinction between 'create my database in this way' and 'instruct someone else to create my database in this way'. Not in the computing world, where everything's already under 17 layers of abstraction.

infogulch · on May 23, 2022

Maybe the the article would be better titled "Aspirations for a Future Database".

samhw · on May 23, 2022

Yeah, I'd agree with that. But even then, it would be more valuable if it committed to at least some meaningful, non-platitudinous positions. Something with which at least one person on the planet might disagree. Not "it would be nice if databases were fast, scalable, reliable, personally tailored to everyone on the planet, ...".

For example, my predictions for future databases would be:

- They will take over more functions of the average backend codebase: instead of database users, they will have a concept of application users, along with their privileges, and simple CRUD logic will be executed by the database.

- Horizontal scaling will be less important than we currently think. Consensus will be handled at a lower level, by networked filesystems or storage engines. (Zookeeper, in the Java world, is a proto-example of what I mean.) This is one instance of the trend that...

- Databases will be 'unbundled' (Kleppmann's term). Many of the dull uniform bits will be shared rather than reimplemented. This will happen either through libraries or - more likely and preferably - through separate pieces of software, implementing an interface, which the user will compose. (Rebundling will occur for users who just want a click-and-tick experience.)

- Self-optimising will - I agree with the article here - widen in scope. Users won't have to perform housekeeping tasks like creating indices on commonly-queried fields.

- Databases won't target a filesystem but block storage. This will accompany a convergence of disk (NVMe) and RAM (NVRAM) towards persistent random-access storage of state. Databases, along with applications, won't think in terms of a rigid distinction between "what's in my process's memory" and "what do I have to expressly commit to the disk with a syscall". Tuple spaces are a precursor.

infogulch · on May 23, 2022

Agreed. And I like your list a lot better, thanks!

jandrewrogers · on May 23, 2022

What I struggle with is that half the databases that have been built over the last decade would have claimed these same principles. It isn't so much that these principles are not novel or unique that gives me pause, but the lack of acknowledgement of how many expertly designed databases failed to deliver on them. These principles contain no insights into why anyone should expect this particular database to succeed where similarly qualified people have failed. At a minimum, these principles smuggle in the assumption that several Hard Problems, both theoretical and practical, have been addressed (which is possible but not in evidence).

As another way of framing it, these principles seem to be making the unstated assumption that the Future Database only supports a narrow set of data models and workloads. Which would be considerably less interesting than actually reimagining core database architecture and solving hard problems.

TheAceOfHearts · on May 23, 2022

If you're truly forward-thinking you'd start planning for multiplanetary databases. Get a leg up on the competition by the time humans reach Mars.

The core issues of planetary database systems have mostly been solved already. What about database systems which have to deal with time dilation due to running on spacecraft that travel through space at different velocities?

But in all seriousness, one of the core features missing in modern database systems is a way to smoothly handle tracking and deploying changes. Why are online schema changes so difficult?

devnulll · on May 23, 2022

Humor aside, many years ago I had a 1:1 discussion with Vint Cerf (2003?) about interplanetary networks. He was thinking deeply on this topic, and I had some interesting and relevant input.

His focus was primarily on Store and Forward routing, and the implications that come with long latency hops. The overlap with military & battlefield networks where nodes are subject to active interference and "Node no longer exists" was a very overlapping topic.

A quick Google search shows he's still working on this.

candiddevmike · on May 23, 2022

Going further, I'd love to have a declarative schema for my database with some kind of hook layer to allow for data transformations between state. Or figure out how to eliminate the need to carry forward hundreds of schema migrations, such as an easier way to squash them periodically.

setr · on May 23, 2022

> What about database systems which have to deal with time dilation due to running on spacecraft that travel through space at different velocities?

I think they already handle this case by accident, because distributed clocks are unreliable and inconsistent anyways.

devnulll · on May 23, 2022

The Big Players put atomic clocks in the datacenter, and provide multiple mechanism's of time synchronization. Google Spanner was the first to do this, back in... 2007?

https://www.theverge.com/2012/11/26/3692392/google-spanner-a... https://cloud.google.com/spanner/docs/true-time-external-con...

Now, I don't think they worry overmuch about general or special relativity at this point, but if we start putting datacenters in orbit that will become an actual issue.

I'll wager we'll start seeing Edge Compute and CDN features on the very large satellite constellations (Starlink, etc) soon. Initially this will preserve bandwidth between satellites and with the ground stations. This will be a big upsell feature. The Quant folks would pay more for those sweet low-latency links...

o_m · on May 23, 2022

The biggest missing piece for future databases is the ability to guarantee you only store your data in Europe. All of these databases are coming from the US, with no self-hosting option, or EU region for their SaaS service. With Schrems II it the EU region should be operated by an European company detached from the US company.

dilyevsky · on May 23, 2022

CockroachDB with enterprise license has data locality pin support even at row level - https://www.cockroachlabs.com/blog/regional-by-row/

I think their serveless product also has that option

k__ · on May 23, 2022

Doesn't Upstash allow this?

jfbaro · on May 23, 2022

Great list. As I can add anything here, I will say:

- Bitemporal support OOTB (storage would be more expensive, as temporal data needs more disk space)

- CoW capabilities OOTB, so it would be super easy (fast and cheap) to create ephemeral database for development purpose.

- Charge per request (ms of reads, ms of writes) - for the sake of being more specific about serverless.

- AI capabilities that detects the use of the database and suggests indexes or other tweaks to make the database as fast as possible (and cheap), even if schema changes, database size increases or query patterns change

- PostgreSQL support (and all its extensions... I know that's a hard one as PS is based on MySQL)

- OOTB capabilities for Masking and/or anonymizing of data (PCI, PII, etc)

Thanks

rmbyrro · on May 23, 2022

My dream was having a database like AWS DynamoDB, but with analytical capabilities built-in the API.

azurezyq · on May 23, 2022

I think you are talking about HTAP(https://en.wikipedia.org/wiki/Hybrid_transactional/analytica...). There are already decent products on the market. But don't expect 100% analytical capabilities like BigQuery / Snowflake. There are tradeoffs.

If your workload fits the model well (mostly OLTP, but occasionally some with heavier aggregations, etc.), it would be awesome.

Example: TiDB

https://docs.pingcap.com/tidb/dev/explore-htap

https://www.vldb.org/pvldb/vol13/p3072-huang.pdf

dravita · on May 26, 2022

When it comes to HTAP workloads, SingleStoreDB tackled that years ago and has been the leader in that category of workloads. In recent years, the product has expanded to handle broader workloads and use cases. https://www.singlestore.com/resources/research-paper-cloud-n...

Scarbutt · on May 23, 2022

Doing application programming in Dynamodb is a nightmare.

gnulinux · on May 23, 2022

I personally found DynamoDB enjoyable but the trick is you need to understand DynamoDB gives you exactly no way to have analytics, so you have to have a separate and robust story for analytics e.g. we periodically dump our DynamoDB into a RedshiftDB so we get best of the both worlds. I personally find it enjoyable to program, and had a better experience than when everything was Postgres.

rmbyrro · on May 24, 2022

Same experience.

From an OLTP standpoint, the limitations to query the data are actually amazing, because it forces you to think about the right questions from the beginning.

On more flexible DBMS like Postgres, it's very very easy to shoot yourself in the foot.

I find it easier to understand and take into consideration the DynamoBD constraints (explicit) than SQL shortcomings (implicit).

Scarbutt · on May 24, 2022

because it forces you to think about the right questions from the beginning.

For many apps thinking that you can have the right questions from the beginning is delusional.

gnulinux · on May 24, 2022

> For many apps thinking that you can have the right questions from the beginning is delusional.

Strong disagree, I think you might be approaching this from a traditional angle where analytics questions are also questions of the system. In DynamoDB you only focus on access patterns you need for system to work real-time, without bothering with analytics (as stated above). Any question you want to ask on data outside of the operational code is out of the scope.

dravita · on May 26, 2022

Here is the database of your dreams: SingleStoreDB. https://www.singlestore.com/customers/fathom/

thdxr · on May 24, 2022

been pairing DynamoDB with rockset to solve this

dravita · on May 26, 2022

Try SingleStoreDB for the combined DynamoDB workloads and analytical workloads in a single database tech, see https://usefathom.com/blog/ditched-dynamodb

rmbyrro · on May 25, 2022

Yeah, I do pairing with analytical DBMS.

The issue is the pairing.

throwusawayus · on May 23, 2022

bunch of principles here are specific to using managed 3rd-party database, rather than self host // on-prem

meanwhile.. lots of apocalyptic posts lately about how vc funding will dry up since public tech stocks have crashed

so how safe is it to have your business completely rely on a managed database startup, instead of self host or using a major cloud provider's direct offering? may be unwise these days methinks