This is great news. Those weird little cases where you need to either defer the computation until read time, or precompute and store yourself always felt warty and ripe for errors. Really glad to see this addition.
EDIT an example from this week: we have a json blob full of stuff and we want to pluck out a specific field to search on. You need to jump through casting hoops to hoist it out as an integer when you query. It’s doable, but the onus is on everyone who queries to understand this.
At least in PG, a regular view must calculate everything at read time, which might mean a lot of duplicated calculations, and a materialized view has to be refreshed "manually", it doesn't auto-update piecemeal when the underlying data changes.
I've had pretty good luck with using table triggers to update materialized views and make everything "automatic". A little more work up front, but pretty easy to forget about it once it's in place.
A REFRESH of a materialized view still requires full re-computation of all values, no? It's better than regular views if you have more reads than writes, but still quite wasteful.
You are right. I've been dealing with this personally with PostGIS and storing large geometries. Ideally you split the geometries up in to many smaller geometries (using a built-in function), and it's much faster to calculate intersections, contained lengths, etc using those instead.
Many places online recommend storing this collection of divided geometries as a materialized view, but I recently had to move it to a separate real table because inserting a single new record would take 15 minutes to update the view (on an RDS 4xlarge database). It could at least update concurrently, so other reads didn't block, but now that the the divided geometries are stored in a separate table I can add new records in under 5 seconds usually.
Do you mean an application layer function, or a dB one? And if a dB one, you also need to create an index to go with it. And then there are dragons that you don’t need in life. If an application one, then people need to know to call it, so, yeah.
I know there are things you can do, but having better tools available is always a win (especially when they remove later maintenance from the equation).
2) Therefore, if you're instead interested in returning the generated columns' values, this feature would be useful in proportion to how expensive the expression you're using is, because you're saving time by precomputing the column rather than computing at query time.
Edit: I can also see the benefit of removing the burden on the person performing the query to have to remember the details of the expression, or in the server case, not having to duplicate the expression across code bases.
I think your edit is why I am excited about this feature. There are so many times when I just need some relatively simple text formatting, such as titlecase, but want to store the original text too. Titlecase isn't hard to do or expensive, but I only have to do it once with a simple SQL expression. Then client code can decide whether they SELECT the formatted or original text. This is especially helpful if I don't already have an ETL pipeline that includes a cleaning/formatting step, and I just need 1-2 columns to be formatted.
> If you're only using the generated column for filtering/sorting rows, you'd be better off using indexes on expressions?
Good point!
> I can also see the benefit of removing the burden on the person performing the query to have to remember the details of the expression, or in the server case, not having to duplicate the expression across code bases.
You can do that also with a database view or function, which is my preference.
I prefer my tables to be fully normalized. Any computation or processing, I try to keep in views. This just helps my mind. Tables = hard data. Views = processed data. But maybe I'm just set in my ways. Calculated columns are in the SQL standard, after all, and have been implemented in other databases for some time. In special cases (heavy calculation + heavy reads) of course this feature makes a bit more sense.
Computed columns essentially make the base table into a transparent materialized view over the (noncomputed) “real” base table. But it's closer to an ideal materialized view than actual Postgres materialized views because it self-refreshes on need. But, conceptually, it fills the same role as a matview.
I think your approach makes more sense when thinking about databases directly. If you're using an ORM, you can use a view but it's kind of awkward. I see this as being more useful in that ORM universe--it just ends up being a read-only field on your model.
> If you're using an ORM, you can use a view but it's kind of awkward
If it's awkward to use a view in an ORM, it's a bad ORM. Your ORM shouldn't care if a relvar is a table or a view. (It obviously might care if it's updatable or not, but updatable views—both automatically updatable and updatable via specific trigger programming—arw a common thing, as are read-only base tables.)
I dunno, I find using relations I'm not supposed to modify pretty odd in an ORM. And YMMV, but I've never seen an updatable view in the wild. I know it's doable, particularly in Postgres, but it seems like something capital-S Surprising to...probably most folks I've ever worked with?
Can you tell me why? Not only can you select from a view, but at least in Postgres you can also insert, update, and delete against one too.
I don't use an ORM. But however you specify a table name in the application code, I imagine just specifying the name of a view instead. The ORM selects, inserts, updates, or deletes with the view as the target, instead of a table. It would not know it was not a table.
I tend to follow the Postgres releases and I am always impressed by the cool things SQL databases can do. But for programmers like me who are used to code in C++/C# I always find the transition from these languages to SQL too harsh. Especially if you don’t have to do SQL daily it’s really hard to remember the syntax and read complex SQL code. Also the transition from SQL results to typed languages is tedious. ORMs help a little but they have their own set of problems.
Not sure what I am trying to say other than that I wish it was easier for regular programmers to use advanced SQL features...
I think you need a visual model that works for you to approach it since they're entirely different beasts. You'll hit a wall very soon if you approach learning SQL in terms of C++/C#.
CSS is a similar phenomenon, you can't approach it as code. The box model is a nice way to get the basics and manipulate the DOM.
Similarly, I think in terms of tables (rows and columns) to visualize SQL operations that I need to do. Every operation is like a matrix operation, so you need to stop thinking in terms of for/loops iterators and think of matrix/table wide operators. Once you've got the basics, then you can look at stuff like window functions to do more advanced operations.
Takes a bit of working with to get used to. That's just how I think about it - you'll need to find a metaphor that works for you!
I sometimes mention declarative vs imperative when talking about software development, especially to people who are coming from a HTML/CSS background. I'll give some details below in case it is useful.
Declarative languages encompass things like HTML, CSS, and SQL. Here we say what should happen, but not specifically how it should happen.
Imperative languages are C, Python, Java, JS etc. Here we give the precise steps which describe how something should work, and there is an absence of describing what the problem actually is.
Declarative languages give us a lot of power to solve a very narrow range of tasks.
Imperative languages give us comparatively less power, but allow us to tackle a very wide range of tasks.
For example: We could certainly write a Python program to sort, filter, group, and join two lists of data. However, the equivalent SQL code would be much shorter and – all things being equal – much more readable.
Conversely, SQL would be a non-starter for writing a web service or an image processing library. SQL does one thing really well, and that thing is data wrangling.
I found this book to be very helpful, I'm not done yet but so far I'm really enjoying it.
"Perfectly intelligent programmers often struggle when forced to work with SQL. Why? Joe Celko believes the problem lies with their procedural programming mindset, which keeps them from taking full advantage of the power of declarative languages. The result is overly complex and inefficient code, not to mention lost productivity."
I think it’s not only the procedural vs declarative difference but also just the plain syntax. To me SQL is just hard on the eyes. It feels a little like FORTRAN in the good old days.
I find that it helps differentiate the different components of the query. Same way we use all-caps for constants, etc. If it's lower-case, or camel-case, then it just sort of "melts" into the rest of the query.
The tooling around sql is pretty awful. I was looking for a vscode plugin to get some syntax highlighting and linting for postgres yesterday, and just gave up. There are plenty of administrative tools that help with connecting to dbs and showing the results in a nice table, but nothing that really helps with writing it (or at least nothing of high quality that I could find).
https://www.jetbrains.com/datagrip/ and Postico are pretty great for PG I think. I'm using them every day and having the option to attach DataGrip to a directory of queries in your code and directly executing them from there with set variables is pretty sweet.
It's helping you write queries by auto formatting and good auto completion.
I actually like them both a lot for someone who is a developer who has spend hours fixing/optimizing large queries. As someone mentioned in another comment good syntax highlighting is often not available for sql (if any) so having the keywords separated is very helpful. When fixing and improving things I mostly care about 10% of the query, the JOINS and WHERE clause and CAPS keyword make those much easier to grasp. And the comma at the line start is very helpful (even though it looks odd for me) when you are adding/removing columns and sections on the fly. If the comma is at the start of the line you can simply comment, if it's at line end you need to edit two lines when removing a particular line from the query.
I've read one of Celko's books (not that one), and I confess I didn't find it very helpful. And hearing "think in sets" is the worst.
In all the algebra classes I've taken, one nice thing about sets is you can arrange them in any order you like, and you'll get the same answer any way you slice it. There's no inherently correct orientation of mathematics. That's kind of the point. It's fully generic, by default. "x" can mean anything (at least, in the algebraic structure we've assumed).
With SQL, the experts drone "think in sets!" as a way to mean "you happened to pick a different order than me, so your query will run 1000 times slower". Well, mine is still sets. I'm not writing a for-loop here (and I'm pretty sure my SQL dialect has those by now).
I've actually had the most luck with SQL by thinking about it in terms of looping first. Figure out the most efficient loop over your biggest table, and then write a query in a way that makes it easy for the optimizer to loop over that.
Then again, nobody ever accused me of being "perfectly intelligent".
For me, it's because I've never studied my database the way I do my programming language. I'll read a thousand pages to master a new language, and I'll take it all to heart. Meanwhile, I begrudgingly read a few pages about PostgreSQL (let's say) and move on to the next task ASAP.
PostgreSQL has great documentation, and I will give it a serious read one day.
Spending a few professional years using/learning PostgreSQL hard have paid off like crazy for me. It's a great server, doing my own devops for it is not much burden, and it doesn't hurt to have it splashed all over my resume.
You’re basically experiencing impedance mismatch first hand ;-)
The term is usually applied to mapping from object-oriented to relational systems and vice versa but it’s perhaps more of or also a mismatch between imperative and declarative programming styles.
I think we just have to accept that these two are different and that each is great at what it does instead of trying to shoehorn a relational approach into object-oriented and imperative ones.
SQL is a bit verbose at times, but it's not much different from going to other programming languages. In fact I'd say it's easier given that it's so verbose.
I have no idea what all those modifier symbols do in Rust, say, but I find "case when InvoiceNo is null then" or "select Name, list(distinct Title) as Titles from" pretty transparent in comparison.
The biggest mental difference I think is that you're dealing with a different data model. You don't have a bunch of individual entities floating around. You got rows in tables, which you mostly process in bulk. As such it's more like an Excel spreadsheet.
What specifically do you find verbose? The only thing I find redundant (based on the queries I handle) is that theoretically join arguments could be deduced from foreign keys (maybe extending this behavior, having joins automatically deduced).
I also wonder if the perceived verbosity is caused by some programmers cramming logic into the queries, rather than keeping them simple and processing the results in the client language.
It's verbose compared to a lot of other languages simply by using words rather than symbols or abbreviations for most things. Though I tried to convey that I did not think this was a bad thing.
The only case where things get really verbose for me is when I need to do subqueries for columns. Often I might need to run basically the same subquery but return aggregates over different fields, ie sum of net weight, sum of gross weight, sum of value. For performance reasons I can't write that as a join, at least on our SQL server.
I agree there should be a shortcut syntax for joining using foreign keys ("join X on foreign key"?) though.
Having done both SQL and C# for a very long time I prefer C# with Linq. Linq gives you nearly all the declarative power of SQL with all the procedural goodness of C# to get things done.
Whats needed is a really good object database for C# to eliminate the impedance mismatch.
Don't get me wrong SQL is great, but to be useful it needs to be mixed with a procedural language (pl/SQL, T-SQL, etc) I would just prefer using a more capable typed procedural language instead.
There's a huge list of caveats in Postgres's inheritance documentation [1], and even the obvious issues haven't been touched in decades. I've never heard of anyone using this feature.
At this point, I would say Postgres itself is a tool that doesn't support Postgres inheritance.
Yeah but it doesn't support .Net types directly everything must be translated between the two type systems. Also doesn't support executing .Net code in the PG process. These are the main mismatches that ORM's try to hide with significant overhead.
The impedance mismatch is not about having to translate, but being unable to translate. An yes, that problem exist with relational-only database, but not with PG where you can create composite types, can have arrays of composite types as a column and all other goodies which come with that.
I personally do not find useful to execute .NET code in PG, but technically you could do that.
As long as you can translate LINQ expressions to PG as you can most of them with Revenj you can avoid the pitfalls of mainstream ORMs.
Not sure I have seen a type or construct that could not be translated one way or another. The mismatch is the need for translation. Table per hierarchy, table per subclass etc.. References vs foreign keys. Instance identity vs primary key.
The ability to run procedural code on the db server allow for much better performance than sending data across the network to the app server for processing. Being able to share procedural code between app server and db allows for the choice to be made easily based on the best place to run it rather than whether it might have to be rewritten to move from one to the other. I suppose a plugin could be made for .Net in PG similar to plV8, SQL server has had .Net stored procedures some time with many issue and caveats so it doesn't get much use.
The problem with Linq and what makes it so nice to use is the ability to mix procedural code in the Linq statement ex: list.Where(x=>x.SomeMethod()) Very easy to make something that can't be translated and then starts streaming the entire intermediate result to the app server for processing SomeMethod that would be much more performant in the db tier local to the data.
Right or Entity Framework, but they are just ORM's which is just hiding the impedance mismatch.
I am thinking more along the lines of an embedded db like SQLite but does native serialization, indexing and query optimization against .Net objects and types.
Then to make a "database server" is really just an app server running your c#. If you really wanted to make it like pg then you would have an app server that accepts code snippets, compiles runs and returns results. Of course you would need proper sandboxing like .net fiddle. You could also do something in between like serializing Linq expression trees to send them to server for processing.
In an ideal world where your code runs (client, app server, db server) is a choice based on locality needs not forced by runtimes (javascript in the browser, .Net on the app server, pg/Sql on the db server).
You can pretty much get close to this now with javascript/node/plV8, but again I prefer C#/.Net.
This is basically what most rdbmses have been offering for the last 20 years. Oracle rdbms is in fact an application server, and best practice is to have an app layer built using plsql package through which you access all the data.
More than 20 years ago I actually used it as a true application server, where the database application layer would even generate all the HTML code required to displayed your app ui.
And with database schemas you have proper sandboxing.
I'm with you on this one. I was thinking it would be cool to "bake in" the DB layer of an application and have it run in-process. Then you'd just provide it with block storage and it would handle the rest without having to have a separate server. You could even do compile-time migration and index generation based on your queries. With a distributed actor framework like Akka.NET or Orleans, it could probably be made to scale too.
The only downside is that the only interface to your data is through the application, but there's certainly a use-case for something like that.
Well you don’t need to worry about creation, and migrations. You only need to tell the store which objects it needs to know about. You declare everything in c#.
It’s stores it as jsonb. So you don’t need to worry about weird relationships you only worry about your root aggregate. You can index it. Apply full text search. And persisted columns in the map if for convince or speed.
So Yeah it maps the linq to the equiv sql for you. But allows you to work with your object and worry less about persistence. Cos you can add and remove properties. It’s the closest thing I’ve come to, to being able to forget about the database and focus on code.
I Guess unless you write your own database from the ground up so it supports the language as a feature without dropping back to something else. It won’t happen.
Again that's great, its an ORM, that's what they do, try to abstract the DB from your language, and do their best to hide the impedance mismatch.
However the mismatch is always there, the overhead of mapping types, inefficient serialization (key names repeated over and over in json vs a real schema), certain statements cannot be transpiled to sql so you have pathologic cases at unexpected times, in ability to run any statement you want in the db tier to take advantage of locality and so on.
Most everyone wants a object database since most people are working with objects which is why ORM are so popular.
It’s jsonb so you can’t repeat key names. Serialization is what ever you want to use for json serialization. Other than some report type queries I wanted to write I haven’t found anything that I couldn’t do in linq.
jsonb doesn't do key interning as far as I know. Every field name will be repeated in every record vs a normal column which will only have the column name once in the db.
This really adds up on a large database. Mongo used to recommend short key names and I even think some client would translate them based on some definition to try and save disk/memory/io
Also its not that Linq can't do what you want, its that if you do certain things that can't be translated to SQL the whole intermediate result will be retrieved from the db, deserialized then the linq statement will be run in memory on the client. Its like putting ToList in the middle of your linq statement where a maybe a few million rows might need to be processed to give the very small result.
If the linq statement was being run in process on the db server at least it would happen local to the data and would be more akin to a table scan vs index, still not great but much better then sending all the results to the client for processing.
There's definitely a language/paradigm switching cost if you jump back and forth too much.
What I found helpful was to prototype all the migrations and queries (of whatever feature or user flow you're building) in a .sql file first with mocked values. You'll be surprised by how much you can do if you're simply forced to write pure sql. Once you've reached the limit of what you can do you can move the queries back into your app to wire the rest up. This is usually not much work if you use raw queries instead of ORMs. You usually end up writing a lot more efficient apps this way!
ORM's tend to lock one into a particular architecture. SQL is more standardized across platforms. It's my opinion that ORM's should assist with SQL generation or usage rather than outright hide it. ORM's job should be to reduce the grunt-work of preparing typical SQL, not wrapping it fully so you can ignore it.
You can take your SQL knowledge to another platform, but less so with say Entity Framework or LINQ. Don't get locked in. Plus, if the ORM doesn't do your query correctly or as intended, someone will need to understand SQL anyhow to debug it.
I think it aged insanely well. DBs are often a bottleneck and need to be heavily optimized so you need to spend some time explaining to your database what it is that you really need, but in general, SQL is all about telling what do you want and not worrying about how it's going to get done. That's how programming languages should look like.
We have so many ORMs and most often they look more ugly than pure SQL to me, especially when it comes to more complex queries.
You could write a translator (something like coffeescript -> JS but X -> SQL), but can you really come up with a syntax that's enough of an improvement that it would be worth using it?
I would agree. SQL is absolutely beautiful in its simplicity imo. I learned SQL along side my first imperative programming language, so I guess I’m significantly biased, but I think it’s the easiest language I’ve ever learned. The best part is, if you spend enough time writing SQL, you end up learning quite a bit about how databases, data structures, and the time complexity of query execution all works. Which is all incredibly valuable for just about any programmer.
I also love ORMs. But perhaps my knowledge of SQL has enhanced my ability to use them well. Without them you end up with numerous other problems which I’d consider worse than the occasional compromises ORMs require. Ideally you want all your business logic in one place, or at least in discrete units, so the fragmentation of stored procedures is a major downside. Embedding SQL directly into other business logic is like and uglier version of an FFI, and serializing SQL statements is even more of a security concern than marshalling FFIs (something that can already be a bit of a footgun).
I’d recommend any engineer to learn SQL, but for people who use ORMs a lot already, I’d suggest to start with understanding explain plans first.
Or is it the other way around? C++/C# concepts are insanely powerful, but they feel to me like we're trying really hard to drag the Simula data model into the 21st century.
Today if you asked me to design an object model for software to model objects, I wouldn't end up with anything remotely like C++/C# has.
I wish it was easier for regular programmers to use advanced SQL features...
It's a fundamentally different way of thinking.
With imperative programming, the mindset is: "I have this certain outcome in mind. What is the sequence of steps I need to take to make it happen?"
In a relational database, the mindset you need to be in is: "How can I define the relational structure of my world (i.e. my data) so that I can get the outcomes I want - and end up saving a great deal of imperative programming effort in the longer run?"
Quite different. And yes, it takes a while to get used to.
The first database I used professionally was called UniVerse[0], currently owned by Rocket Software. It's a Pick-style, non-relational database.
In the data dictionary for a file you could create I-descriptors, which were computed columns much like this feature allows. The difference is that I-descriptors were always calculated on the fly and they could do a LOT more than PostgreSQL's generated columns.
These were commonly used to accomplish things that SQL would use a JOIN to do, mainly because the query language didn't have joins. An INVOICES file, for example, would have fields like CUSTOMER.NAME, CUSTOMER.ADDRESS, etc, (note that the '.' is just another character in the field name, it doesn't actually mean anything to the database) which would pull the relevant information from the customer file or call a subroutine to find the relevant information (e.g., in a history file).
The results of the I-descriptor don't have to be stable - they can be calculated based on the current date/time, random numbers, data in other files, etc. This leads to some interesting possibilities that I don't think PostgreSQL's implementation can touch. It also leads to some interesting gotchas.
I don't have a good reference handy for UniVerse's I-descriptors, but the System Description document[1] has a section on it.
It had a certain elegance that I miss in modern SQL databases. On the flip side, modern SQL databases are so much more powerful.
> The difference is that I-descriptors were always calculated on the fly and they could do a LOT more than PostgreSQL's generated columns.
So, just like columns in a view.
> These were commonly used to accomplish things that SQL would use a JOIN to do, mainly because the query language didn't have joins.
Yeah, in SQL you could do that with a correlated subquery in a column definition in a view, instead of a join, but it's generally not optimal (I've seen people do it, though not recently.) There's other uses for correlated subqueries besides being the inefficient way to do joins, and they, plus normal functions, let you calculate just about anything you might want in a columns in a view.
> The results of the I-descriptor don't have to be stable - they can be calculated based on the current date/time, random numbers, data in other files, etc. This leads to some interesting possibilities that I don't think PostgreSQL's implementation can touch.
The calculated columns one that supports only immutable functions because the values are materialized which doesn't make sense otherwise can't, but then on-the-fly calculation makes reads expensive. If you need that, though, Postgres supports it in views since (approximately) forever, which are the classic SQL approach to the problem.
However, I-Descriptors have the power of the underlying UniVerse BASIC runtime at their disposal, so you can do a LOT of things. I mention one of these in another comment[0].
However, as I mentioned, doing it this way isn't necessarily a good idea.
Postgres views have the power of the installed procedural language runtimes behind them; its pretty common this includes an unrestricted Python interpreter, among others.
I've also used this product but it's more of an operating environment than just a database. Super convenient on the front end. On the down side, getting data out of it is extremely expensive if you aren't targeting a single record ID. It's also not ACID compliant.
I encountered this PICK stuff a couple of decades ago. It was really enlightening. I began to understand how the features of relational databases (transactions, isolation levels, constraints, normal form) fit together to avoid anomalies and orphans. They had jobs that would search for orphan references etc at night.
Some of these things are making their way into the Pick world.
UniVerse, for example, gained proper transaction support at some point in its lifetime. When I showed the consultant developer, an old-school Pick guy, how that worked he was very impressed.
Granted, it was fairly limited since they somewhat faked it using the regular locking mechanisms and it was only usable within UniVerse BASIC, but it did what it said on the tin. I ended up writing my own simple command interpreter which allowed me to use the UniVerse command prompt and BEGIN/COMMIT/ROLLBACK transactions.
Transactions are definitely foreign to a LOT of Pick code though. I also worked on an ERP system that ran on a Pick-style system - if it crashed during GL posting it was completely possible, and this did happen frequently, that you would get a half-posted entry and an out-of-balance ledger. Had they just used the transactions that the underlying database supported they would not have had that problem.
The equivalent in the SQL world is to use your database connection in autocommit mode, which is pretty rare. Most developers know to use transactions or work with a framework that intelligently uses transactions.
For most things no, but there are some things you can do that are difficult or impossible to do in an SQL view.
I-Descriptors can call subroutines, which opens them up to doing anything that the underlying UniVerse BASIC runtime can do.
I once made an I-Descriptor which called a subroutine to feed the postal code into a web service that returned the geographic location of that postal code. It also implemented caching so it would only call the service for a postal code once. You could simply list out the file and find customers in a particular geographic region just using the query language.
I'm sure this is do-able using something like PL/Python, but I suspect it would be much more difficult to do it all in-database than how I did it in UniVerse.
Whether doing it this way is a good idea is, of course, up for debate. In a modern application I wouldn't dream of doing this on the fly in a view. Just because I could build it the way I did doesn't mean it was a good idea.
I've not studied the documentation in detail, but MySQL looks pretty good here. It allows partitioning on a generated column, and Postgres currently does not.
An alternate way of doing this is to pass the entire current table row to a function which can be done easily: if you have a table "purchase" PG also creates a "purchase" type, so if you have this function:
CREATE FUNCTION vat(a purchase) RETURNS numeric AS 'SELECT a.value * .25' LANGUAGE 'sql'
Then you can run:
SELECT value, vat(purchase.*) FROM purchase;
And so be able to use every purchase column within the SQL function to do your calculation.
There are two interesting shortcuts: You can call just vat(purchase) because the type is an alias for the current table row. That alias is very confusing and this is not recommended (try select table from table!)
There's also a method like shortcut which the documentation calls "functional notation":
This isn’t a good example because you wouldn’t pre calculate vat and store it for product listing, as vat doesn’t apply to all countries and it’s subject to change, and differ between countries. (Japan just changed gst yesterday)
You also wouldn’t want to run a function on something you need to filter against. To give you an example we have the concept of a “deadline” date which is based on the time the record is stored + a period of time which it must be completed by.
Calculating that column in sql before doing a where filter is crazy slow when you’re looking at millions and millions of records. But pre-calculating it and storing it, and then adding an index on top of it, is insanely fast.
This is currently done in code. But if I moved this to a computed column then I can Ensure the result is always up to date if the period changes and avoid code being written to accidentally forget to update this value.
There are use cases for computer columns. As there are for functions. And doing it in code.
This feature in pg12 mainly gives us the ability to index the value which we couldn’t do before.
I'm not sure if I understand you correctly, but logically the WHERE clause happens before the SELECT clause[1], so it only calculates this value for the rows you're interested in. It is also possible to index functions without generated columns.
This is a workaround for not having computed columns, and you can even use a VIEW to present an interface that looks a lot like a table with computed columns, so it's pretty good, but you don't get persistence (the column will be computed every time it's required) and you don't get indexing.
Another workaround is to create a proper column and then ON INSERT OR UPDATE triggers that force NEW.that_column to have a computed value. This approach gets you persistent computed columns that you can index on.
What's surprising is that the new functionality in PG 12 doesn't support non-persistent computed columns, and that the other limitations (which are good) don't quite jive with the second workaround I mentioned.
But after contemplating it, is this really necessary? I fear for putting business logic and meanings into the wrong the layer; There are use and abuse, and my consideration fears the latter.
Why do you think placing business logic into the database system is a layering violation?
Just because most developers use their DB as a dumb store doesn't mean it needs to be. There are also plenty of successful software systems that place the majority of their business logic and use a generic programming language and runtime only for the presentation layer.
If you're comfortable fully exploiting the capabilities of your DB, then the intelligent combination of a relational model with custom data types, constraints, triggers, views and stored procedures can make the DB the perfect place to implement business logic.
I love what you're saying, but running a normal modern development and deployment cycle on code stored in databases is hell.
Until db vendors start taking developer happiness seriously, stored procedures and triggers are a total non starter for any serious work.
I mean, in most databases you can't even rename a column without causing an enormous blocking migration. What? Why can't this happen in the background? Why can't I have column aliases? The very basics of developer happiness aren't covered, let alone the harder bits, like versioning stored procedures, switching schema and code when switching to a new git branch, and so on.
(EDIT: of course there are open source tools that help with all of the above, but they're all swimming upstream, fighting a database that simply can't imagine change, and are usually terribly leaky abstractions as a result)
> I love what you're saying, but running a normal modern development and deployment cycle on code stored in databases is hell.
Can you be more specific? What exactly are you missing?
As long as you put your code and data into separate schemes and follow good technical practices, it shouldn't be too different from other technologies.
> I mean, in most databases you can't even rename a column without causing an enormous blocking migration.
Changing a column name is just a metadata change, so it shouldn't take too long in Postgres.
> The very basics of developer happiness aren't covered, let alone the harder bits, like versioning stored procedures, switching schema and code when switching to a new git branch, and so on.
Does Tomcat or any other application server version your WAR files for you or does it manage git for you?
Also in my experience DB Schemas often (but not always) outlast logic implemented in applications. The more the business assumptions reside in the database, the easier it becomes to rewrite applications on top of it later.
Declarative programming expresses the logic of a computation without describing its control flow.
This new feature in PostgreSQL is a great example of that: generated columns (declarative logic) were introduced to reduce the need for triggers (imperative logic).
In SQL, declarative logic consists of constraints, indexes, views, and prepared statements. They can significantly increase the efficiency and reliability of the entire system. Imperative logic is mostly triggers and stored procedures, both of which can become hard to maintain and scale.
Clearly the feature should not be abused, but calculated columns are great for stuff that's obvious and always true (i.e. independent from applications), e.g. a trade value computed from a trade unit price and a trade quantity.
This way I can select the top N trades for a given key without having to do the computation in the application, or storing redundant information in the DB.
Not postgresql but I saw a start date and an age in days columns in MS SQL server. The age gets updated daily. It didn't sound right to me. I'm pretty sure I'd fail my database class in college if I did that. What is different in real life and why didn't they teach me this in college?
I obviously can't answer for this particular case, but my first thought on why I would consider doing such a thing is if I had an app in which number of days old was something that had to be queried, displayed, and/or used in other functions/queries a massive number of times per day in the course of normal application usage. If the application had low usage, or number of days old was infrequently queried/displayed/used, I wouldn't consider it. The moment I found that significant time & resources were spent calculating the value in normal/regular usage, I'd start looking at ways to reduce that time & resource usage. How to go about it varies, but the win of that value being immediately available without computation could mean a lot to an app/business and its users.
There's no clear separation in databases between data definition and application logic. And this separation is very useful. You can change application logic very easily. Just stop old application and start new. You can use multiple application instances to balance load in many cases. You can often rollback bad application update. There's absolutely no problem to use miriads of development tools from Git to CI systems.
Changing database schema is a big deal. It might take a lot of time or it must be done with great caution to keep database online. It's hard to properly version it and it's often hard or just impossible to roll back bad update.
Generally database is state and application is stateless. You can couple it, but decoupling works better.
Incorrect separation of data from logic is a human problem, not a Postgres problem. Put your logic in one schema[1], let's call it code schema and your data in another schema, the data schema. There you have your separation. Now you can:
* Change your application logic very easily.
* Use a transaction to deploy new code! Zero downtime! [2]
* Use read replicas to balance load in many cases.
* Rollback bad application updates
* Test anything [3]
* Use miriads of development tools from Git to CI systems.
* Use row level security, so that every user can only see his own data [4]
* Only allow applications to call your code schema, never let them touch your data directly.
You probably wouldn't use multiple tables or multiple columns either and just have a single table that stores document-like rows... which vaguely reminds me of something.
Foreign key constraints are not business logic, they're part of a sound and sane database architecture (just like basic indices on heavily-queried columns). For an overwhelming majority of use cases, tables should have them; one needs a very good reason why a table shouldn't have them.
The last time I tried to do that, I struggled a lot with error handling. The errors your database give you for schema violations aren’t really user friendly, so I have to convert them to proper errors (you must not enter negative amounts). When using SQLite there didn’t seem a way to know what column cast the error without parsing the string. That often lead that I had to implement the business logic twice. One check for in the code for the error handling and another one in the database. That was also often quite inefficient and prone to race conditions (if using improper isolation). For example if the table has two unique fields, I cannot just check for a unique constraint violation because it doesn’t tell me the column in the error data structure.
Looking at the Postgres driver it seems easier but is there any good tutorial on how to do error handling properly for databases?
All sorts of things are possible but it misses a fantastic opportunity to compartmentalise the data away from the implementation. If it doesn't make sense to compartmentalise data and logic, why compartmentalise anywhere? Do the whole project in one big file. Of all the surprises a project is going to face 'oh, this data is useful for [new thing]' is one of the most likely. And everyone expects to find a boundary drawn there because it is such an obvious place to draw one; so it saves on confusion.
Putting complex business logic in the database is opening up all sorts of interesting new ways for data to be unavailable, corrupted or complicated to access. It is easy to imagine it working out well for simple requirements where there just needs to be something that works.
PostgreSQL is a piece of software that takes data and enforces the relational data model on it. Great idea. But the relational model of data is really only tuned to relational algebra. Put complex logic in there and all that is really being accomplished is now you can't migrate away from PostgreSQL. Relational databases already have great integration with every other programming language in current use.
> All sorts of things are possible but it misses a fantastic opportunity to compartmentalise the data away from the implementation.
That's what schemas are for. You have a schema for your code and a schema for your data. You can redeploy the code scheme independently of the data scheme and set up permissions so that higher layers can only use objects from the code scheme and never touch the data.
> Put complex logic in there and all that is really being accomplished is now you can't migrate away from PostgreSQL.
Say you wrote your code in PHP, now you want to migrate to Ruby or NodeJS. You can't, you have to rewrite everything. How often do you plan to migrate to another database? In my experience this almost never happens in reality, but the layers above come and go.
> You can redeploy the code scheme independently of the data scheme and set up permissions so that higher layers can only use objects from the code scheme and never touch the data.
That all just sounds a touch complicated compared to data in database, code somewhere else (like git). I'm circling back to the same point in a couple of different ways, but pgSQL specialises in the relational model of data. That isn't a great data model for code and there are already better ways to manage code than shoehorning it into the database. Its cool that it is possible, and I'm not saying that someone who does is making a mistake. But I also don't think they are gaining an advantage and there is a really easy opportunity to separate out where bugs can occur from where data lives.
> How often do you plan to migrate to another database? In my experience this almost never happens in reality, but the layers above come and go.
If you are using a database for its advanced general purpose programming capabilities? The chances probably start to get more likely.
Databases that are a store of data don't need to change because they already do their one job (disk <-> relational model translation) really well. If they are pressured to do 2 or 3 tasks like business logic then suddenly it is a lot more likely that there will be pressure to swap databases.
If I were using SQLite and someone wants to do fancy triggers then maybe I need to swap to PostgreSQL. Avoiding that sort of decision making is a great reason to seal the data completely away from the code.
> That all just sounds a touch complicated compared to data in database, code somewhere else (like git).
You use Postgres as a deployment target and not as a replacement for git. It's not complicated at all. You even get features like transactional deployments and the ability to prohibit applications from directly touching the data.
> pgSQL specialises in the relational model of data
The relational model is SQL:92, Postgres does much more than that. Postgres has JSON support, recursive CTEs, Row Level Security, and Window functions that would require dozens of lines of procedural code to do what can be done with a single OVER in its SELECT clause.
> If I were using SQLite and someone wants to do fancy triggers then maybe I need to swap to PostgreSQL.
If you want to put your business logic into an RDBMS, you wouldn't be using an embedded DB anyway, but rather Oracle, Postgres, SQL Server or DB2, which are designed for this type of architecture.
This thinking really intrigues me, as it seems like it fell out of fashion (at least for greenfield projects), but it might have a come back.
Could you share a few cases where it is better to use a custom type as opposed to a relation? The only things I can think of are generic things like uuid. Would there be a need to create an Employee type vs an Employee relation?
Also, how is the experience with using Python for stored procedures? One reason they are not used is that the language pl/sql is non-familiar to most. If anyone has any experience with that, could you share some of your thoughts?
It could just be something more pedestrian like not yet being able to handle database changes well for deployments. Things like blue/green, canary, reverting, etc. It's a bit of work to get that functioning well.
> But after contemplating it, is this really necessary?
No, you could always do the materialized equivalent via triggers, so it's not necessary for correctness. And since this feature is limited to the materialized form, too, it doesn't offer much change other than simpler expression of what is going on (which, to be fair, is a big win.)
> I fear for putting business logic and meanings into the wrong the layer
If you don't have some mechanism for calculated columns, you are forced to put domain logic that belongs in the DB in the app layer, which is especially problematic in the (increasingly out of fashion, apparently) circumstance where there are multiple consumers of a DB, as it promotes inconsistency.
If your database contains objects like "customer ID", "address" and the like, you already have business considerations (if not logic) inside it. In fact, businesses using databases for business stuff is probably 95% of DB use cases.
If it's something that's not application-specific and fits nicely into the query language (like in the example given) it makes sense to have it inside the DB, because for queries involving sorting, limiting, etc. you can do all that stuff inside the DB and return the few values that interest you, instead of transferring all the data and having to write code to do that inside your app.
Say you suddenly don't have data for a column anymore, because the API that you read from has stopped providing it. Now you can add a computed column that uses a heuristic as an interim until you have time to remove all the reads from that column.
Or say that you have an object with a separate display name, and product management decides that the display name now can't be chosen freely anymore, but must be generated from name and title. A computed column can do that trick pretty well.
As an analyst, either for reporting or machine learning tasks, it scratches my itch.
I'm often creating wide, flat views with extended attributes. Many columns are nearly identical, for example, year as an integer, year a date type, etc. This is trivial with views, but ...
For fast query performance, I usually materialize those views. But that's a multi-step and often manual process, requiring schedules or triggers. This feature combines both of those worlds, with fewer total database objects. Especially with ALTER TABLE.
It's great for speeding up queries. Optimizations frequently require giving up on absolute purity. We use computed columns a lot for speeding up specific, frequently used queries.
One example is where we need to store a special identifier code. The identifier code is required by law to have a certain format, and part of it is the date the code was generated.
Users frequently want to view items with codes generated on a specific day. For making presentation and reporting easier, as well as significantly faster, we added a computed date column that extracts the date from the code and indexed it.
Keeping an extra field in sync in code would inevitably have lead to bugs, especially as we have several different codebases which could update the id code.
I agree that it should be used sparingly. But for more than 5 years I've wished I had a computed column feature because you are often thrown into code bases that need refactoring and refactoring 100% in 1 commit is not always the safest way to go. Sometimes you want to migrate a codebase in several steps. Computed/virtual columns are an absolutely fantastic feature and it allows refactoring and migrations at your own pace, or as a fallback.
Putting business logic in the database has been a boon to large companies for decades - while the development process can be trickier, it helps ensure that every team is using the same logic (incl version) to access the data.
With PostgreSQL, you can define business logic in everyday languages, including JavaScript (plv8).
You’d hate my work, we shoehorn everything we can into the database, with a nice simple API layer of versioned stored procedures in front. It’s easy to test and enforce the business logic, and then it can be easily utilized across web, mobile apps, custom apps, and third parties.
Works great if you’re sure you won’t ever need to scale huge horizontally. One of my RoR apps connects to a schema of views, with instead of triggers calling procedures for updates. Something like 30 normalized tables are rolled up into 5 denormalized views, leaving hardly any ActiveRecord woes and still excellent performance.
Here's one of my current use cases: We have a folder tree table in our (formerly nested sets, then closure table, now materialized path), and we cache a materialized version of the path with the folder names. There are some places in the app where you can look up a node by its full path, and this column should conceptually be unique. Obviously an unindexed VARCHAR(2048) is bad to filter on, but MySQL indexes can only cover the first 767 bytes (we used utf8mb4). So we have another column, PathHash CHAR(40) AS (SHA1(path)) VIRTUAL, and then a unique index on that.
This is indeed very nice for maintaining things such as indices for full text search. Solutions based on triggers never felt easier to maintain because triggers were one step removed from the table itself.
IMHO "computed columns" generally implies columns that are computed on-demand when read, as opposed to this case when they're computed when the row is altered and persisted in the table.
> There are two kinds of generated columns: stored and virtual. A stored generated column is computed when it is written (inserted or updated) and occupies storage as if it were a normal column. A virtual generated column occupies no storage and is computed when it is read. Thus, a virtual generated column is similar to a view and a stored generated column is similar to a materialized view (except that it is always updated automatically).
I've been using this for a while in a hobby project, great feature which makes for neat declarative schemas.
Unfortunately some tooling is choking on this, for instance datagrip crashes om introspection, but hey that's what we can expect for being early adopters. :-)
Looking through the docs I cannot find the limitations for what you can generate (e.g. only simple column * column or more "advanced" statements like you can use in a update statement)?
Yes. A generated column can be included in a foreign key and be referenced by a foreign key. Nothing too weird happens. It's basically like having every update and insert specify the column value.
1% of the people that want this feature are surely thrilled. It is a cool feature. The other 99% of us are looking at this like a potential land mine that will show up unsuspectingly when trying to refactor old code, or migrating from postgres.
Generated/computed columns are a fairly common RDBMS feature; MySQL has it since 5.7; SQL Server, Oracle, and DB2 have them. So, Postgres having them makes migrating into Postgres easier and migrating out no more difficult except maybe if you are migrating to SQLite or a nonrelational store.
I doubt 99% of the community is so humorously pessimistic. This functionality can be easily replicated in several databases (with triggers and the like). Generated columns l make a common pattern simpler and more explicit. Maybe it’s not a useful feature for everyone but it’s certainly not some instant tech debt like you’re implying.
As a developer I always hated this feature in other rdbms. The idea of your app saving a row and retrieving it only to have more data in it.
IMO these types of calcs are better done in your app, where you can at least write a test and assert it's doing the right thing. It also makes it a lot easier to reason about your code if the logic is in the app rather than bits of it stuck in column definitions.
Perhaps there are legitimate uses of this, maybe for DBs that aren't just repositories for apps?
IMO these types of calcs are better done in your app, where you can at least write a test and assert it's doing the right thing
1) the assertion that code in the DB can’t be tested is a bizarre and unfounded one
2) in any serious organisation there may be dozens of apps in a dozen different languages talking to the DB. Do you seriously propose implementing the same thing in each one, or doing it once in the DB and knowing it’s correct for everyone?
I disagree with the parent comment, but tucking logic away in different parts of your DB does come at a cost. It increases the burden on the engineer who’s trying to understand it. Read the code > look at the schema is no longer enough. Now you also have to know where all of the strongly coupled business logic is inside the DB. Triggers and views can be especially dangerous in a complex system, and having to keep a mental view of how all these discrete resources work together can lead to all kinds of failure.
Triggers tend to generate all kinds of surprises when you change something in a part of your database and suddenly other seemingly unrelated things begin to change. Views can be hard to understand if you have views that use views, use views, and so on. But I don't see any problems with functions and stored procedures as long as you put them in a separate schema from your tables. A function in SQL shouldn't be more tightly coupled than a function in Ruby or Java.
Why is a trigger any more surprising than any callback style interface? Or using inotify (Linux) or reparse points (Windows)? Triggers are very easily discoverable, they are attached along with their source code to the table!
A view is just a named select statement, that’s all it is.
> Why is a trigger any more surprising than any callback style interface?
A procedure call is explicit, a trigger is implicit. You don't call a trigger, it just happens as a side effect of something else. People tend to forget implicit things. Suddenly you notice that something is acting strangely or slowly in your application. You can look at your functions and procedures and try to find the problem. But if your application is full of triggers, how do you know what is going on? A trigger can change a dozen rows, which in turn can change other rows, so changing a single row can trigger thousands or millions of triggers. Also, triggers are not fired in a particular order, the database is free to change the query plan according to what it thinks is best at the moment, so triggers are not deterministic. Triggers can sometimes work and sometimes not.
In summary, triggers are implicit, have side effects and are not deterministic. They are confusing and surprising. Almost everything that can be done with a trigger can be done with a procedure, but explicitly, deterministically and in most cases even without side effects.
I don’t think there’s anything wrong with any of these features, I’ve used all of them myself at various times. The problem though is that whenever you use them, you’re introducing additional complexity to your application, so you have to decide every time whether it’s worth it. If you use these features without proper consideration, you can easily end up with a mess of interdependent schema objects. It’s easy enough to get to a state where it’s difficult to visualize code flows, and in that case making changes will become riskier because all of those features can produce major gotchas.
Also great for BI.
Most BI tools make a lot of calculations in their queries just to get the data as they want.
With this, you can get that calculations pre-made and persisted except when you make a backup.
> IMO these types of calcs are better done in your app, where you can at least write a test and assert it's doing the right thing.
I can write a test for the db logic, too.
> It also makes it a lot easier to reason about your code if the logic is in the app rather than bits of it stuck in column definitions.
Its a lot easier to be confident that all consumers of the DB, regardless of whether they are coming through a particular app, have the correct view of the data if the non-app-specific domain logic is in the database.
If you’re the one implementing the functions (or even if you’re not), there are tools such as pgtap that can help you with testing. I’ve used pgtap successful on a number of projects. There’s also no reason you can’t test the behavior of functions through a driver in some other language, though you’re now one step removed.
I’m not aware of any coverage tools, though it’s been a while since I’ve looked.
Or you could test behaviour, and not code. For example, have a test as part of your app's general test suite - insert a row, read it back and check the generated column value is what you expect.
It really depends on what your app does, and how software development is structured at your company.
In general though, materialized views and computed columns are much more useful for reporting than replacing application logic.
They also can be useful to smooth out upgrading line-of-buisness software, where say the old system needs several separate values, but the new system only needs 1, and derives the others. You would use something like this to make the 2 systems play nice until the migration is complete.
I’m not saying this is the right solution, but we do face the problem of application-calculated attributes that have meaning in an API and in the application, but are not persisted to disk. A trivial example from our education domain might be % correct, assuming we’ve persisted the possible score and an achieved score on a question or test.
When we ELT to the data warehouse, analysts and internal users want to report on reason about these calculations. But then we face the quandary — do we re-implement that business logic in the ELT? Or do we go back and make sure we persist everything, even these values that are so quick and light to calculate and build in the application later? Or do we (shudder) make the ELT load from a bulk API, rather than just copying across the DB?
I could imagine someone talking themselves into using these calculated columns as a solution. I wouldn’t recommend it, but I can see the ELT problem prompting it.
> where you can at least write a test and assert it's doing the right thing
SQL is code. PL/pgSQL is code. Code can be tested. Code should be tested. You don't even need to leave the database to write tests for SQL or PL/pgSQL[1].
I'm not a db-engine expert, but if I'd have to write this functionality myself, I'd perform the computation only when any of the involved columns change, not on every read.
EDIT: sorry, I read your comment on the opposite way. I see we are saying the exact same thing :)
Yes, but increasingly for us we tend to have several "apps" accessing our data: The mobile, desktop and API 'apps' are all different code bases for us.
And then someone at support comes along and updates the data via SQL and forgets to update the derivative field because he got woken up at 2am for an emergency.
Using a computed column ensures the data is consistent.
EDIT an example from this week: we have a json blob full of stuff and we want to pluck out a specific field to search on. You need to jump through casting hoops to hoist it out as an integer when you query. It’s doable, but the onus is on everyone who queries to understand this.