Hacker News new | past | comments | ask | show | jobs | submit login

Just out of curiosity, what sort of requirements compel a person to write a postgres extension? I know of things like PostGIS but I'm not well versed enough in this world to know the totality of the problem.



Author of `pgx` here.

We developed pgx so that we could rewrite "ZomboDB" (https://github.com/zombodb/zombodb) in Rust. ZDB is a custom "Index Access Method" for Postgres. Think btree index, but stored in Elasticsearch.

So that's definitely a thing.

Other ideas might be custom data analytics/transformation/processing engines that you'd prefer to run in the database instead of externally.

Custom data types to represent, for example, street addresses or genetic information.

The only limit is yourself! ;)


> The only limit is yourself! ;)

For folks who don't get the joke, please visit zombo.com before Flash is finally dead :).


Alternatively, there's https://html5zombo.com


For folks who still didn't get it, like me: https://en.wikipedia.org/wiki/Zombo.com


Have you evaluated Tantivy/Toshi as possible alternative to ES? I only ask because of your native use of Rust. PGX seems like a more robust offering than pg-extend-rs. This is a great addition to the community!

Also, v.0.0.3 doesn't create much confidence about stability! I'll keep an eye on the project and wish the team the best.


re: v0.0.3 -- sure. I just published it last night.

We've been working on it since November last year, and have now fully ported ZomboDB to it.

It's proving out nicely, but keep in mind that Postgres' internals are infinitely complex. Getting safe wrappers around all its "things" is going to take a very very long time.

I'd rather get something that seems very stable now, and continue to iterate on it over time.


I think the comment was less about your code but rather about your v0.0.3 version number. I guess a v0.99 would imply more stability.

That's why lots of companies have switched to date based versioning, similar to Ubuntu and others.


I've looked into them. It seems they're designed to work within a single process, and it's not quite clear to me how sharing the underlying data files across postgres backends (even with proper Postgres locking) would work.

That's not say they aren't good frameworks. I'm sure they are. It just seems like they're designed for different use cases.

That said, I have other ideas on this front that I can't talk about today. ;)


> The only limit is yourself! ;)

You have no idea how significant of a limit that is... I'm kidding.

Thanks for the response! And zombodb looks very cool. I might bring it up since we have a couple of heavy analytics endpoints right now.


The most common reason to write a PG extension is to add native support for a structure/data type that PG doesn't already have native support for.

For example, despite Postgres having a (base-10000) numeric/decimal type, there are certain things that can be done much more cheaply in base-256 (e.g. ser/des from bit-packed and/or hex strings), and thus pgmp (https://github.com/dvarrazzo/pgmp) exists as a Postgres libgmp binding.

There are also "data service" type extensions—exposing some other data "into" Postgres so it can participate in a query. PG's own pg_stat_statements extension is an example. Postgres's support for Foreign Data Wrappers has mostly obviated writing these as extensions, though; it's much easier (and safer for your RDBMS-side data!) to just write these as standalone daemons that Postgres mounts as FDW servers.


A foreign data wrapper is an extension.

To install it, you run "create extension my_fdw"


Yes, a given Foreign Data Wrapper—that is, the module of functionality that describes the connection strategy for some remote resource, like an ORM "adapter" module—is a Postgres extension. To get PG to speak a new protocol, you need to write a new FDW extension.

But a Foreign Data Wrapper server is just an instance of that class. You don't need to write a Postgres extension, just to use an existing Foreign Data Wrapper.

And, crucially, many of these FDWs are written to be backed by standard (or at least well-documented) network protocols.

Which means that, a large percentage of the time, people thinking "I'll take this third-party thing and integrate its data into Postgres" these days, don't write a Foreign Data Wrapper, but rather choose one of the existing network protocols with an existing Foreign Data Wrapper that supports it, and make their third-party thing speak that protocol, so that they (and anyone else who likes) can mount it as a server using the existing FDW.

Specifically, many systems that want to be "Postgres-compatible" these days (CockroachDB, Materialize, etc.), speak the Postgres wire protocol as if they were Postgres itself. This is not (mainly) so that real clients can talk to them through a libpq binding (as they tend to have their own, more idiomatic client ABI); but rather so that actual Postgres instances can mount them using `postgres_fdw`, the FDW that assumes the other end is another Postgres server and uses libpq to talk to it.


This is a sort of broad question, but if I were to try and answer I would say that if there is anything you want to be handled by the database you would gravitate towards an extension.

PostGIS is a great example since one of the most well known it additions it makes is a datatype, but that's not the only thing it adds. A datatype on it's own wouldn't be that useful without Spatial Indexing (https://postgis.net/workshops/postgis-intro/indexing.html) or spatial utilities to make spatial joins more expressive (https://postgis.net/workshops/postgis-intro/joins_exercises....).

Rich data types aren't the only candidates for extensions, automated partition management (https://github.com/pgpartman/pg_partman), data sharding (https://www.citusdata.com/), or even database cron scheduling (https://github.com/citusdata/pg_cron) are also good examples of things that are well suited to be extensions.

To be honest, I have a hard time imagining alternative paths where some of this functionality _isn't_ an extension. You might imagine an external daemon for things like pg_cron and pg_partman, but if PostGIS wasn't an extension you would probably be using a different database/tool (or a fork) if you had geospatial requirements. It's worth noting that Citus was a fork of PostgreSQL before it was refactored to be an extension.

I know this isn't a direct answer to your question, but hopefully you find it somewhat useful.

Edit: I totally forgot to add 1 more thing about extensions. When functionality is packaged as an extension there is a much higher likelihood you can mix and match them. For example, I currently run a PostgreSQL cluster using Citus, with pg_partman for automatic time-based partitioning, pg_cron regularly scheduling partition creation, and PostGIS for geospatial datatypes. You could extrapolate the various ways you might use this kind of setup...one that jumps to my mind is a scalable GPS history that could be queried by time period as well as by geographic region.


Yeah, "rich data types" is a good point.

pgx provides a #[derive(PostgresType)] macro that lets you represent any normal Rust struct (that serde_cbor can (de)serialize) as a Postgres type. You write no code.

It even generates all the boring SQL boilerplate for you.

I plan on putting together an example about this and doing a twitch stream this week to discuss in detail.


I think “extension” covers any custom code (even functions that need to import a c library), not just major db features. I have had to do it recently [1] to customize partitioning function to be the way I like. It would be great to do it in a friendlier language

[1] https://harikb.github.io/posts/postgres-11-hash-partitioning...


> It would be great to do it in a friendlier language

Whether Rust is "friendlier" than C depends a lot on your definition of what that means.


You're not wrong, but barring bugs in `pgx` (of which I'm sure there are plenty right now), at least Rust gives you compile-time guarantees around not crashing.

And when running inside your database process, that's a huge win.


As a follow-up. Here's what Postgres' "generate_series" function looks like with pgx: https://github.com/zombodb/pgx/blob/0803bd3aa2f8ef2bd80d2e03...

And here's what you'd have to do to implement it in C: https://github.com/postgres/postgres/blob/dad75eb4a8d5835ecc...


One more follow-up...

The top one is pgx, the bottom is Postgres. So there's a little room for improvement here with pgx, but that's okay for a v0.0.3 release.

    test=# select count(*) from srf.generate_series(1, 10000000);
    Time: 1552.115 ms (00:01.552)

    test=# select count(*) from generate_series(1, 10000000);
    Time: 1406.357 ms (00:01.406)


The largest part of the time executing the above query isn't inside the function, so this isn't that a material comparison. The reason for that is that SRFs in FROM to be materialized into a tuplestore, which isn't free:

    postgres[607045][1]=# SELECT count(*) FROM generate_series(1, 10000000);
    ┌──────────┐
    │  count   │
    ├──────────┤
    │ 10000000 │
    └──────────┘
    (1 row)

    Time: 1249.224 ms (00:01.249)

    postgres[607045][1]=# SELECT count(*) FROM (SELECT generate_series(1, 10000000)) s;
    ┌──────────┐
    │  count   │
    ├──────────┤
    │ 10000000 │
    └──────────┘
    (1 row)

    Time: 460.206 ms
For mostly historical reasons SRFs in the target list can use the non-materializing SRF query protocol, but SRFs in the FROM list can't.

Any chance you could show the timings for the pgx version of the second query?


Sure! Top is Postgres, bottom is pgx, after running each 5 times...

    test=# SELECT count(*) FROM (SELECT generate_series(1, 10000000)) s;
      count   
    ----------
     10000000
    (1 row)

    Time: 399.630 ms
    test=# SELECT count(*) FROM (SELECT srf.generate_series(1, 10000000)) s;
      count   
    ----------
     10000000
    (1 row)

    Time: 478.194 ms

Thanks for the reply. I'm not surprised there's room for optimization in pgx, especially in the Iterator-->SRF path.

edit: PS, thanks for the idea of putting what I assume is the backend pid in psql's prompt. I need to go figure out how to do that right now!

edit edit: hmm, I guess that's not the PID.


Thanks for the update.

> edit: PS, thanks for the idea of putting what I assume is the backend pid in psql's prompt. I need to go figure out how to do that right now!

Here's my ~/.psqlrc:

    andres@awork3:~/src/postgresql$ cat ~/.psqlrc
    \set QUIET 1

    \pset pager always
    \set VERBOSITY verbose
    \pset null '(null)'
    \set COMP_KEYWORD_CASE upper
    \pset linestyle unicode
    \pset border 2
    \set PROMPT1 '%/[%p][%l]%x%R%# '
    \set PROMPT2 '%/[%p][%l]%x%R%# '
    \set PROMPT3 'c:%/[%p][%l]%x%R%# '
    \set HISTCONTROL ignoredups
    \set HISTSIZE 100000

    \timing on
    \set QUIET 0


In the function definition of `generate_series`, what is going on with the `default!` macro?

That isn't some sort of way of actually getting real default param values is it? I thought that wasn't possible in Rust.


That is certainly nicer.


Does Rust give you compile-time guarantees around not crashing? I thought any function could panic without so much as a peep from the compiler.


Fair. With pgx, however, Rust "panic!"s are translated into standard Postgres "ERROR"s, such that instead of crashing, only the current transaction aborts.

So while you're pretty much correct in general, pgx handles it in the way a PG extension author would expect.


One might say that Rust is friendlier downstream, to the ops people running the compiled binary :)


Really? I struggle to imagine a case for claiming that C is friendlier.


The ecosystem of preexisting C extensions to generalize from, for one


It just depends on the type of problem you'd like to solve.

I've got a whole bunch of ETL helper functions and patterns i've written for my business that could be packaged up as an extension if I were so inclined for example. Those are all SQL or PL/PGSQL functions, so no need for native code there though...

For PostGIS, a whole lot of that code is just SQL or PL/PGSQL functions...but there are quite a few C functions as well, to take care of the heavy lifting which is less efficient to implement in SQL or PL/PGSQL.


A few years ago I wrote an experimental Logical Decoding plugin that translates database transactions to DNS UPDATE messages. I was super impressed by how nice it was coding to PostgreSQL's plugin APIs in C. I'm currently not sure if a C plugin is the best approach for this particular task: it might be better to use one of the more popular Logical Decoding to JSON plugins and munge the JSON into DNS UPDATEs in something more high-level than C.


We have to write a Json logger versus the default log format


`pgx` would let you do that.

You could hook Postgres "emit_log_hook" and probably just use serde to xform the provided "ErrorData" pointer right to json and ship it off where ever you want.

(edit: typeos)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: