> I have to say, it's very perplexing to me that people (especially in HN) would...

skytreader · on June 12, 2021

Actually, if you read the rest of my previous reply (as opposed to an offhand comment at the end) you will see that I'm considering the logistics of operating an in-DB queue as well.

Using an in-DB queue doesn't give you zero OpEx does it? Maybe you can make the argument that it's cheaper but it's cheaper for a reason: the job wasn't done properly, so to speak.

Both options introduce new failure modes and op costs into your system. Might as well do it properly if (and this is a huge IF in my opinion) for slightly more cost. When you run into a failure it's standard, maybe the solution is even one Google search away as opposed to realizing, one hour into a late-night debugging session that, shit, that table should be treated as a queue!

derefr · on June 12, 2021

I read your whole reply. I didn't see any mention of anything I'd consider an OpEx cost.

Re: Monitoring

Presuming you're already monitoring anything at all, adding monitoring for a new piece of infrastructure is a pure CapEx cost. You presumably already have monitoring infrastructure running. Its ingestion, by design, will be 99% idle — i.e. it won't need to be scaled horizontally proportionally to the number of components. The only thing needed, then, will be careful up-front design for the monitoring.

Which is usually also a cost already paid for you in advance when you use an abstract job-queue library. They all know how to expose Prometheus metrics endpoints, and they do the accounting to serve those endpoints efficiently (usually using process-local per-job-producer and per-job-consumer counters, which you must then roll up yourself at the PromQL level, taking the irate() to find spikes.)

Re: AAA

> Also, why does no one consider the added "surface for human error" that using a DB for a queue introduces.

Because the type of person who reaches for their DB first to solve a novel problem, is the same type of person who understands and makes use of their DB's security features, to the point that doing more DB-security-config isn't a complex one-off problem for them, but something they can whip off in a few minutes.

For a simple option, you can create two service-account DB roles: a queue_producer, and a queue_consumer. You can put all the job-queue implementation tables in a schema owned by the queue_producer; and then grant the queue_consumer SELECT privileges on all the tables, and UPDATE privileges on some of them. Then, nobody but the job_producer (or a DB superuser) can create or destroy jobs; and nobody but a job_producer, a job_consumer, or a superuser, can read or modify jobs. (Your job-queue abstraction library within your service usually maintains its own DB connection pool anyway, so it's no sweat to have those connections use their own credentials specific to their job-queue role.)

For a more complex — but perhaps useful? — option, the tables themselves can be "hidden" behind stored procedures (DDLed into existence by the abstract job-queue library), where nobody has any privileges (incl. SELECT) on the tables, only EXECUTE rights on the sprocs. And the sprocs are carefully designed to never do anything that could have an unbounded CPU time. Then anyone can "check up on" or even "insert into" the job-queue, but nobody can do anything "funny" to it. (My god, it's an encapsulated API!)

Once again — the libraries that abstract this away, already think about concerns like this, and choose one or the other of these options. That's why libraries like this exist, when the "core premise" is so simple: it's so there's a place to put all the fine details derived from man-years of thought on how to make this approach robust.

-----

On a completely different track, though: having a queue in the DB can sometimes be the optimal (i.e. the "if we had infinite man-hours for design + implementation") engineering decision. This case comes when the thing the queue is operating upon is data in the DB. In such a case, the DB data modification, and the job's completion, can succeed or fail together atomically, as part of a single DB transaction.

To accomplish the same thing when your queue lives outside the DB, you usually end up either needing some really gnarly distributed-locking logic that both your DB and your app layer need to know everything about (leaky abstraction!); or you need to completely remodel your data and your job queue into an event-streaming paradigm, so that you can "rewind" one side when the other side fails.

skytreader · on June 12, 2021

> For a simple suggestion, the job queue can live in a schema that is owned by a job-queue service-account role. Nobody other than that user (or a DB superuser) can issue any query in there.

Yes service accounts are fine and dandy but [a] isn't that more hoops borne by choosing the wrong tool for the job (gotta stay mindful of those permissions, you can have an Ansible typo, etc) and [b] as I said, proper queue services would not even allow tampering of enqueued data. For no extra cost other than installing the service. In contrast, your service account falling into a malicious actor is still a threat vector.

> For a more complex — but perhaps useful? — option ...

Now this is just a queue service with extra steps running in a relational DB instead of natively as an OS process. You did cite it as just an option but I don't see why this is an attractive option.

> I didn't see any mention of anything I'd consider an OpEx cost.

It seems to me we have differing definitions of OpEx and, judging by your previous comment, you value this almost exclusively in terms of "the number of ops salaries you have to pay". Even if I play by that definition, I could tell from experience (and also corroborated by other commenters here, in other subthreads of this discussion), that operating a Redis cluster does not need extra warm bodies. The people who monitor that your app responds, your servers aren't running out of memory, could take on the task with little to no additional training.

The experience I want to cite, bearing in mind of course that everyone's mileage varies: in my previous job, customer base of 2M and growing across two countries, we operated and monitored a redundant Redis cluster with an ops team of 3. In my first job I talked about in another subthread here, we managed three high-traffic Redis queues (and a few other smaller-traffic special cases) with a team of 7. In both jobs, we also did active across-the-stack development; we weren't dedicated to monitoring, nor to the components related to the queues.

In fact I would argue running a DB (queue or no queue) is just more complex than running a queue service. Rather telling is the need for "DB Administrators/Specialists" but no such equivalent for Redis or other proper queueing solutions.

> Which is usually also a cost already paid for you in advance when you use an abstract job-queue library. They all know how to expose Prometheus metrics endpoints...

Honestly, this sounds fine but I'm rather wary of the caveats that might be present. How do they do it efficiently? What do I need to keep in mind to keep things efficient? As such, unless you can point me to your preferred job-queue library which does all these wonders, I have to reserve judgment.

Edit: I just remembered mentioning this in another subthread. Even if you have a fantastic library/middleware abstracting all those queue concerns, that ties you in with that library. If someone wants to build integrations into your queue, they have to play by that library. If majority of your app is in Java (and so is the library), and Data Science (who, in this example, uses Python, not improbable to happen) wants to share your data pipeline, if that library isn't available for them, tough luck I guess? More dev time for someone.

And also, whatever underlying DB features or semantics your library might rely on in order to enforce a queue structure, you can't be assured that future versions of your DB would support those features because, you know, your DB isn't really in the business of queues. It opens you up to running an outdated DB version just to keep that queue functionality.

derefr · on June 12, 2021

> Now this is just a queue service with extra steps running in a relational DB instead of natively as an OS process. You did cite it as just an option but I don't see why this is an attractive option.

Your DB then shares a WAL log with your queue. Meaning a single managed physical replication pipeline for them both. Meaning only one set of leader-election issues to debug, not two. Meaning one canonical way to do geographic high-latency async replication. Meaning disaster recovery brings back a whole-system consistent snapshot state. Etc.

Honestly, if I had my way, every stateful component in the stack would all share a single WAL log. That’s what FoundationDB and the like get you.

> In fact I would argue running a DB (queue or no queue) is just more complex than running a queue service.

Well, yeah, but you usually need a DB. So, if you’re going to be paying the OpEx costs of the DB either way, then you may as well understand it deeply in order to wring the most productive use you can out of each OpEx dollar/man-hour spent.

(I feel the same way about Redis, as it happens: if you need it, and are locking your code into its model anyway, then you may as well take advantage of its more arcane features, like Redis Streams, Lua scripting, etc.)

However, maybe our company is uncommon in how much our service literally is doing fancy complex DB queries that use tons of DB features. We’re a data analytics company. Even the frontend people know arcane SQL here :)

> that ties you in with that library

The difference between what you / apps / abstracting libraries do in Redis, and what they do in an SQL DB, is that in the DB, the shape of everything has to be explained in a vendor-neutral manner: SQL DDL.

Sometimes Redis-based solutions converge on conventional schemas; see e.g. Sidekiq’s informal schema, which several other queuing systems are implemented in terms of. But when they don’t, there’s nothing you can really do — beyond hacking on the libraries involved — to bring them into sync.

In an SQL DB, anything can be adapted into the expected shape of anything else, by defining SQL views. (Heck, in an SQL DB with Redis support, like Postgres with redis_fdw, the Redis data can be adapted into any shape you like using SQL views.)

And that’s further enabled by the fact that the DB had received from the app, through DDL, a schema, that you can examine, manipulate, and refactor; or even synthesize together with other schemas.

> you can't be assured that future versions of your DB would support those features

You can if those features are in the SQL standard. I’ve never heard of a DBMS regressing on its level of SQL standard support.

skytreader · on June 12, 2021

> Your DB then shares a WAL log with your queue...

I feel like the understanding we can come to here is that we have differing definitions of necessary complexity.

You did mention you work in data analytics and I have worked and am working in more traditional product-oriented gigs. Everything you mentioned are nice, impressive even, but to me they are toys. I don't need a queue with a WAL stream, or PITR backups. Queues, to me, are inter-process communication mechanisms, or a means to concurrency. In fact, worst case, you can delete the queues or restart Redis even without triggering a disk write (though I note this is another feature that comes free with Redis); it would inconvenience our customers but they can always just retry.

Of all the benefits you mentioned, leader-election is the only one I could make a good case for.

> then you may as well understand it deeply in order to wring the most productive use you can out of each OpEx dollar/man-hour spent.

Understanding it is one thing but using that understanding to create/develop (CapEx) and then maintain (OpEx) an in-DB queue feels like unnecessary costs. CapEx to install Redis is practically nil, and you seem to agree that running it is far simpler OpEx than PG too ("Well, yeah, but...").

Of course, I keep in mind your point about libraries doing the dirty work. But if it all boils down to a third party in the end, the point I'm replying to above is rather moot no? And if it comes to a third dependency anyway, I've already reasoned out my preference for an actual queue solution.

> explained in a vendor-neutral manner: SQL DDL

> You can if those features are in the SQL standard. I’ve never heard of a DBMS regressing on its level of SQL standard support.

While this is an ideal case, my experience so far shows that every major DB package relies on their own tweaks to the SQL standard, and so "If those features are in the SQL standard" turns out to be a pretty big "if" to ask. I don't worry about vendors regressing on SQL standard support but rather that the libraries for DB queues are relying on the non-SQL-standard features/behavior to mock out a full queue functionality---non standard behavior that has no guarantee of consistency across versions.

I mean, if there's a DB queue middleware/library that works across DB vendors, be it Postgres, Oracle, Maria, MySQL (with a reasonable storage engine, perhaps) then that's a convincing argument that the SQL standard can support queuing sufficiently. But otherwise, this counterpoint is a hard sell for me.

derefr · on June 12, 2021

> it would inconvenience our customers but they can always just retry.

We have very different things in our queues :) An MQ, to me, is a place to put stuff that needs reliable, at-least-once delivery. Like, say, invoices. User never receives that, bad things happen.

There's very little difference between that type of data, and the data in your DB, while it's "active." It's only once the data in the queue is fully ACKed and no longer relevant that the data behaves more ephemerally than first-class DB state.

> and you seem to agree that running it is far simpler OpEx than PG too

I did not say that. In our stack, we run a Redis cluster on top of Kubernetes, because it's nominally semi-stateless. In reality, it's our biggest SPOF. It fights the container daemon around memory allocations (why does a Redis container that uses 2GB memory steady-state need a 64GB memory limit to finish starting up!?) and has weird dynamics around startup readiness (i.e. Redis started from a large AOF-with-RDB-prolog file will start serving requests for just a moment, then immediately stop again to finish loading. This confuses k8s into thinking it's done with startup-probes and is ready for readiness-probes. Which it's not.)

Redis, when configured to be both Highly Available and durable (which is not your use-case, I know) is really just less-robust than a DB, with less management tooling, fewer docs, fewer people who know how to scale it, fewer third-party solutions pre-adapted to expect it, etc. Because, to paraphrase: Redis isn't really in the business of being a database. And yet that's exactly what you end up with, when you try to use Redis as a reliable at-least-once MQ: a database. Just, a kinda sucky one. (One that doesn't even have a concept of users with different privileges; let alone RBAC!)

And yet, that actually is the niche of Redis. It's the durable, client-server cousin to in-process peer-mesh state-sync solutions like Hazelcast/Mnesia. Which makes it kind of weird that it's not particularly tuned at solving the problems in the one niche unique to it.

(I have nothing against Redis! It has non-unique use-cases where it works great, e.g. when it's being used as "memcached with more data structures", or as a coordination/locking server instead of Zookeeper, etc. A robust reliable MQ it is not. My argument is that your average RDBMS is closer to being a robust, reliable MQ than Redis is. Though, of course, you'll get the best robustness+reliability for that use-case out of a "real" MQ — Kafka or NATS or something.)

> but rather that the libraries for DB queues are relying on the non-SQL-standard features/behavior to mock out a full queue functionality

They don't have to, because everything you need to implement reliable queuing semantics is already part of [the lowest-common-denominator part of the] SQL standard [that everything that calls itself an RDBMS implements.]

These libs have DB-specific adapters for two reasons:

1. There's no such wire protocol as "SQL" — every RDBMS speaks its own wire protocol to convey your SQL statements to the database and convey result-sets back; and so, unless you're using a language with a common API abstraction for RDBMS client libs to be built against (e.g. JDBC), then each such client lib ends up presenting its own distinct API to callers, which needs a separate adapter written for it.

2. with SQL, reliable queuing is "easy", but knowing about enqueued items (with as minimal an overhead as possible, scanning as few additional items as possible) is hard. SQL says nothing about how to get push-notification-to-the-client type information out of the DB, and so every DBMS implements its own mechanism for this. These are pure optimizations using pure-ephemeral code, though; the schema of the queue in the DB remains the same either way, such that the data could be exported from one RDBMS to a .sql file, imported from that .sql file to a different RDBMS, and the only thing needed on the client side would be switching adapters.

(In theory, presuming a common wire-protocol abstraction interface lib like JDBC, you could have a "standard SQL" adapter. It'd just be annoyingly low-throughput. Like using HTTP polling when you could be using WebSockets.)