We have our own queue, because it was easy, fun and has been exceedingly reliabl...

mr-karan · on Sept 25, 2023

> The workers ask for work when they want it, rather than being constantly listening

Can you elaborate more on this? How do the workers know when they have to process a new job?

Also, am I right in assuming this is typically a single node setup only, as all the files are mounted on a non "share-able" XFS disk?

donatj · on Sept 25, 2023

They ask for work after they finish the previous job (or jobs, they can ask for more than one). Each worker is a single process built just for one task.

If there's no work for them there's a small timeout and they ask for more. Simple loop. It's all part of a library we built for building workers. For better or worse, it's all done over http.

You are right, though, it is one XFS volume per queue instance.

We just run multiple instances (EC2) on a load balancer. Each instance of the queue gets it's own set of workers though so the workers know the right server to report done to.

We want a way to have a single pool of workers, rather than a pool per queue instance, and have them talk to the load balancer rather than directly, but we haven't come up with a reasonable way to do that.

latchkey · on Sept 25, 2023

I like how GCP cloud tasks reverses the model. Instead of workers pinging the server asking for work, have the queue ping the worker and the worker is effectively a http endpoint. So you send a message to the server, it queues it and then pings a worker with the message.

https://cloud.google.com/tasks/docs/dual-overview

donatj · on Sept 25, 2023

Ooh, that's kind of interesting. Am I reading this right that it holds the HTTP connection open for up to thirty minutes waiting for the work to complete? That's kind of wild.

latchkey · on Sept 25, 2023

Indeed. If you're hitting AppEngine or GCP Functions, they auto scale workers up for you to manage long running tasks. Ideally though, you finish as quickly as possible by breaking the work down into more tasks. That way, you can parallelize as much as possible.

It is all configurable, but I've scaled up to hundreds of workers at a time to blast through tasks and it wasn't expensive at all.

Workers being an HTTP endpoint makes them super easy to implement and even better... write tests for.

lysecret · on Sept 25, 2023

I love Task Queues. We are using them extensively. Also, they give you deduplication for free and a lot of other nice features like delayed tasks storing tasks for up to 30 days extremely detailed rate limits etc.

latchkey · on Sept 25, 2023

GCP is really under rated in this regard.

Are there any open source implementations of Task Queues? It feels like something that has been missing for years.

lysecret · on Sept 25, 2023

Yea, this is the only thing I don't like about them, that I can't test them locally.

More generally, is there something like a "on prem cloud" which just replicates say Cloud Tasks (but also other Cloud Apis) using local compute as well as say a local db. For testing / development this would be very cool.

latchkey · on Sept 25, 2023

I implemented my tasks as cloud functions, so I just test Tasks the same way I tested functions... by calling the handler function directly.

pjmlp · on Sept 25, 2023

Just like using JMS, MSMQ or similar queues, I fail to see what is so great about it.

latchkey · on Sept 25, 2023

JMS is a specification, not an implementation.

pjmlp · on Sept 25, 2023

Playing pedantic?

latchkey · on Sept 25, 2023

Ok, if you can't see what is so great about it, then I'd suggest spending some time in the GCP documentation to figure it out.

pjmlp · on Sept 26, 2023

Likewise there are plenty of JMS implementations to play around.

latchkey · on Sept 26, 2023

Sure. Name one implementation that does exactly what Cloud Tasks does.

ahoka · on Sept 25, 2023

But this way you can lose messages, but I guess it’s fine for your use case. Having to provide redundancy is when things get usually complicated.

Karrot_Kream · on Sept 25, 2023

I'm curious what throughput are you moving? How many tasks per second on average and how long does each task take to be serviced, on average?

lijok · on Sept 25, 2023

Built very similar but on S3. Jobs have statuses, land in /jobs, indexed by status at /indexes-jobs/PENDING, etc. Scheduler polls for jobs in PENDING index, acquire lock, pass job to processor, change its status to COMPLETE or DEAD.

300~ LOC or so and fairly easy to test. Wouldn't take that approach every time, but definitely worth it when you're aiming for a simple architecture.

hknmtt · on Sept 25, 2023

Why files though, and why move them into different directories? You said billions a day. With files, the physical drive must be taking a beating. Not to mention potential issues with directory file limitations(based on OS and file system). Why not use some kvdb?

lijok · on Sept 25, 2023

As I understand (correct me if I'm wrong, it's been forever since I've worked with filesystems) - file renames are very cheap as the actual data does not get moved, simply a journal gets updated

amerine · on Sept 25, 2023

Sounds neat. Do you have the go code anywhere for folks to poke at?

donatj · on Sept 25, 2023

Nah, afraid not. I wish.

We always wanted to open source it, but we got bought out by a big and very IP protective company before we got the chance.

matja · on Sept 25, 2023

I'd just be aware that XFS has no data journaling, only metadata.

winrid · on Sept 25, 2023

What if a task fails/crashes?