In the case of tokio, multiple actors can run on a single thread. Tokio uses a worker pool of threads equal to the number of cores on your system. So spawning a new actor will run amongs other actors. This lets us perform io operations in an actor, such as http connection, and progress other actors whilst waiting for a response.
Kameo does have a `spawn_in_thread` function for CPU bound actors if needed.
Right. It's not a very widespread use case, to be honest. You'd find that most would be N actors for M threads (where N <= M ; an Actor in itself is never shared among multiple threads [So `Send` and not `Sync`, in theory] - an inner message handler _could_ have parallel processing but that's up to the user)
I think you should assume in Kameo that every Actor's message handler is going to be CPU-bound. For example, it means that your internal message dispatch and Actor management should be on a separate loop from the User's `async fn handle`. I don't know if it's already the case, but it's an important consideration for your design.
Nice library, BTW, I think it checks all the marks and I like your design. I've tried most of them but could not find one that I liked and/or that would not have a fatal design flaw (like async_traits, ...) :)
PS : Multi-threaded tokio runtime should be the default. Nobody wants a single-threaded actor runtime. It should be in capital letters in the readme.
> Right. It's not a very widespread use case, to be honest. You'd find that most would be N actors for M threads (where N <= M
What makes you think that? Having a large number of actors per thread is by far the most important use case. The Actor model is commonly used in communication systems where there are hundreds of thousands of actors per machine (often one for every single user). In this context, Actors are typically extremely lightweight and not CPU-bound. Instead, they mostly focus on network I/O and are often idle, waiting for messages to arrive or be sent.
- 1 actor on 2 thread = you are probably doing it wrong.
As for the rest, whether or not they are used in communication systems and whether or not they are cpu-bound, consider there are and run the handle on a separate loop from the main message dispatching. Otherwise you _will_ delay messaging if handles don't await.
stomata open, light interacts with chlorophyll, water is moved and creates a tiny depression in the xylem tube, more water moves up through capillarity
Not the OP or familiar with Hatchet, but generally ZeroMQ is a bit lower down in the stack -- it's something you'd build a distributed task queue or protocol on top of, but not something you'd usually reach for if you needed one for a web service or similar unless you had very special requirements and a specific, careful design in mind.
This tool comes with more bells and whistles and presumably will be more constrained in what you can do with it, where ZeroMQ gives you the flexibility to build your own protocol. In principle they have many of the same use cases, like how you can buy ready made whipped cream or whip up your own with some heavy cream and sugar -- one approach is more constrained but works for most situations where you need some whipped cream, and the other is a lot more work and somewhat higher risk (you can over whip your cream and end up with butter), but you can do a lot more with it.
ZeroMQ is a library that implements an application layer network protocol. Hatchet is a distributed job server with durability and transaction semantics. Two completely different things at very different levels of the stack. ZeroMQ supports fan-out messaging and other messaging patterns that could maybe be used as part of a job server, but it doesn't have anything to say about durability, retries, or other concerns that job servers take care of, much less a user interface.
I used the same approach while designing a lock-free bounded broadcast log (as in a "RWLock<Vec<T>>"; a MCMP, append-only Vec). It's quite easy to do because it's bounded. However, I could not find a way to make it both unbounded and efficient.
It's very hard to beat locking when the queue needs grow. It's the performance statistics - you're going to grow more often as the app warms up, reach a steady state, and only need to grow under heavier than expected load. And in that case you probably aren't going to see performance dominated by time spent waiting to acquire a write lock.
The alternative is to dive into the literature on lock-free mpmc queues, which are kind of gnarly to implement. A lot of the literature handwaves ABA problems with stuff like "use hazard pointers and RCU" without correct pseudo code to help you.
That's why locking in unbounded queues is popular, imo. It's not actually inefficient, and the alternatives are arcane enough to avoid. No one is going to understand or trust the code anyway.
It's worth mentioning that "lock free" means "one task does not prevent other tasks from making progress" and in the case of a bounded queue, you can trivially accomplish that by busy-waiting when the queue is full. This isn't appropriate if you need consumers to receive events in the order in which they happen, but you can kind of fix that using an atomic counter (to paraphrase Mike Acton, if you have a safe counter, you have a safe queue) and time stamp events/sort on which ones are consumed.
> The alternative is to dive into the literature on lock-free mpmc queues, which are kind of gnarly to implement. A lot of the literature handwaves ABA problems with stuff like "use hazard pointers and RCU" without correct pseudo code to help you.
Yes and I do think they are fundamentally incorrect or they cannot be translated to the log use case.
I also agree that, performance-wise, the lock make little difference. In fact my old benchmarks tend to point that locks were MORE EFFICIENT than atomics on ARM64 (MBP M1), for example. It's more like a fun little exercise, and also to confirm that I'm not completely dumb and that the problem is not solvable with the simple use of 2 atomics and a counter.
I'm no expert here, but I wonder if a linked list of bounded logs would work well.
So, everyone has a pointer to the current one, and there's an atomic pointer to the next.
When the log fills up, you allocate a new one and set the next pointer to it using compare_and_swap. If someone beat you to it, you can walk the list and add yours to the end so that the allocation isn't wasted.
This way, most of the time you're using the efficient bounded log.
in the case of a bounded log, readers are expected to give an offset at which they want to perfom the read (kafka-like).
So the linked list would involve going through all the links and following end tail refs/pointers. It would make reading O(n) and that's a Nope.
However, you could imagine having a Vec index that contains a ref to all allocated inner-logs, and query the index first in order to obtain the buffers' location. That works, but then the index has to go through a lock (either a RWLock or a mutex) as the CaS operation isn't enough if we get to the end of the index and it needs to be reallocated. It's fine, and I think that's the most appropriate solution.
PS : In fact, there is a sweet spot where you'd like to have a semi-unbounded log. If your index is big enough to contain something like 4Md entries, you'd probably end-up splitting the log in several pieces for archiving and performances purposes. Loading the log (fully or partially) from disk efficiently is then more important than having a real unbounded log. Then you would not necessarily use a lock and could CaS in the index.
I am not sure! Most of the data structures I design are for embedded systems without allocators. On the desktop, I mostly defer to others.
I've used tokio's broadcast channel quite a bit before, but it is also bounded. After talking to Eliza from the tokio project, I'm fairly convinced that unbounded queues are a scary thing to have around, operationally :).
But again - this is a bit out of my actual expertise!
Not sure if it could be extended here, but I've seen a lock free hash map that supported lock free reallocation by allocating new space and moving each entry one by one, either when the entry is accessed or in a separate thread concurrently. Accessing an entry during the reallocation would check the new region first, and if not found check the old region. Entries in the old region would be marked as moved and once all entries were moved the old allocation could be freed.
For the (un)bounded logs, the whole concept reside on the fact that the log isn't going to move once allocated, and that references to an item will never be invalided until the end of the program
I'm a 5y user of Firefly-III, and was very happy with it. Unfortunately, JC5, I discovered Actual recently.
It's not that FIII is bad at was it was doing, but simply Actual was a bit faster, a bit simpler, a bit more practical to use when importing. I'm mostly using rules, categories and reporting.
> ... What are you doing ? It's definitively unusual.
No it's not, it's extremely common. `Arc<Box<...>>` or `Arc<Mutex<Box<...>>>` or similar is used all the time when you want to share a mutable reference (on the heap, in the case of Box); especially in the case of interior mutability[1]. It's pretty annoying, but I have learned to love the borrow checker (although lifetime rules still confuse me). It really does make my code extremely clear, knowing exactly what parts of every struct is shareable (Arc) & mutable (Mutex).
It's like "I want a shared, immutable reference to something recursive, or unsized". The heap part makes no sense because it's already allocated there if you use an Arc : https://doc.rust-lang.org/std/sync/struct.Arc.html
> The type Arc<T> provides shared ownership of a value of type T, allocated in the heap.
Yeah, looks like Arc-Box is kind of pointless, but isn't there some thread locality reason why people wrap `Box` in `Arc`? I remember reading something about it a while ago but maybe I'm misremembering.
`Arc<T>` places the refcounts immediately before `T` in memory. If you are desperate to have `T` and `T`'s refcounts be on different cachelines to reduce false sharing in some contrived scenario, `Arc<Box<T>>` would technically accomplish this. I think a more realistic optimization would be to pad the start of `T` however.
`Box<Box<T>>` and `Arc<Box<T>>` are thin pointers (size ≈ size_of::<usize>()) even when `Box<T>` is a fat pointer (size ≈ 2*size_of::<usize>() because `T` is `str`, `[u8]`, `dyn SomeTrait`, etc.). While various "thin" crates are typically saner alternatives for FFI or memory density (prefixing lengths/vtables in the pointed-at data instead of incurring double-indirection and double-allocation), these double boxes are a quick and dirty way of accomplishing guaranteed thin pointers with only `std` / `alloc`.
I would not call either of these use cases "extremely common" however.
I don't exactly remember the exact thing I used that's why I said "I resorted to things like.. ". Also, I used actix web, which could be the reason of slow compiling code.
At that point a regular GC is probably faster (at least from my experience of doing memory management in C++ with ref-counted smart pointers, which has a 'death-by-a-thousand-cuts' performance profile, e.g. the refcounting overhead is smeared over the whole code base and doesn't show up as obvious hotspots in the profiler).
They run in separate tasks / threads anyway and they are cpu-bound. So, why would it be necessary to make them async ?