Persistence modules for the invesdwin-context module system

Animats · on Sept 4, 2016

Yes, Linux pipes suck as an IPC mechanism. I'd like to see a benchmark using QNX message passing on the same hardware.

rdtsc · on Sept 4, 2016

Doesn't linux have something similar: System V messages or even Unix Dgram sockets say using named paths for queue names.

Animats · on Sept 4, 2016

System V messages weren't bad, but they aren't used much.

The real trick is tight IPC and CPU scheduling integration. You want a send from process A to process B to result in an immediate transfer of control from process A to process B, preferably on the same CPU. The data you just sent is in the CPU's cache. QNX is one of the few OSs where somebody thought about this.

With unidirectional or pipe-like IPC, the sender sends, which unblocks the receiver, but the sender doesn't block. So the OS can't just toss control to the receiver. The receiver goes on the ready-to-run list and, quite likely, another CPU starts running it. Meanwhile, the sending process runs for a short while longer and then typically blocks reading from some reply pipe/queue. It takes two extra trips through the scheduler that way. Worse, if the CPU is busy, sending a message can put you at the end of the line for CPU time, which makes for awful IPC latency under load.

It's one of the classic mistakes in microkernel design.

rdtsc · on Sept 5, 2016

> With unidirectional or pipe-like IPC, the sender sends, which unblocks the receiver, but the sender doesn't block. So the OS can't just toss control to the receiver.

Interesting. I remember looking at the API but I just didn't have enough experience or context then to dig deeper and answer those questions. I stayed away from message queues and opted for shared memory, mostly because they seemed obscure and was afraid I would hit some corner case bug and would be stuck on my own debugging low level kernel code.

> You want a send from process A to process B to result in an immediate transfer of control from process A to process B, preferably on the same CPU.

I can see a message-passing centric system having some specific optimizations in scheduler. Say once a few messages are sent, there might be a DAG formed of which senders send to which receivers. Sorting that DAG using topological sort might be interesting, then making scheduling decisions based on it. That is, if sender1 sends message to receiver1 and receiver1 and the sends to receiver2. Maybe it is more efficient to run them in that order -- sender1, receiver1, receiver2.

Saw that done in a realtime system, which processed low latency data. That graph was static, but this sorting trick allowed sometimes for processing data with the latency of only one frame.

Animats · on Sept 5, 2016

There's a discussion of this in this old QNX design document.[1] See the IPC section. Integration between scheduling and message passing is essential if you're making lots of little IPC calls. When control passes from one process to another via an IPC call, the receiving process temporarily inherits the priority and remaining CPU quantum of the caller. This makes it work like a subroutine call for scheduling purposes - nobody goes to the back of the CPU queue. So IPC calls don't incur a scheduling penalty.

[1] http://www.qnx.com/developers/docs/6.3.0SP3/neutrino/sys_arc...

fmajid · on Sept 9, 2016

Solaris has a mechanism called Doors that works like this. It's primarily used by gethostbyname()/getaddrinfo() to make a synchronous process-crossing RPC to nscd, the name system caching daemon.

faragon · on Sept 4, 2016

TL;DR: If I understood it correctly, it describes performance differences on how using one thread for doing JNI to LevelDB request serialization and IPC communication between other threads, using different IPC mechanisms.

I would like to see where the time is spent, e.g. if pipe communication is slow because of small requests, because how serialization is implemented (e.g. time on spin locks and mutexes), etc.

gsubes · on Sept 4, 2016

Our tests showed than even with larger messages (100k price ticks per request) pipes were still a magnitude slower.

sctb · on Sept 4, 2016

We updated the title from “Memory Mapping 15x faster than Named Pipes and 2x faster than Queue (Java)“, which was referring to these benchmarks from the project page:

  ArrayDeque (synced)       Records:   127.26/ms  in  78579 ms    => ~50% slower than Named Pipes
  Named Pipes on TMPFS      Records:   263.80/ms  in  37908 ms    => why ~5% slower on TMPFS?
  Named Pipes               Records:   281.15/ms  in  35568 ms    => using this as baseline
  SynchronousQueue (fair)   Records:   924.90/ms  in  10812 ms    => ~3 times faster than Named Pipes
  LinkedBlockingQueue       Records:  1988.47/ms  in   5029 ms    => ~7 times faster than Named Pipes
  Mapped Memory             Records:  3214.40/ms  in   3111 ms    => ~11 times faster than Named Pipes
  Mapped Memory on TMPFS    Records:  4237.29/ms  in   2360 ms    => ~15 times faster than Named Pipes

gsubes · on Sept 4, 2016

why these title changes?!? Does one really have to write separate blog posts with the same title so they don't get changed?

sctb · on Sept 4, 2016

From the guidelines:

> Otherwise please use the original title, unless it is misleading or linkbait.

The long-standing policy is to represent the submitted content as accurately as possible and let readers pick out what's interesting to them, not what the submitter found interesting. A comment in the thread is a fine place to call such things out if a blog post is overkill.

rdtsc · on Sept 4, 2016

Shared memory is fast but dangerous if not done right, lots of opportunities for data races. I used it for low latency data access. It took an inordinate amount of time to debug. With hardware at the time there was no other alternative given the constraints.

Two other things that would be fun to benchmark is Unix Sockets and System V IPC messages (anyone uses those? probably the most obscure IPC around these days). Hmm, maybe some of those are already used behind the scenes by some of the Java methods described.

prodigal_erik · on Sept 4, 2016

Where does the speed advantage come from? Is it because LinkedBlockingQueue doesn't spin-wait before giving up and rescheduling the consumer's thread?

gsubes · on Sept 4, 2016

I am wondering about this myself, even SynchronousQueue (where the spin wait was taken from) is slower than Mapped Memory. Though ASpinWait spins a magnitude longer before sleeping, maybe that is the difference. Or the fact that Memory Mapping goes Off-Heap and instantiates no objects during transfer.

optforfon · on Sept 4, 2016

I tried to figure out how to do memory mapping with Boost once... It was complicated. I wish there was a good crossplatform solution that was clear and easy to use

jschwartzi · on Sept 4, 2016

Are you talking about just mapping a file into memory? Because there are only two function calls you need to do something like that in POSIX:

open(), and then mmap().

If you're talking about POSIX shared memory, you can do that with shm_open(). The only thing you have to do is have both processes use the same name for the shared memory area. Additionally, you can use POSIX named semaphores as a synchronization primitive.

It's pretty easy to wrap these functions up in a C++ class. You could conceivably share an entire C++ class between two processes using these primitives.

faragon · on Sept 4, 2016

"You could conceivably share an entire C++ class between two processes using these primitives."

Not really. May be just memory mapping a plain class/struct, without virtual function tables, etc. (pointers are not necessarily valid from process to process)

CyberDildonics · on Sept 4, 2016

The crazy thing is that it isn't that difficult to do. Just like thread local allocation, the easiest way seems to be to just figure out the systems calls yourself and wrap them up. All the solutions out there seem WAY over complicated.

The best solution I've found is whitedb, although it does leave something to be desired. The biggest flaw is that it is GPL, which is not a good license for something meant to be included as source. It also isn't thread safe without locking everything.

icefox · on Sept 4, 2016

How about Qts qfile memmap functionality?