Andrew knows this stuff better than I do but one way to start thinking about this is to look at what the sources of latency are, where your data gets copied, and what controls allocation of storage (disk/RAM/etc). For example mmaped disk IO cedes control of the lifetime of your data in RAM to the OS's paging code. JSON over HTTP over TCP is going to make a whole pile of copies of your data as the TCP payloads get reassembled, copied into user space, then probably again as that gets deserialized (and oh the allocations in most JSON decoders). As for latency you're going to have some context switches in there which is not helpful. One way you might be able to improve performance is to use shared memory to get data in and out of the workers and process it in-place as much as possible. A big ring buffer and one of the fancy in-place serialization formats (capnproto for example) could actually make it pretty pleasant to write clients in a variety of languages.
Thanks for the tips :D. These all seem like very likely places to look for low hanging fruit. We were actually early employees at RethinkDB so we've been around the block with low level optimizations like this. I'm really looking forward to get Pachyderm in a benchmark environment and tearing it to shreds like we did at Rethink.