Polyglot means not having to fight with marshaling overheads when integrating be...

__all__ · on Sept 8, 2023

Thanks!

Honestly, I can't envision a near future where SQL is not the main interface. Happy to see the future proving me wrong here though!

Despite I can buy the arguments about how having a better data structure to communicate between processes (in the same server) could help, it's a bit difficult to wrap my mind around how Arrow will help in distributed systems (compared to any other performant data structure). Do you have any resources to understand the value proposal in that area?

Same for vector processing, would be great to read a bit more about some optimizations that would help improving Postgres leaving out pure analytical use cases.

refset · on Sept 8, 2023

> it's a bit difficult to wrap my mind around how Arrow will help in distributed systems

Comparing with the role of Protobuf is perhaps easiest, there's a good FAQ entry [0] which concludes: "Arrow and Protobuf complement each other well. For example, Arrow Flight uses gRPC and Protobuf to serialize its commands, while data is serialized using the binary Arrow IPC protocol".

This will be increasingly significant due to the hardware trends in network & memory (and ultimately storage too) compared with CPUs. I posted about that in a comment a few days ago [1], but it's worth sharing again:

> here’s a chart comparing the throughputs of typical memory, I/O and networking technologies used in servers in 2020 against those technologies in 2023

> Everything got faster, but the relative ratios also completely flipped

> memory located remotely across a network link can now be accessed with no penalty in throughput

The graphs demonstrate it very clearly: https://blog.enfabrica.net/the-next-step-in-high-performance...

> would be great to read a bit more about some optimizations that would help improving Postgres leaving out pure analytical use cases

Unfortunately I don't have a good reference on that to hand but I'll take a look around and reply again soon.

[0] https://arrow.apache.org/faq/#how-does-arrow-relate-to-proto...

[1] https://news.ycombinator.com/item?id=37365816

[2] https://www.singlestore.com/comparisons/postgresql/

refset · on Sept 8, 2023

Okay so on the Postgres question this mailing list thread is interesting: https://www.postgresql.org/message-id/8181205c-69e5-bde7-15e...

I am no expert on Postgres but the thread seems to suggest the default out-of-the-box JIT performance is actually more efficient than a custom vectorized executor that was built for the PoC. That probably rules out any low-hanging optimizations based purely on vectorization for OLTP specifically, but there are undoubtedly many wider ideas that could in principle be adopted to bring OLTP performance in line with a state-of-the-art research database like Umbra (memory-first design, low-latency query compilation, adaptive execution etc.). As usual with databases though, if the cost estimation is off and your query plan sucks, then worrying about fine-tuning the peak performance is ~irrelevant.