Hacker News new | past | comments | ask | show | jobs | submit login

It's kind of funny how specific people's technology experience is becoming. He considers separate processes unworkable, but it wouldn't bother me in the slightest. Most code I've ever written was web server code with the only serious runtime changing state in the DB anyway, so the the processes wouldn't ever talk to each other anyway, just to the DB. Then he claims this is a whole new set of problems and technology, but it sounds very similar to whenever I work on a SOA setup. There your technology sometimes supports transaction context across web service calls or it doesn't. So when it does, your web service calls during your transaction can be rolled back when you rollback, and when it doesn't you need specific calls out for roll back purposes or you need to buffer until the end or break things up into reversible parts. Doesn't sound that different from his concerns about calling out of the process into the system and having to buffer if the system doesn't have a transaction context.



Separate processes is completely unworkable for the kind of work armin has in mind. Web servers are very specific - there is very little data shared between processes, because web is mostly stateless and the entire state can be encoded in SQL database (typically). However anything that requires some complex interprocess communication is essentially unworkable. You need to serialize and deserialize your data into wire-level protocols, which is often prohibitively expensive, even when you can get it to work.


Honest question: what kind of work is that?

The more I think of it, the more I think that "high concurrency with local state on one machine" is a use case that is not going to get more popular in the future.

I believe a big part of the uptick in interest in concurrency is an uptick in the use of distributed systems. With distributed systems, all your authoritative state is stored outside the process anyway (at least if you want it to be fault tolerant, which you do if you need it to be of any significant size).

And of course once you have more than one machine you must deal with serialization. Also, I'm not sure what the kind of work there is where "serialization is prohibitively expensive" but Python is still a natural choice.


Game servers are never going to be anything other than "high concurrency with local state on one machine", due to their intersecting requirements of low latency, authority, high computational demands, and relative indifference to uptime.

Relevantly, CCP say the idea of EVE is "incontheivable" without Stackless Python[0], which doesn't have STM, but does go in a similar direction (co-operative multitasking) to solve the same problem.

[0] http://www.slideshare.net/Arbow/stackless-python-in-eve


well. it's not entirely true, if distributed systems were the future, we would never see multi-core systems at all, there are still tasks that are much easier to do if you have them done on a single shared-memory system then distributed. I don't think they're necesarilly less complex, they just don't deal with any sort of "big data". pypy's translation toolchain is one of those problems, but I can see a lot of situations where a large mostly-read-only set is necessary to be accessible all the time.

regarding "Python is still a natural choice" - don't confuse python as a language and python (CPython) as interpreter. You would not use CPython for any performance-critical tasks probably, at least not without spending significant time rewriting pieces to C, but PyPy is quite usable in some scenarios and the list is only to grow.


> there are still tasks that are much easier to do if you have them done on a single shared-memory system then distributed.

Such as...? A lot of the big users of cray-type systems were for scientific uses. AFAIK a lot of them are seriously looking at cloud or commodity-type clusters as the problems get bigger.

Anyway I am curious if PyPy is aiming at some specific problem that I don't know about. For "web stuff", I think what is proposed is perhaps overly complicated.

If you want a large read-only set of data to fit on one machine and need high performance, Java, Python and the like aren't great choices because they don't give you much control over the memory layout. Python is probably better because you could write a C extension. But PyPy itself is not optimized for memory size (in fact I think it uses more memory to get speed).


Well, imagine the database server itself. You want it to be able to use multiple cores. If the database server is multi-process, you want to be able to access the entire data set regardless of which process is handling your client connection.

Of course this problem becomes moot if the database supports auto-sharding over multiple machines, but SQL databases generally have bad support for sharding.


sessions, push notifications, XMPP


In the robotics domain you will have dozen of processes on 10+ machines interacting to accomplish very complex tasks. Things like obstacle tracking, terrain classification, and motion planning. For most of the data flows serialization and comms overhead is minimal, this especially true with efficient binary encodings (faster then protobuf or Jason). For bulk data flows there are optimizations you can use to limit deserialization to once per machine. It is also entirely practical to split up workloads to large for one machine and aggregate results.


> However anything that requires some complex interprocess communication is essentially unworkable.

I completely disagree. Unless you mean something very special by "complex".

IPC, even complex IPC, is definitely workable. For example, most web browsers today have a split process model. That's as real-world an example as you can have, and it's definitely complex. It works great though.

Furthermore, all that this is, is the model of no shared state/message passing (that the talking entities are in separate processes and not threads is not fundamental here). That's a very clean model of parallelism, much better than shared state. Lots of languages use the no shared state/message passing model, it's definitely not "essentially unworkable."


"It works great" can I have your browser? It works great for doing completely independent tasks (like rendering unrelated websites) and this is as far as it goes. I would not call it "complex interprocess communication" since there is almost none communication involved - tasks are almost entirely parallelizable. In fact you would be as good just running two separate browsers for most of the time.


I am talking here of the communication between the parent process and the child process, not between child processes (which by design should be isolated). For those processes, actually there is a lot of communication and coordination. For example, graphics buffers are stored in shared memory (to avoid huge copies), caches of many forms need to be synchronized, etc. It takes a lot of work to make a multiprocess browser because of the complex communication, but this is a standard thing nowadays.

Are you really saying that the actor model and in general message passing/no shared state is "essentially unworkable"?


Well, maybe we have a slightly different definition of complex. Browser interprocess communication is IMO relatively simple and/or not performance critical, more so all those structures already must have a serializable format because they're persistent.

Sometimes you deal with something that is more time-critical and serialization of objects (like a complex graphs of objects) would kill a lot of performance benefits. I'm sure it takes a lot of effort to make a browser (not to mention a multiprocessing one), but a lot of those complexity are not because the task at hand it's hard -- quite the opposite it's just lots and lots of easy tasks crammed together in a big pile, usually C++ pile with tons of history on top of it.


Fair enough, we just mean different things by "complex" I guess. Obviously the message-passing model isn't good for everything. It is good for a lot of real-world stuff though, which is why I was surprised by your statement before.

Note that serialization is not necessary in this model, for example you can have actors in the same process, and transfer ownership of objects. So there is no shared state, and they are logically passed as messages, but there is no copying cost to doing so or need to implement serialization.


However anything that requires some complex interprocess communication is essentially unworkable. You need to serialize and deserialize your data into wire-level protocols, which is often prohibitively expensive, even when you can get it to work.

Would you post more info about previous attempts at this? I was about to create precisely that. Literally today.

But if it's fundamentally unworkable...


What's fundamentally unworkable -- serialization?? There are plenty of options in Python -- pickle, JSON (no graphs), or protocol buffers/thrift if you want something language agnostic.

Having a serialization layer in distributed applications is a pretty vanilla requirement... I am also confused about the "unworkable" comment.

The "multiprocessing" module (which I have heard some not-so-good things about) automatically serializes Python objects over byte streams using pickle or some such thing.


Serialization is very expansive compared to shared data structures. Even if not everybody is affected by that, it still makes serialization unsuitable as a general solution in this case.


I've been trying to tackle precisely that problem.

Would someone please help me out and point me at a way of efficiently exposing a read-only block of memory to Python?

EDIT: Looks like "MemoryView" introduced in 2.7 is the answer.


2.7 introduced memoryview, and there are a slew of C API routines to interact with it ...

as for shared memory, have you tried just mmap'ing the node from /dev/shm ?


Aha! That sounds like just the ticket. I hope it's easy to use. A million thank-you's, and a Salami.

EDIT: I just saw your edit. That's what I'm doing --- shm_open(), ftruncate(), then mmap(). A little annoying that it doesn't automatically kill the shared memory when the program terminates, but still...


It's pretty slick, and I've had good results using it in conjunction with a C program that generates data.


I'd advise specifically against using Protobuf -- Google's support for Python seems to be pretty bad.


I've worked with protobufs in Python a bit, although not in production, and it seems fine. What trouble have you run into?


Well, the Protobuf PyPI package (which is two versions out of date) has not been able to be installed by Pip for over 3 years now. And yes, this is a known bug.


MessagePack looks quite interesting (http://msgpack.org/) and seems to have quite a lot of binding (I don't the support level on each one though)




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: