PyPy: Transactional Memory (II)

lnanek · on Jan 14, 2012

It's kind of funny how specific people's technology experience is becoming. He considers separate processes unworkable, but it wouldn't bother me in the slightest. Most code I've ever written was web server code with the only serious runtime changing state in the DB anyway, so the the processes wouldn't ever talk to each other anyway, just to the DB. Then he claims this is a whole new set of problems and technology, but it sounds very similar to whenever I work on a SOA setup. There your technology sometimes supports transaction context across web service calls or it doesn't. So when it does, your web service calls during your transaction can be rolled back when you rollback, and when it doesn't you need specific calls out for roll back purposes or you need to buffer until the end or break things up into reversible parts. Doesn't sound that different from his concerns about calling out of the process into the system and having to buffer if the system doesn't have a transaction context.

fijal · on Jan 14, 2012

Separate processes is completely unworkable for the kind of work armin has in mind. Web servers are very specific - there is very little data shared between processes, because web is mostly stateless and the entire state can be encoded in SQL database (typically). However anything that requires some complex interprocess communication is essentially unworkable. You need to serialize and deserialize your data into wire-level protocols, which is often prohibitively expensive, even when you can get it to work.

chubot · on Jan 14, 2012

Honest question: what kind of work is that?

The more I think of it, the more I think that "high concurrency with local state on one machine" is a use case that is not going to get more popular in the future.

I believe a big part of the uptick in interest in concurrency is an uptick in the use of distributed systems. With distributed systems, all your authoritative state is stored outside the process anyway (at least if you want it to be fault tolerant, which you do if you need it to be of any significant size).

And of course once you have more than one machine you must deal with serialization. Also, I'm not sure what the kind of work there is where "serialization is prohibitively expensive" but Python is still a natural choice.

hythloday · on Jan 14, 2012

Game servers are never going to be anything other than "high concurrency with local state on one machine", due to their intersecting requirements of low latency, authority, high computational demands, and relative indifference to uptime.

Relevantly, CCP say the idea of EVE is "incontheivable" without Stackless Python[0], which doesn't have STM, but does go in a similar direction (co-operative multitasking) to solve the same problem.

[0] http://www.slideshare.net/Arbow/stackless-python-in-eve

fijal · on Jan 14, 2012

well. it's not entirely true, if distributed systems were the future, we would never see multi-core systems at all, there are still tasks that are much easier to do if you have them done on a single shared-memory system then distributed. I don't think they're necesarilly less complex, they just don't deal with any sort of "big data". pypy's translation toolchain is one of those problems, but I can see a lot of situations where a large mostly-read-only set is necessary to be accessible all the time.

regarding "Python is still a natural choice" - don't confuse python as a language and python (CPython) as interpreter. You would not use CPython for any performance-critical tasks probably, at least not without spending significant time rewriting pieces to C, but PyPy is quite usable in some scenarios and the list is only to grow.

chubot · on Jan 14, 2012

> there are still tasks that are much easier to do if you have them done on a single shared-memory system then distributed.

Such as...? A lot of the big users of cray-type systems were for scientific uses. AFAIK a lot of them are seriously looking at cloud or commodity-type clusters as the problems get bigger.

Anyway I am curious if PyPy is aiming at some specific problem that I don't know about. For "web stuff", I think what is proposed is perhaps overly complicated.

If you want a large read-only set of data to fit on one machine and need high performance, Java, Python and the like aren't great choices because they don't give you much control over the memory layout. Python is probably better because you could write a C extension. But PyPy itself is not optimized for memory size (in fact I think it uses more memory to get speed).

FooBarWidget · on Jan 15, 2012

Well, imagine the database server itself. You want it to be able to use multiple cores. If the database server is multi-process, you want to be able to access the entire data set regardless of which process is handling your client connection.

Of course this problem becomes moot if the database supports auto-sharding over multiple machines, but SQL databases generally have bad support for sharding.

est · on Jan 15, 2012

sessions, push notifications, XMPP

Game_Ender · on Jan 14, 2012

In the robotics domain you will have dozen of processes on 10+ machines interacting to accomplish very complex tasks. Things like obstacle tracking, terrain classification, and motion planning. For most of the data flows serialization and comms overhead is minimal, this especially true with efficient binary encodings (faster then protobuf or Jason). For bulk data flows there are optimizations you can use to limit deserialization to once per machine. It is also entirely practical to split up workloads to large for one machine and aggregate results.

azakai · on Jan 14, 2012

> However anything that requires some complex interprocess communication is essentially unworkable.

I completely disagree. Unless you mean something very special by "complex".

IPC, even complex IPC, is definitely workable. For example, most web browsers today have a split process model. That's as real-world an example as you can have, and it's definitely complex. It works great though.

Furthermore, all that this is, is the model of no shared state/message passing (that the talking entities are in separate processes and not threads is not fundamental here). That's a very clean model of parallelism, much better than shared state. Lots of languages use the no shared state/message passing model, it's definitely not "essentially unworkable."

fijal · on Jan 14, 2012

"It works great" can I have your browser? It works great for doing completely independent tasks (like rendering unrelated websites) and this is as far as it goes. I would not call it "complex interprocess communication" since there is almost none communication involved - tasks are almost entirely parallelizable. In fact you would be as good just running two separate browsers for most of the time.

azakai · on Jan 14, 2012

I am talking here of the communication between the parent process and the child process, not between child processes (which by design should be isolated). For those processes, actually there is a lot of communication and coordination. For example, graphics buffers are stored in shared memory (to avoid huge copies), caches of many forms need to be synchronized, etc. It takes a lot of work to make a multiprocess browser because of the complex communication, but this is a standard thing nowadays.

Are you really saying that the actor model and in general message passing/no shared state is "essentially unworkable"?

fijal · on Jan 14, 2012

Well, maybe we have a slightly different definition of complex. Browser interprocess communication is IMO relatively simple and/or not performance critical, more so all those structures already must have a serializable format because they're persistent.

Sometimes you deal with something that is more time-critical and serialization of objects (like a complex graphs of objects) would kill a lot of performance benefits. I'm sure it takes a lot of effort to make a browser (not to mention a multiprocessing one), but a lot of those complexity are not because the task at hand it's hard -- quite the opposite it's just lots and lots of easy tasks crammed together in a big pile, usually C++ pile with tons of history on top of it.

azakai · on Jan 14, 2012

Fair enough, we just mean different things by "complex" I guess. Obviously the message-passing model isn't good for everything. It is good for a lot of real-world stuff though, which is why I was surprised by your statement before.

Note that serialization is not necessary in this model, for example you can have actors in the same process, and transfer ownership of objects. So there is no shared state, and they are logically passed as messages, but there is no copying cost to doing so or need to implement serialization.

palish · on Jan 14, 2012

However anything that requires some complex interprocess communication is essentially unworkable. You need to serialize and deserialize your data into wire-level protocols, which is often prohibitively expensive, even when you can get it to work.

Would you post more info about previous attempts at this? I was about to create precisely that. Literally today.

But if it's fundamentally unworkable...

chubot · on Jan 14, 2012

What's fundamentally unworkable -- serialization?? There are plenty of options in Python -- pickle, JSON (no graphs), or protocol buffers/thrift if you want something language agnostic.

Having a serialization layer in distributed applications is a pretty vanilla requirement... I am also confused about the "unworkable" comment.

The "multiprocessing" module (which I have heard some not-so-good things about) automatically serializes Python objects over byte streams using pickle or some such thing.

DasIch · on Jan 14, 2012

Serialization is very expansive compared to shared data structures. Even if not everybody is affected by that, it still makes serialization unsuitable as a general solution in this case.

palish · on Jan 14, 2012

I've been trying to tackle precisely that problem.

Would someone please help me out and point me at a way of efficiently exposing a read-only block of memory to Python?

EDIT: Looks like "MemoryView" introduced in 2.7 is the answer.

veyron · on Jan 14, 2012

2.7 introduced memoryview, and there are a slew of C API routines to interact with it ...

as for shared memory, have you tried just mmap'ing the node from /dev/shm ?

palish · on Jan 14, 2012

Aha! That sounds like just the ticket. I hope it's easy to use. A million thank-you's, and a Salami.

EDIT: I just saw your edit. That's what I'm doing --- shm_open(), ftruncate(), then mmap(). A little annoying that it doesn't automatically kill the shared memory when the program terminates, but still...

veyron · on Jan 14, 2012

It's pretty slick, and I've had good results using it in conjunction with a C program that generates data.

lubutu · on Jan 14, 2012

I'd advise specifically against using Protobuf -- Google's support for Python seems to be pretty bad.

mikeash · on Jan 15, 2012

I've worked with protobufs in Python a bit, although not in production, and it seems fine. What trouble have you run into?

lubutu · on Jan 15, 2012

Well, the Protobuf PyPI package (which is two versions out of date) has not been able to be installed by Pip for over 3 years now. And yes, this is a known bug.

ldng · on Jan 14, 2012

MessagePack looks quite interesting (http://msgpack.org/) and seems to have quite a lot of binding (I don't the support level on each one though)

lubutu · on Jan 14, 2012

What surprised me most was discovering that Intel's upcoming Haswell microarchitecture may include Hardware Transactional Memory. That's pretty exciting!

mark_h · on Jan 14, 2012

I hadn't heard that, thanks. That's kind of a full-circle moment then, given that transactional memory started out as a hardware proposal!

sanxiyn · on Jan 15, 2012

As I understand, they seem to say TM failed because it was considered as an alternative to threads and events, so they propose to do events+TM.

That is, one big problem with TM is I/O, like doing network read and write. But if you are already writing event-driven programs, you return to event loop whenever you do network read and write, and at that point TM implementation can do smart things. If TM is sold as "write threads without locks" this can't work, but if TM is "write events but with more parallelism" this seems possible.

Actually I am quite surprised that this model was not heavily investigated already.

Peaker · on Jan 15, 2012

TM does work well in Haskell, though, where constraining IO effects during a transaction is simple.

sbov · on Jan 15, 2012

I've done similar in a personal project before: a multiplayer text game (a MUD). In general, in these games, the actions you and creatures perform only interact with things that are locationally close to them. Each location belongs to an "area" - so each "area" had a thread to execute actions. There were also world "actions" that had to execute in isolation, but those were much rarer.

That said, a lot of people seem to dislike threads. The problem isn't with threads, but rather having to shoehorn every attempt to parallelize your program into a threaded model. I would feel similar ire if I had to shoehorn every attempt to parallelize my program into an event model.

Threads are a way you can reason about and simplify your program. I've written programs where threads make the program easier to reason about rather than harder. Because of this, telling me your method is better than threads makes no sense to me - for what class of problems?

sanxiyn · on Jan 15, 2012

> I've written programs where threads make the program easier to reason about rather than harder.

I want to hear about this.

sbov · on Jan 15, 2012

The one that comes to mind in particular was something that sent email. We wanted to limit emails being sent per email server to each ISP.

We went with: queue per ISP for emails to send out, then each server that could send email to that ISP had its own thread. Each thread would take X emails out of the queue every 10 seconds, send them, then sleep for any leftover time - or not if there isn't any.

Most of the time is spent sleeping or on I/O - the point of threads wasn't for CPU usage, it was because it made everything so simple to implement. The only synchronization issues were taking emails out of the queue and putting them back somewhere if they failed to send.

All of the logic for each thread in its main loop read like a normal, sequential program. I guess you could call this an embarrassingly easy problem, but part of why it's so easy is it maps so easily to threads.

eropple · on Jan 15, 2012

Okay, here's one. =)

At work we recently wrote something for handling automated translations of content via our external review providers. Occasionally some stuff-to-translate gets pulled from a database, we figure out what it needs to be translated to, and job orders get dropped into queues for each (provider, from-lang, to-lang) tuple. Once the queue is processed, the translated content gets pushed back into the database.

One fairly major catch: each of our translation providers has different numbers of concurrent translation requests - we can do two simultaneous English->Spanish translations, say, but four English->French ones. So I wrote it such that each (provider, from-lang, to-lang) queue is serviced by its own threadpool that has with an upper bound on its active threads equal to the maximum that the external translation provider can concurrently handle, so we get in-flight management for free. (This could have ugly consequences in overloaded cases with tons of providers/tons of in-flight requests, but the machine it runs on is dedicated to this process and, as we tend to do, the JVM it runs on is provisioned with approximately eight hojillion bytes of RAM.)

There are certainly other ways to accomplish this task, but threading and threadpools made it conceptually a lot simpler to reason through. (Though evented was the first thing I thought of, we couldn't really do something using it--both because Java's support for it is poor and because of the nature of some of our translation providers. For one at least, we have to repeatedly hit an HTML page and scrape it to find out when our translation job is done!) We could functionally treat each module as a discrete case - the simple queued interface let the people writing the translation handlers treat it as if it was a single-threaded application. That's actually exactly how we wrote it, too: I told the other person on the project to just drop the class they were writing into main() and make sure it worked, while I built the infrastructure in which it'd actually run as a daemon. He didn't have to care about the threading, while it gave us resource control and ease of expansion.

radarsat1 · on Jan 14, 2012

I actually kind of like the GIL. It allows me to write threaded code in a "lazy" way sometimes, when I don't want to take the effort to do it properly. (e.g. implementing a message queue using a simple list, using a global bool or int as a state flag.) Even though it's not using the correct structures for these things, it sometimes allows to prototype a simple threaded approach quickly, and at least be sure it won't cause weird clashes in the interpreter. When I want to later port it to a proper solution, I'll switch to the multiprocess model which forces you to use queues and other IPC mechanisms instead of shared data structures-- this is more effort, but usually worth it, if not for efficiency then for robustness. But for one-off scripts and little proofs of concept it can be nice to just carefully share some variables and know with some assurance that you're not going to break anything in the run-time. That's not to say that transactional memory wouldn't be welcome, but sometimes the GIL allows a certain pragmatism in a prototyping scenario.

In Python, threads are "enough" to implement event-based concurrency, as well as asynchronous I/O and calling out to long-running C functions for number crunching.

Beyond that, the main reason to remove the GIL and allow true multicore parallelism is purely for efficiency reasons. The catch is, if I'm serious about efficiency, I probably won't be using Python. Or at least porting only my CPU-intensive functions to C.

Of course, if PyPy eventually provides Java-like speeds, maybe it makes sense to go this route, as Python could start to be depended on for efficiency, so the gains of true parallelism would be visible. But removing the GIL from CPython seems kind of pointless, since the interpreter has so much overhead regardless.

ot · on Jan 14, 2012

> I actually kind of like the GIL. It allows me to write threaded code in a "lazy" way sometimes

As far as I understand, this is exactly the reason why the PyPy team went for the STM route: they want to guarantee the same semantics of CPython with the GIL, but with more parallelism.

The idea here is that most of the time threads do not conflict, which means that they do not mutate objects while other threads are accessing them. Based on this assumption, you can optimistically run a snippet of code assuming that no conflict will happen, but if there is such a conflict all the effects of the code (the "transaction") are rolled back and the code is run again.

Since the interpreter can decide where to put the transaction boundaries, it can do it where CPython would acquire/release the GIL. This would give basically GIL-equivalent semantics.

munin · on Jan 14, 2012

you're getting lucky. as I understand it, the GIL just protects certain pieces of interpreter state. you still need to lock to get correct behavior from your code. the presence of the GIL doesn't remove the need to lock your data structures and each thread created with the threading stdlib is actually a native thread on your platform.

dripton · on Jan 15, 2012

Yep. The GIL means that individual Python bytecode operations are atomic, and so by extension single Python statements that translate to individual bytecode instructions can be treated as atomic (if you don't care about portability to alternate Python implementations that lack a GIL). But complicated Python statements that translate into multiple bytecode instructions (like dict.setdefault) can't. And, of course, neither can sequences of multiple statements. So you still need locks for most cases where you mutate objects that are shared across threads.

radarsat1 · on Jan 15, 2012

Ah yes, that's a good point. I wouldn't go as far as to say I'm just getting lucky though, as I do tend to reserve my lazy bad practices to very simple cases. But I would never use such code in production anyways.

simon_weber · on Jan 14, 2012

The link to the previous entry is broken. Here it is: http://morepypy.blogspot.com/2011/06/global-interpreter-lock...