Hacker News new | past | comments | ask | show | jobs | submit login
How to share a NumPy array between processes (superfastpython.com)
67 points by jasonb05 on Aug 29, 2023 | hide | past | favorite | 16 comments



Though deprecated probably in favor of more of a database/DBMS like DuckDB, the arrow plasma store holds handles to objects as a separate process:

  $ plasma_store -m 1000000000 -s /tmp/plasma
Arrow arrays are like NumPy arrays but they're made for zero copy e.g. IPC Interprocess Communication. There's a dtype_backend kwarg to the Pandas DataFrame constructor and read_ methods:

df = pandas.Dataframe(dtype_backend="arrow")

The Plasma In-Memory Object Store > Using Arrow and Pandas with Plasma > Storing Arrow Objects in Plasma https://arrow.apache.org/docs/dev/python/plasma.html#storing...

Streaming, Serialization, and IPC > https://arrow.apache.org/docs/python/ipc.html

"DuckDB quacks Arrow: A zero-copy data integration between Apache Arrow and DuckDB" (2021) https://duckdb.org/2021/12/03/duck-arrow.html


What's the latency delta of reading from a local database instance versus reading from memory - is there much additional overhead to slow things down?


With most databases, a `SELECT * FROM tblname` has to serialize a copy of the query result and let the ODBC or similar database driver unserialize, so the query result unavoidably gets copied at least once at query time.

Plasma and e.g. DuckDB do zero copy so that there is no: unmarshal of the complete database file into RAM, table scan in order to unindexed query, and then marshal/unmarshal (serialize/deserialize) of the query result for the database driver (which could allow access to a DB/DBMS over a local pipe(), a TCP socket, HTTP(S), protobufs over HTTPS)

If the memory is held in RAM by the plasma store, I'm not sure which queries are possible to service by handing an object reference to where the full non-virtual table is allocated in RAM (on only one node)? Presumably if there's no filtering or transformation in the query, and the query does not access "virtual" or "materialized" tables


Good content, but over SEO optimized. Would be nice to hear about the actual efficiency of using these methods.

For instance, does fork() copy the page of memory containing the array? I believe it's Copy-on-Write semantics, right? What happens when the parent process changes the array?

Then, how do Pipe and Queue send the array across processes? Do they also pickle and unpickle it? Use shared memory?


This is a super common occurrence in training loops for ML models because PyTorch uses multiprocessing for its dataloader workers. If you want to read more, see the discussion in this issue: https://github.com/pytorch/pytorch/issues/13246#issuecomment...

As you’ve pointed out fork() isn’t ideal for a number of reasons and in general it’s preferred to use torch tensors directly instead of numpy arrays so that you are not forced into using fork()

There’s also this write up which I found to be quite useful for details: https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multip...


“I believe it's Copy-on-Write semantics, right? What happens when the parent process changes the array?”

Yes and you just answered your question.


For a programming novice: Is CoW where a 'child' process operates on a copy of data stored elsewhere in memory, but you have some mechanism to always update the copy any time the parent process modifies/writes to the original data?


No, if I have stuff, and I fork(), you, my child process is expected to have a copy of that stuff in it's current state. However, the kernel lies to you. It delays copying the stuff until I write to it i.e until I need to see a new state. Hence, "copy-on-write".

In this case, I am the parent python process, you are the child processes started with the `multiprocessing` module, and the data is the numpy array.

The implication is that the changes made by the parent process to the numpy array post fork() won't be visible to the child processes.


I was searching for similar article. Im working on AutoML python package where I use different packages to train ML models on tabular data. Very often the memory is not properly released by external packages so the only way to manage memeory is to execute training in separate processes.


Actually ran into this problem this week, toyed around with multiprocessing.shared_memory (which seems to also rely on mmaped-files, right?) and decided to just embrace the GIL.

Multiprocessing is not needed when all of your handful subprocesses are just calling Numpy-code and release their gil anyways.

Also some/most Numpy functions are multithreaded (depending on the BLAS implementation, linked against), take advantage of that and schedule huge operations and just let the interpreter sit idle waiting for that result.


I have seen this movie before, it always ends the same way.

Look to go parallel for speed increase, burn a bunch of time and decide its way too hard and just wait the extra few minutes instead.


It's actually the opposite. My current job is making a piece of code faster my predecessor tried to make faster using multiprocessing. I now try to pull stuff back into a single process.

The code for example loads a file from disk in a separate process to have the main process available for number crunching. The problem: This data is then sent to the main process and another copy is created which takes time. Doing the same with threads is way faster, since data i/o operations are releasing the gil anyways.


Actually, you get the speed increase if using threads on python with numpy because numpy will release the GIL internally, so, operations using numpy will actually run in parallel (i.e.: c libraries can release the GIL and do get the performance benefits when using multiple processors -- although yes, when you get back to running python code that won't be parallel due to the GIL).


My genuine deep love for numpy extends to them doing these things without me knowing anything about it.


That's something I worry about when reading about future Python releases without GIL.

Will people just throw more and more threads at unfathomably slow Python code to get somewhat fast results?


There is also Apache Arrow that uses a similar use case. Maybe this is worth considering: https://arrow.apache.org/docs/python/memory.html#on-disk-and...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: