Hacker News new | past | comments | ask | show | jobs | submit login
Writing C in Cython (honnibal.wordpress.com)
195 points by syllogism on Oct 20, 2014 | hide | past | favorite | 54 comments



Python + Numpy: core loop is computationally intensive + benefits from vector ops

Cython: repeated invocations overpower any core loop advantage

In the case of NLP, which is the author's field, you'll find not only do you have an intensive core task, but the number of times you'll call something is insane.

To give an idea: the time complexity for parsing is O(n^3) [n = words in a sentence]. For Combinatory Categorial Grammar (CCG), the formalism that the author Matthew and I commonly use, it's O(n^6). At that point, the inner loop almost doesn't matter -- even if it was just a null op, the function call overheads and everything around it are performance killers. Both of those are what Cython works incredibly well for, especially when you get full control of the data structures.

This is one of those situations where the concept that premature optimization is evil is really quite far off..! =]


I don't see where this is premature optimization. You, and the author, clearly know a lot about the field and the nature of the problem, so you're already past the 'premature' portion. If I know from the start that I've got a function that will be called 10,000 times per second, there's no reason I should write the un-optimized version first. That's like starting every sort off with a bubble sort; I don't need to optimize to quick sort switching off to insertion as appropriate from the beginning of coding, but I know I can do better than bubble from the get-go.

Defining "premature optimization" as the less hyperbolic "optimization before we know we need it," I think it's clear that the problem you've written shows you already know you need it!



That's amazing and makes me really sad that Tcl has not won the DSL world over.


Back in the late 90's I worked for a startup that used it as their core language.

Quite a good experience.


Can't tell if serious.


Speaking of "really is just C with some syntactic sugar" — so, is there really a point in putting effort in learning (and using) Cython over C? Reading the code in the article I wasn't completely sure if I should feel like it is "cool" or "confusing". Not that there was something I actually dislike, but it's somewhat awkward to see C written like Python, so I can't "rate" it myself right away.


Well, here's a less trivial example:

https://github.com/syllog1sm/redshift/blob/develop/redshift/...

The task/algorithm is roughly (but not precisely) the one described here:

http://honnibal.wordpress.com/2013/12/18/a-simple-fast-algor...

If you just scroll through parser.pyx, you might get the flavour of why this is better than straight C. The "core", or "skeleton" of the code is all C-typed, but I "fall-back" to Python for work-a-day stuff like reading and writing config files, collecting the set of labels, etc.

There are also places in the algorithmic code where I use Python data structures. In Parser._count_feats, I use a dictionary of dictionaries. This would be quite annoying in C, and efficiency is quite irrelevant here.

I think using Cython in this way has given me a competitive edge in my research. This library has by far the best published results on speech parsing[1], and its speed/accuracy trade-off on written text is as good as any other libraries out there.

[1] http://www.transacl.org/wp-content/uploads/2014/04/46.pdf


Oh, thanks, that gives some insights indeed.

> I think using Cython in this way has given me a competitive edge in my research.

May I ask, have you ever considered a possibility using something like OCaml? As far as I heard it's quite convenient to use ML-family languages for writing parsers and that OCaml generates pretty well-performing code while being more like "Python-expressive" than "C-expressive".

> This library has by far the best published results on speech parsing

Ah, yes, I remember it from seeing a while ago, when you just released it. Never looked in the code, though. Didn't recognize it was you, until this post, however. =)


Which profiling tool do you use to target your optimizations?


I use cProfile; for the Cython code you need to set profile=True as a compilation directive.


Your heart is in the same place as mine. Mileage may vary, but I found unless it is the case that I am speeding up a legacy application that has already invested in Python and the Python ecosystem, Cython is less valuable. It may be so because I can churn out C or C++ code productively enough. If I do need to add some scripting, I find adding Lua (now) or Tcl (earlier) lighter.

With Cython I have to remember which subset of Python syntax does the version I am using support, the real time-saver that is Numpy is no longer a timesaver. The reason is Cython has to call back into Numpy's C API for Numpy expressions. If you do that in a loop that is a considerable overhead. The alternative is that you have to write the array indexing code at a low level in Cython. This to me is significant, because avoiding that error prone and tedious part is one the reasons I use Python/Numpy in the first place. Then there is the question of debugging C code that a code generation tool has generated. Usually there is gobs and gobs of it. Cython does try to help you with debugging that, but I find myself better of debugging C code that I wrote myself. Not to mention mature tooling that exists around C or C++. For me, usually what happens is that I Cython'ize a small part, then I have to Cythonize the loop containing it for performance reasons and my Cython keeps swallowing the Python application bit by bit, mostly inward out. So usually comes a time for me when I say screw it, I will just write this in C or C++.

As I said, this is a personal take, your experience with it may vary. I still happily use Cython if its a Python code that I have to speed up, or I have to collaborate with someone in Python. I consider Cython the best tool for doing this. If, however, I were to write something from scratch, I typically just go ahead with C++ now a days, and have lately been poking and prodding at Julia. I like what I see there, a lot.


Another example that I worked on last week is for the UI framework Kivy. Most of the core libraries are in Cython, and particularly for the graphics stuff I think it's a nice balance of execution speed and ease of development. In particular, it's trivial to use the Cython objects from pure Python, without having to mess with the CPython API on either side.

Here's an example of the experimental SVG parser/renderer we've been working on over the past few weeks, if you want to see some more "real world" Cython: https://github.com/kivy/kivy/blob/master/kivy/graphics/svg.p...


Cython is used to write extension modules for Python. If that is something you want to do, it is better than plain C because the code is more compact and less error prone because some things are taken care of behind the scenes. The point of using Cython is to combine the best of both worlds (Python and C).


> some things are taken care of behind the scenes

Like what?


Reference counting, type checking, converting iteration over lists or ranges to more efficient code. In general a lot of error checking.


This is similar to how people are using the LuaJIT FFI, as a higher-level, garbage-collected C that lets you control data layout in memory.


That's an interesting comparison, because under the hood Cython and LuaJIT FFI could hardly be more different.

Cython is an optimizing Python (+ extensions) -> C compiler.

LuaJIT FFI is a library for Lua. It doesn't change the language at all (there are none of Cython's optional annotations), there is no build-time tool or need to run a C compiler.

But I can see how they could serve some of the same use cases.


"LuaJIT FFI is a library for Lua. It doesn't change the language at all ... "

This is literally true, of course, but in my ~5 yrs using it now I can say that writing LuaJIT+FFI-data-type code is, for me, both aesthetically and mechanically different from writing plain Lua code. The post captures it quite well:

  This is basically the inverse of the old promise that languages like Python came 
  with: that you would write your whole application in Python, optimise the “hot 
  spots” with C, and voila! C speed, Python convenience, and money in the bank.

  This was always much nicer in theory than practice. In practice, your data 
  structures have a huge influence on both the efficiency of your code, and how 
  annoying it is to write.
I often write my LuaJIT programs from the ground up designed around C data types and C-style memory allocation discipline. Bu I can always ditch that in areas where I know I don't care (e.g., configuration loading etc where 10 lines of C become a 1-liner in Lua).


In both cases a tool is being adapted for something it wasn't intended for: to write C-style code—specifically, code that works with memory the way C does—in a high-level scripting language.

This way of combining high and low levels within the same program seems qualitatively unlike, say, C++, traditional FFIs, and other answers to this problem. It's fascinating to see it emerge from more than one existing technology.


"Python with C data types" was somewhat of a design goal for Pyrex, the language that Cython followed/built on:

http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/ver...


That's fascinating because it is very much about the same problem.


I guess I'm not sure which tool you meant was adapted; Cython (and Pyrex) certainly adapt Python to work more directly with memory, but I would say the style of code in the article is an intended use case of them.

(I also might have undersold the connection between the two, Cython is arguably just a friendly fork (I guess mostly they wanted to move faster))


I just meant that the OP is adapting Python (via Cython) similarly to how FFI-driven LuaJIT adapts Lua. The resulting programming style is different enough from classic Python or Lua that it approaches being a different language—something like an embedded C DSL.


When I've tried to use Cython I've not enjoyed it. I feel like I'm invoking a load of barely comprehensible copy and paste magic if I need to use numpy arrays. I imagine it would be possible to understand this stuff, but I need it so rarely that I can't justify learning it. C or C++ feels so much simpler.


Can you give an example of the copy/paste you're worried about?

I agree that working with numpy arrays in cython introduces some semantic overhead, then again in newer version of cython they've introduced the more generic concept of typed memoryviews [0]. I had a lot of success recently using this to generate some very efficient multithreaded code with a minimum of hassle.

[0] http://docs.cython.org/src/userguide/memoryviews.html


It's things like the square brackets and triangular brackets, e.g. def naive_convolve(np.ndarray[DTYPE_t, ndim=2] f, np.ndarray[DTYPE_t, ndim=2] g)

I know what this is for, but I have no idea how this is implemented - is it some sort of library, some sort of macro, something inside the compiler, or what? It feels very opaque. I don't understand the lifetime of these objects or what could go wrong.

Even with C++ templates, I can usually work out how something is happening, but not here without research. The Cython-specific syntax also feels pretty unpythonic. The feeling I get from using it is that it's an ugly Python/C/C++ mutant with a load of hidden rules.

What I really would like is a set of templates or even macros for C++ which would allow easy wrappers to be written for different Python implementations, handling the reference counting, etc. Does such a thing exist?


The example you give is the old way to specify the type of a numpy array. With typed memory views it's more straightforward IMHO:

    cdef convolve(double [:, :] f, double [:, :] g)
I don't know if there are templates or macros for C++ which do the same thing, but if you look at the output of Cython and the complexity and amount of work it's doing, I don't think it's likely something like that could work.

The generated code is of course clunky and difficult to follow, but I found that when trying to understand specific things it's easy to follow what parts of the Python C API are being invoked and consider the overhead involved. The annotated HTML output is useful for this (cython -a).


And usually you want to avoid invoking the python C API from cython unless you have to use it for some reason. The whole point of the typed memoryviews is that you get direct access to the raw memory buffer and can access it like a numpy array or even like a pointer array.

This blog post [0] incrementally shows how to get microoptimizations out of your cython code.

In the end, the autogenerated code is usually not too bad too follow since it's a one-to-one transpilation from the cython to C. cython -a helps a ton, or if you can use it, the cythonmagic [1] for the IPython notebook embeds the output of cython -a inline in a notebook, making the iteration process much quicker.

[0] https://jakevdp.github.io/blog/2012/08/08/memoryview-benchma...

[1] http://ipython.org/ipython-doc/2/config/extensions/cythonmag...


In the specific case you pasted, the [] are just syntactic sugar to specify information about the arrays that you are passing in to Cython.

I find that far more legible than complicated template expressions - but that's likely a matter of habit.

About the object lifetime: there is a lot of good information on the Cython webpage (although it is a bit spread out) - but usually, unless you are writing Cython only applications, I tend to write tight Cython numerical routines that take a pre-allocated input/output vector and just update the output vector.



We developed biicode (a C and C++ deps manager) in python using Cython

http://blog.biicode.com/bii-internals-compiling-your-python-...


So how would you do multithreaded apps with cython while avoiding stuff like the GIL?

I work with objective-c a lot and python was my first language, so to see this kind of unification just blows my mind.


As the sibling comment suggests, multi-threading through Cython isn't as smooth as it could be. But, it doesn't seem too bad. I've used it in a rather rudimentary way to accelerate key computations for some NLP models that I've been playing with recently.

As an example in [1], you can check out the function at lines 293-314, which calls the function at 224-256 (or lines 259-291 depending on the machine architecture...). The actual dispatcher that initializes the threads is in [2]. I guess the key points are the use of "nogil" in the function declaration on lines 224-229, and the use of "with nogil" when that function is called on line 312. Note that the code is complicated a bit by my use of BLAS functions accessed via function pointers provided by SciPy.

[1]: http://github.com/Philip-Bachman/NN-Python/blob/master/nlp/C...

[2]: http://github.com/Philip-Bachman/NN-Python/blob/master/nlp/C...


In theory this is easy and good, in practice it's not.

http://docs.cython.org/src/userguide/parallelism.html

The problem is that it's really nasty to declare all your functions nogil, because that means you don't get exception handling. There may be a trick to this that I've missed --- I find the GIL handling stuff in Cython sort of confusing.

I think my applications are genuinely hard to multi-thread, so maybe it's usually easier than I've found it. It doesn't help that I use a Macbook Air, and OpenMP isn't available for CLang. Developing on OSX sucks sometimes.


I'v had a lot of success with cython's parallelism. The key is to release the GIL in tight inner functions of big loops. You can then parallelize the loop by iterating over calls to the nogil function or "with nogil" block.

The support for only openmp kind of sucks, but I'm working on a Linux cluster so it doesn't bother me except when I'm on my laptop.


Sometimes it is possible to do long-running computations with pure C code and the GIL can be released. Mostly though you use Cython because it lets you combine Python code with C code, and any code that touches Python datastructures must hold the GIL. Therefore the same advice as for Python in general applies, use multiprocessing instead of threading.


In C, you need header files so that vars are declared before they're used. How does cython deal with this? I find headers painful, and would appreciate a memory foundry language more if it didn't suffer this.


Unfortunately, you're dead right --- this sucks in C and it sucks in Cython!

I don't understand why .pxd files are necessary --- they should be generateable from the .pyx files. At some point I'll have a go at this...

But, as of now, you need to maintain this separate file, which you must keep synchronised to changes in the original. And, yes --- this sucks!

The only mercy is that writing code in the headers is illegal.


It is not illegal to write code in pxd files. For cdef functions to be inline-able across modules, they need to be specified in a pxd file. But that is the exception to the rule.


In Cython you can use Python modules/functions/variables as you normally would, dynamically and without declaring them. To integrate with C code you do need headers and type declarations. Other tools (ctypes, cffi) can load libraries dynamically, but I think that has its own tradeoffs.


It is a bit hard to read code with one-letter variable names


I took them from the original post. I imagine they're the standard forms of the equations.


I'm not sure why position is "r", but F for force, m for mass, w for world (in a short function), N for, well, N - doesn't seem too bad. Math looks much worse if you spell everything out.


It's pretty standard to use r as a position vector in physics.


He's following the code in the original article: https://www.crumpington.com/blog/2014/10-19-high-performance...

Makes side by side comparisons easier.


I took at peek at Cymem just to see what was going on, as it was described as a "small" wrapper. A 3600 line .c file was... more than I expected.


https://github.com/syllog1sm/cymem/blob/master/cymem/cymem.p...

The .c file really shouldn't be in the repo...


That is the C code generated by Cython. The actual source is in the .pyx files.


While it is interesting to see all these efforts improving the Python ecosystem, I go straight to a ML language with native compiler.


I wonder how much time is needed to learn C proficiently to make use of Cython properly.


It depends what you mean by 'use cython properly'. To speed up some numerical routines, you don't need to know almost anything about pointers/malloc, etc, you just move your Python function into its own separate file, rename it from .py to .pyx, add type annotations, and enjoy the free speed up.

To do more complex things (low level wrappers, etc) then you do need to know more things about C.


Its good for job security.


Cylons write in Cython.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: