Hacker News new | past | comments | ask | show | jobs | submit login
Armin Ronacher: Collections in C (pocoo.org)
77 points by jgalvez on Nov 23, 2010 | hide | past | favorite | 38 comments



Armin seems to have reinvented the BSD queue macros.

Having once had the pleasure of inheriting a codebase animated by CPP-macro collections, allow me to be a voice in favor of running, not walking, from programmers who embrace them. They're tricky, they blow up, they're extremely noisy in the code, and (worst of all) they don't encapsulate, offering new devs a myriad of ways to write tangly hard-to-understand subtly broken code that works directly with the collection structure.

Type safety is entirely overrated. C requires so much deliberation to deploy even the simplest collection that the likelihood of you picking up a foo where a bar was what was provided is minimal, and easily diagnosed. Just use voidstar.

One of the first things I did at that job was to port STLport's Red-Black tree (from <map>) to C code, specialized on voidstar. It worked beautifully. If there are times when you don't want to use a void* collection (or a gossamer-thin wrapper around one), those are also times when you don't want a generic collection library to begin with.


There's also a much cleaner way to have interfaces and implementations in C, without relying on macro hackery.

Why, there's even a whole book on that: http://www.amazon.com/Interfaces-Implementations-Techniques-...

Like Norvig's Paradigms of Artificial Intelligence: Case Studies in Common Lisp and Joshua Bloch's Effective Java, it's one of those books which despite having a specific programming language in the title, is really about programming in general.


This is my favorite C book ever, I recommend it wholeheartedly, and it recommends voidstars. =)


To those who don't yet have it on their shelves, source is available here:

http://code.google.com/p/cii/source/browse/#svn/trunk


Respectful of you but not of the book, I have to ask: what do you like about it? My initial reactions were very negative (http://news.ycombinator.com/item?id=1705332). I love K&R, but recoiled in horror at every page I managed to read of this book.

I've written entire macro-based systems similar to those recommended by the author here, and find them very useful. Void pointers are great where they can be used, but I love the efficiency and clarity of the generated code approach.


I found your objections to this book pedantic.

For instance, in a chapter on string atoms (symbols) --- a concept which virtually no C program in the world takes advantage of, despite the centrality of the concept to Lisp, Python, and Ruby --- and this is the first chapter in the book --- you were shocked by his use of a 43 byte string to hold a number string... because someone might be using that code on a 192 bit machine?

You missed the forest for the trees. If you don't want your code tainted by the number 43, don't write that code. The point of the book is how you structure your code, divide it into subsystems, and present coherent interfaces to the rest of your program.


I'm not bothered by the idea of a fixed length for strings, rather the explanation that it is best to use the number 43 instead of a named constant (MAX_STRING_LEN) to avoid namespace pollution. If the first paragraph I quoted had been standalone, or better as a short comment, I would have been in full approval.

Anyway, I appreciate your response. It's good try to appreciate what others see that I do not. Yes, if you ignore the details of the code and the explanations, there are some good parts. If you look _only_ at the big picture, it's probably a fine book. And I love his clearly prefixed naming conventions. But I think you'd do a lot better reading something like the SQLite source code rather than this book if you want to see examples of good C.

I guess I have to ask: do you feel that chapter 4 on using setjmp() and longjmp() plus some brittle macros for error handling is also good for learners? I thought it was technically very clear but about 40 years out of date as to good practice. Is this a forest or a tree?


As I recall, Meyer also recommends voidstars for generic collection implementations to deal with template bloat.


This is one of the last books that I'm still hoping will make it to a digital format somehow. I just can't see ordering a new paper copy again.


I mostly agree (and use void* myself), but I don't 100% agree on the last point. There are a bunch of algorithms that are conceptually generic, but get a significant speed-up if type-specialized to certain types, mostly numeric ones. For example, I think STL's type-specializing sort is a pretty reasonable default, while C suffers from having only a non-type-specializing sort built in, which ends up much slower if you're sorting, say, an array of integers. Same with a bunch of machine-learning algorithms that work on generic comparable types, but work much faster if you type-specialize them to integer or double, the way C++ templates would do. That's the main use case I see those terrible C macro approaches used for. An alternative I've seen is to actually generate the type-specialized variants ahead-of-time using some external templating or macro system, or even some hacked-up Perl script.


Fast atomic types support is indeed a good reason to do this sort of hackery - that's one of the reason why it is used in kernels. The difference of performance hashmap/vectors/sets between void* and int-specialized can be very significant (easily up to one order of magnitude if not more in some real cases).

I wished there was a language (not C++) which could help for this kind of things. Unfortunately, it becomes difficult very fast. One interesting approach is K as suggested by some FreeBSD hackers, but it never went into production AFAIK (http://wiki.freebsd.org/K)


I'd question how much of that impact comes from specializing the collection code, and how much of it comes from allocation and locality. The temptation with void* collections is to malloc everything, which is lethal for performance, and one of the very few non-bug things I've come across whose fix actually yields a 10x performance improvement.


Memory allocation can indeed be amortized with specialized allocator, but that's not what I had in mind. The context I usually operate with is numerical computation, and the indirection cost is very high in those cases, especially when you can access memory in blocks if you use specialized allocator. It would be very hard to do well for sure. It is well known that allocator is one of the main weakness of the STL (one of the reason for the existence of Electronic Arts STL).

I am certainly not advocating doing this in general - I think the need for atomic support in generic collections is quite low (I have been investigating the issue recently to add fast and generic support for sparse matrices in scipy). I am pretty sure the macro, specialized ones used in freebsd (tree/queue.h) and linux (rbtree, list) have been benchmarked to hell, though, and would trust them more than most STL implementations.


Seriously, why not C++? It's just C with some stricter type checking. Pretend it's C, then judiciously grab a pair of angle brackets when you need them. No macros required.


C++ has portability issue (where portability does not mean supporting g++ and MSVC), poor interoperability, and is much harder to maintain in a distributed team of people with varying ability in the language (e.g. open source).

When those are not issues, C++ is appropriate. Otherwise, it is a pain.


Why not just pick and choose what you need from C++? For all the hate that C++ gets, one nice thing is it doesn't force anything down your throat. Between simple not using features, and compile time options (disabling exceptions for example), you NEVER have to pay for a feature that you don't want or need.

He could have literally written C-style code except used C++ just to create a few generic data stores instead of using all of this preprocessor magic which I guarantee is more fragile/harder to debug than basic C++ templates.


FWIW, C++ does force some things on you, such as automatically typedefing structs instead of giving you a separate namespace, not warning about multiple declarations with different arguments (because function overloading is a feature), not warning about multiple definitions in different compilation units (linker just picks one, assumes they are equivalent), having to cast assignment of void* to a typed pointer (ugly and redundant), slower compilation, and generally later errors/warnings (thus usually more cryptic and less useful, although it does stricter type checking in some ways, so this point is not completely clear-cut).


Hi, just want to bounce on the "cast your void to a pointer"*, actually you don't need to do that if you don't use an object oriented paradigm à la Java, but go generic full speed and use return by value. No pointers, little or no casting.


> go generic full speed

What does this mean?

As a simple example, suppose I use malloc to allocate memory (because I don't want to allocate with new which would oblige me to handle exceptions and prevent me from handing ownership over to a C client or using an allocator provided by a C client (without extra work)). I have to cast the return value as in

  struct Foo *x = (struct Foo*)malloc(n*sizeof(*x));
instead of

  struct Foo *x = malloc(n*sizeof(*x));
This looks purely cosmetic, but in C, I can write the macro

  #define MallocA(n,p) (!(*(p) = malloc((n)*sizeof(**(p))))
which then works with a standard error checking convention

  struct Foo *arr;
  err = MallocA(n,&arr);CHK(err);
This doesn't work in C++ without non-portable typeof or evil and less safe (not conforming for function pointers)

  *(void**)(p) = malloc((n)*sizeof(**(p)))
Similarly, if a client registers a callback with a context, I store their context in a void* and pass it back to them

  int UserCallback(void *ctx,...) {
    struct User *user = (struct User*)ctx;
instead of

  int UserCallback(void *ctx,...) {
    struct User *user = ctx;
I understand that this is just cosmetic. I don't see how "going generic full speed" helps with this. Also note that aggregate returns are slower for all but trivially small structures, and downright bad for big structures.


If you really want to learn how to do this, and do it really really well, check out sglib:

http://sglib.sourceforge.net/

It's got nearly every data structure you can imagine, all implemented as CPP meta hackery. Brilliant.


I used this library a while back, and it was nice at first, until I noticed that his quicksort always picked the first element as the pivot, which makes it O(n^2) on already-sorted input. Then I started second guessing everything, and when you do that with a library you may as well write it yourself without the CPP hackery. At least then the compiler errors will make sense and you can debug it.

I went to go check out the code to see if this issue had been fixed, but the download link has gone bad.

I mentioned this library on HN some years back, when I was still excited about it, and the reply I got was something like "No! Not that macro boneyard!" He was right.


I don't agree with one of his premises:

"I normally like to avoid tools that generate new C files for the very simple reason that these usually generate ugly looking code I then have to look at which is annoying or at least require yet another tool in my toolchain which causes headaches when compiling code on more than on operating system. A simple Python script that generates a C file sounds simple, but it stops being simple if you also want that thing to be part of your windows installation where Python is usually not available or works differently."

I have moved most of my macro code generation to using another language, with a proper template engine, to generate code (in my case C++ but the same holds for C). If he doesn't trust Python across platforms, he can take a fixed version with known properties and code around it, or take another language (which presumably will have the same issues). I use PHP, I'm a bit careful in cross-platform features and it works great.

Apart from this, he can still write his code generator in C, so that on a new platforms it can be bootstrapped with a regular C compiler, then process his templates, then compile his actual code. It's painful to do string processing in C, but this generator only needs to be written once anyway, and it's not much work. Plus if he uses a small template engine, that'll take most of the pain away (most of the work will be in modifying the templates, not the code generation engine).


A good C library implementing common data structures (lists, trees, hash tables, etc) is GLib:

http://library.gnome.org/devel/glib/

It does many other things, provides abstractions for threads, files, etc. It is a general-purpose utility library originally written for GTK+. I never really understood why GLib is not more often used.


"This worked really well up to the point where I needed two kinds of lists. One that accepted floats and another one that works with arbitrary pointers."

Had he never heard of unions?

Even without knowledge of unions, you could treat a pointer as a, depending on the architecture, sequence of 32 bits. That sequence can be cast to whatever you want. So long as you have a clear way of describing what you've stored in that 32 bit sequence, you should never get confused.


Another way is probably to use a Linux Kernel-like list: http://isis.poly.edu/kulesh/stuff/src/klist/


The kernel list approach also has the advantage that there's only one implementation of the list operation functions, whereas the one outlined in the article generates a new set of functions for each type.


There is something to be said for not taking DRY to extremes. This appears to be one of those extremes.


Oh man, with that kind of macro "metaprogramming" (if it can be called that), you're bringing on a world of error message pain. It's almost not worth it.


Painful as it may look, it's used in the while a lot.

Most of the time it actually works well.


I find myself reminded of the famous Dr. Sagan quote for some reason: “If you want to make an apple pie from scratch, you must first create the universe.” (From "cloning Minecraft" to "implementing a list" in only five sentences!)


This is both awesome and very terrible at the same time.

This proves that you can do magic in C but it also proves that C is really a low level assembler :-)

For me, stuff like this has been one of the prime reasons to prefer C++ over C. You can totally get the same performance and compiled code when using templates. But you gain compile-time checking and much more robust code.

(These preprocessor tricks are also popular in the BSD kernel)


`#define _CAT(A,B) A##B` is going to invoke undefined behaviour, I think:

""" None of these macro names, nor the identifier `defined`, shall be the subject of a #define or a #undef preprocessing directive. Any other predefined macro names shall begin with a leading underscore followed by an uppercase letter or a second underscore. """ (s6.10.8)


Not really, all that says is that it risks the name colliding with a built-in macro defined by the compiler. At worst that seems like implementation-defined behaviour.


I was too hasty and confused implementation-defined with undefined.


I give thanks every day that there are people like the author and others who can write and understand code like that so that I don't have to. I'll be damned if that doesn't look like trying to build a boat using a chainsaw and popsicle sticks with instructions written in Farsi.


this kind of stuff has always scared me. Its seems like it would be really easy to get confused just trying to use this stuff. Generally if i need a list i either use the void* interface and cast as neccesary, or if i want to do something like store a list of floats, or anything where i want to store the value of the item in the list (not just a pointer) then i use a 'fat' datastructure like this

https://github.com/lukesandberg/Regex/blob/master/src/util/f...

when you construct it you just tell it how big each item is then it copies the value into the stack element for storage.

Sometimes it a few extra copy/cast statements but at least its very clear what you are doing.



For those who are curious, this solution has a name:

Variable Queue




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: